SOURCE RECORDS

The Center collects intelligence data from multiple sources for each cybercrime or security threat that it seeks to study or seeks to encourage others to study. These sources typically identify an Internet identifier such as a domain name, URL or IP address as a threat, and provide metadata (artifacts, indicators) that can be used in analysis or measurements.

We evaluate sources using these commonly accepted industry indicators of confidence:

Reputation in the operational, cybersecurity and academic communities, who trust that these sources are rigorously prepared and are accurate, e.g., have low false-positive rates,

Practical, widespread adoption. Commercial and non-commercial entities worldwide rely on these sources to mitigate threats or risk for their networks and their users.

Transparency and accountability. These sources actively maintain their data sets (lists) and provide guidelines and methods for reconciling false-positives (e.g., de-listing).

Service quality and integrity. These sources have a positive reputation for high availability and are recognized for the quality of their methodology and scale of their detection infrastructures.

Metadata. These sources group identifiers (URLs, domain names) into to security threat categories that we intend to study, e.g., phishing domains, malware domains, fake site domains, botnet command-control domains, and spam domains. Some sources provide additional metadata; for example, domain registrar, the date and time when a threat was discovered (reported), brand targeted,

From these source records, we produce several kinds of records.

Visit the Contributors page for a list of publicly accessible, non-commercial, and commercial, contributed intelligence data.

We collect threat intelligence data

The Center collects threat intelligence data - reputation or block list data for domain names, hyperlinks (URL), Internet Addresses (IP) and Autonomous Systems (AS) - from DNS, URL, and IP address. The aggregated and derived records that we produce from threat intelligence data that we collect are available to download from the Records Repository page.

Here, we explain the relationships between the many data sets that we collect, process, and publish.

COMPOSITE RECORDS

We create composite records from threat intelligence records and their metadata, from public domain name and IP Whois, and from DNS query responses. Threat intelligence data may vary according to the kind of threat - for example, phishing data may identify targeted brand, and malware data may provide a hash of the malware executable - so our records may have different schema depending on threat.

Generally, our composite record is anchored by an identifier, e.g., a URL or domain name. For each identifier, we append metadata to that identifier's record. For example, if the identifier is a phishing URL, we parse host, registered domain name, TLD, path. We collect DNS and Whois data for the registered domain name and both Autonomous System Numbers (ASN)s and IP addresses where the URL was hosted. We also append discovery or reporting dates, and classifiers; e.g., for phishing, we determine if the domain is a malicious/legit registration, the brand targeted).

Each composite record is tagged with data origin (e.g., from which threat feed did we collect this record). Certain of our sources are provided under licenses which prevent us from sharing records derived from these sources. Other sources may permit us to share composite records (which contain some elements of source data). We maintain data origin so that we can explore circumstances where we are able to share subsets of composite records that are not encumbered by license.

At this time, we continue to explore opportunities to share composite records with providers of our source data.

AGGREGATE RECORDS

When we conduct analyses or measurements, we also generate aggregate records. Aggregate records are typically measurements (e.g., counts of occurrences of a threat within a registry, registrar, hosting operator) that we use to create tables or charts. These records do not contain data from our threat intelligence sources. We publish certain aggregate records in CSV format. The CSV data sets can be downloaded by visitors who wish to refine or examine underlying data; for example, we may publish a CSV data set that contains counts of phishing sites for several hundred registries, and some visitors may want to study only the legacy, new, or country code TLDs and can do so with the CSV data.

Aggregate records contain several types of measurements.

Occurrence measurements. Counters that help us identify concentrations of cybercrime or threats; for example, how many phishing, spam, or malware domains were registered through a given registrar, or hosted at a particular operator, or delegated from a given Top-level domain.

Elapsed Times. Measures of the elapsed time between occurrences of two related events, for example, the time elapsed between the registration of a domain name (the creation date) and the time when that domain name is identified or reported as a threat, e.g., a phishing site.

Longitudinal measurements. Measures of cumulative misuses or abuses of identifiers over (long) periods of time, e.g., the number of unique malicious, spam, phishing or malware domain registered through a given registrar or delegated from a given TLD over 365 days.For example, we collect phishing URLs daily and so that we can examine the number URLs over a period of time, e.g., a month or year. With data to take such measurements, we can, for example, support analyses that can show flocking and migration behavior (e.g., when a criminal targets a given TLD for malicious registrations they will use to support a spam infrastructure) or repetitive behavior (e.g., to identify days of weeks where phishing most frequently occurs).