Frequently Asked
Questions

What kinds of data do you collect?

The Center collects intelligence data from multiple sources for each cybercrime or security threat that it intends to study or seeks to encourage others to study. Specifically, the Center collects:

Threat intelligence data includes reputation or block list data for domain names, hyperlinks (URL), Internet Addresses (IP) and Autonomous Systems (AS) - from those repositories that we determine to be highly accurate and reliable sources based on commonly accepted indicators of confidence. These sources group identifiers (URLs, domain names) into to security threat categories that we intend to study, e.g., phishing domains, malware domains, fake site domains, botnet command-control domains, and spam domains.

Metadata includes operational data- for example, address and domain name registration records, domain name system (DNS) zone records, and routing data - as well as artifacts or indicators of criminality/compromise, for example, the date and time when a threat was reported, brand targeted, malware type and family, etc.

Whose threat intelligence data do you collect?

Currently, the Center collects fraud related threat intelligence from

APWG eCrime eXchange (eCx),
Invaluement DNSBLs (InvaluementURIvaluement.com/)
Malware Patrol (Business Protect),
MalwareURL (database),
OpenPhish (Premium),
PhishTank (database),
Spamhaus Domain Block List (DBL),
SURBL intelligence reputation data (SURBL)and
URLhaus (API).

What metadata do you collect?

We currently use the following sources for metadata:

Registry Whois or RDAP data (public fields only)
Domain DNS (zone records, e.g., A, AAAA, NS, MX)
Team Cymru ASN Lookup (ASN and IP prefix)
RIPE Stats (Country)
DNS.coffee (TLD zone counts
ICANN Monthy Registry Reports (4 months delayed)
BGP view (IP prefix size)
Domain Tools (Historical Whois)
Zetalytics Massive Passive DNS API (Passive DNS)

We also use independently compiled lists of operators (e.g., subdomain service providers), lists of known sinkhole operators, and public suffix lists.

Why do you use these data?

We select threat intelligence data as “candidates” for inclusion based on several criteria, including:

Reputation in the operational, cybersecurity and academic communities, who trust that these sources are rigorously prepared and are accurate, e.g., have low false-positive rates.

Practical, widespread adoption. Commercial and non-commercial entities worldwide rely on these sources to mitigate threats or risk for their networks and their users.

Transparency and accountability. These sources actively maintain their data sets (lists) and provide guidelines and methods for reconciling false-positives (e.g., de-listing).

Service quality and integrity. These sources have a positive reputation for high availability and are recognized for the quality of their methodology and scale of their detection infrastructures.

Availability over time. Our project relies offers historical looks at cybercrime. Choosing a threat feed that has its own long history of availability, evidence of long-term commited funding, or commercial viability is essential for our needs.

We establish a business or collaborative relationship with the service operators, and then collect, review and experiment with candidate data for months to establish confidence in the service before we include each set.

What records does the center create?

We create composite records from threat intelligence records and their metadata, from public domain name and IP Whois, and from DNS query responses. Metadata provided in threat intelligence data may vary according to the kind of threat - for example, phishing data may identify targeted brand, and malware data may provide a hash of the malware executable - so our records may contain different sets of data elements depending on threat.

Generally, our composite record is anchored by an identifier, e.g., a URL or domain name. For each identifier, we append metadata to that identifier's record. For example, if the identifier is a phishing URL, we parse host, registered domain name, TLD, path. We collect DNS and Whois data for the registered domain name and both Autonomous System Numbers (ASN)s and IP addresses where the URL was hosted. We also append discovery or reporting dates, and classifiers; e.g., for phishing, we determine if the domain is a malicious/legit registration, the brand targeted).

Each composite record is tagged with data origin (e.g., from which threat feed did we collect this source record). Certain of our sources are provided under licenses which prevent us from sharing records derived from these sources. Other sources may permit us to share composite records (which contain some elements of source data). We maintain data origin so that we can explore circumstances where we are able to share subsets of composite records that are not encumbered by license.

When we conduct analyses or calculate measurements, we also generate aggregate records. Aggregate records are typically measurements (e.g., counts of occurrences of a threat within a registry, registrar, hosting operator) that we use to create tables or charts. These records do not contain data from our threat intelligence sources. We publish certain aggregate records in CSV format. The CSV data sets can be downloaded by visitors who wish to refine or examine underlying data; for example, we may publish a CSV data set that contains counts of phishing sites for several hundred registries, and some visitors may want to study only the legacy, new, or country code TLDs and can do so with the CSV data.

What do you measure?

Generally, we measure occurrences and times elapsed between occurrences.

Occurrence measurements are counters that help us identify concentrations of cybercrime or threats; for example, how many phishing, spam, or malware domains were registered through a given registrar, or hosted at a particular operator, or delegated from a given Top-level domain.

We use data that we collect and warehouse over time to measure the elapsed time between occurrences of two related events, for example, the time elapsed between the registration of a domain name (the creation date) and the time when that domain name is identified or reported as a threat, e.g., a phishing site. We can also use historical data to measure cumulative misuse or abuse of identifiers, e.g., the number of unique malicious, spam, phishing or malware domain registered through a given registrar or delegated from a given TLD over 365 days.

Lastly, we use data that we collect daily to take longitudinal measurements (to look at historical patterns). For example, we collect phishing URLs daily and so that we can examine the number URLs over a period of time, e.g., a month or year. With data to take such measurements, we can, for example, support analyses that can show flocking and migration behavior (e.g., when a criminal targets a given TLD for malicious registrations they will use to support a spam infrastructure) or repetitive behavior (e.g., to identify days of weeks where phishing most frequently occurs).

Frequently Asked Questions