Mirror, Mirror, on the Wall, Who’s the Fairest (website) of Them all?
Back in the .com days of the internet, a company called Alexa started collecting statistics on the websites that users visited with a plugin people installed in their browser. As part of that collection, Alexa aggregated that data into a collection of the “Top 1 Million” sites on the internet — the most-requested domains by users of their plugin. They then gave that list away for use by the Internet community.
Because it was available, and especially because it was free, the “Alexa Top Million” list became widely used across the internet. In the security world, being listed on the Alexa Top Million list was often used as a proxy for whether a domain should be considered “safe” by default.
The reasoning behind that typically went something like this:
- If a domain is highly ranked in the Alexa Top Million, then
- The domain must be quite popular (even if you haven’t personally heard of it)
- If the domain is popular, blocking it may result in a material number of complaints
- Security engineers should “think twice” before blocking access to it from their organization.
DomainTools has historically provided the Alexa rank of a domain in Iris as one indicator to help investigators make this sort of calculation themselves.
The Conflict
Alexa, now owned by Amazon, announced they are discontinuing the Alexa Top Million list as of 1 May, 2022. This leaves us with a bit of a quandary: do we keep using a “frozen”/outdated list? Do we switch to someone else’s list? Do we drop the Alexa ranking entirely? Or do we try to generate our own ranking of the top domains on the Internet?
We went with the last option: generating our own. We recently acquired Farsight Security, whose DNSDB has a great deal of information about DNS requests, so we were confident we could build a good replacement.
Of course, it’s never that easy.
What Does “Top Million” Even Mean?
When we started researching whether to generate our own list, we ran into a fundamental question: top million domains by what criteria? That question has a lot of answers, and each of them has an interesting bias:
- Domains users request in their browser — This is what the original Alexa list focused on, and what Netcraft still tracks. Collecting information directly from a user’s browser is a good way to understand user interests in the browser. This was Alexa’s original business focus. However, there’s a lot of traffic on the internet that doesn’t involve web browsers. A browser-centric view of the top million domains risks missing things like patch mirrors, content delivery networks (CDNs), DNS resolvers, and other internet infrastructure that are important, but which are transparent to users in their day-to-day browsing.
- Domains requested by a user’s system in DNS: This is what the Cisco/Umbrella Top Million list tracks. They collect and aggregate DNS queries to the OpenDNS resolvers to create the Umbrella Top Million. This data includes CDNs and some other internet infrastructure, which is an improvement over a browser-based approach if you want to understand the internet-facing scope of user traffic. However, it is still user-centric, so unless an organization has pointed their entire organization’s DNS at the OpenDNS servers, this will not see server-related things like patch mirrors. It’s also a bit biased in that it can only see traffic from users who have chosen to point their DNS to a non-default resolver, so the user behavior shown in this traffic may not be a true reflection of trends on the general internet – this is data from a unique self-selected group of users.
- Domains seen in requests across an entire organization’s DNS — This is similar to the Cisco/Umbrella OpenDNS data, but isn’t dependent on users pointing their DNS to a particular resolver. Instead, it’s dependent on the user’s organization/ISP sharing the DNS queries that the organization is sending to the internet with a 3rd-party aggregator. This is the Farsight Security DNSDB dataset. This has an advantage over the OpenDNS dataset in that the users in this dataset are following the default behavior from their organization or ISP, and server DNS traffic will be in this dataset as well. This has a similar limitation to the Cisco/Umbrella dataset, though — it is still biased towards organizations that have chosen to share data with Farsight Security. Also, since this is only seeing traffic that would go to the internet, if the organization’s nameservers already have a domain cached, that request won’t be seen in this feed.
- Domains scored by their connections to each other — So far we’ve only talked about user-facing lists. The user-request-tracking approaches suffer from a limitation that users need to visit a site before it appears in the list. For example, a site may be important but rarely visited, and that won’t be reflected in ratings that rely on user traffic. Rather than relying on user requests, you can look at how websites link to each other to identify popular sites. This is how Google’s PageRank algorithm works to identify important domains to return in a search. The DomCop and Majestic Million domain lists follow this PageRank-like approach. This approach has the advantage of not requiring users to be clairvoyant about any site they may ever want to visit in the future, but has the disadvantage of being easily attacked by spammers.
With all of those options, which one to choose? The best answer from our point of view is “all of the above.”
Enter Tranco
We are not the first people to be thinking about this problem. In 2019, a group of researchers looked into building lists of the top domains for research purposes as well as identifying problems for these sorts of lists (churn, mis-classifying a popular but malicious domain, etc). Their paper analyzed the overlap of the various “top domain” lists with each other and with Alexa, and concluded that a combination approach was the best suited to their purposes. We agree, and think it’s well-suited to ours as well.
The approach they put forward uses the position of a domain on each list to generate a “score” for each domain, then takes the average of the scores from each list to generate the position of a domain in the final list. (In practice it’s a bit more complicated than that, but that’s the core idea.) The practical effect of this averaging is that domains that are missing from one or more lists will be pushed down in the final list, since they’ll get “0” votes from lists that don’t have the domain. Conversely, domains that are in all of the lists will be pushed up. This rewards domains that appear consistently across all the collection types, which we feel is a good thing — a domain that is ranked highly across multiple sampling methods is one that is likely legitimately popular.
The researchers have put up a website that automatically does this combination of multiple lists, and in theory we could have just used their list. We chose not to, mostly because we wanted to control our own fate. We are going to be working with the Tranco team, but the actual list that appears in Iris will be generated by DomainTools internally.
Homegrown Popularity
Having decided to build the list ourselves, the next question becomes: which data sources are we using? We already knew we wanted to use the Farsight Security dataset, and that we wanted to average that against multiple other datasets to try to address blind spots in that data, but what would we be averaging the Farsight Security dataset against? When making this decision, we wanted to get a mix of sampling methods to ensure we got a good cross-section of the different ways of looking at this problem. We also needed to consider the license terms for each data set to make sure we were allowed to use them. In the end, we chose 4 datasets to use for our “top” list:
- The Majestic Million — This is a pagerank-like list of 1-million-domains, updated regularly, under a Collective Commons/Attribution license (CC-BY).
- The Cisco/Umbrella top million — This is a User-system-DNS-requests list, released “free of charge” by Cisco.
- The Netcraft topsites top 100 — Netcraft runs a browser plugin that collects much the same data that Alexa did. They, however, do not offer the top 1million sites, just the top 100 on their site. Their data is offered with a license that allows republishing if Netcraft is attributed.
- Farsight Security’s passive DNS data — This is also a passive DNS list, but more by organization than by individual user. We definitely have a license to use this information, since Farsight Security is now a part of DomainTools.
We feel that this combination of lists is a good, broad, mix of sampling methods, and the Tranco averaging methodology gives us a good way to collect them together.
Why Should You Care?
By the end of Q2 this year, DomainTools will be shifting the rank scores presented in our API and Iris to use this newly-generated ranking. What does this mean to you, our customers? Practically, it means:
- The “Alexa” column in Iris will be replaced with a “Rank” column. That column will contain the domain’s popularity rank according to this new list.
- The Iris API will have a new attribute added to responses, also called “rank”. We will leave the “alexa” attribute in place for the time being, but we expect to remove the “alexa” attribute by the end of 2022.
If you are using the Iris API, and are using the Alexa rank field in those queries, we recommend that you shift to the new “rank” field soon. Beyond that, we do not anticipate any other changes to the user experience. We have confidence that the data generated in this list will be fairly stable, and will be a transparent replacement for the Alexa Top Million list.