Farsight DNSDB: Mapping Privacy in the DNS
Farsight Security Distinguished Scientist and Director of Research Joe St. Sauver recently wrote a post on how to describe passive DNS to friends and family. I thought I’d follow up and share one of the analogies that I use to explain how Farsight safeguards end-user privacy in the DNS data we collect.
There are many approaches to collecting Passive DNS. Each approach has a different privacy profile. Some aim for fine-grained visibility of end-user activity. At Farsight, we aim to map the DNS without capturing any user attributable behavior.
I’m a bit of a cartophile so most of my go-to analogies revolve around maps. I like to think of Farsight’s DNSDB, our historical passive DNS database, as a giant map of the Internet’s domain namespace. Just like a street map might show us a connection where Third Avenue meets Elm Street, DNSDB shows us that the name, www.example.com, intersects with the IP address 93.184.216.34. To best explain, let’s consider a turn-by-turn GPS navigation system. Modern systems exchange location data with “the cloud” in real-time and provide updates based on traffic and changes in topography (road closures) etc. Older systems, on the other hand, use static map data in the unit and GPS input to work out where you are headed.
The privacy implications between these two models should be easily apparent. In the real-time model, the service providing you with data knows exactly where you started, where you are and where you are going at any point in time. Conversely, in the older model, only someone in physical possession of the device could determine where you’ve been.
Now, let’s follow this analogy to look at how we collect Passive DNS data. At Farsight, we want the most complete and accurate map of the DNS possible. How can we do that without collecting data on where any one user is or where they are going? We do this by carefully selecting where and how we collect data.
Let me explain. There are four common “collection-points” to capture passive DNS. The first, and most invasive, is on the client. This is where host agent software usually does its thing. You have a direct connection between the queries and the user. The second collection point, which we refer to as “Below the recursive”, captures data between the client (stub) and the recursive resolver. We know the IP address that the query came from as well as the full query and response. In addition, we also know the sequence and timing of the queries made by that system.
We encourage use of this “below the recursive” type within enterprises with a stated expectation of user privacy, as it provides lots of valuable telemetry when searching for “patient-0”. This model is not appealing when applied to general populations without strict privacy controls in place. Most of the newer encrypted technologies focus on eliminating this privacy exposure below the recursive. Encrypting the data between the client and the recursive resolver does make it harder to work out who is doing what. Unfortunately, it also makes legitimate DNS based security controls, such as a corporate DNS firewall, nearly impossible to enforce. [Note: encrypting DNS traffic to the recursive with protocols such as DoH or using a 3rd party open recursive such as 8.8.8.8 will do nothing to hide other, non-DNS signals that a user is emitting.]
The third mode is where we capture data at the authority servers and is a valuable telemetry source for root server and large-scale authority server operators. Here, we see lots of the same query/answer pairs from a diverse set of recursive servers, so privacy is well protected, however we are only looking at the namespace covered by the instrumented authority servers in question.
The final mode — and the model we use at Farsight — is referred to as “Above the recursive”. Here, we monitor the traffic between the recursive resolver and the authority servers. It’s similar to the third approach, however we are looking at the other end of the connection. A recursive server usually serves a large user population. We are specifically looking at the queries that the recursive has to make to the authority servers in order to satisfy queries from that population. Thanks to this architecture, we see all of the unique pattern space giving us the entire DNS map, but we never get to see who made the query, the order of a user query stream, or how many users behind a given server asked the same question. Seeing the same query from multiple sensors and the frequency that we see a resolver refresh a given query still gives us a general idea of the query volume.
Let’s go back to my GPS analogy. If you think of this in terms of the GPS giving instructions, the result of our approach will, at first, look like an un-connected stream of turn instructions “turn left on to Main St. from 3rd Ave.” or “take exit 4 from I-95 North on to Pleasant St.”. By deduplicating and analyzing this stream, you can see how we can develop a detailed and accurate map without any information about the travelers who took those specific turns.
The map we build into DNSDB is compiled from data collected over 10 years at rates often exceeding 250 thousand (above the recursive) observations per second. No map is perfect – we can’t see data related to roads or alleys not taken, for instance — but, as we have seen in the recent investigation into the SolarWinds compromise and other attacks, it is an invaluable tool for investigators who have to piece together how, when and the collateral damage of a given attack. Farsight Passive DNS gives you most of the picture, and you see the streets that are in use, while protecting “drivers” – the users – privacy. Farsight technology was built with privacy in mind – long before GDPR was a talking point, and it will continue to be a mainstay of our architecture into the future.
Ben April is the Chief Technology Officer for Farsight Security, Inc.