Farsight's Real-time DNSDB, Part One
Background
I remember back in 2008 when Paul Vixie introduced PassiveDNS replication to me, a real-time stream of names and answers scrolling by in a terminal window. Every now and then I could pause the terminal and see what obviously looked like a “pharma” domain or some kind of phishing. It was exciting and magical, but it wasn’t quite useful — yet. Until an indexed database was available, it wasn’t easy to make associations between the same IP addresses a criminal used for one campaign to identify other campaigns. We also wanted to ask questions like “What other hostnames are being used inside this domain?” Until we were able to build an index based on the live data, all we could do was “grep”.
We’ve had friends try to use standard SQL databases, and they’ve had difficulty being able to keep up with inserting new information with the flood of data coming in while still being able to perform queries. It inspired us to look to time-delimited NoSQL solutions. We’ve gone through several iterations of NoSQL database design on the back-end ranging from:
- a simple DB4 file (disk was too slow), to
- hybrid CDB sorted indices (not scalable long-term), to
- Cassandra clusters (reliability/speed issues), to
- TokyoCabinet on PCIe SSD (generally good performance), to
- developing a generic MTBL and specific DNStable implementation of sorted string tables.
Each iteration took advantage of technology available at the time to make lookups as efficient as possible. The last two were developed to take advantage of SSD to write data once and let clients read as many times as needed without any spinning media bottlenecks. We generated hourly databases from live data that were merge-sorted into daily databases and then monthly databases and yearly databases. A “fileset” gives an access client the list of databases to open in parallel to query for their answer. The process scales well on RAID arrays of generic 2.5″ SSD drives, and we can replicate linearly if needed.
If researchers wanted or needed to perform lookups for data in the last few minutes, we usually directed them toward the raw Passive DNS real-time data feed that’s available on the Security Information Exchange (SIE). They could develop their own methods to utilize the real-time data, but they found it complex and time-consuming to use both a database and create their own processing scripts for the real-time data.
As announced last week, we’ve improved our DNSDB Export service to real time. Now, users don’t have to wait for the next hourly update to get more information. We now make updates available for DNSDB Export every minute. We developed a TLS-based download manager to speed up transfers and manage consistency on the client side based on what local files are available.
How do I use it?
First, understand that with PassiveDNS replication, we gather matched questions and answers between recursive nameservers from all over the Internet. We have a waterfall computing model where raw uploads from sensors are deduplicated by their query and answer, deduplicated again based on where in the DNS hierarchy the answer arrived from, and then filter out superfluous data. The process is documented in Passive DNS Architecture. The live data includes information that looks like the following:
type: EXPIRATION count: 1 time_first: 2015-10-22 03:20:19 time_last: 2015-10-22 03:20:19 bailiwick: tumblr.com. rrname: ziegast.tumblr.com. rrclass: IN (1) rrtype: A (1) rrttl: 30 rdata: 66.6.41.21 rdata: 66.6.42.21 rdata: 66.6.43.21 type: INSERTION count: 10 time_first: 2015-10-22 07:16:26 time_last: 2015-10-22 18:56:16 response_ip: 194.85.252.62 bailiwick: ru. rrname: 1f.ru. rrclass: IN (1) rrtype: NS (2) rrttl: 345600 rdata: ns3.nic.ru. rdata: ns4.nic.ru. rdata: ns8.nic.ru.
The entries are inserted as tuples into the database DNStable. The indices are built to enable queries based mostly on rrname and rdata. We can make simple direct queries like:
- “What is the history of NS records for
1f.ru
?” - “What other names are hosted at
66.6.41.21
?” - “What other domains are hosted by
ns3.nic.ru
?”
Rdata types like IP addresses have their indices optimized for CIDR lookups, and rrname or rdata names have their indices optimized for wildcard searches. As such, we can quickly provide answers to:
- “What other names are in the
*.tumblr.com
domain?” - “What other names point their addresses into
66.6.32.0/20
?”
The databases are created each and every minute. For example, all
of the new data from Oct 22, 2015 at 18:51 UTC get stored in a file
named: dns.20151022.1851.m.mtbl
. We merge-sort databases into
combined databases at 10-minute, 1-hour, 1-day, 1-month and 1-year intervals.
The collection of files form a set that the dnstable library can open and
access in parallel to gather answers.
A command line lookup tool, dnstable_lookup
, can use the DNSTABLE_FNAME
to look up answers in one database file or a list of files included
in a file specified in the DNSTABLE_SETFILE
environment variable.
Another command line tool, dnstable_dump
, can take the binary format
stored in the databases and convert them to rows of JSON.
We’ll provide examples of both commands below.
Brand name / counterfeit example
Back in April I wrote a blog about how to look up counterfeit names using SIE access and enhancing it with DNSDB lookups. This time, we’ll just use our DNSDB Export files.
Consider the Burberry line of clothing and accessories. As a popular luxury brand, it is often targeted by counterfeiters. Counterfeiters often make use of these freshly created domain names, since they tend to have their wares taken down from established online sales platforms (Amazon, eBay, etc), and are unable to establish long-lived domain names due to the ability of rights holders to easily take down domain names with tools like the U.S.’s DMCA and ICANN’s UDRP. The examples below show freshly created domain names that would appear at first glance to fit into this pattern.
Let’s look at the latest minute…
$ dnstable_dump -r dns.20151022.1941.m.mtbl | grep burberry | grep -v ';' burberrybags808.tumblr.com. IN A 66.6.41.21 burberrybags808.tumblr.com. IN A 66.6.43.21 burberryoutletstores.xyz. IN NS f1g1ns1.dnspod.net. burberryoutletstores.xyz. IN NS f1g1ns2.dnspod.net. burberryoutletstores.xyz. IN NS f1g1ns1.dnspod.net. burberryoutletstores.xyz. IN NS f1g1ns2.dnspod.net. burberryoutletstores.xyz. IN SOA f1g1ns1.dnspod.net. freednsadmin.dnspod.com. 1444295154 3600 180 1209600 180 www.burberryoutletstores.xyz. IN CNAME burberryoutletstores.xyz.
Looking up over the last year, we can find other merchandise hosted there. Using a larger set of DNSDB history, here’s another lookup:
$ ls dns.2015* > dns.fileset $ export DNSTABLE_SETFILE=dns.fileset $ dnstable_lookup rrset burberryoutletstores.xyz A ;; bailiwick: burberryoutletstores.xyz. ;; count: 5 ;; first seen: 2015-10-04 03:40:30 -0000 ;; last seen: 2015-10-07 17:14:44 -0000 burberryoutletstores.xyz. IN A 142.54.172.171 ;; bailiwick: burberryoutletstores.xyz. ;; count: 18 ;; first seen: 2015-10-10 21:15:43 -0000 ;; last seen: 2015-10-21 19:46:38 -0000 burberryoutletstores.xyz. IN A 151.237.189.86 ;;; Dumped 2 entries.
Looking up prior addresses finds other trademark names being hosted on the same servers now:
$ dnstable_lookup rdata ip 142.54.172.171 louisvuittonoutletonline.pw. IN A 142.54.172.171 raybaneyeglasses.us.com. IN A 142.54.172.171 3gp-ds.ytconv.net. IN A 142.54.172.171 coachoutletonline.top. IN A 142.54.172.171 michaelkorshandbags.xyz. IN A 142.54.172.171 burberryoutletstores.xyz. IN A 142.54.172.171 ;;; Dumped 6 entries. $ dnstable_lookup rdata ip 151.237.189.86 raybaneyeglasses.us.com. IN A 151.237.189.86 abercrombieandfitchoutletsonline.com. IN A 151.237.189.86 furlaoutletsonline.in.net. IN A 151.237.189.86 burberryoutlet.top. IN A 151.237.189.86 discountnfljerseys.top. IN A 151.237.189.86 burberryoutletonline.top. IN A 151.237.189.86 burberryoutletstores.top. IN A 151.237.189.86 guccioutletonline.xyz. IN A 151.237.189.86 burberryoutletonline.xyz. IN A 151.237.189.86 burberryoutletstores.xyz. IN A 151.237.189.86 ;;; Dumped 10 entries.
We don’t have to use command line tools to look at the data. There are C and Python bindings for easily doing lookups against DNStable files.
Consider the following script, lookup_ip.py
:
#!/usr/bin/python
import sys
import dnstable
d = dnstable.reader('dns.fileset')
q = dnstable.query(dnstable.RDATA_IP, sys.argv[1])
for res in d.query(q):
print res.to_json()
Additionally, we can get JSON tuples for all of the names that reference that IP address:
$ ./lookup_ip.py 151.237.189.86 {"rrtype": "A", "time_last": 1445413846, "time_first": 1444238290, "count": 15, "rrname": "raybaneyeglasses.us.com.", "rdata": "151.237.189.86"} {"rrtype": "A", "time_last": 1445478986, "time_first": 1443590863, "count": 56, "rrname": "abercrombieandfitchoutletsonline.com.", "rdata": "151.237.189.86"} {"rrtype": "A", "time_last": 1445061327, "time_first": 1444404624, "count": 13, "rrname": "furlaoutletsonline.in.net.", "rdata": "151.237.189.86"} {"rrtype": "A", "time_last": 1444735169, "time_first": 1444302797, "count": 42, "rrname": "burberryoutlet.top.", "rdata": "151.237.189.86"} {"rrtype": "A", "time_last": 1444948864, "time_first": 1444948864, "count": 2, "rrname": "discountnfljerseys.top.", "rdata": "151.237.189.86"} {"rrtype": "A", "time_last": 1445428527, "time_first": 1444273015, "count": 7, "rrname": "burberryoutletonline.top.", "rdata": "151.237.189.86"} {"rrtype": "A", "time_last": 1444132851, "time_first": 1444097643, "count": 6, "rrname": "burberryoutletstores.top.", "rdata": "151.237.189.86"} {"rrtype": "A", "time_last": 1445492367, "time_first": 1444473075, "count": 9, "rrname": "guccioutletonline.xyz.", "rdata": "151.237.189.86"} {"rrtype": "A", "time_last": 1445424613, "time_first": 1444273016, "count": 22, "rrname": "burberryoutletonline.xyz.", "rdata": "151.237.189.86"} {"rrtype": "A", "time_last": 1445456798, "time_first": 1444511743, "count": 18, "rrname": "burberryoutletstores.xyz.", "rdata": "151.237.189.86"}
rrname
is the name that was queried, and rrtype
was the DNS type
(“A”, “NS”, “MX”, etc.) found in the answer.
The tuple of time_first
, time_last
and count
show how many times
the name was seen within a given period. The times values are Unix epoch
seconds (the number of seconds since midnight Jan 1 1970 UTC). A count of “0”
means it was seen once in an INSERTION
record. Actual counts are made on
EXPIRATION
records.
The bailiwick
is the place in the DNS heirarchty from which we
received and answer. Sometimes a registry nameserver and the domain’s
authoritative nameserver can be out of sync. If they are out of sync,
they will list different bailiwick
and rdata
for the same rrname
and
rrtype
.
The rdata
is an array of answers returned for the given
rrname
/rrtype
/bailiwick
during the timeframe. In DNS, order of
answers doesn’t matter, so it may make sense to make sure the answers
are sorted before importing to a database.
Conclusion
In the next article, I will provide more use-case examples.
Eric Ziegast is a Senior Distributed Systems Engineer for Farsight Security, Inc.
Read the next part in this series: Farsight’s Real-time DNSDB, Part Two