The term latency is used a lot in networking and most commonly refers to how long it takes a packet to reach a destination and come back again. The most common tools for measuring network latency are ping and traceroute, but there are more. When I speak to operators around Asia Pacific about DNS though, it's interesting to hear that latency is not often used when benchmarking or measuring their DNS service quality.
A lot of operators overlook the importance of DNS latency and how it directly affects their end customer's user experience. The most common metric I come across for measuring DNS is Queries Per Second (QPS) (usually combined with CPU utilization), but this approach is really just a measurement of how many queries a given server can process while still responding to clients at given CPU levels. When CPU spikes, for whatever reason (usually a DNS-DDoS), the servers can become unresponsive, which is an outage. When the server is still considered healthy, however, what is often missing is measuring how long those DNS responses take to come back to a subscriber - there's a big difference between 10ms and 200ms when it comes to DNS responses.
Why is Low DNS Latency Important?
DNS latency is important to measure because it shows how customers perceive the responsiveness of the DNS service is. Most customers won't know it's the DNS that is slow, but it has a direct impact on how they perceive the speed of the internet service in general. Slow DNS = slow internet. If DNS latency is high, the more technical customers may switch to other DNS services like Google's 18.104.22.168. When customers change their DNS settings to alternate DNS service providers, operators in many ways they have lost that customer as a subscriber, as they've lost the ability to enforce any DNS-based policy which is becoming an increasingly used method of enforcing network policy. This is a situation operator should avoid. Less technical subscribers may feel the internet service is just simply slow and churn to another operator.
If we think about a heavily trafficked site like Facebook, YouTube etc., or any site that has a lot of rich, embedded content, the number of DNS queries involved in just retrieving one page can be staggering. To load Facebook in a browser today can use upwards of 30+ DNS queries while it pulls in embedded content from various sources (mostly CDNs, often with very low TTLs), depending on liked feeds of course. If we compound unnecessary milliseconds onto each query, this can result in a slower page load than it should be. This situation is worse for mobile networks where propagation latency of IP is usually higher anyway. Latency can be above normal range while the servers are otherwise perming with CPU utilization levels.
The following is a very quick check on DNS response times for www.google.com comparing a local ISP DNS and Google DNS. This would obviously be coming directly from the DNS server's cache, so no recursion should be involved. The query time below can vary between 5ms up to 50ms (local ISP) depending on what the network is doing, but most should be down < 20ms range.
ISP DNS (can you guess who?)$ dig @22.214.171.124 +noall +stats www.googe.com A
;; Query time: 9 msec
;; SERVER: 126.96.36.199#53(188.8.131.52)
;; WHEN: Wed Jun 14 22:44:48 2017
;; MSG SIZE rcvd: 47
Google's DNS service:
$ dig @184.108.40.206 +noall +stats www.googe.com A
;; Query time: 142 msec
;; SERVER: 220.127.116.11#53(18.104.22.168)
;; WHEN: Wed Jun 14 22:48:02 2017
;; MSG SIZE rcvd: 47
You can see the impact of that additional DNS latency could have if site or app needed to make 20+ DNS queries per load.
What are Some Causes of High Latency DNS?
There are a number of factors that influence DNS latency. Some of the main factors to consider are:
- DNS server location - Where the DNS servers are located (in relation to subscribers) impacts latency. A distributed DNS infrastructure (servers closest to subscribers) generally provides lower latency DNS over centralized DNS. This, of course, plays a more important part in large geographic regions (such as Australia), or regions where transit links may be less reliable. Additionally, using DNS servers outside your ISP's network can impact this - including Google's DNS.
- Wireless networks - Most often wireless networks will have a higher DNS latency over fixed-line. LTE and newer networks have dramatically improved propagation latency over the radio interfaces (as compared to 3G), and this will continue to improve over time, but in general, most wireless networks will impact DNS latency.
- Malicious DNS traffic - The amount of malicious DNS traffic the servers are processing can impact DNS latency. This may be PRSD attacks that trigger high recursion, botnet/C&C queries, malware queries all consume (waste) CPU cycles on the server. Please see this blog post on DNS Security Importance DNS Security -- part 1/
- DNS server under-scaling - If the DNS infrastructure is not scaled correctly, this can result in overall higher CPU utilization which has a negative impact on DNS latency. Generally, the higher the CPU utilization; the higher the latency. Should the CPU utilization reach 100%, DNS can, of course, become unusably slow or even unresponsive.
- CDNs with low TTLs - CDNs play a huge role in delivering content today. The internet is largely made up of content owners (Apple, Facebook) and content providers (Akamai, Facebook CDNs). The way CDNs operate relies on low TTLs DNS records. TTLs determine how long DNS servers are meant to retrain the data in cache before expiring it (in case it is changed by the content owner). When the TTL expires, recursion is then required to retrieve the new data. As more and more content is moved to CDNs, there will be more DNS transactions for low-TTLs records, which increases recursion. Recursion increases latency because the DNS server has to retrieve the data. This is a big topic and out of scope for this blog post, but something to consider nonetheless. Some DNS software (including CacheServe) does use various techniques to mitigate this impact, such as pre-fetching and prioritized-prefetching, recursion pooling. These features mean that DNS data will be pre-fetched before the data is actually expired from cache, meaning that recursion is mostly not subscriber-initiated.
How do You Monitor DNS Latency?
In practice, this may be easier said than done. In order to accurately monitor DNS latency, you may have to deploy probes in the network, ideally at the end of user connections like ADSL, cable, etc. and periodically send DNS queries to measure the response times. You may need a small script (pick your language) to run the actual query and measure the response. Most NMS platforms will allow you to add this into graphing fairly easily. I would recommend setting the threshold at around 20ms for a desirable DNS response (and you might need several above that to trigger an alert). Consistently higher response times than that would start to impact user-experience.