dslreports logo
 
    All Forums Hot Topics Gallery
spc
uniqs
35

TSI Marc
Premium Member
join:2006-06-23
Chatham, ON

TSI Marc to koitsu

Premium Member

to koitsu

Re: Google DNS versus ours

I'm sure Gabe will chime in but I think it's pretty straight forward what the graph says...

85-90% of queries take 10ms to return a request and all requests always take less then 200ms...

your graphs show queries per second and load.. we're highlighting how quickly a query is returned not how many it can return which is also an important stat no doubt but given we have 4 servers.. load is less of an issue for us.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

I don't find this graph straight-forward in any way shape or form.

"85-90% of all queries take 10ms to get a result". Okay, that's because you look at the graph and see that the point where the graph "shoots off horizontally" starts at 85%, with the vertical axis being at 10ms, correct? That's the only way I can see how you reached that conclusion.

Except if you apply the same logic to the data shown on the rights side of the graph, you could safely say that 97% of all queries took 200ms to get a result...

The following graph (X axis = duration, Y axis = nameserver IP) makes perfect sense but doesn't really provide any hard data, though as I said, that one does make sense. It's the first graph that doesn't.

TSI Marc
Premium Member
join:2006-06-23
Chatham, ON

TSI Marc

Premium Member

it's a simple graph...

x axis = time in ms
y axis = % of querries..

if 100 querries were sent, order the results by shortest amount of time and put a dot along the y axis and how much time it took and that's the distribution you would get.
TSI Marc

TSI Marc to koitsu

Premium Member

to koitsu
and no axis of evil

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu to TSI Marc

MVM

to TSI Marc
Marc, politely: I've had two other senior systems engineers (like myself) look at the graph. Both of them are equally as perplexed, and in the same way I am.

I'll let Gabe respond from here on out, but I'll explain more verbosely:

What you've described *makes sense* (as in conceptually what you want is doable), but what your first graph actually shows doesn't jibe with what you claim the results are -- and it's because of the type of graph being used + how the data is being graphed.

(Readers should note I ABSOLUTELY believe Teksavvy's claims that their nameservers take ~9ms on average vs. Google's 20-30ms. And the reason for that is quite honestly network round trip time between TekSavvy customer and Google's DNS servers, also taking into consideration authoritative nameservers on the Internet who do not work with large EDNS packets (this adds time to the response)).

I believe the data you have is confusing because you're using a line graph rather than a scatter graph or scatter plot.

Honestly what should be happening under the hood:

Loop iteration #1:

1. Issue 100 DNS queries and keeps track of the response time of each query. Query types will vary (different zones, TLDs, A vs. NS vs. PTR etc.), and response times will vary (some will be cached results, some won't be -- those which aren't should be much higher in response time)

2. Get an average response time: add up all 100 query response times, divide by 100. Result: average response time of 100 queries.

3. Graph result on Y axis, with Y axis label "average response time (in ms) of 100 DNS queries". X axis should be incremental based on time, or simply an incrementing variable ($loopcount++).

Loop iteration #2: repeat step 1/2/3, except in step 3, the X axis location should be further to the right than before, and that you can draw a line from iteration plot data point #1 to iteration plot data point #2.

The resulting graph would look roughly something like this.

The first loop iteration -- assuming all the nameservers its querying have *no cached records* -- should be very slow (high response times due to recursive, non-cached lookups). The 2nd loop iteration should be much faster (cached results), the 3rd as well, etc. etc...

The 2nd to Nth results should be "roughly" all within the same amount of time -- however, this greatly depends on the data set being measured (more specifically: what the per-record TTL is of something being resolved, or the SOA TTL associated with that record's zone).

If you were to take all the graphed averages (how many depends on how many loop iterations you let things run for -- it matters! If just one loop, then the results are worthless!) and put them in their own data set. You could then graph those using a bar graph or bar chart, where each bar would represent response time sections, e.g. 0-10ms, 11-20ms, 21-30ms, etc. and let people see what the "general average" response time is for everything. This is akin (mostly) to the 2nd graph you listed in your post (the blue horizontal bars), except with more granularity.

And trust me, I am quite familiar with data/metrics graphing -- I wrote all of what you see there, sans the dygraphs library, and have had to write an entire code base (all perl + dealing with the mess that is RRDTool) to graph VirtualHost bandwidth usage on Apache (using no third-party modules). Not trying to troll or give you a headache, mate!

TSI Gabe
Router of Packets
Premium Member
join:2007-01-03
Gatineau, QC

TSI Gabe

Premium Member

The graph is being generated by a tool called namedbench, I believe it's Google themselves that released it. This isn't something I created.
TSI Gabe

TSI Gabe

Premium Member

I understand what you are saying though, there are more details the namedbench report spews out that is missing here and I didn't necessarily want to publish it for fear of releasing internal network info.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

Understood. And yeah, in my original/first reply, I linked to the namebench site -- their graphs are identical in layout (see "Response Distribution Chart"), meaning the use of a line plot model.

I had a 4th colleague of mine (better educated than myself, especially in mathematics) look at the graphs as well, and he agrees the presentation model is incorrect for what kind of data is trying to be plotted (not that the data itself is wrong!). There are better presentation/layout models (scatter, etc.) that would present the information in a way that makes more sense, but that's not your fault -- it's the fault of namebench. Although since it uses the Google Chart API, the HTTP arguments could be changed to refer to a different model.

The part that shocks me the most is that namebench was written by a pair of Google employees. I'm surprised that someone would write such a useful tool then completely botch the visual representation part. "It's open source, so go fix it, koitsu!" Yeah, and it's Python; I'd rather swallow hot coals.

Anyway, thanks for chiming in and clarifying a bit, TSI Gabe See Profile, very much appreciated!
MaynardKrebs
We did it. We heaved Steve. Yipee.
Premium Member
join:2009-06-17

MaynardKrebs to TSI Gabe

Premium Member

to TSI Gabe
Gabe,

You might want to invest in a copy of this bible
»www.edwardtufte.com/tuft ··· oks_vdqi

TSI Marc
Premium Member
join:2006-06-23
Chatham, ON

TSI Marc to koitsu

Premium Member

to koitsu
said by koitsu:

Marc, politely: I've had two other senior systems engineers (like myself) look at the graph. Both of them are equally as perplexed, and in the same way I am.

...

Not trying to troll or give you a headache, mate!

Hey no worries didn't mean to come off like that.. I'm a mechanical engineer and built and ran our network for 10 years.. the graph makes sense to me, I just assumed it did to others too.

All good though man, appreciate the feedback, I know it's coming from a good place. I'm happy we were able to tweak a bit more performance on this front. Seems we're all excited about that

Teddy Boom
k kudos Received
Premium Member
join:2007-01-29
Toronto, ON

1 recommendation

Teddy Boom to koitsu

Premium Member

to koitsu
said by koitsu:

Except if you apply the same logic to the data shown on the rights side of the graph, you could safely say that 97% of all queries took 200ms to get a result...

It is essentially a Cumulative Distribution Function:
»en.wikipedia.org/wiki/Cu ··· function

The right side of the graph says that 97% of all queries took less than 200ms to get a result.
mlord
join:2006-11-05
Kanata, ON

mlord to TSI Marc

Member

to TSI Marc
I've been re-testing TSI DNS since this thread began, and thus far it hasn't failed on any sites for us (a record for TSI DNS here), and seems plenty quick enough now.

So TSI is now number one on the "Forwarders" list for our local DNS.
Good stuff, guys!

TSI Gabe
Router of Packets
Premium Member
join:2007-01-03
Gatineau, QC

TSI Gabe

Premium Member

I can take a look but it would be really useful if you guys could provide me with a hostname to test against.
wally_walrus
join:2009-10-07
Orleans, ON

wally_walrus

Member

callcentric.com

or

srv.callcentric.com
wally_walrus

wally_walrus

Member

Also could you please provide us with a method to test for this in the future? I'd really like to have all devices behind my router use the default servers (hopefully Teksavvy), instead of configuring different DNS servers on each and every one

neko
All Hail Canada
Premium Member
join:2006-08-11
Canada

neko to TSI Gabe

Premium Member

to TSI Gabe
callcentric.com works & resolves through Teksavvy DNS, but that isn't the recommended solution from CallCentric.

They recommend using: srv.callcentric.com

That does not get resolved & causes my device to fail in registering with CallCentric. I had to use different DNS in the configuration of my device to have it correctly resolve the srv.callcentric.com

As Wally_Walrus said, i'd prefer to have it resolve using Teksavvy DNS, than having to hardcode an alternate into my device.

For reference:

Callcentric Problems Using ISP Assigned DNS

Callcentric DDOS Mega Thread
mlord
join:2006-11-05
Kanata, ON

mlord

Member

MMmm.. the problem appears to be not with Teksavvy, but rather with callcentric's DNS provider (telengy.net):

$ whois callcentric.com
...
Domain servers in listed order:
NS1.TELENGY.NET 66.193.176.41
NS2.TELENGY.NET 204.11.192.20
NS3.TELENGY.NET 204.11.192.68
...

$ nslookup
> server ns1.telengy.net
Default server: ns1.telengy.net
Address: 66.193.176.41#53

> callcentric.com
Server: ns1.telengy.net
Address: 66.193.176.41#53

Name: callcentric.com
Address: 204.11.192.22
Name: callcentric.com
Address: 204.11.192.23
Name: callcentric.com
Address: 204.11.192.31
Name: callcentric.com
Address: 204.11.192.34
Name: callcentric.com
Address: 204.11.192.35
Name: callcentric.com
Address: 204.11.192.36
Name: callcentric.com
Address: 204.11.192.37
Name: callcentric.com
Address: 204.11.192.38
Name: callcentric.com
Address: 204.11.192.39
Name: callcentric.com
Address: 204.11.192.135
Name: callcentric.com
Address: 204.11.192.159
Name: callcentric.com
Address: 204.11.192.160

> srv.callcentric.com
Server: ns1.telengy.net
Address: 66.193.176.41#53

Non-authoritative answer:
*** Can't find srv.callcentric.com: No answer

> server ns2.telengy.net
Default server: ns2.telengy.net
Address: 204.11.192.20#53

> srv.callcentric.com
Server: ns2.telengy.net
Address: 204.11.192.20#53

*** Can't find srv.callcentric.com: No answer

neko
All Hail Canada
Premium Member
join:2006-08-11
Canada

neko

Premium Member

Here is a guy explaining what's happening:

»DNS SRV - Callcentric

Hopefully you'll understand what he's on about, as I do not.

All I know for sure is Tekk's DNS doesn't allow my device to register & make calls; using an alternate DNS provider does work.

I'm sorry I can't be more helpful, as I have no clue about all this stuff.
mlord
join:2006-11-05
Kanata, ON

mlord

Member

The discussion at that link is NOT looking for srv.callcentric.com at all.
Instead, they are using _sip._udp.callcentric.com. for the hostname.

Perhaps that's part of your problem?
The (successful) query below is using Teksavvy DNS:

[~] dig @206.248.154.22 _sip._udp.callcentric.com SRV
;; Truncated, retrying in TCP mode.

; > DiG 9.7.0-P1 > @206.248.154.22 _sip._udp.callcentric.com SRV
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER- opcode: QUERY, status: NOERROR, id: 54216
;; flags: qr rd ra; QUERY: 1, ANSWER: 24, AUTHORITY: 3, ADDITIONAL: 12

;; QUESTION SECTION:
;_sip._udp.callcentric.com. IN SRV

;; ANSWER SECTION:
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha1.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha2.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha3.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha4.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha5.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha6.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha7.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha8.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha9.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha10.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha11.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha12.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha1.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha2.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha3.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha4.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha5.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha6.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha7.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha8.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha9.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha10.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha11.callcentric.com.
_sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha12.callcentric.com.

;; AUTHORITY SECTION:
callcentric.com. 58 IN NS ns2.telengy.net.
callcentric.com. 58 IN NS ns3.telengy.net.
callcentric.com. 58 IN NS ns1.telengy.net.

;; ADDITIONAL SECTION:
alpha1.callcentric.com. 51 IN A 204.11.192.22
alpha2.callcentric.com. 58 IN A 204.11.192.23
alpha3.callcentric.com. 58 IN A 204.11.192.31
alpha4.callcentric.com. 51 IN A 204.11.192.34
alpha5.callcentric.com. 51 IN A 204.11.192.35
alpha6.callcentric.com. 51 IN A 204.11.192.36
alpha7.callcentric.com. 51 IN A 204.11.192.37
alpha8.callcentric.com. 51 IN A 204.11.192.38
alpha9.callcentric.com. 51 IN A 204.11.192.39
alpha10.callcentric.com. 59 IN A 204.11.192.135
alpha11.callcentric.com. 51 IN A 204.11.192.159
alpha12.callcentric.com. 51 IN A 204.11.192.160

;; Query time: 16 msec
;; SERVER: 206.248.154.22#53(206.248.154.22)
;; WHEN: Sun Oct 21 15:37:16 2012
;; MSG SIZE rcvd: 1401
mlord

mlord

Member

Ah, but even with the correct hostname (_sip._udp.callcentric.com), there's still a problem with Teksavvy DNS: they don't return ANY results when "truncating" the response. This prevents voip devices (ATAs) from working at all unless the device supports TCP DNS lookups (not all do).

[~] dig @206.248.154.22 _sip._udp.callcentric.com SRV +notcp +noignore

; > DiG 9.7.0-P1 > @206.248.154.22 _sip._udp.callcentric.com SRV +notcp +noignore
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER- opcode: QUERY, status: NOERROR, id: 32860
;; flags: qr tc rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;_sip._udp.callcentric.com. IN SRV

;; Query time: 30 msec
;; SERVER: 206.248.154.22#53(206.248.154.22)
;; WHEN: Sun Oct 21 15:43:19 2012
;; MSG SIZE rcvd: 43

neko
All Hail Canada
Premium Member
join:2006-08-11
Canada

neko

Premium Member

said by mlord:

Ah, but even with the correct hostname (_sip._udp.callcentric.com), there's still a problem with Teksavvy DNS: they don't return ANY results when "truncating" the response. This prevents voip devices (ATAs) from working at all unless the device supports TCP DNS lookups (not all do).

So it is their problem, & not ours?

That's all we wanted Gabe to know. That Tek's DNS doesn't work for our VOIP device settings.

Hopefully a fix can be made.
mlord
join:2006-11-05
Kanata, ON

mlord

Member

Well, the real problem is callcentric.com. putting "too much" data into their DNS entry. Teksavvy *could* help out by having their DNS return partial (truncated) results to the UDP query, but they're not required to by any internet standards.

TSI Gabe
Router of Packets
Premium Member
join:2007-01-03
Gatineau, QC

TSI Gabe

Premium Member

honestly I kinda hate doing this but I agree with mlord... the problem is that the record is too large therefore requires the use of TCP....knowing that most ATAs don't support this they are kind of shooting themselves in the foot.

What's even more questionable is that they expect the DNS entry to be truncated...so why not enter fewer SRV records in to begin with to allow UDP to work?

I'm not saying I don't want to fix this...fixing it though on the other end would be IMO a big hack and not even sure that this would really be RFC compliant....

luckily though I'm at NANOG right now and am surrounded by geeks that deal with this on a daily basis...I'll ask around when I get the chance.
jabley
join:2012-10-21
London, ON

jabley

Member

There are several things going on, here.

The resource record set (RRSet) that the ATA is looking for is unusually large. Large responses in the DNS are generally accommodated by either negotiating a large UDP response buffer with EDNS0 (see RFC 2671) or by setting the truncate bit (TC) to 1 in a response and forcing a second DNS request using TCP.

Both these approaches have problems. Large UDP buffer sizes result in fragmentation, and fragmentation can be problematic. Many firewalls and other middleware make bad assumptions about 53/tcp, and hence TCP requests don't always work.

So, solution 1: if I was callcentral, I would be reducing my response to that particular query to ensure that it fits in a 512 byte DNS response message without truncation. If they chose their server names more carefully they could still pack a good number of resource records in the ANSWER section by taking better advantage of label compression.

The ATA described here appears not to support EDNS0 or TCP, so it has no capability of receiving large (complete) DNS responses. Not supporting TCP means not following the specification. The ATA is definitively broken, here. It violates RFC 1035. (I realise it's not unique in that. There are lots of bad DNS implementations in the world.)

Solution 2: fix the ATA. It's broken. The fact that it has ever worked is a happy accident.

BIND9's behaviour when it falls back to TCP is to set TC=1 in the response header, and to populate the answer section with as much as will fit. This response is intended to be interpreted as "this is not an accurate response, but here is a partial answer and you should use TCP to get the rest of it".

Unbound's behaviour is not to return partial responses. It says "I can't give you a complete response, and I'm not going to risk giving you a partial answer because that might be bad, so you need to use TCP".

Needless to say, this level of detail (how to populate the ANSWER section in a truncated response) is not really specified in RFC 1035, which is old. Technically, I think it's fair to say that both unbound and BIND9 are following the specification, as far as it goes.

BIND9's behaviour here is more forgiving of the broken DNS code in the ATA. I don't see an option in unbound to emulate the BIND9 approach to this.

Solution 3: choose different nameservers that behave as BIND9 does.

Unbound is good, polished software in my opinion. It has performance advantages and is far harder to fool with cache poisoning attacks than BIND9. It's hard to argue that the correct solution here is to replace unbound with BIND9; in effect, that would be throwing out the benefits of unbound for all users simply to accommodate one buggy ATA that is used by a tiny minority.
mlord
join:2006-11-05
Kanata, ON

mlord to TSI Gabe

Member

to TSI Gabe
said by TSI Gabe:

luckily though I'm at NANOG right now and am surrounded by geeks that deal with this on a daily basis...I'll ask around when I get the chance.

Sounds like the Right Crowd to find a solution with, but I'm with you on this one -- risky to modify the DNS behaviour to accommodate a clueless voip provider, especially as there's a definite risk of breaking other stuff.

Don't forget to go out for some Real Beef BBQ down that way, and let us all know if you manage to reverse engineer the famous fountains down town (if you grok the pattern, you can stride up the middle without getting wet!).

Cheers!
OTIS3
join:2011-09-29

OTIS3 to jabley

Member

to jabley
said by jabley:

simply to accommodate one buggy ATA that is used by a tiny minority.

I don't think anyone has specifically mentioned any ATA models yet in this thread. For me personally, I'm using a Linksys PAP2T which is probably the most widely deployed ATA for home users. I'm not saying it is good or that Linksys/Cisco isn't known for having buggy devices. They are also not likely to fix it in a new firmware at this point.
wally_walrus
join:2009-10-07
Orleans, ON

wally_walrus

Member

+1. Even though some / most models are "buggy" they are widely used, so efforts should be made to support them. I am using an SPA-3102
mlord
join:2006-11-05
Kanata, ON

mlord

Member

said by wally_walrus:

+1. Even though some / most models are "buggy" they are widely used, so efforts should be made to support them.

+100

The PAP2T are very likely the most common "non locked" ATA devices out there, with their cousins the SPA-3102 also fairly prevalent.

So it's not feasible to simply ignore them. Callcentric.com needs to do better.

Meanwhile, anyone affected by this can just use a different DNS service, or run a copy of bind9 locally to relay from Teksavvy DNS without the issue of Teksavvy DNS.

I just checked here, and my local bind9 service does return partial results just fine, but the stripped down DNS in my router does not. I imagine that folks running OpenWRT on their routers would have the option of adding bind9 service onto those, which would take care of it as well.

Cheers
mlord

1 edit

mlord

Member

said by mlord:

Meanwhile, anyone affected by this can just use a different DNS service, or run a copy of bind9 locally to relay from Teksavvy DNS without the issue of Teksavvy DNS.

Or maybe TSI Gabe could channel the spirit of Teksavvy Past, and run bind9 on one server internally (doesn't need to be accessible outside of Teksavvy), and have it act as the authority for Callcentric.com for use by Teksavvy's public DNS servers. Hacky, and there are probably other similar/better workarounds that TSI Gabe could dream up.

TSI Gabe
Router of Packets
Premium Member
join:2007-01-03
Gatineau, QC

TSI Gabe

Premium Member

Well jabley See Profile is with me at NANOG and his reply here is what came out of the conversation we had. The reality here is that clearcable is knowingly serving a large RRSET that doesn't fit in a 512 byte buffer and they also know that this results in a half broken DNS reply when using a few well known ATAs. I've also talked to a few more people, some of them that work for TLDs and to be honest the opinion that I've heard loud and clear so far is what the heck is clearcable doing.

By far the easiest way to fix this would be for clearcable to shorten the RRSET reply by using the various methods that jabley highlighted above.

While I'm not against "fixing" this in the spirit of being nice. This issue is only specific to using certain ATAs on the clearcable service.