 TSI MarcPremium,VIP join:2006-06-23 Chatham, ON kudos:16 | reply to koitsu
Re: Google DNS versus ours it's a simple graph...
x axis = time in ms y axis = % of querries..
if 100 querries were sent, order the results by shortest amount of time and put a dot along the y axis and how much time it took and that's the distribution you would get. -- Marc - CEO/TekSavvy |
|
|
|
 koitsuPremium,MVM join:2002-07-16 Mountain View, CA kudos:20 | Marc, politely: I've had two other senior systems engineers (like myself) look at the graph. Both of them are equally as perplexed, and in the same way I am.
I'll let Gabe respond from here on out, but I'll explain more verbosely:
What you've described *makes sense* (as in conceptually what you want is doable), but what your first graph actually shows doesn't jibe with what you claim the results are -- and it's because of the type of graph being used + how the data is being graphed.
(Readers should note I ABSOLUTELY believe Teksavvy's claims that their nameservers take ~9ms on average vs. Google's 20-30ms. And the reason for that is quite honestly network round trip time between TekSavvy customer and Google's DNS servers, also taking into consideration authoritative nameservers on the Internet who do not work with large EDNS packets (this adds time to the response)).
I believe the data you have is confusing because you're using a line graph rather than a scatter graph or scatter plot.
Honestly what should be happening under the hood:
Loop iteration #1:
1. Issue 100 DNS queries and keeps track of the response time of each query. Query types will vary (different zones, TLDs, A vs. NS vs. PTR etc.), and response times will vary (some will be cached results, some won't be -- those which aren't should be much higher in response time)
2. Get an average response time: add up all 100 query response times, divide by 100. Result: average response time of 100 queries.
3. Graph result on Y axis, with Y axis label "average response time (in ms) of 100 DNS queries". X axis should be incremental based on time, or simply an incrementing variable ($loopcount++).
Loop iteration #2: repeat step 1/2/3, except in step 3, the X axis location should be further to the right than before, and that you can draw a line from iteration plot data point #1 to iteration plot data point #2.
The resulting graph would look roughly something like this.
The first loop iteration -- assuming all the nameservers its querying have *no cached records* -- should be very slow (high response times due to recursive, non-cached lookups). The 2nd loop iteration should be much faster (cached results), the 3rd as well, etc. etc...
The 2nd to Nth results should be "roughly" all within the same amount of time -- however, this greatly depends on the data set being measured (more specifically: what the per-record TTL is of something being resolved, or the SOA TTL associated with that record's zone).
If you were to take all the graphed averages (how many depends on how many loop iterations you let things run for -- it matters! If just one loop, then the results are worthless!) and put them in their own data set. You could then graph those using a bar graph or bar chart, where each bar would represent response time sections, e.g. 0-10ms, 11-20ms, 21-30ms, etc. and let people see what the "general average" response time is for everything. This is akin (mostly) to the 2nd graph you listed in your post (the blue horizontal bars), except with more granularity.
And trust me, I am quite familiar with data/metrics graphing -- I wrote all of what you see there, sans the dygraphs library, and have had to write an entire code base (all perl + dealing with the mess that is RRDTool) to graph VirtualHost bandwidth usage on Apache (using no third-party modules). Not trying to troll or give you a headache, mate!  -- Making life hard for others since 1977. I speak for myself and not my employer/affiliates of my employer. |
|
 TSI GabePremium,VIP join:2007-01-03 Chatham, ON kudos:2 | The graph is being generated by a tool called namedbench, I believe it's Google themselves that released it. This isn't something I created. |
|
 TSI GabePremium,VIP join:2007-01-03 Chatham, ON kudos:2 | I understand what you are saying though, there are more details the namedbench report spews out that is missing here and I didn't necessarily want to publish it for fear of releasing internal network info. |
|
 koitsuPremium,MVM join:2002-07-16 Mountain View, CA kudos:20 | Understood. And yeah, in my original/first reply, I linked to the namebench site -- their graphs are identical in layout (see "Response Distribution Chart"), meaning the use of a line plot model.
I had a 4th colleague of mine (better educated than myself, especially in mathematics) look at the graphs as well, and he agrees the presentation model is incorrect for what kind of data is trying to be plotted (not that the data itself is wrong!). There are better presentation/layout models (scatter, etc.) that would present the information in a way that makes more sense, but that's not your fault -- it's the fault of namebench. Although since it uses the Google Chart API, the HTTP arguments could be changed to refer to a different model.
The part that shocks me the most is that namebench was written by a pair of Google employees. I'm surprised that someone would write such a useful tool then completely botch the visual representation part. "It's open source, so go fix it, koitsu!" Yeah, and it's Python; I'd rather swallow hot coals. 
Anyway, thanks for chiming in and clarifying a bit, TSI Gabe , very much appreciated! -- Making life hard for others since 1977. I speak for myself and not my employer/affiliates of my employer. |
|
 | reply to TSI Gabe Gabe,
You might want to invest in a copy of this bible »www.edwardtufte.com/tufte/books_vdqi |
|
 TSI MarcPremium,VIP join:2006-06-23 Chatham, ON kudos:16 | reply to koitsu said by koitsu:Marc, politely: I've had two other senior systems engineers (like myself) look at the graph. Both of them are equally as perplexed, and in the same way I am.
...
Not trying to troll or give you a headache, mate!  Hey no worries didn't mean to come off like that.. I'm a mechanical engineer and built and ran our network for 10 years.. the graph makes sense to me, I just assumed it did to others too.
All good though man, appreciate the feedback, I know it's coming from a good place. I'm happy we were able to tweak a bit more performance on this front. Seems we're all excited about that  -- Marc - CEO/TekSavvy |
|
 mlord join:2006-11-05 Nepean, ON kudos:10 Reviews:
·Start Communicat..
·TekSavvy Cable
·TekSavvy DSL
| I've been re-testing TSI DNS since this thread began, and thus far it hasn't failed on any sites for us (a record for TSI DNS here), and seems plenty quick enough now.
So TSI is now number one on the "Forwarders" list for our local DNS. Good stuff, guys! |
|
 TSI GabePremium,VIP join:2007-01-03 Chatham, ON kudos:2 | I can take a look but it would be really useful if you guys could provide me with a hostname to test against. |
|
 | callcentric.com
or
srv.callcentric.com |
|
 | Also could you please provide us with a method to test for this in the future? I'd really like to have all devices behind my router use the default servers (hopefully Teksavvy), instead of configuring different DNS servers on each and every one |
|
 nekoAll Hail CanadaPremium join:2006-08-11 Canada | reply to TSI Gabe callcentric.com works & resolves through Teksavvy DNS, but that isn't the recommended solution from CallCentric.
They recommend using: srv.callcentric.com
That does not get resolved & causes my device to fail in registering with CallCentric. I had to use different DNS in the configuration of my device to have it correctly resolve the srv.callcentric.com
As Wally_Walrus said, i'd prefer to have it resolve using Teksavvy DNS, than having to hardcode an alternate into my device.
For reference:
Callcentric Problems Using ISP Assigned DNS
Callcentric DDOS Mega Thread -- ...virtue gives you heraldry. |
|
 mlord join:2006-11-05 Nepean, ON kudos:10 Reviews:
·Start Communicat..
·TekSavvy Cable
·TekSavvy DSL
| MMmm.. the problem appears to be not with Teksavvy, but rather with callcentric's DNS provider (telengy.net):
$ whois callcentric.com ... Domain servers in listed order: NS1.TELENGY.NET 66.193.176.41 NS2.TELENGY.NET 204.11.192.20 NS3.TELENGY.NET 204.11.192.68 ...
$ nslookup > server ns1.telengy.net Default server: ns1.telengy.net Address: 66.193.176.41#53
> callcentric.com Server: ns1.telengy.net Address: 66.193.176.41#53
Name: callcentric.com Address: 204.11.192.22 Name: callcentric.com Address: 204.11.192.23 Name: callcentric.com Address: 204.11.192.31 Name: callcentric.com Address: 204.11.192.34 Name: callcentric.com Address: 204.11.192.35 Name: callcentric.com Address: 204.11.192.36 Name: callcentric.com Address: 204.11.192.37 Name: callcentric.com Address: 204.11.192.38 Name: callcentric.com Address: 204.11.192.39 Name: callcentric.com Address: 204.11.192.135 Name: callcentric.com Address: 204.11.192.159 Name: callcentric.com Address: 204.11.192.160
> srv.callcentric.com Server: ns1.telengy.net Address: 66.193.176.41#53
Non-authoritative answer: *** Can't find srv.callcentric.com: No answer
> server ns2.telengy.net Default server: ns2.telengy.net Address: 204.11.192.20#53
> srv.callcentric.com Server: ns2.telengy.net Address: 204.11.192.20#53
*** Can't find srv.callcentric.com: No answer |
|
 nekoAll Hail CanadaPremium join:2006-08-11 Canada | Here is a guy explaining what's happening:
»DNS SRV - Callcentric
Hopefully you'll understand what he's on about, as I do not.
All I know for sure is Tekk's DNS doesn't allow my device to register & make calls; using an alternate DNS provider does work.
I'm sorry I can't be more helpful, as I have no clue about all this stuff. -- ...virtue gives you heraldry. |
|
 mlord join:2006-11-05 Nepean, ON kudos:10 Reviews:
·Start Communicat..
·TekSavvy Cable
·TekSavvy DSL
| The discussion at that link is NOT looking for srv.callcentric.com at all. Instead, they are using _sip._udp.callcentric.com. for the hostname.
Perhaps that's part of your problem? The (successful) query below is using Teksavvy DNS:
[~] dig @206.248.154.22 _sip._udp.callcentric.com SRV ;; Truncated, retrying in TCP mode.
; > DiG 9.7.0-P1 > @206.248.154.22 _sip._udp.callcentric.com SRV ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER- opcode: QUERY, status: NOERROR, id: 54216 ;; flags: qr rd ra; QUERY: 1, ANSWER: 24, AUTHORITY: 3, ADDITIONAL: 12
;; QUESTION SECTION: ;_sip._udp.callcentric.com. IN SRV
;; ANSWER SECTION: _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha1.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha2.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha3.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha4.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha5.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha6.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha7.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha8.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha9.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha10.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha11.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 5080 alpha12.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha1.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha2.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha3.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha4.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha5.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha6.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha7.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha8.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha9.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha10.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha11.callcentric.com. _sip._udp.callcentric.com. 51 IN SRV 20 0 10123 alpha12.callcentric.com.
;; AUTHORITY SECTION: callcentric.com. 58 IN NS ns2.telengy.net. callcentric.com. 58 IN NS ns3.telengy.net. callcentric.com. 58 IN NS ns1.telengy.net.
;; ADDITIONAL SECTION: alpha1.callcentric.com. 51 IN A 204.11.192.22 alpha2.callcentric.com. 58 IN A 204.11.192.23 alpha3.callcentric.com. 58 IN A 204.11.192.31 alpha4.callcentric.com. 51 IN A 204.11.192.34 alpha5.callcentric.com. 51 IN A 204.11.192.35 alpha6.callcentric.com. 51 IN A 204.11.192.36 alpha7.callcentric.com. 51 IN A 204.11.192.37 alpha8.callcentric.com. 51 IN A 204.11.192.38 alpha9.callcentric.com. 51 IN A 204.11.192.39 alpha10.callcentric.com. 59 IN A 204.11.192.135 alpha11.callcentric.com. 51 IN A 204.11.192.159 alpha12.callcentric.com. 51 IN A 204.11.192.160
;; Query time: 16 msec ;; SERVER: 206.248.154.22#53(206.248.154.22) ;; WHEN: Sun Oct 21 15:37:16 2012 ;; MSG SIZE rcvd: 1401 |
|
 mlord join:2006-11-05 Nepean, ON kudos:10 Reviews:
·Start Communicat..
·TekSavvy Cable
·TekSavvy DSL
| Ah, but even with the correct hostname (_sip._udp.callcentric.com), there's still a problem with Teksavvy DNS: they don't return ANY results when "truncating" the response. This prevents voip devices (ATAs) from working at all unless the device supports TCP DNS lookups (not all do).
[~] dig @206.248.154.22 _sip._udp.callcentric.com SRV +notcp +noignore
; > DiG 9.7.0-P1 > @206.248.154.22 _sip._udp.callcentric.com SRV +notcp +noignore ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER- opcode: QUERY, status: NOERROR, id: 32860 ;; flags: qr tc rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION: ;_sip._udp.callcentric.com. IN SRV
;; Query time: 30 msec ;; SERVER: 206.248.154.22#53(206.248.154.22) ;; WHEN: Sun Oct 21 15:43:19 2012 ;; MSG SIZE rcvd: 43 |
|
 nekoAll Hail CanadaPremium join:2006-08-11 Canada | said by mlord:Ah, but even with the correct hostname (_sip._udp.callcentric.com), there's still a problem with Teksavvy DNS: they don't return ANY results when "truncating" the response. This prevents voip devices (ATAs) from working at all unless the device supports TCP DNS lookups (not all do). So it is their problem, & not ours?
That's all we wanted Gabe to know. That Tek's DNS doesn't work for our VOIP device settings.
Hopefully a fix can be made. -- ...virtue gives you heraldry. |
|
 mlord join:2006-11-05 Nepean, ON kudos:10 | Well, the real problem is callcentric.com. putting "too much" data into their DNS entry. Teksavvy *could* help out by having their DNS return partial (truncated) results to the UDP query, but they're not required to by any internet standards. |
|
 TSI GabePremium,VIP join:2007-01-03 Chatham, ON kudos:2 | honestly I kinda hate doing this but I agree with mlord... the problem is that the record is too large therefore requires the use of TCP....knowing that most ATAs don't support this they are kind of shooting themselves in the foot.
What's even more questionable is that they expect the DNS entry to be truncated...so why not enter fewer SRV records in to begin with to allow UDP to work?
I'm not saying I don't want to fix this...fixing it though on the other end would be IMO a big hack and not even sure that this would really be RFC compliant....
luckily though I'm at NANOG right now and am surrounded by geeks that deal with this on a daily basis...I'll ask around when I get the chance. -- TSI Gabe - TekSavvy Solutions Inc. Authorized TSI employee ( »TekSavvy FAQ »Official support in the forum )
|
|
 jabley join:2012-10-21 London, ON | There are several things going on, here.
The resource record set (RRSet) that the ATA is looking for is unusually large. Large responses in the DNS are generally accommodated by either negotiating a large UDP response buffer with EDNS0 (see RFC 2671) or by setting the truncate bit (TC) to 1 in a response and forcing a second DNS request using TCP.
Both these approaches have problems. Large UDP buffer sizes result in fragmentation, and fragmentation can be problematic. Many firewalls and other middleware make bad assumptions about 53/tcp, and hence TCP requests don't always work.
So, solution 1: if I was callcentral, I would be reducing my response to that particular query to ensure that it fits in a 512 byte DNS response message without truncation. If they chose their server names more carefully they could still pack a good number of resource records in the ANSWER section by taking better advantage of label compression.
The ATA described here appears not to support EDNS0 or TCP, so it has no capability of receiving large (complete) DNS responses. Not supporting TCP means not following the specification. The ATA is definitively broken, here. It violates RFC 1035. (I realise it's not unique in that. There are lots of bad DNS implementations in the world.)
Solution 2: fix the ATA. It's broken. The fact that it has ever worked is a happy accident.
BIND9's behaviour when it falls back to TCP is to set TC=1 in the response header, and to populate the answer section with as much as will fit. This response is intended to be interpreted as "this is not an accurate response, but here is a partial answer and you should use TCP to get the rest of it".
Unbound's behaviour is not to return partial responses. It says "I can't give you a complete response, and I'm not going to risk giving you a partial answer because that might be bad, so you need to use TCP".
Needless to say, this level of detail (how to populate the ANSWER section in a truncated response) is not really specified in RFC 1035, which is old. Technically, I think it's fair to say that both unbound and BIND9 are following the specification, as far as it goes.
BIND9's behaviour here is more forgiving of the broken DNS code in the ATA. I don't see an option in unbound to emulate the BIND9 approach to this.
Solution 3: choose different nameservers that behave as BIND9 does.
Unbound is good, polished software in my opinion. It has performance advantages and is far harder to fool with cache poisoning attacks than BIND9. It's hard to argue that the correct solution here is to replace unbound with BIND9; in effect, that would be throwing out the benefits of unbound for all users simply to accommodate one buggy ATA that is used by a tiny minority. |
|