dslreports logo
 
    All Forums Hot Topics Gallery
spc
Search similar:


uniqs
5915
aryoba
MVM
join:2002-08-22

aryoba

MVM

One app didn't work over Frame Relay; worked over ISDN

I know that it has been years since my last thread posting. I usually solved other people's problem. However this time, I expected people to help me or at least throwing ideas as to why the following problem occurred. Here it goes.

We have two sites connected via Frame Relay as primary link and ISDN as backup link, as follows:

Frame Relay (primary)
Site A ----------------------- Site B
| |
+------------------------------+
ISDN (backup)

Nothing fancy on the network setup. Both sites have 1751 router. On both routers Fast Ethernet port, there is a layer-2 switch connection several hosts (both clients and servers).

Everything went fine until one day. That day the Frame Relay link was down. Checking with telco, turned out there was a PVC mapping problem on the telco frame relay switch, causing frame/packet loss between two PVC. Telco rebuilt the PVC and seemed to fix the problem.

Since the telco rebuilt the PVC, there was one specific http application that had connection problem when the Frame Relay link was used. Site A was the client site and site B was the server site. Even though site A was able to telnet to the site B server on port 80 (indicating end-to-end network connectivity), site A could not load up the application.

We originally thought it might be browser or application problem. However the connection went fine when the back up ISDN link was used.

Please note that on the Frame Relay link, there was only one http application that is having problem. Other application (including other http application) did not have any connection problem at all. When using the ISDN link, all application (including the problematic http application) worked fine.

To further investigate, we checked what exactly the problem that the http application experienced using the Frame Relay. From the server, we saw that there was the following error message:

"HTTP Error 304
Not Modified. The server has identified from the request information that the client's copy of the information is up-to-date and the requested information does not need to be sent again."

We then did packet capture on both Frame Relay (the non-working) and ISDN (the working) to compare. From the capture we noticed that the problem occurred because not all of the http data was making it back to site A. Following is the detail.

Up to the request and first response, everything was good. In the working capture (via ISDN), there was an extra packet that had the 2nd half of the http payload.

In the non-working capture, the extra packet did not exist. As a result, site A acknowledged the first half of the http payload only. The web server on site B could not send just the 2nd half of the http payload, so it backed up and tried to resend the full payload again. However such condition causing http payload out of order problem at site A since site A was expecting only the 2nd half of the payload.

The task then was to find out why the 2nd half of the payload was not showing up over Frame Relay, but was showing up over ISDN. Please note that before we did packet capture, we already checked both routers configuration for possible routing or firewall issues, power cycled both routers, etc.; and nothing seemed to repair the application problem.

Until we fixed the problem, we had to use the ISDN backup link since the application was considered critical. The problem was that ISDN 24-hour-a-day/7-day-a-week usage was expensive.

At this point, does anybody have a thought as to why such problem occurred?

Covenant
MVM
join:2003-07-01
England

Covenant

MVM

Interesting... is it fixed and what did you do to fix it?

Can you attach the sniffer captures in a text file if possible, the one working and the one not working? Where were those sniffer outputs capured, on the LAN side of the routers?

Did the other HTTP application that was not having the problem via FR being hosted on the same webserver as the one having the problem.

How big is the FR link, 64K, 128K, 256K, etc and are you performing any FR fragmentation on it, therefore what is the MTU?

Sorry to ask questions but we need a better understanding of the issue before we can progress this.

Thinking out aloud, from your symptoms, I would say MSS was the issue but without seeing it for myself, that is just a guess.

rovernet
Premium Member
join:2004-02-11
Richardson, TX

rovernet to aryoba

Premium Member

to aryoba
Also, check with your frame-relay provider to make sure the pvc was rebuilt correctly. Let them know that your end is checked out and is missing some data on its way back across the link.

Policing might had been activated while rebuilding the pvc in the provider's switches. It would be great if they can check the frame relay end2end and see if they are having any packet loss along the frame-relay/atm cloud.

Hope it helps.
aryoba
MVM
join:2002-08-22

aryoba to Covenant

MVM

to Covenant
"Can you attach the sniffer captures in a text file if possible, the one working and the one not working?"

Answer:
I was trying to do that. However the output was not readable.

"Where were those sniffer outputs capured, on the LAN side of the routers?"

Answer:
Both captures were taken between both routers (the WAN side).

"Did the other HTTP application that was not having the problem via FR being hosted on the same webserver as the one having the problem"

Answer:
Other HTTP applications that were not affected, were not being hosted by the same physical server.

"How big is the FR link, 64K, 128K, 256K, etc"

Answer:
I believe it was about 128k.

"Are you performing any FR fragmentation on it"

Answer:
No. We also do not run any voice; strictly data.

"what is the MTU"

Answer:
Frankly, I don't really remember. I had to check out the router configuration for this

"It would be great if they can check the frame relay end2end and see if they are having any packet loss along the frame-relay/atm cloud."

Comment:
There was no packet loss at all. As mentioned, we could successfully telnet the server on port 80.

"Check with your frame-relay provider to make sure the pvc was rebuilt correctly"

Question:
How can improper PVC rebuilt contribute problem to ONLY ONE specific http application and affect no other?

rolande
Certifiable
MVM,
join:2002-05-24
Dallas, TX
ARRIS BGW210-700
Cisco Meraki MR42

rolande

MVM,

You mentioned an HTTP response from the webserver having the second packet of HTTP payload dropped. Can you see this second packet of data coming out of the server before it gets dropped? If so, what is the length and content? Can you isolate which device or media is dropping the packet?

You need to run a full sniffer capture with something like Ethereal on the server and client for both scenarios, filter the data to just the broken TCP session in question and post the .cap file. You should also at the same time use 'debug ip packet detail [acl#]' on both of the routers and use the ACL to filter for Destination port 80 to the webserver's IP and source port 80 coming from the webserver's IP. If this second packet of HTTP data is making it to either of the router's, you should see how each router is handling it and whether or not it actually makes it out the Frame Relay interface or not or if it gets dropped/blackholed etc.
aryoba
MVM
join:2002-08-22

aryoba

MVM

Rolande Reply:

We could see the 2nd http payload on the LAN side, leaving the server. None of the server of client host dropped the payload. We also saw the payload entering the site B router from the LAN side.

As to the ACL, again we do not use one. There is no firewall, rules, nothing; just a simple router configuration.

When the Frame Relay link was used, no 2nd http payload entering the site A router (the client site) eventhough the payload was seen leaving the site B router (the server site). When the ISDN link was use, the payload was seen leaving router B and also was seen entering router A.

In addition, site A is basically one of multiple branch offices and site B is a corporate; having hub-and-spoke Frame Relay and ISDN link as backup. All routers in other branch offices have the same router model, the same router configuration, the same IOS version and feature, the same PVC Frame Relay (CIR and Burst), the ISDN, the same line bandwidth, and so on. All other branch offices also run the same application to the Site B (the corporate). However only Site A that had problem with one specific http application over the Frame Relay and no other branch office.

Further, we also already triple check everything from comparing router configuration between the working from other branch offices, comparing PVC and telco setup, etc. Nothing seemed to explain why there was only Site A that had the problem.

PA23
join:2001-12-12
East Hanover, NJ

PA23 to aryoba

Member

to aryoba
just a though, disable cef on both ends and see if it works

rolande
Certifiable
MVM,
join:2002-05-24
Dallas, TX
ARRIS BGW210-700
Cisco Meraki MR42

rolande to aryoba

MVM,

to aryoba
I didn't ask if you had an ACL on the interface. I was asking you to use the command 'debug ip packet detail [acl#]' to actually see specifically what each router was doing with that particular packet. If the packet goes out the frame interface on Router B on the correct PVC without an encapsulation failure and it does not get received by Router A's Frame interface then there is something strange going on at Layer 1 in the Frame cloud.

Just as a point of reference, there used to be known issues with specific DSU's on point-to-point 56K circuits where certain types of traffic would match the remote loopback pattern and send the DSU into a test loop. Of course the last time I saw that was in 1995, but still stranger things can happen at Layer 1/2.

Can you do an extended ping with a 1500 byte payload from router to router and make sure nothing is dropped? Somewhere there was a white paper I found on bit pattern testing across circuits to test for certain anomalies. If I find it, I'll post it for you to test with.
aryoba
MVM
join:2002-08-22

aryoba to PA23

MVM

to PA23
N251EA,

As mentioned, we already compared site A router configuration to all other branch site router. We also already compared everything else like telco setting, PVC, etc.

Telco also informed us that they tested the line clean. All other applications (including different http application) was running fine via the Frame Relay; indicating the Frame Relay itself was good of carrying data.

In addition, site A was not a new site. We had this running fine for years. The problem started when the telco fixed the PVC mapping problem.
aryoba

aryoba to rolande

MVM

to rolande
Rolande,

Yes, the packet goes out the frame interface on Router B on the correct PVC without an encapsulation failure and it does not get received by Router A's Frame interface.

We also already did extended ping from router to router and nothing was dropped. We even could establish telnet session from site A (the client) to site B (the server) at port 80 as mentioned previously.

When you then there is something strange going on at Layer 1 in the Frame cloud; I would agree. We just don't know which strange event that is.

Yes, if you do find the white paper; please post it. I guess this will educate all of us

rolande
Certifiable
MVM,
join:2002-05-24
Dallas, TX
ARRIS BGW210-700
Cisco Meraki MR42

1 edit

rolande

MVM,

Here is some of the info. I am still looking for the full list of Data Patterns and what problem each different type of data pattern can validate for you.

Complete these steps to perform serial line extended ping tests:

Type: ping ip
Target address = enter the IP address of the local interface to which IP address was just assigned
Repeat count = 50
Datagram size = 1500
Timeout = press ENTER
Extended cmds = yes press ENTER
Source Address = press ENTER
Type of service = press ENTER
Set Df bit in ip header = press ENTER
Validate reply data = press ENTER
Data pattern: 0x0000
Press ENTER once
Sweep range of sizes = y press ENTER.

Notice that the ping packet size is 1500 bytes, and that we perform an all zeros ping (0x0000). Also, the ping count specification is set to 50. Therefore, in this case, fifty 1500 byte ping packets are sent out.

Here is a sample output:

Router#ping ip
Target IP address: 172.22.53.1
Repeat count [5]: 50
Datagram size [100]: 1500
Timeout in seconds [2]:
Extended commands [n]: yes
Source address or interface:
Type of service [0]:
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]: 0x0000
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 50, 1500-byte ICMP Echos to 172.22.53.1, timeout is 2 seconds:
Packet has data pattern 0x0000
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (50/50), round-trip min/avg/max = 4/4/8 ms
Router#

Perform additional extended ping tests with different data patterns.

For example:

Repeat step 1, but use a Data Pattern of 0x0001
Repeat step 1, but use a Data Pattern of 0x0101
Repeat step 1, but use a Data Pattern of 0x1111
Repeat step 1, but use a Data Pattern of 0x4040
Repeat step 1, but use a Data Pattern of 0x5555
Repeat step 1, but use a Data Pattern of 0xaaaa
Repeat step 1, but use a Data Pattern of 0xffff

Data Pattern of 0x0000 = 0000000000000000
Line-code mismatches. Used to test B8ZS substitution of all 0's pattern. With sweep range of sizes you can find if the problem occurs with specific frame sizes.

Data Pattern of 0x4040 = 0100000001000000
Test for suspected timing errors. The 0x4040 extended ping pattern also enables you to detect jitter and wander. T1 phase variations greater than or equal to 10Hz are considered jitter, and variations less than 10Hz are considered wander.

Data Pattern of 0xffff = 1111111111111111
Identify repeater power problems

Verify that all the extended ping tests were 100 percent successful.
Examine the show interface serial command output to determine whether input errors have increased.
If input errors have not increased, the local hardware (DSU, cable, router interface card) is probably in good condition. Also look for cyclic redundancy check (CRC), frame, or other errors. Look at the fifth and sixth line from the bottom of the show interface serial command output to verify this.
If all pings are 100 percent successful and input errors have not increased, the equipment in this portion of the circuit is probably in good condition. Move on to the next loopback test to be performed.
Remove the loopback from the interface. To do so, remove the loopback plug, the software loopback commands, or request the telco to remove their loopback. Then restore your router to the original setting.

Here is a link to another article I found regarding strange intermittent CRC's on a T-1.

»www.pmg.com/tip_archive/ ··· ping.htm

WAN Troubleshooting
»www.cisco.com/warp/publi ··· over.pdf

PA23
join:2001-12-12
East Hanover, NJ

PA23 to aryoba

Member

to aryoba
aryoba See Profile,

you said that you are loosing packets,

"In the non-working capture, the extra packet did not exist. As a result, site A acknowledged the first half of the http payload only"

A common side effect of CEF problems that I have seen is dropping every other packet. its worth a shot, disable CEF see if the problem goes away. if it does then you can try troubleshooting CEF
aryoba
MVM
join:2002-08-22

aryoba

MVM

N251EA,

"A common side effect of CEF problems that I have seen is dropping every other packet. its worth a shot, disable CEF see if the problem goes away. if it does then you can try troubleshooting CEF".

Comment:
As mentioned previously, all other branch sites did not have the issue and they were all running the very same http application over the same FR setting. If you say that this is a router configuration problem, then it must affect all branch site and not just one.

Covenant
MVM
join:2003-07-01
England

Covenant

MVM

You mention:

"Until we fixed the problem", what did you to fix the issue if it is fixed?

As regards the CEF issue, PA23 See Profile is right in his statement that CEF could cause this issue but I would have expected a reload to cure it. aryoba See Profile, with CEF, it might not affect multiple routers but just one as may be the case here. Did you also disable fast switching, on the LAN and WAN.

Also, during your sniffer trace, what was the TCP like? Were there multiple attempts at negotiating the window size, etc? What kind of sniffer traces do you have. If you can get them into ethereal format so everyone has access to them and upload them, that would be brill. I am thinking that if it did get dropped in the cloud, was it because the packet was too big which can happen when you have two hosts try to negotiate a window size and can't.

I have mixed results with the different data patterns in pings, the most accurate way to do this is to perform a BERT test using the builtin CSU/DSU but I am not sure if you have plain vanilla serial wic or one with an integrated csu/dsu.
aryoba
MVM
join:2002-08-22

aryoba

MVM

Rolande,

I did all the extended ping tests. They were all 100% successful. And yes, we already examined the "show interface serial" outcome, cleared the counters, and everything else as the first step of troubleshooting as usual.

We even went to the extreme. We took all "working equipments" from working branch site to site A, and we still noticed the same error. When we installed the "non-working equipments" from site A to the "working branch site", all equipments worked as supposed to; no missing http payload anywhere. This then indicated that all related physical equipments from site A (DSU, router, cable, interface cards, etc.) were in working condition.

As you said Rolande, there is something strange going on at Layer 1 in the Frame cloud. We just don't know what.
mmedford
join:2006-01-14
South Ozone Park, NY

mmedford to aryoba

Member

to aryoba
well, stupid suggestion here, but if you say there are other working sub-sites. Why not just pull the config off of that router, tailor it to the connection details needed and drop it into the router. This could possible eliminate the router as a problem...meaning it could be the telco.
aryoba
MVM
join:2002-08-22

2 edits

aryoba

MVM

mmedford,

As I already explained previously, we already did compare the router configurations (three times actually). We even went to the extreme by swapping "all equipments" from site A with the equipments from working branch.

When you say the problem could be the telco, where do you think the problem lie then?
aryoba

aryoba to Covenant

MVM

to Covenant
Here is the update.

We have been working with the telco to solve the problem. This time we asked telco to recheck their previous PVC rebuilt work.

The telco engineer hyphotized that the previous PVC rebuild was probably only between the frame relay switches. Then the engineer decided to rebuild the PVC between the switches plus he completely removed the logical ports before adding them back in.

Since the 2nd PVC rebuild, it has been two days that the problematic http application not as "problematic" anymore. Since the 2nd rebuild, site A kept using the Frame Relay link and never faced the http payload problem. So far, the 2nd PVC rebuild seemed to cure the problem.

Here is my question. Let us say that the improper PVC rebuild was the culprit. How could this situation only affect ONE SPECIFIC HTTP APPLICATION and no other?

Anybody has a thought?

rolande
Certifiable
MVM,
join:2002-05-24
Dallas, TX
ARRIS BGW210-700
Cisco Meraki MR42

rolande

MVM,

There may have been a bit pattern in the specific packet that disappeared in the Frame cloud that caused an issue at Layer 1/2 based on the previous PVC config. I am not pointy-headed enough in that arena to even begin to take a stab at it what it could have been. All I know is the whole reason for pattern substitution like B8ZS at Layer 1 is based on the physical issues with the circuit signaling. There are other things in the Layer 1 facilities that react to specific bit patterns, as well and those must be avoided lest you have strange problems like this occur. I have no idea at Layer 2 though what is going on in that Frame cloud.

rovernet
Premium Member
join:2004-02-11
Richardson, TX

rovernet to aryoba

Premium Member

to aryoba
See my post on 3/23.
Looks like the telco cloud was dropping some of your data, maybe due to policing, which is not indicated on any of the counters in the routers because your physical and data link hook-up to the telco CO was/is doing fine, AND the application was receiving an incomplete response based on framerelay not retransmitting those dropped packets. It doesn't happen that often but I've seen more than my share of those.

Then again, that's all i can provide from several hundred -or thousand- miles away and based on the symptoms described on the beginning of the thread.

Take it or leave it. I usually don't answer why questions except to a very few selected group.
aryoba
MVM
join:2002-08-22

2 edits

aryoba

MVM

Keep in mind that this is problematic http payload transaction (layer 7 problem), not just layer 2 or layer 3 problem.

From my understanding; if the problem is the layer 1, then it must affect ALL application and not just one. It is like layer 1 problem directly causes layer 7 problem without affecting layer 2 to layer 6 at all.

What I don't understand is how improper PVC rebuild in the frame relay cloud might only affect one specific http application?

In addition, other http applications were not affected; just one specific http application.

Weird?

Covenant
MVM
join:2003-07-01
England

Covenant to aryoba

MVM

to aryoba
As rovernet See Profile put it, it is very hard to work out what went wrong remotely but my money is on a software issue, not a physical issue as the same rebuild but this time including the logical ports would follow the same physical path as the previous PVC with the only difference being the "logical" or "software path". Seen loads of those for ATM PVCs but rarely seen "stale" PVCs on frame relay ccts but then again, it depends on the hardware platform that is being used with not all FR switches created equally.

What I tell my team is whenever you have anything unusual going on, or something that can't be explained by logic or RFCs/IEEE standards, then look into the possibility of it being a software issue cos nothing doesn't make sense like a bug.
aryoba
MVM
join:2002-08-22

aryoba

MVM

Yes, I quite agree Covenant.

However if it is a software issue, then we must have the problem from the very beginning. The application has been stable for years. The problem started since telco rebuilt the PVC.

How could the application become so picky when others had no complains?

Covenant
MVM
join:2003-07-01
England

1 recommendation

Covenant

MVM

I guess we will never know as happens quite often with logical paths. I do not know the platform that your FR network is built on but in the same sentiment we can ask why do we get stale dialers, or ATM PVCs which get frozen in Cisco routers and the only way round it is to reconstruct it with no documented bugs to explain it. Take into account that most of the core infrastructure would not have been reloaded in a very long time so things don't quite act as expected more often. I guess what I am saying is that with software, somethings unfortunately can't be explained. I guess that is not what you want for your inquisitive mind but it is one of those things that you use as "experience". Glad it got sorted in the end but I would have been interested to see the result of applying ip tcp adjust-mss 1300 on the routers as I had an issue similar to what you describe over the MPLS network and short of rebuilding the ATM PVC and causing an outage, I applied that command and the application started working again. Just a thought.