myxiplx
Posts: 132
Joined: 16.Mar.2001
Status: offline
|
Hey everyone (Tom, I really, really hope you read this one!), Over the past 6 weeks we've had major e-mail problems on our network. This is going to be a long post so I'll split it into several parts: To begin with I'll describe the symptoms we've been seeing, I'll then go over our network topology & the steps we've taken in troubleshooting this. I'll then go over what we've discovered in our troubleshooting before finally giving my current theory on what's happening in the hope that somebody reading this can confirm or deny it. Now we've been working with Microsoft's Exchange Server support team for about 6 weeks on this, we've just recently passed the case onto their ISA team, but they're already at the stage that they want netmon traces and that's going to be incredibly difficult. The problem I suspect takes about an hour to replicate (if I can) and it's so intermittent it happens about once every week to ten days. Short of installing a dedicated server for netmon traces this is going to be difficult, so I'm coming here for a 2nd opinion (and then I'll build that server). Symptoms of the Problem - Our mail server starts generating a backlog of messages.
- Under Exchange server message delivery is failing with "The connection was dropped by the remote host."
- On a MIMEsweeper appliance message delivery fails with "I/O error".
- (Yes, we swopped mail servers while troubleshooting this)
- Small messages can send ok, but any large message will jam up.
- Once you've sent a large message to a domain, further messages to the same domain just get stuck in the queue.
- The firewall server starts logging denied SMTP packets from multiple servers with error code:
0xc0040017 FWX_E_TCP_NOT_SYN_PACKET_DROPPED - We have no other problems on the network. Internet access works find and I can establish a telnet session to the affected mail servers.
- The problem occurs every week to ten days, other than this we have no problems with mail, we routinely send hundreds of MB's of mail a day and receive similar amounts.
Our Network - route taken by E-mails 1. SMTP Mail Server (Exchange or MIMEsweeper) | 2. ISA Server 2004 Firewall | 3. Q-Balancer (QoS and Load Balancer) Restricts outbound SMTP to 128kb/s | 4. DSL Router | 5. Remote SMTP server (recipients server or smarthost) We tried rebooting all the devices, starting & stopping services, pretty much everything we could think of. In the end, we had to eliminate items one at a time, leaving ISA for last: 1. Two different SMTP servers exhibit the exact same symptoms. 2. .... 3. Bypassing the Q-Balancer entirely has no effect 4. We've tried two different DSL routers, both known to work perfectly. 5. Multiple domains are affected, including big names like hotmail.com, yahoo.com. Changing delivery to a smart host also has no effect. However: 2. Moving the mail appliance outside the ISA server immediately solves the problem (we plugged it directly into 4. the DSL router). ISA has been working for 2 years without ever causing any kind of problems. ISA Configuration Windows 2000 Server, SP4. ISA 2004, fully patched with the exception of SP2. (Some of the problems reported here with SP2 will affect us directly so I have not even considered installing this yet.) What we have found so far - Moving the MIMEsweeper appliance outside the firewall solves the problem immediately (we moved it to plug directly into the DSL router).
- Running windows update on the firewall server didn't seem to have any effect.
- Since installing the MIMEsweeper appliance we were able to access reports on mail traffic, we discovered that when this problem hits we are sending 2.9GB of traffic to a single e-mail address over a 24 hour period. This is our first clear indication of the cause of the problem.
- We checked our logs and discovered that the problem starts immediately we send large mails to this address.
- (In this case we were sending fifteen 3MB e-mails.)
- Reading up on mail loops, I found two possible causes:
- a forwarder sending the same messages back and forth between two servers
- the sending server not receiving the final 'OK' response from the receipient server
- Checking our mail logs, we are not receiving large amounts of mail, so I suspect the 2nd option.
- This could tie in with the "TCP_NOT_SYN" message on the ISA server - if the final OK is being blocked our mail server will simply keep trying to send messages.
- We have found out that this particular recipient had an infinite mail loop, their e-mail filtering company has reported that they saw over 200,000 messages flowing from their server on that day.
- Once delivery has failed to this domain, all large e-mails to any domain fail.
- Some mail does appear to flow, our message counts and queue size will vary by up to 3-4 messages and 1-2MB throughout the day but the general trend is always upwards. (I believe this is small messages being transmitted ok).
Some Figures - Our mail server allows up to 50 concurrent connections.
- At the peak of the problem we have upwards of 200 messages quueed, totalling over 150MB.
- Those 15 original messages reduce the outbound transfer rate to 8.5kb/s.
- This means it will take 48 minutes to transmit these messages assuming no errors occur.
- During those 48 minutes, many other messages will arrive and I believe we quickly hit the 50 connection limit.
- 50 concurrent connections drops the speed per message to 2.56kb/s.
- That means a 1MB message will take 53 minutes to send. The original 3MB messages will now need 2hrs 40 mins.
As far as we can tell, these kinds of delays are pretty normal for our mail servers and they cope fine normally. Even with 200 messages in the queue, once we move the server outside the ISA box the queue drops fairly rapidly, matching the peak transfer rates we would expect on this connection). My Theory - The load on the original recipient mail server could conceivably delay the time it took for their server to send that final 'OK'.
- Standard SMTP servers appear to handle this fine (we have no problem when mail server placed outside ISA)
- I think ISA may however have dropped the connection due to this long delay, causing it to block that final 'OK'. That would cause the mail server to retry these messages (and repeated attempts would add up to 3GB over 24 hours).
- After this problem has happened once, I think ISA also begins dropping any other connections that have been open for a while. At this point that will include large e-mails to any domain - exchange will have between 15 and 50 connections open, and will will be wanting them kept open for a minimum of 30 minutes, probably nearer 60.
What I would like to know - Could ISA be blocking the packet in this way? Microsoft are adamant that ISA will not cause this, however after the chaos caused by their Exchange team I am not inclined to simply take their word for it.
- Many, many people have reported the TCP_NOT_SYN error but nobody seems to have a definitive answer.
- I have seen one report of almost exactly the same symptoms - SMTP connections dropping with TCP_NOT_SYN errors from multiple domains:
(see http://www.mcse.ms/archive99-2004-11-1007098.html) If you have seen or are affected by this problem, please let me know. - If ISA is blocking this packet, what then causes the problem with the other domains? When this problem occurs we see that same TCP_NOT_SYN error for multiple domains, usually we have no problem with any of them. Could there be a problem with ISA causing it to drop all these other connections once this error has occured?
If anybody has any ideas or theories on this, please let me know. Ross
|