Friday, January 6, 2012

Log Jam


It’s been a long while since I posted anything.  Frankly, that’s because things have been quiet and I’ve not run into anything unusual or noteworthy.  One of my resolutions this year is to start posting again, so I’ll be digging back though my notes. I’m also wrapping up some projects that I’ll post.

Today on of our Exchange 2007 servers hosting our public folders started consuming log file space at the rate of about 1 new log file per second.  This corresponded with unexpected database growth. I’ve seen this before.  There are quite a few blogs and KB articles for troubleshooting the problem.  One of the best is here: http://blogs.technet.com/b/mikelag/archive/2009/07/12/troubleshooting-store-log-database-growth-issues.aspx

Unfortunately none of the suggestions actually pointed to my problem. It starts by looking for looping mail using Exchange User Monitor (Exmon) to look for a looping mail. Exmon runs on servers that host a database.  It cannot be used on Hub/CAS servers unless they are also hosting a database. Not the output below shows that the local HUB/CAS box is the top CPU consumer. So that’s a start


 
Moving to the Hub/CAS server I can see in the screenshot below that I have quite a few emails in retry that are heading for my journaling server and I have a small message that seems to be stuck.  In reality it’s looping faster than the GUI can keep up. If I double click the message or try to right-click and delete it I get a message indicating that the system has processed it already.



The first time I saw this I spent time trying to delete it using the Exchange management shell and the remove-message commandlet:

Remove-message -filter {Subject -eq "Win Big"} -WithNDR false

Yeah….That didn’t work either. I punted and called Microsoft. By now the system, which is located in EU had eaten 250 gigs. The data drive and log drive are both on the SAN so we were able to expand them but it was consuming space at such a rate that we would soon need to expand the drive again.  To make matters worse space on the SAN was already constrained so I would not be able to expand the drives enough to continue to troubleshoot. It was using 5 GB every 20 minutes.  Of course we had to rerun all the troubleshooting  :)

While on hold, as he discussed the problem with the next level engineer, I continue to scour the net for a solution. I found a vague reference to the queues becoming corrupt. I’ve recreated the queues before when I had a huge mail backup which caused hub throttling to kick in.  The queue database does not shrink once it’s expanded.  The easiest way to reclaim that drive space is to create a new queue. So I gave that a try.

Fixing a corrupt queue on a hub transport server is fairly simple:

•    Stop the MS transport service on the Hub/CAS server
•    Rename  the queue folder

(Note that most instructions indicate that you should recreate a new "queues" folder.  My experience has been that you should let the transport server create this folder itself when you start it)

•    Restart the MS transport service
•    Test mail transport using a mailbox in that site sent to/from a mailbox in another site
•    Monitor the queues, database and log file drives for growth.
Fixed. No more mail loop. new logs dropped from 130 a second to about 1 a minute.