Showing posts with label windows. Show all posts
Showing posts with label windows. Show all posts

Wednesday, September 24, 2008

Debugging after a power outage

Here at Hive7 we host all our servers with a hybrid co-location/hosting provider. When you host your servers in a colo facility there are a few key things you look for. Stuff like multiple redundant internet routes, clean power, zero interruption backup power systems, adequate cooling, and decent security. While our host has all of these, after all they come standard in any decent co-location, they also provide us with a few services above and beyond a bare bones colo like good prices on rented servers of any configuration and hardware load balancers. We've had our share of small mishaps, but things have been pretty smooth sailing. That is, until, about 36 hours ago.

Sometime around 5am PDT PG&E had a major power outage. Normally when this (rarely) happens in a colo, the backup batteries and generator carry you through without even a hiccup. Well, not this time. Backup power systems faltered and servers went down. Things quickly came back up, but, there was a catch. The Air Conditioner in our data center did not! Servers rapidly began overheating. Quick to react, the colo's on site engineers hard powered off a bunch of our servers (I'm still not quite sure why we didn't get a phone call to do this on our own the safe way).

Text messages flew (Zenoss is your friend) and our chief sys admin dude rushed down to the data center to assess the damages as soon as he was alerted. On the surface there were a few major issues. Many of our larger RAID Arrays were running degraded and needed some love. Some servers had not powered back on and were stuck. He spent all day trying to right the wrongs made by our colo's (lack of) backup power.

While he was doing that, guess what I was doing? Yeah, trying to make Knighthood run. Knighthood is a pretty big web app. We have millions of users, and roughly 10,000 actively playing at any given point in time. With a few database servers running with degraded arrays, virtual machine hosts not running, and some other systems still not powered on I set forth scouring logs and troubleshooting. One by one we got the necessary systems back up and running. First it was the email service, then the email invitation service, then the Active Directory domain controllers. With all of these up and running properly, the game was still performing poorly. It was exhibiting performance traits I'd seen before. It'd go fast for a bit, then completely hang, then go fast again.

In the past, a performance pattern like this has been caused by some reader/writer lock contention where an upgrade to a write lock causes everything to stop. It's also been caused by transactions hanging in Distributed Transaction Coordinator, or SQL Blocking, or contention on a big cache item in Memcached, or some code that is infinitely recursing. So, I did my normal troubleshooting in this scenario.

I popped open perfmon to look at request rates, bandwidth, cpu, threads, memory, etc across the farm. Every single front end IIS server (7 of them for this game) was processing absolutely nothing for a 20 second period. And by nothing, I mean nothing. IIS wouldn't even serve up an image! And CPU was 0% utilized. But, once that 20 seconds was over we'd get a good 5 seconds of processing done. Everything was queued and no requests were dropped. I thought for sure the power outage had severely damaged one of our databases, DNS servers, or a big chunk of cache (we run about 60GB of memcached), or something else obvious like that. It really felt like something was timing out and then letting the flood gates open.

At this point the logs weren't showing any errors to lead me down a debugging path. Since the behavior was happening consistently I decided to grab a hang dump of the w3wp process. The dump was, well, completely surprising! Guess what was happening?

No really. Guess.

Give Up? Nothing! Yeah, nothing. During that 20 seconds all the managed and unmanaged threads were completely idle doing absolutely nothing. No locks were held. No pages were processing. I know that because of Tess's neat blog post about which threads you can ignore. It's as if IIS just decided we didn't really want to process any more requests. This had me scratching my head. I repeated the process another 4 or 5 times with the same result. It seemed the problem must be somewhere in kernel space. None of my user mode dumps found anything at all. That scared me.

I have no experience doing kernel debugging. So, well, this is where I partially threw in the towel. I called Microsoft support and opened a premier support case. It had been about 8 hours and we had a LOT of pissed off paying users. After a few hours on the phone we captured some more user mode dumps (they didn't believe me that there wasn't anything interesting there) and uploaded them. I wish saying "I am experienced doing production crash dump debugging" meant something to these guys... I'm not sure how many times I had to say "No, you see, when it hangs it uses 0% CPU!".

The Microsoft crash dump engineers went about their business and said they'd call me back when they found something (though I knew they wouldn't find anything and no amount of whining could make them skip this step). To their credit, since it was production and affecting our main line of business, they offered to do the debugging immediately rather than withing 2 business days which is the normal turnaround.

A couple hours later everything magically started working perfectly. I changed nothing. I called our sysadmin and he said he changed nothing that should have affected Knighthood. It had already been a long day and we decided to wait for the next day to find what had fixed it. Sysadmin dude sends me an IM this morning to let me know he figured out the problem. Guess what it was?

No really. Guess.

Well of course, it was IIS logging! We have all our IIS logs pointed at a NAS. I never thought this would be an issue since, well, logging happens in a background thread right? It can't possibly interfere with actual request processing. Turns out that is incorrect! The NAS was one of the last things that Sysadmin dude had brought up at the end of the day because it was only used for archiving and log files, a very low priority in a crisis scenario. Well, apparently not!

Microsoft called me back about an hour later to let me know that the hang dumps did not uncover anything and it appeared that requests were simply being delayed before they could be processed. "Well, no $@*! I told you that 12 hours ago", I thought. So, I let them know we had fixed the problem and closed the case. I'm sure we could have seen it in a kernel hang dump in the IIS kernel mode stuff, but the problem was not reproducing anymore and I didn't want to bother...

Now you know. IIS logging can clog up all your requests. So, if you're logging to a remote system over a windows share, make sure it never goes down! Or, well, don't do that and log locally and ship them out on occasion.

Saturday, May 24, 2008

Windows Server 2008

For anybody who's been watching you will have noticed that I have had some fun trying to get x64 Windows Vista stable on my workstation. Well, because of all that fun I've been running good ole trusty XP Pro 32 bit (and Ubuntu) for the last 8 months or so. I noticed both Server 2008 and Vista SP1 came out and I thought, "hey, it's time for an upgrade!"

After my last episode installing Vista x64 on my workstation, I decided I should instead go for Windows Server 2008 x64 – I know, it shouldn't really make a difference, but it made me feel better! So I login to MSDN, download it, burn it to disc, and away I go. Before I know it, everything is installed and working. It's been over a week now and it's still working. Not a single crash! Amazing... It's almost like, dare I say it, I got a Mac crossed with Open BSD and a touch of Linux! ;)

I have to say, they did something right with Windows Server 2008 for us hardcore workstation users. It is the perfect blend of security, cutomizability, and sexiness. You gotta love a server operating system with all the IIS 7 goodness that lets you turn on Aero. :) Both of my printers even have drivers now!

However, I must confess, I did cheat a little bit. I disabled my on-board sound card that was the culprit for many of my BSOD's with the prior attempts at Vista x64 and bought a PCI sound card. Ah well, Server 2008 rules, Vista sucks! There. I said It.

Oh yeah, I did have one issue. For some stupid reason it didn't want to activate, giving me the stupid error:

Windows Activation Error: A problem occurred when Windows tried to activate. Error Code 0x8007232B. For a possible resolution, click More Information. Contact your system administrator or technical support department for assistance. DNS name does not exist.


Luckily there are about a million hits on Google on the subject. Here's the most concise one. Yeah, you read that right, enter the same exact product key and click activate again. You would think that's an error that would have been fixed in over a year...

Sunday, October 7, 2007

Vista Rant

Next month Vista will have been out in RTM for a year (it was released on 11/30/2006 to the volume/dev world). Why is it then, that it still doesn't work?

When I originally bought my current workstation, almost two years ago, I set it up with four partitions (I had a terrabyte RAID). I installed XP Pro 32 bit for games and Windows Server 2003 x64 for development. I left two open partitions, one for Vista, and one for some flavor of Linux. For the first six months of its life I spent most of my time in XP Pro x86. It worked, but I was wasting my fancy 64 bit hardware!

The beginnings — x64, take 1

On June 6 2006 I blogged about my experience installing Vista on my workstation. At the time it went reasonably well. Heck, my system passed the hardware compatibility wizard with flying colors! However, I really should have posted some follow-ups.

A week later — x64, take 2

After I started actually using Vista, it started crashing, a lot. I'd get at least one blue screen a day. I looked at the minidumps and they were all related to one of my NVidia drivers. It would be either the NIC or the sound or the SATA or the RAID — yeah, pretty much everything that comes with my motherboard (I bought it for the pretty colors). One day I noticed a Windows update for my NVidia RAID drivers. "Cool", I thought. Maybe they sped them up or improved stability. The drivers installed; my system rebooted; it bluescreened; repeat. After much frustration I switched to XP Pro x64.

A month later — x64, screw you

XP Pro x64 was fairly stable, but I still had serious issues with the interaction of the drivers. If I was using the network heavily and then started using the sound card heavily: Kablammo! BSOD! I sort of learned how to work around this and would mute my music before doing anything intensive over the network. Needless to say this was very annoying. I switched to XP Pro x86.

December, 2006 — Vista x64 RTM install

"Ok, Vista is RTM now", I thought. "Surely NVidia has got their act together now!" They're working closely with Microsoft on this stuff, right?!

December, 2006 — Vista x86 RTM install

"Ok, x64 is the bastard child of hardware", I thought. "Surely this will work."

December, 2006 — Back to the trusty XP Pro x86 partition

Yeah, back on XP. I'll spare you the details.

August, 2007 — Vista x86 install

"Ok, it's been a million computer years now", I thought. "Surely all my crash reports and support incidents have caused some bug fixes." Not so much. I even installed it on a PATA drive, no RAID required. To add to the pain, I also purchased a D-Link DWA-130 USB Wireless N Adapter for my desktop at the same time. I can't pass on the promise of speed with no wires. "What?! No Vista drivers?! I quit!" What's with a new product (released in July 2007) not having Vista drivers? Oh, there are beta ones available now — they don't work either.

August, 2007 — XP x86 Pro for life

After over a year of struggling and giving it a chance, I've given up. Either Microsoft or these hardware/driver vendors need to get their heads out of their asses, take some initiative, and fix this crap. I'm officially a hateful bastard until someone shows me some reason not to be.

October, 2007 — The world should know

I wrote this blog. It feels good to vent. Much cheaper than a therapist.

Thursday, December 14, 2006

Installing Coversant Products on Vista

Due to the enhanced security in Windows Vista, not all Coversant products are able to be installed out of the box. Luckily, this is really easy to work around and rest assured, future version of our installation packages will not suffer from these issues.

The symptoms show up as an error message dialog with code 2869:

And then a series of empty dialog boxes:


This is apparently some sort of permissions issue. To solve it, the MSI needs to be run as an administrator. The easiest way to do this is to create a bat file to run the msi manually. It would be something like this:

msiexec.exe /i "c:\SoapBoxServer2007\files\SoapBox.Server.Enterprise.x64.3.0.213.69.msi"

Then you right click on the bat file and choose "Run As Administrator". Presto, a working installation in Vista.

Friday, June 2, 2006

Fun Installing Vista Beta 2 on AMD x64

As a self proclaimed geek and MSDN subscriber I feel as though it's my duty to explore all the new software that Microsoft comes out with. This last week I have been embarking on one such journey. Working with beta software is always a bit trying, but tack on a beta driver model and a "new" hardware platform (x64) and things get really interesting.

Vista Beta 2 was released to MSDN about a week ago. The next day I fired up my DVD burner and started messing around. About four hours later I had a working installation. Why so long? Well, because my workstation is an AMD NForce 4 x64 system and boots from the onboard SATA RAID. Apparently this is not one of Microsoft's test platforms. I had an experience similar to this guy.

I had to run the Vista install from an existing Windows installation. It simply would not work when I attempted to boot from the dvd. I never got an option to load drivers. Nvidia recently released beta Forceware drivers for Vista x64. I assumed these would have the RAID drivers I needed to install Vista, afterall they did have the appropirate txtsetup.oem file and seemed to be correct. After a few attempted installations, blue screens, automatic reboots, and hangs, I decided that my assumption was bad. Lesson learned: to install Vista x64 on an Nforce 4 RAID use the XP x64 Nforce4 RAID and SATA drivers. Yup, that's right. Well, almost.

If you're like me and want to use the latest Nvidia XP x64 drivers you'll be greeted with a black screen telling you that your drivers are corrupt after the first setup reboot. Say wha? Lucky for you, they aren't. This is a feature of Vista x64. Hit F8 at the boot screen and choose to disable driver signature verification. Of course, hitting F8 EVERY time you boot your computer is not going to be very fun. Luckily (for now) there is an application called bcdedit (just run "Bcdedit.exe –set nointegritychecks ON") that you can use to disable signature verification after you get into your desktop. Oh, don't forget to right click on the Command Prompt link in your start menu and choose "Run as Administrator" before trying to run this command, or it will tell you "Access is Denied". Yay security! I should definitely mention that Vista prompted me to allow this action (VPMTATA), at least once.

Oh yeah, I almost forgot, after you finally get to your desktop Vista will keep telling you that it has found an unknown device. This is your RAID controller; the one with XP drivers. Point the device wizard thingy to the inf file (VPMTATA) of the Forceware Vista x64 Beta drivers and this annoyance will go away. You'll also need to install drivers (VPMTATA) for the Nforce4 audio chipset.

On the plus side, Vista had drivers for my Geforce 7800GT and Nforce4 gigabit network card. It even figured out I had dual monitors (VPMTATA), picked the max resolution for both (VPMTATA), and presented a neat little dialog that let me choose the desktop layout (VPMTATA). Of course, I wanted to upgrade to the latest ones from Nvidia. This is usually straightforward.

I ran the setup exe (VPMTATA) for the Nvidia Vista Beta 2 Geforce drivers. It extracted stuff (VPMTATA), ran the second installer exe (VPMTATA), and then failed with some cryptic error messages I probably should have written down and submitted as bug reports. Subsequent attempts to run the installation package (VPMTATA) resulted in an error about running 32 bit uninstaller code on a 64 bit platform. I was very confused, didn't want to spend much time on it, and gave up.

A few days later I had an epiphany - "I should just try to update the Microsoft Geforce driver with the inf". Duh. Well, I opened up the Device Manager (VPMTATA) and clicked update drivers (VPMTATA). Viola! The drivers were upgraded. Though, I have no idea if there are any control panels with these drivers (as there are in XP) since I couldn't run the full setup. Ah well, at least I have better video acceleration.

I'll be back later for my accounts of fun with Vista. After a week of use, I think I could write a book. However, it definitely hasn't all been bad (though VPMTATA) and I will continue to use Vista as my primary OS until it does something very mean or simply won't allow me to get things done.

About the Author

JD Conley is an entrepreneur and hacker, currently working away his golden handcuffs at Playdom, a subsidiary of the Walt Disney Company, since Hive7 was acquired. We make social games. The views and opinions expressed on this post are his and do not necessarily represent or reflect those of The Walt Disney Company.