Wednesday, September 24, 2008

Debugging after a power outage

Here at Hive7 we host all our servers with a hybrid co-location/hosting provider. When you host your servers in a colo facility there are a few key things you look for. Stuff like multiple redundant internet routes, clean power, zero interruption backup power systems, adequate cooling, and decent security. While our host has all of these, after all they come standard in any decent co-location, they also provide us with a few services above and beyond a bare bones colo like good prices on rented servers of any configuration and hardware load balancers. We've had our share of small mishaps, but things have been pretty smooth sailing. That is, until, about 36 hours ago.

Sometime around 5am PDT PG&E had a major power outage. Normally when this (rarely) happens in a colo, the backup batteries and generator carry you through without even a hiccup. Well, not this time. Backup power systems faltered and servers went down. Things quickly came back up, but, there was a catch. The Air Conditioner in our data center did not! Servers rapidly began overheating. Quick to react, the colo's on site engineers hard powered off a bunch of our servers (I'm still not quite sure why we didn't get a phone call to do this on our own the safe way).

Text messages flew (Zenoss is your friend) and our chief sys admin dude rushed down to the data center to assess the damages as soon as he was alerted. On the surface there were a few major issues. Many of our larger RAID Arrays were running degraded and needed some love. Some servers had not powered back on and were stuck. He spent all day trying to right the wrongs made by our colo's (lack of) backup power.

While he was doing that, guess what I was doing? Yeah, trying to make Knighthood run. Knighthood is a pretty big web app. We have millions of users, and roughly 10,000 actively playing at any given point in time. With a few database servers running with degraded arrays, virtual machine hosts not running, and some other systems still not powered on I set forth scouring logs and troubleshooting. One by one we got the necessary systems back up and running. First it was the email service, then the email invitation service, then the Active Directory domain controllers. With all of these up and running properly, the game was still performing poorly. It was exhibiting performance traits I'd seen before. It'd go fast for a bit, then completely hang, then go fast again.

In the past, a performance pattern like this has been caused by some reader/writer lock contention where an upgrade to a write lock causes everything to stop. It's also been caused by transactions hanging in Distributed Transaction Coordinator, or SQL Blocking, or contention on a big cache item in Memcached, or some code that is infinitely recursing. So, I did my normal troubleshooting in this scenario.

I popped open perfmon to look at request rates, bandwidth, cpu, threads, memory, etc across the farm. Every single front end IIS server (7 of them for this game) was processing absolutely nothing for a 20 second period. And by nothing, I mean nothing. IIS wouldn't even serve up an image! And CPU was 0% utilized. But, once that 20 seconds was over we'd get a good 5 seconds of processing done. Everything was queued and no requests were dropped. I thought for sure the power outage had severely damaged one of our databases, DNS servers, or a big chunk of cache (we run about 60GB of memcached), or something else obvious like that. It really felt like something was timing out and then letting the flood gates open.

At this point the logs weren't showing any errors to lead me down a debugging path. Since the behavior was happening consistently I decided to grab a hang dump of the w3wp process. The dump was, well, completely surprising! Guess what was happening?

No really. Guess.

Give Up? Nothing! Yeah, nothing. During that 20 seconds all the managed and unmanaged threads were completely idle doing absolutely nothing. No locks were held. No pages were processing. I know that because of Tess's neat blog post about which threads you can ignore. It's as if IIS just decided we didn't really want to process any more requests. This had me scratching my head. I repeated the process another 4 or 5 times with the same result. It seemed the problem must be somewhere in kernel space. None of my user mode dumps found anything at all. That scared me.

I have no experience doing kernel debugging. So, well, this is where I partially threw in the towel. I called Microsoft support and opened a premier support case. It had been about 8 hours and we had a LOT of pissed off paying users. After a few hours on the phone we captured some more user mode dumps (they didn't believe me that there wasn't anything interesting there) and uploaded them. I wish saying "I am experienced doing production crash dump debugging" meant something to these guys... I'm not sure how many times I had to say "No, you see, when it hangs it uses 0% CPU!".

The Microsoft crash dump engineers went about their business and said they'd call me back when they found something (though I knew they wouldn't find anything and no amount of whining could make them skip this step). To their credit, since it was production and affecting our main line of business, they offered to do the debugging immediately rather than withing 2 business days which is the normal turnaround.

A couple hours later everything magically started working perfectly. I changed nothing. I called our sysadmin and he said he changed nothing that should have affected Knighthood. It had already been a long day and we decided to wait for the next day to find what had fixed it. Sysadmin dude sends me an IM this morning to let me know he figured out the problem. Guess what it was?

No really. Guess.

Well of course, it was IIS logging! We have all our IIS logs pointed at a NAS. I never thought this would be an issue since, well, logging happens in a background thread right? It can't possibly interfere with actual request processing. Turns out that is incorrect! The NAS was one of the last things that Sysadmin dude had brought up at the end of the day because it was only used for archiving and log files, a very low priority in a crisis scenario. Well, apparently not!

Microsoft called me back about an hour later to let me know that the hang dumps did not uncover anything and it appeared that requests were simply being delayed before they could be processed. "Well, no $@*! I told you that 12 hours ago", I thought. So, I let them know we had fixed the problem and closed the case. I'm sure we could have seen it in a kernel hang dump in the IIS kernel mode stuff, but the problem was not reproducing anymore and I didn't want to bother...

Now you know. IIS logging can clog up all your requests. So, if you're logging to a remote system over a windows share, make sure it never goes down! Or, well, don't do that and log locally and ship them out on occasion.

Thursday, July 17, 2008

ASP.NET - It's not just for DataGrids after all

Last night I gave a talk at the San Francisco chapter of the Bay.NET User's group. It was a lot of fun. Thanks for the great interaction everyone! I am also thoroughly impressed that I finished nearly on time and didn't have 10 slides left! Usually I have way too much material for these things.

As promised, here are the slides from the talk.

If any other groups out there are interested in this talk, let me know!

Saturday, May 24, 2008

Windows Server 2008

For anybody who's been watching you will have noticed that I have had some fun trying to get x64 Windows Vista stable on my workstation. Well, because of all that fun I've been running good ole trusty XP Pro 32 bit (and Ubuntu) for the last 8 months or so. I noticed both Server 2008 and Vista SP1 came out and I thought, "hey, it's time for an upgrade!"

After my last episode installing Vista x64 on my workstation, I decided I should instead go for Windows Server 2008 x64 – I know, it shouldn't really make a difference, but it made me feel better! So I login to MSDN, download it, burn it to disc, and away I go. Before I know it, everything is installed and working. It's been over a week now and it's still working. Not a single crash! Amazing... It's almost like, dare I say it, I got a Mac crossed with Open BSD and a touch of Linux! ;)

I have to say, they did something right with Windows Server 2008 for us hardcore workstation users. It is the perfect blend of security, cutomizability, and sexiness. You gotta love a server operating system with all the IIS 7 goodness that lets you turn on Aero. :) Both of my printers even have drivers now!

However, I must confess, I did cheat a little bit. I disabled my on-board sound card that was the culprit for many of my BSOD's with the prior attempts at Vista x64 and bought a PCI sound card. Ah well, Server 2008 rules, Vista sucks! There. I said It.

Oh yeah, I did have one issue. For some stupid reason it didn't want to activate, giving me the stupid error:

Windows Activation Error: A problem occurred when Windows tried to activate. Error Code 0x8007232B. For a possible resolution, click More Information. Contact your system administrator or technical support department for assistance. DNS name does not exist.


Luckily there are about a million hits on Google on the subject. Here's the most concise one. Yeah, you read that right, enter the same exact product key and click activate again. You would think that's an error that would have been fixed in over a year...

Wednesday, March 19, 2008

A SQL Table Manhunt

As I mentioned in my last post I recently took on the exciting new position of Chief Software Architect at Hive7, Inc. We're building all kinds of great stuff. Our most popular game Knighthood has over a million registered users and over 100,000 daily actives. This game is growing quickly. Over 125,000 people added the game two weeks ago, and over 150,000 added it in the last week. The game came into existence in December.

This massive growth leads to some exciting scalability challenges. I'll be spending a lot of time talking about that in the future. Today, is a simple tidbit related to databases. Our current performance bottleneck is with database write I/O. We have enough memory in the systems and a caching layer, so the disks barely need to read. Tracking this down is a whole other post, but it's fairly simple. Once we knew we were write I/O limited we set out to find out why.

The original DB physical layout started out pretty simple. There was one File for data, one for logs. In the next 3 or 4 revisions more and more files were created. Why? Well, so we could run this nifty little query and find out which of our db tables/indexes/etc were causing the write bottlenecks:

select
db.name as DbName,
f.name as FileName,
f.physical_name as FilePhysicalName,
vf.TimeStamp,
vf.NumberReads,
vf.BytesRead,
vf.IoStallReadMS,
vf.NumberWrites,
vf.BytesWritten,
vf.IoStallWriteMS,
vf.BytesOnDisk
from fn_virtualfilestats(-1,-1) vf
inner join sys.databases db on db.database_id = vf.DbId
inner join sys.database_files f on f.file_id = vf.FileId
order by vf.NumberWrites desc

If you have physically separated your various database tables and indexes into different files, the output from this function will give you all kinds of useful information about which ones are most accessed, and which put the most strain on your I/O subsystem. Optimizing it, of course, is up to you. :)

If you enjoy big scale, fast moving, tough problems, we're hiring for a Lead Web Designer and Brilliant Lead DBA/Sysadmin and Web Games Developer (.NET)!

Thursday, February 28, 2008

C# 3.0 Overview

It's been forever since my last post. I promise I'll do better. I've just been juggling three jobs. ;) But that has changed (more on that soon)!

The last couple of nights I did the same talk at two different user groups. Sacramento .NET User's Group and the Central California .NET User's Group. Thank you guys for having me, and not throwing any tomatoes. I think we had a good time at both events. Though, I did take up the whole two hours both times.

The talk was based on Jon Skeet's upcoming book titled C# in Depth which I had the pleasure of reviewing and providing technical feedback. We went through the evolution of C# from 1.0 to 3.0, explored a bunch of the new features, played Human LINQ (hilarious). Oh yeah, it was pointed out to me in the Sacramento group that the word "jumped" should really be "jumps". That's what I get for copying my work! hah! If anybody in the Sacramento area wants to help, I'd like to do it again and video tape it...

Stuff to Download
  • C# 3.0 Overview Presentation – In both talks I didn't have enough time to bore you guys with the "in depth" slides. Pick up Jon's book to learn the nitty gritty about how all that stuff works.
  • A Sorted Affair 2 – - A few weeks ago I published the first version of this on my blog. This one is much cooler.
  • Human LINQ – The code we executed with our Human LINQ provider.
  • Sort Performance – A quick exploration of the relative sorting speeds using different sort methods.

I'll write up another quick blog in a bit on the sort performance. It's quite interesting, indeed.

Monday, February 4, 2008

A New Kind of Application Server

As you probably know, I'm a cofounder of Coversant which, at its heart, is an XMPP development platform. Most of our larger customers (thousands of simultaneous users) are ISV's that have built on the SoapBox Platform®. We allow you to easily develop XMPP applications using .NET technology.

A really long time ago, I wrote about some possibilities for using the SoapBox Platform including examples of what our customers were doing at the time. This was before there microblogging was popular, or I probably would have used that example too. :)

The last couple of weeks there seems to be quite a bit of buzz around the subject of using XMPP as an application server, and that gets me really excited! A friend/competitor Matt Tucker of Jive Software wrote in his company blog about how XMPP is the future for cloud services. A "real" online author (aka not a member of an XMPP company) even picked up Matt's article and ran with it. Yesterday, a little buzz hit Slashdot when another friend/competitor Mickael Raymond of Process One wrote about introducing the XMPP application server (when I wrote this, it seems Process One was experiencing a bit of the Slashdot effect -- hopefully by the time you read this it will be gone and you can read his article), which is an exploration of building a Twitter-like microblogging system on top of their XMPP server. Great stuff, indeed!

This is wonderful news and very validating for me personally! It seems after six years of committing to the infant technology, I wasn't crazy after all, and XMPP is a good platform for presence/messaging systems! And if you're in the market for .NET based XMPP solutions, head on over to the SoapBox Developer site. :)

Wednesday, January 30, 2008

A Sorted Affair: History of the C# Sort

Next month I'll be giving a talk at the Sacramento .NET User's group titled C# 3.0 Overview where I'll be presenting the great new features in the third version of C#. I've been developing .NET/C# software since the first pre-release copies of Visual Studio .NET reached MSDN. It's been fun to see the fledgling C# language evolve with the times. In just a few short years it's gone from what many considered to be an uninspired Java clone to a highly productive, unique, language.

Today I'm going to spoil the opener code sample to my user group presentation. Let's take a journey through the life of C# as a developer with a simple task: sorting a list! I even made a super 1337 Winforms app you can play with.



For the purposes of this example, we're going to use a simple class called Person. We'll be sorting the list by the person's first name.

public class Person
{
public string FirstName { get; set; }
public string LastName { get; set; }

public override string ToString()
{
return FirstName + " " + LastName;
}

public override int GetHashCode()
{
return FirstName.GetHashCode() ^ LastName.GetHashCode();
}

public static ICollection<Person> GetPeople()
{
return new List<Person>
{
new Person{FirstName="Ray",LastName="Ozzie"},
new Person{FirstName="Larry",LastName="Ellison"},
new Person{FirstName="Steve",LastName="Jobs"},
new Person{FirstName="Bill",LastName="Gates"},
new Person{FirstName="Britney",LastName="Spears"}
};
}
}
C# 1.0

In the beginning, there was C# 1.0. C++, VB, and COM be damned! For the majority of Microsoft based development, this new environment was a godsend. We enjoyed the benefits of a garbage collected, managed runtime, with the speed of native code. Life was good, but more than a bit clunky.

For the prim and proper developer out there, the ArrayList just didn't cut it. We used CollectionBase and implemented our own strongly typed collections. Here's my collection, in all its glory!

public class PersonCollection
: CollectionBase
{
public PersonCollection()
{
}

public PersonCollection(IEnumerable people)
{
foreach (Person p in people)
base.InnerList.Add(p);
}

public int Add(Person p)
{
return base.InnerList.Add(p);
}

public void Remove(Person p)
{
base.InnerList.Remove(p);
}

public Person this[int index]
{
get
{
return (Person)base.InnerList[index];
}
set
{
base.InnerList[index] = value;
}
}

public void Sort()
{
base.InnerList.Sort();
}

public void Sort(IComparer comparer)
{
base.InnerList.Sort(comparer);
}

public void Sort(int index, int count, IComparer comparer)
{
base.InnerList.Sort(index, count, comparer);
}
}

Ok, we've got a collection of people. Now, we can finally sort them, C# 1.0 style! There's a class called Sorting in the download that these methods are a part of.

public PersonCollection SortCS1(PersonCollection people)
{
PersonCollection sortedList = new PersonCollection();
foreach (Person p in people)
sortedList.Add(p);
sortedList.Sort(new PersonFirstNameComparer());
return sortedList;
}

private class PersonFirstNameComparer : IComparer
{
public int Compare(object x, object y)
{
return ((Person)x).FirstName.CompareTo(((Person)y).FirstName);
}
}

Goodness, that's a lot of code! No wonder we didn't sort anything in C# 1.0. Not only do you have to create a custom collection class, there is also a custom Comparer class! And there's no getting around this when you want to sort by a member of a complex business object.

C# 2.0

C# 2.0 came along with the .NET Framework 2.0 and a host of improvements. Features like generics, anonymous methods, custom iterators, and partial classes changed the face of the language for the better. No longer did we have to spend time writing the same code like Collection classes over and over, and could focus more on getting things done.

public IEnumerable<Person> SortCS2(IEnumerable<Person> people)
{
List<Person> sortedList = new List<Person>(people);
sortedList.Sort(delegate(Person x, Person y) { return x.FirstName.CompareTo(y.FirstName); });
return sortedList;
}

Wow, so compared to C# 1.0's 60+ lines of code, in 2.0 we wrote one. Very nice. Though, that anonymous method syntax still really bugs me. It's ugly! Why should I have create a full method signature just to do a simple one line function. Ah yes...

C# 3.0

Enter C# 3.0. This is by far the biggest leap forward in C# to date. With the inclusion of lambda expressions we're now able to do some real functional programming, without the anonymous method gunk to get in the way. Check out the list sorting now!

public IEnumerable<Person> SortCS3(IEnumerable<Person> people)
{
return people.OrderBy(p => p.FirstName);
}

Doesn't that look so much better than the 2.0 version? And it's light years better than the 1.0 version. Clear, concise code. It isn't cluttered up with extra, unnecessary syntax. It's just business. The way it should be.

Since 3.0 also comes with the LINQ syntax, I thought I'd throw that example in there as well.

public IEnumerable<Person> SortCS3Linq(IEnumerable<Person> people)
{
return from p in people orderby p.FirstName select p;
}

I admit, in this casee the LINQ syntax weighs you down, but you can't say it's not clear what's going on. Throw some cross joins and advanced filters in there and it will start to look at lot more appealing.

The End

C# has changed quite a bit over the last six years, and I can't say I disklike any of it.

About the Author

Wow, you made it to the bottom! That means we're destined to be life long friends. Follow Me on Twitter.

I am an entrepreneur and hacker. I'm a Cofounder at RealCrowd. Most recently I was CTO at Hive7, a social gaming startup that sold to Playdom and then Disney. These are my stories.

You can find far too much information about me on linkedin: http://linkedin.com/in/jdconley. No, I'm not interested in an amazing Paradox DBA role in the Antarctic with an excellent culture!