Starin' at the Wall: How to Build Scalable .NET Server Applications: Memory Management

I'll get this out of the way from the start. This series of blogs will have nothing to do with ASP.NET or web services. However, if you plan on writing you own implementation of IIS in managed code this would probably be a good place to start. :) I also won't be providing very many code examples, as I'd be flogged by our intellectual property lawyers. You will not be able to copy and paste and create your own scalable server. However, I hope to provide enough insight so you can avoid a big list of gotchas we have had to figure out the hard way. This is one piece of a huge puzzle, memory management. Yes, you do have to think about that in .NET, at least if you want to build a large scale application.

For those who don't already know, SoapBox Server is a part of our SoapBox Collaboration Platform that supports the XMPP protocol as well most of the interesting JEP extensions. At the core of SoapBox Server is a highly efficient Socket server and thread machine capable of scaling into the hundreds of thousands of simultaneous users, and it's built 100% on .NET (C# now, but used to be VB).

SoapBox Server is the first multithreaded Socket based server application I've had the pleasure of working on. During the course of building the SoapBox Server into the extremely scalable and reliable system it is today I've learned a few things (as has the rest of the team, I hope). Thanks to Chris (who already had tons of experience with such things in Win32/C++), a few bloggers out there, some books, customers finding very interesting bugs, Windbg with Son of Strike, oh and Starbucks, I'd say I'm pretty well versed in the land of building scalable server applications. I'm no Jeff Richter, mind you, but I feel I have now learned enough to at least speak intelligently about it.

In that spirit I'd like to share the fruits of our tuning and debugging work, which, if history repeats itself, will continue to evolve as we begin work on our next major revision of the product. First, I'd like to repeat something I said a couple paragraphs ago, SoapBox now scales to hundreds of thousands of simultaneous connections with a single piece of server hardware. Think about that for a second. A user brings up an IM client, connects to SoapBox Server, and then holds that connection open until they Log Out. Repeat hundreds of thousands of times. This is no simple task. The .NET CLR does not provide a magic "Process.Scalable = true" property. We have invested hundreds of hours into tuning (maybe thousands) over the life of the server on classes of hardware varying from single processor laptops to 16-way Itanium2 systems with 64GB RAM. We've been through four distinct processing models as well as quite a few iterative improvements on our Socket interaction layer. Basically we have ran the server under a bunch of different profilers under many scenarios, found slow bits of code, and fixed them. But I'm not going to talk about profiling and performance tuning; perhaps another time. I'm going to talk about memory and scalable applications.

Every time your application creates a new Socket, Windows pulls memory from it's Nonpaged Kernel memory, which is simply physical memory that is reserved by the kernel and will never be paged out to disk. This block of memory has a finite limit and the kernel picks the limit based on the amount of phsyical RAM available to it. I don't know the exact algorithm, but with 4GB RAM it's usually somewhere around 150,000 TCP Socket connections, give or take. Want to see this in action? Simply create a loop that instantiates sockets. It will stop working eventually with a SocketException telling you there isn't enough buffer space. On top of this hard kernel level limitation, you also have to worry about how much memory each concurrent connection uses in your own application. In SoapBox we store a lot of information about each connection in memory in order to improve performance and decrease our IO operations. This includes things like the user's contact list, their last presence (available, away, busy, etc), authorization information, culture information, user directory information, etc. If we didn't hold this in memory we'd have to hit a file, database, or some other out of process persistent store for the information every time we needed it. Being IO bound is no fun. Believe me, we started out that way.

However, because of our extensive caching, SoapBox Server 2005 can only reliably handle about 20,000 simultaneous connections on the beefiest of 32 bit hardware (on 64 bit it's much, much, much higher -- I also have to admit we haven't stress tested the 2007 build on 32 bit hardware, it would probably be much higher now). It doesn't matter if you have 64GB RAM and 16 32bit processors, it we can still only handle 20,000 connections. Why, you ask? Well, it's because of the 2GB (well, really 3GB with a boot.ini switch) virtual memory limit per process in 32bit Windows. Without delving into managing your own memory your process is only allowed up to 3GB to play with. Typically, we use that up, or rather, .NET thinks we use it up, somewhere between 20,000 and 30,000 connections. Now why would I say ".NET thinks we use it up?" Story time!

A little over a year ago one of our customers kept running into a very bad situation. As evidenced by the Event Log, SoapBox Server was crashing (insert shock and awe here). It was an irregular occurance, but it did happen. However, we did no take this lightly. This customer was running about 2,500 simultaneous connections on a Dual Xeon with Hyperthreading and 4GB ram and the /3GB switch set. It was plenty of hardware for the job, and probably overkill. However, the service was still crashing. We set them up with the Debugging Tools For Windows and had them startup the process to wait for a crash (another blog we'll have to write some day). After a few tries we got a dump with some useful information in it. The result? We were out of memory, sort of.

In .NET when you call any socket operation and pass it a buffer, whether it be a send or receive, synchronous or asyncronous, it takes that buffer and pins it before giving it to the Winsock API's. Pinning, in a nutshell, is taking a .NET data structure and telling the .NET CLR memory manager not to move it, until it is explicitly un-pinned. The memory manager in the CLR is smart. As you allocate and deallocate memory it is constantly defragmenting it for you so the overall memory footprint is lower. There are quite a few really good/long/complicated articles on how this works so I won't bore you. However, pinning throws a wrench in this and the memory manager isn't quite smart enough to deal with it well (though it has gotten a lot better in 2.0). Basically, that buffer you want to put on the socket cannot move in memory (physically -- in terms of you virtual memory space) from the time the socket IO operation begins until it ends. If you look at the Winsock2 API's this is obvious, since the buffer is passed as a pointer. Anybody who's built this type of application in Winsock2 is probably saying "DUH!". I'd consider this a very leaky abstraction. Due to this behavior, it is quite easy to write a socket application in .NET that runs out of memory.

Back to the story! Not only were we out of memory, but the there was only about 200MB worth of data structures in the heap. For those of you like me that use calc.exe for all your basic math let me figure that out for you, 200MB > 3GB. Uhh, say what? How the heck were we out of memory? Well, we ran into the shortfall of pinning and memory fragmentation. The cause of this was a small number of small pinned buffers, in our case 2KB each, that were high enough in the heap to cause fragmentation spanning over 2.8GB. Where did the other 2.8GB go, you ask? Well, is was there, allocated by our process, but not being used by our code. In Son Of Strike (SoS -- a command line plug-in to the Windbg debugging tool I hope you never have to use) this showed up as free, empty, unused space! It was just sitting there waiting to be used, but we still ran out of memory. I think I mentioned earlier the memory manager in .NET isn't so smart when it comes to fragmented memory and pinning, well, this is what happens in the worst case.

Good thing for you, the answer to all your memory fragmentation and pinning woes is quite simple. Pre-allocate buffers for use by anything that will be causing pinning, and do it early on before there is a lot of memory thrash (when your application is rapidly allocating and deallocating a lot of memory). We created a simple class called a BufferPool that we use to pre-allocate a certain number of buffers. This pool can grow as need be, but it does so in large chunks and forces a garbage collection each time before the buffers are actually used. This considerably reduces the chances of fragmentation caused by pinned memory. If the pool starts off with 500 buffers, but then the 501st buffer is needed it will grow by a configurable value, typically another 500 buffers, and the induced garbage collection will cause these buffers to shift to the lowest possible point on the heap.

Interestingly enough when we found this bug we already knew about the pinning behavior of socket operations, but had only solved half of it. All of our BeginReceive calls were using the BufferPool because we knew the buffers would remain pinned until we received data from a client, but the BeginSend calls were not using the pool. We had not even considered the fact that sending a few KB of data might take long enough to pin memory, fragment the heap, and cause an OutOfMemoryException. But there is one case where they do, timeouts. The Windows TCP subsystem is very forgiving. If a client loses its connection and the server isn't explicitly told about it, the next piece of data you try to send to that client socket will end up being pinned while the TCP subsystem waits for the client to respond. It can take up to 5 minutes with the default configuration of Windows for the TCP subsystem to figure out the client isn't really there. During that entire time your buffer is pinned in memory. *poof* OutOfMemoryException.

Unfortunately, pre-allocating buffers does not completely fix the issue of running out of memory. There are also some other limits to the size of a .NET process's virtual memory space that are very complicated and I won't talk about, but basically you end up with anywhere from 1/2 to 2/3 usable virtual memory without running the risk of OutOfMemoryException. So, if you have 2GB virtual memory available (standard on a 32bit machine), you end up with about 1.3GB you can actually use reliably. Of course, this varies, and some applications will be able to use more, or maybe less. Your mileage may vary.

Don't fret, all of the issues I've talked about in here have been fixed since SoapBox Server 2005 SR1. And with the most common usage patterns people were not actually affected to begin with.

I hope this was at least marginally interesting to someone. :) Next up, I'll probably talk about limitations we discovered in the Windows Socket infrastructure, or maybe async IO, IOCP, and worker threadpools, or maybe how in the world we actually test at this scale. Only time will tell, unless Chris beats me to it.

Starin' at the Wall

Monday, June 26, 2006

How to Build Scalable .NET Server Applications: Memory Management

No comments:

Post a Comment

About the Author