Showing posts with label tuning. Show all posts
Showing posts with label tuning. Show all posts

Tuesday, March 9, 2010

Mr. Sprite Sheet, Meet Ms. MovieClip

In Youtopia we wanted to have animations. We also wanted to have thousands of buildings on the screen at once. Anyone who has done a lot of Flash development will tell you that these two things are not compatible. You simply cannot create that many movie clip instances and have them playing. But, all is not lost! With a sprite sheet animation system like the one in PushButton Engine (PBE) you can have your cake and eat it too!

Sprite sheets are as old as video games that have pre-rendered art. The basic idea is you draw a bunch of frames of an animation (or multiple animations) and put them on a single image. You can see some examples online. The software knows how to read and draw a frame of the desired animation from that one big graphic, and that's what you see on the screen. Easy!

However, sprite sheets also have their down sides. First, they aren't made to scale. We're talking bitmap graphics here. Your resized image is only going to look as good as your resizing code can make it look (which usually is not very good). Second, animations with a lot of frames can create very large file sizes. A five second animation at 30 frames per second means you end up with all 150 frames of the animation on a single image.

This is where movie clips come in. "But wait," you say with bewilderment, "didn't you tell me earlier we couldn't use movie clips." Why yes I did, so let me explain. Flash was designed from its inception to create small file sizes for downloads. It was also designed to support resizing. This is done using vector graphics which can be animated in a movie clip. We should take advantage of this feature.

One of the bits of code I contributed to PBE was the SWFSpriteSheetComponent. It takes an instance of a MovieClip and converts each frame into a bitmap. It then exposes these bitmaps to the PBE rendering engine as if they were part of a sprite sheet. Viola! You get the best of both worlds. A small download size, the option to render your animation at any scale, and super awesome performance. Using this technique and well drawn vector graphics we were able to save over 5X on the download size of Youtopia as compared to sprite sheets and get the same performance!

NOTE: If you're not familiar with PBE you might want to skim through the docs before diving into my code below. I'd highly recommend the video talks. The composite entity architecture can throw you for a loop if you're not used to seeing it.

Using the SWFSpriteSheetComponent is really easy and I've created an example application to show it off. The application spawns a sleeping Ben Garney every second and makes him move. There are some animated zzz's to indicate that, regardless of the smile, he is in fact sleeping. Ben is the lead developer on PBE, so he's getting picked on. ;) Click here to see the app. You can also download the source and follow along at home.

The process starts by creating and exporting a MovieClip in your fla with a flash.display.MovieClip base class. Give it a class name that you're not going to forget. In this example we call it "z_fx". Then publish the swf and put it somewhere that your PBE application can find it. The example has a "res" folder where I put the effects.swf. The CS3 format effects.fla (created for me by Jesse Tudela here at Hive7) is in the res folder in the download if you want to check it out. Now on to the code!

For easy deployment we embed the effects.swf and the garney.png file using PBE's ResourceBundle like so:

package
{
import com.pblabs.engine.resource.ResourceBundle;

public class Resources extends ResourceBundle
{
public static const EFFECTS_PATH:String = "../res/effects.swf";
public static const GARNEY_PATH:String = "../res/garney.png";

[Embed(source="../res/effects.swf",mimeType='application/octet-stream')]
public var effects:Class;

[Embed(source="../res/garney.png",mimeType='application/octet-stream')]
public var garney:Class;
}
}

Now on to our "main" method. We start off by creating a SceneView, which is the target where PBE will draw stuff. Then we call startup, load our embedded resources, and create the scene. This is all standard PBE initialization. A ThinkingComponent is added to the scene entity in order to spawn Garney instances based on the game's virtual time. And finally, we register our garney entity factory with the TemplateManager and spawn a Garney!

// The SceneView is where PBE will draw to
var sv:SceneView = new SceneView();
addChild(sv);

// Start the logger, processmanager, etc
PBE.startup(this);

// Embed my resources
PBE.addResources(new Resources());

// Create a basic scene through code
var scene:IEntity = PBE.initializeScene(sv);

// ThinkingComponent is an efficient Timer based on virtualTime rather than real time
scene.addComponent(new ThinkingComponent(), "spawnThinker");

// Register the callback for my "garney" template that will be used to instantiate garneys
setupTemplate();

// Spawn a garney then kick off the timer. They keep coming, eeek!
spawn();

The setupTemplate method is where all the interesting stuff happens. In here we choose all the components that make up our entity and determine how they relate to each other.

private function setupTemplate():void
{
PBE.templateManager.registerEntityCallback("garney",
function():IEntity
{
var e:IEntity = PBE.allocateEntity();

// Spatial component knows where to put the garney
var spatial:SimpleSpatialComponent = new SimpleSpatialComponent();
spatial.spatialManager = PBE.spatialManager;

// Rendering component knows how to draw the garney
var render:SpriteRenderer = new SpriteRenderer();
render.fileName = Resources.GARNEY_PATH;
render.positionProperty = new PropertyReference("@spatial.position");
render.scene = PBE.scene;

// Here's the SWFSpriteSheet magic!
var fxSheet:SWFSpriteSheetComponent = new SWFSpriteSheetComponent();
fxSheet.swf = PBE.resourceManager.load(Resources.EFFECTS_PATH, SWFResource) as SWFResource;
fxSheet.clipName = "z_fx";

// Fx Rendering component knows how to draw the z's from the spritesheet
var fxRender:SpriteSheetRenderer = new SpriteSheetRenderer();
fxRender.positionProperty = new PropertyReference("@spatial.position");
fxRender.positionOffset = new Point(30, 10);
fxRender.scene = PBE.scene;

// Need an animation controller to assign and animate the sprite sheet on the renderer
var animator:AnimationController = new AnimationController();
animator.spriteSheetReference = new PropertyReference("@fxRender.spriteSheet");
animator.currentFrameReference = new PropertyReference("@fxRender.spriteIndex");

var idle:AnimationControllerInfo = new AnimationControllerInfo();
idle.loop = true;
idle.spriteSheet = fxSheet;
idle.frameRate = 30; // In PBE your animation framerate can be independent of your stage framerate

animator.animations["idle"] = idle;
animator.defaultAnimation = "idle";

// Garneys self destruct after a random amount of time
var suicide:ThinkingComponent = new ThinkingComponent();

// Add all the components to the entity
e.addComponent(spatial, "spatial");
e.addComponent(render, "render");
e.addComponent(fxSheet, "fxSheet");
e.addComponent(fxRender, "fxRender");
e.addComponent(animator, "animator");
e.addComponent(suicide, "suicide");

e.initialize();
return e;
});
}

The "garney" template is made up of six distinct components. Each of these components perfmorm a small piece of highly specialized work. I created two components for rendering, one for the garney sprite and one for the animated z's. The z's use a SpriteSheetRenderer and a SWFSpriteSheetComponent. Both renderers are positioned based on the position property on the spatial component. The AnimationController is a very powerful class that lets you do things like automatically change out the animation being rendered based on an event firing. But, that's probably a post for another day. In this case it just plays the z's animation on the fxRender component.

All of this gets tied together in the spawn method, which creates a new garney entity based on the template, assigns it some random values for position and velocity, makes sure it draws the most recently spawned entity on top, picks a random time for the entity to commit suicide, and schedules the next spawn.

private function spawn():void
{
// Create a garney!
var garney:IEntity = PBE.templateManager.instantiateEntity("garney");

// Randomly position a garney!
var spatial:SimpleSpatialComponent = garney.lookupComponentByName("spatial") as SimpleSpatialComponent;
spatial.position = new Point(Math.random() * 800, Math.random() * 600);
spatial.velocity = new Point(Math.random() * 50 * (Math.random() < .5 ? -1 : 1), Math.random() * 50 * (Math.random() < .5 ? -1 : 1));

// Choose when this garney commits suicide, up to 20,000 virtual MS from now
var suicide:ThinkingComponent = garney.lookupComponentByName("suicide") as ThinkingComponent;
suicide.think(garney.destroy, Math.random() * 20000);

// Set the zIndex so our components render consistently in spawn-order
var render:DisplayObjectRenderer = garney.lookupComponentByName("render") as DisplayObjectRenderer;
render.zIndex = _zIndex;

var fxRender:DisplayObjectRenderer = garney.lookupComponentByName("fxRender") as DisplayObjectRenderer;
fxRender.zIndex = _zIndex;

// Grab the global spawn thinking component and schedule a think
var thinker:ThinkingComponent = PBE.lookupComponentByName("SceneDB", "spawnThinker") as ThinkingComponent;
thinker.think(spawn, 1000);
}

That's all there is to it! There are some caveats, though. Make sure you keep your source MovieClips really simple. Just like it is CPU intensive to play a complex MovieClip, it is CPU intensive to render each frame to a bitmap. In addition, nested clips with separate timelines and as3 code in the clip that is not based on the timeline will not be executed. This works really well for simple frame based animations, but is not designed for complex interactive clips with tweening. YMMV.

Let me know if you have any questions or if you're interested in any other PBE related topics. Also, PBE 1.0 is out! Download it from the project site. This example includes the 1.0 release swc.

Sunday, November 22, 2009

Angry Players Make Sunday More Interesting

Youtopia has been growing quickly the last couple of weeks. It's fun to watch and the team is really excited about it. Of course, with the growth comes a lot of performance tuning with our code. Today we hit an issue I wasn't expecting at all. . .

We've been running Windows 2008, IIS7, and ASP.NET 3.5 in production for a while now, but haven't had to do much of any performance tuning. It just works, and is fast. Which is awesome!

But today, Youtopia was running slowly and requests were hanging so I investigated. The databases were performing normally and not having any locking issues. The network looked good. The memcached cluster was healthy. The queueing service looked great. The ASP.NET performance counters even looked good at first glance.

None of the diagnostic performance monitors I'd used in the past (such as Requests in Application Queue) showed the issue, but requests were absolutely being queued -- or otherwise not processed immediately. There were also plenty of free worker and IOCP threads. The only thing that clued me in was the Pipeline Instance Count and Requests Executing counters were exactly the same (96) on all the servers. So I started investigating from there.

It turns out that due to the way IIS7 ASP.NET integrated mode threading model functions there is a (configurable) request limit of 12 per CPU. We hit this limit in Youtopia today because we hold open requests for asynchronous Comet-like communications and there were over 288 people online simultaneously. Our three eight core web servers each had 96 (8*12) people connected to them and weren't really serving any other requests. We aren't running into any thread configuration limits as the long running requests are asynchronous and not using ASP.NET worker threads.

Here are a few great links that came out of my research.

With ASP.NET 3.5 SP1 it boils down to a simple configuration file change. Use something like this in the aspnet.config file (in x64 it's at C:\Windows\Microsoft.NET\Framework64\v2.0.50727\aspnet.config). This is the default. Adjust maxConcurrentRequestsPerCPU to suit your needs.

<system.web>
<applicationPool maxConcurrentRequestsPerCPU="12" maxConcurrentThreadsPerCPU="0" requestQueueLimit="5000"/>
</system.web>

In addition, the application pool needs to be configured to allow more requests. By default it only allows 1000 concurrent requests. This is done under the Advanced Settings for the application pool in the IIS 7 manager. Set Queue Length to 5000 to match this system level configuration.

Monday, November 16, 2009

Ditch Your Events (Part 1)

About four months ago Max, Hive7's Lawful Evil CEO, decided we needed to take our games to the next level and build something fun and accessible that everyone who plays "those farming games" would want to play. We all brainstormed, pitched our ideas to the company, and everyone voted by comparing every idea against every other – I wish we had a digital photo of the giant matrix on the whiteboard. There were a bunch of great ideas, but in the end... I won! Youtopia was born.

Youtopia was released to the public about three months from its inception. Hats off to the dev and art team for pulling this one together. A new technology for the developers and fully animated objects for the art team led to much blood, sweat, and tears, but we got 'er done! Of course, we're still actively developing Youtopia, and there are lots of great things planned for the future! But, back to my tech article...

It's been a long time since I've stepped out of my comfort zone and learned a new (to me) technology. Don't get me wrong, I'm always experimenting with the lastest .NET based thingie-ma-bobbers out there, but I haven't used a completely foreign development environment since C#/.NET came out over eight years ago. But for this project I needed to learn Flash/AS3, and it needed to be done yesterday. Luckily for me nobody else on our dev team knew Flash so I could still pretend like I knew what I was talking about and make lots of (un)educated architectural decisions without anyone being the wiser!

One such recent decision was to use an event driven property binding system. Youtopia's engine is based on a great open source game engine, brought to you from some of the Dyamix/GarageGames people, called the PushButton Engine (or PBE). In PBE there is a class called PropertyReference. This class facilitates a late-bound approach for one component to read the value of a property (member variable or getter/setter) on another component. It's a pretty cool pattern, but requires you to poll the target component whenever you want to know if the property changed. This works fine when you're talking about 10's or 100's of components. But in Youtopia we have thousands of entities in the scene at once. We needed this binding to be event-driven.

Of course, with my .NET background I immediately reached for the INotifyPropertyChanged pattern used in .NET's data binding infrastructure. With INotifyPropertyChanged it is the responsibility of the object owning the property to raise an event whenever a property value changes. Any listeners will then immediately know they need to poll for the new value if they want it.

This works great in .NET and is very performant. But in Flash, events are a whole other story. They are an extremely feature-rich subsystem that I don't really want to get into. In the end, all the features and memory allocations when you raise an event lead to poorer performance than we needed for Youtopia. We need every bit of CPU power on that single Flash thread and really shouldn't be wasting it raising events.

So, I shamelessly copied the .NET patterns and brought them over to AS3. Let's start at the core. In order for things to perform their best, I couldn't use the built-in Events. Though Troy did the benchmarking legwork, he didn't provide an implementation we could use to register callbacks and call multiple functions. So, I wrote a MulticastFunction that behaves a whole lot like the MulticastDelegate in .NET. Usage is really straightforward.

var func:MulticastFunction = new MulticastFunction();

//register my listener callback
func.add(
function():void
{
//this callback does amazingly cool stuff
trace("hello from the callback");
});

//calls all the callbacks that have been added, in the order they were added
func.apply();

As you can see, dealing with the MulticastFunction is a lot like the EventDispatcher, but each MulticastFunction is only designed to be used for a single event. So, to use it for events, create a public getter on your class named something reasonable and add your callbacks to it. Done!

Ok, I realize I keep talking about event dispatching speed, but haven't put my money where my mouth is. I wrote some benchmarks of my own and here is the output with a release build, in the latest standalone Flash 10 player. It does five test runs. Download the Source

running tests...
Event dispatching took 848ms
MulticastFunction took 355ms

running tests...
Event dispatching took 846ms
MulticastFunction took 351ms

running tests...
Event dispatching took 834ms
MulticastFunction took 352ms

running tests...
Event dispatching took 836ms
MulticastFunction took 351ms

running tests...
Event dispatching took 823ms
MulticastFunction took 343ms

Yup, that's right. MulticastFunction is nearly 2.5x faster, and I haven't spent much time tuning it. For example, it's using an Array under the hood and doing more work than it needs to during the apply call. Events will also become less performant over time as you have to create (and potentially clone) Event objects for every dispatch, causing a lot of garbage collection pressure. Here's the MulticastFunction, with lots of comments or you can download the source

package com.jdconley
{
/**
* A wrapper that mimics the synchronous behavior of the MulticastDelegate used in .NET for events.
* This doesn't support any of the async methods, as we don't have free threading here.
* It also doesn't support return values.
* See: http://msdn.microsoft.com/en-us/library/system.multicastdelegate.aspx
*/
public class MulticastFunction
{
private var _functions:Array = [];
private var _iterators:int = 0;

/**
* Adds a function to be called when apply is called.
* If the function is already in the list it won't be added twice.
* Returns true if the function was added.
**/
public function add(func:Function):Boolean
{
var i:int = _functions.indexOf(func);
if (i > -1)
return false;

//add new functions to the end so they are picked up live during an apply
_functions.push(func);
return true;
}

/**
* Removes a function to be called when apply is called.
* Returns true if the function was removed.
**/
public function remove(func:Function):Boolean
{
var i:int = _functions.indexOf(func);
if (i < 0)
return false;

if (_iterators == 0)
_functions.splice(i, 1);
else
_functions[i] = null;

return true;
}

/**
* Synchronously applies all functions that have been added.
* Functions can be safely added or removed during an apply and changes will take effect immediately.
* Added functions will be called, and removed functions will not.
**/
public function apply(thisArg:*=null, argArray:*=null):void
{
_iterators ;
var holes:Boolean = false;

for (var i:int = 0; i < _functions.length; i )
{
var f:Function = _functions[i];
if (f == null)
holes = true;
else
f.apply(thisArg, argArray);
}

//cleanup holes left by removing functions during this apply call.
//if any of the function apply's throw an error the state of _iterators will be off.
//but, we'll only leak array slot memory if functions are removed.
//putting a try/finally or try/catch block here significantly decreases performance.
if (--_iterators == 0 && holes)
{
for (i = _functions.length - 1; i >= 0; i--)
{
if (_functions[i] == null)
_functions.splice(i, 1);
}
}
}

/**
* Removes all functions from the list. Stops the current apply call, if there is one.
**/
public function clear():void
{
_functions = [];
}
}
}

Although capture, bubble, weak references, and priority are handy features of the Flash eventing system, they're not always necessary and will hurt your performance when you might have thousands of them firing per frame.

In Part 2 we'll put this MulticastFunction to use in a more meaningful way with the INotifyPropertyChanged implementation.

Thursday, March 12, 2009

ioDrive, Changing the Way You Code

Introduction

In my lifetime there have been very few technologies that have created a paradigm shift in the software industry – I was born just after the spinning magnetic hard drive was created. Off the top of my head I can think of: the Internet (thanks Al!), optical disks, Windows, and parallel computing. From each of these technologies entirely new software industries were born and development methodology drastically changed. We're at the beginning of another such change, this time in data storage.

Contents
Random Story for Context

At Hive7 we make web based social games for platforms like Facebook and Myspace. We're a tiny startup, but producing a successful game on these platforms means we're writing code to deal with millions of monthly users, and thousands of simultaneous users pounding away at our games. Because our games are web based they're basically written like you'd write any other web application. They're stateless, with multiple RDBMS back end servers for most of the data storage. Game state is pretty small so we don't really store that much data per user. We don't have Google sized problems to solve or anything. Our main problem is with speed.

When you're surfing the web you want it to be fast but can live with a page taking a few second to load here and there. When you're playing a game, on the other hand, you want instant gratification. A full second is just way too long to wait to see the results of your action. Your character's life might be on the line!

To accomplish this speed in our games we currently buy high end commodity hardware for our database servers, and have a huge cluster of memcached that we tap into. It works. But, properly implementing caching is complex. And those DB servers are big 3U power hungry monsters! Here's a typical disk configuration of one of our DB servers:



Each of those drives are 15k RPM 72 GB SAS (or whatever the fastest is at the time of build). And the RAID controllers are very high end with loads of cache. And here's the kicker! We can only use about 25% of the capacity of these arrays before the database write load gets too high and performance starts to suffer. They cost us about $10k a piece. Sure, there are much more complex architectures we could use to gain performance. Or we could spend a few hundred grand and pick up a good SAN of some sort. Or we could drop some coin for Samsung SSD's. But, those options are a bit out of the price we want to spend for our hardware, not to mention the necessary rack space and power requirements.

Enter the ioDrive. With read/write speeds that are very close to the 24 SSD monster that Samsung recently touted, at a way lower price, I have a hard time imagining choosing the 24 drive option. Maybe if you had massive storage requirements, but for pure performance you can't beat the ioDrive price/performance ratio right now. I don't remember if I'm allowed to comment on pricing, but you can contact a sales rep at Fusion-io for more info.

Last month we picked up one of these bad boys for testing. In summary, "WOW!" I spent a few hours this week putting the ioDrive through the ringer and comparing it to a couple different disk configurations in our datacenter. My main goal was to see if this is a viable option to help us consolidate databases and/or speed up existing servers.

The Configuration

ioDrive System (my workstation)

  • Windows Server 2008 x64 Standard Edition
  • 4 CPU Cores
  • 6 GB Ram
  • 80 GB ioDrive
  • Log and Data files on same drive

Fast Disk System

  • Windows Server 2008 x64 Standard Edition
  • 8 CPU Cores
  • 8 GB Ram
  • 16 15k RPM 72 GB SAS Drives (visualization above)
  • Log and Data files on different arrays

Big and Slow Disk System

  • Windows Server 2008 x64 Standard Edition
  • 4 CPU Cores
  • 8 GB Ram
  • 12 7200 RPM 500 GB SATA Drives
  • Log and Data files on different arrays
Test Configuration

For this test I used SQLIOSim with two five minute test runs. We were really only interested in simulating database workloads. If you want a more comprehensive set of tests check out Tom's Hardware. I should also mention that this was obviously not a test of equals. Both disk based systems have a clear RAM advantage and the fast disk system has a clear CPU advantage. The hardware chipsets and CPU's are also slightly different, but they're the same generation of Intel chips. In any case, when you see the results you'll see how this had a negligible effect. We're talking orders of magnitude differences in performance here...

I ran two different configurations through SQLIOSim. One was the "Default" configuration that ships with the tool. It represents a pretty typical load on a SQL Server disk system for a general use SQL server. The other was one I created called "Write Heavy Memory Constrained". The write heavy one was designed to simulate the usage in a typical game, where, due to caching, we have way more writes than reads to a database. Also, the write heavy one is much more parallel. It uses 100 simulated simultaneous random access users where the default one has only 8. And, with the write heavy one there is no chance the entire data set can be cached in memory. It puts a serious strain on the disk subsystem.

I took the output from SQLIOSim and imported it into Excel to do some analysis. I was primarily concerned with two metrics: IO Duration and IO Operation count. These two things tell me all I need to know. First, how long does it take the device to perform IO on average, and how many can it get done in the given time period.

Test Results

Write Heavy Memory Constrained Workload
Metric ioDrive Slow Disks Fast Disks
Total IO Operations 10,625,381 1,309,673 3,260,725
Total IO Time (ms) 17,625,337 1,730,147,612 356,839,912
Cumulative Avg IO Duration (ms) 1.66 1,321.05 109.44


Wow, 100x faster IO's on average!


Over 20x less time spent doing IO operations!


And over 3x more operations performed. This would have been way higher, but the ioDrive system was CPU constrained, taking 100% CPU. Looks like we'll be loading up at least 8 cores in any database servers we build with these cards!

Default Workload
Metric ioDrive Slow Disks Fast Disks
Total IO Operations 690,753 287,180 456,300
Total IO Time (ms) 3,616,903 231,859,576 93,991,055
Cumulative Avg IO Duration (ms) 5.24 807.37 205.99


40x faster on average in this workload! Looks like the bulk operations and larger IO's present in this workload narrowed the gap a bit.


This time, a little under 30x less time spent doing IO operations!


Only 1.5x more total operations this round. This time we weren't CPU constrained, and I didn't take the time to dig in to the "why" on this one. Based on the raw data I would guess this is caused by IO blocking a lot more often for ioDrive than the fast RAID system. This probably has to do with the caching system in the RAID cards under this mixed write workload. You'll notice if you look at the raw report, that the ioDrive has no read or write cache at the device level. It doesn't really need it.

In case you want to see the raw data or the SQLIOSim configuration files, you can download the package here: ioDrive Test Results

Conclusion

Wow! ioDrive is going to be scary fast in a database server, especially when it comes to tiny random write IO's, parallelism, and memory constraints. I think we'll be seeing a lot of new interesting software development and system architectures due to this type of technology. The industry is changing. You no longer need either tons of cache (or cash) or tons of RAM to get great performance out of your data store. We're talking 100x better performance than our fast commodity arrays. I think it's safe to say we'll be using these devices in production in the near future. Since this device is currently plugged into my workstation, maybe I'll post another review about how it's improving my development productivity so you can convince your boss to buy you one. :)

Wednesday, March 19, 2008

A SQL Table Manhunt

As I mentioned in my last post I recently took on the exciting new position of Chief Software Architect at Hive7, Inc. We're building all kinds of great stuff. Our most popular game Knighthood has over a million registered users and over 100,000 daily actives. This game is growing quickly. Over 125,000 people added the game two weeks ago, and over 150,000 added it in the last week. The game came into existence in December.

This massive growth leads to some exciting scalability challenges. I'll be spending a lot of time talking about that in the future. Today, is a simple tidbit related to databases. Our current performance bottleneck is with database write I/O. We have enough memory in the systems and a caching layer, so the disks barely need to read. Tracking this down is a whole other post, but it's fairly simple. Once we knew we were write I/O limited we set out to find out why.

The original DB physical layout started out pretty simple. There was one File for data, one for logs. In the next 3 or 4 revisions more and more files were created. Why? Well, so we could run this nifty little query and find out which of our db tables/indexes/etc were causing the write bottlenecks:

select
db.name as DbName,
f.name as FileName,
f.physical_name as FilePhysicalName,
vf.TimeStamp,
vf.NumberReads,
vf.BytesRead,
vf.IoStallReadMS,
vf.NumberWrites,
vf.BytesWritten,
vf.IoStallWriteMS,
vf.BytesOnDisk
from fn_virtualfilestats(-1,-1) vf
inner join sys.databases db on db.database_id = vf.DbId
inner join sys.database_files f on f.file_id = vf.FileId
order by vf.NumberWrites desc

If you have physically separated your various database tables and indexes into different files, the output from this function will give you all kinds of useful information about which ones are most accessed, and which put the most strain on your I/O subsystem. Optimizing it, of course, is up to you. :)

If you enjoy big scale, fast moving, tough problems, we're hiring for a Lead Web Designer and Brilliant Lead DBA/Sysadmin and Web Games Developer (.NET)!

Wednesday, November 28, 2007

Linq to Sql Surprise Performance Hit

A couple days ago I built Photo Feeds. I wrote about it here. After using the app a bit I noticed something peculiar. The page took a really long time to load. From my local web server it was on the order of 2-5 seconds. That's way too slow.

Having written the code a few hours prior, I immediately knew what was wrong. I thought "Well, duh, either Facebook sucks, I'm hitting my database too often, or I'm calling the Facebook service too often." I/O is, after all, usually what causes an application this simple to be slow. I glanced over the code and thought to myself "Holy crap, who wrote this junk? It's calling the database waaaaay too much. That has to be the problem."

Having been down this path many times I decided to analyze a bit more before officially jumping to conclusions. I popped open task manager to see if I was in fact I/O bound (i.e. to see if my CPU was racing or idle) – usually a good rough indicator of what's going on. To my jaw-dropped amazement the CPU was hitting 100% for the majority of the page request time. "Could it be the JIT?" I thought. So I hit refresh 50 times and watched in amazement as the CPU spiked each time. I switched over to the Processes tab and, again to my amazement, noticed it was the web server chewing up the cycles, and NOT SQL server. Did I mention I was amazed?

Ok, so apparently I was wrong [insert shock and awe]. The web site was way too CPU hungry! Good thing I didn't put any bets on that one. I then theorized that maybe Cassini just sucked (my code couldn't be that bad). I whipped up a virtual directory in IIS and ran my tests again.

Nope, it wasn't Cassini. Alright, so now let's get scientific. I pop open ANTS and setup a profiling session on my virtual directory. I loaded the page a few times and, lo and behold, the worst offender screen pops up:


Wow! That's a long time (and a lot of hits). I was definitely right. I'm hitting the database way too often. But, I know I saw the CPU racing... What the heck?

Here's the offending code:

public static Feed GetOrCreateFeed(
FeedsDataContext dc,
string owner,
FeedSourceType sourceType,
string facebookId)
{
var q = from f in dc.Feeds
where f.OwnerId == owner &&
f.FeedSourceType == (byte)sourceType &&
f.FacebookId == facebookId
select f;

Feed existingItem = q.SingleOrDefault();

...
}

So I dug into the call graph a bit and found out the code causing by far the most damage was the creation of the LINQ query object for every call! The actual round trip to the database paled in comparison. Now that was, again, a huge surprise. Check out the hit counts on this call – holy cow!


1,176,879 calls to SqlVisitor.Visit. Over a million calls to anything for loading a page three times can't be good.

I started doing some research. I remembered reading about using compiled LINQ to SQL queries for optimization a few months ago. I did a little more searching and ran into this gem for anybody scared of lambda functions. Warning, scary .NET 3.5 lambda functions ahead. Here's the magic fix:

private static readonly Func<
FeedsDataContext, string, FeedSourceType,
string, IQueryable<Feed>> _compiledGet =
CompiledQuery.Compile(
(FeedsDataContext dc, string owner,
FeedSourceType sourceType, string facebookId) =>
from f in dc.Feeds
where f.OwnerId == owner &&
f.FeedSourceType == (byte)sourceType &&
f.FacebookId == facebookId
select f);

public static Feed GetOrCreateFeed(
FeedsDataContext dc,
string owner,
FeedSourceType sourceType,
string facebookId)
{
Feed existingItem = _compiledGet(
dc, owner, sourceType, facebookId)
.SingleOrDefault();

...
}

If that's your first time seeing lambdas in C#, I feel for you. I'd suggest trying them out and doing a lot of research. That simple change yielded the following result:



That method is still by far the worst offender in the application, but is five times faster than it was and barely registers on the CPU when the page loads. This one change netted sub second page rendering times which is good enough for me. Obviously I still have major issues, hitting the database way too often, but I'm a big believer in not doing any premature optimization. But now, at least I'm back in my comfort zone and am I/O bound.

So, what have we learned? Never, ever, ever, ever, ever, ever, ever, ever, ever blindly edit code to fix performance issues. Always run it through a profiler. No matter how much experience you have performance tuning, the worst offenders will still surprise you 99% of the time. Oh, and sometimes LINQ to SQL is slow at creating its query objects. Use compiled queries in these cases!

Friday, September 28, 2007

Asyncify Your Code

Asyncify your code. Everybody's doing it. (Chicks|Dudes)'ll dig it. It'll make you cool.

Pretty much everything I build these days is asynchronous in nature. In SoapBox products we are often waiting on some sort of IO to complete. We wait for XMPP data to be sent and received, database queries to complete, log files to be written, DNS servers to respond, .NET to negotiate Tls through a SslStream, and much more. Today I'll be talking about a recent walk down Asynchronous Lane: the AsynchronousProcessGate (if you don't like reading just download the package for source code goodness).

I ran into a problem while working on a new web application for Coversant. I needed to execute an extremely CPU and IO intesive process: creating and digitally signing a self extracting compressed file -- AKA The Package Service. This had to happen in an external process, and it had to scale (this application is publicly available on our consumer facing web site). Here's a basic sequence of the design I came up with:


Do you notice the large holes in the activation times? That's because we're asynchronous! The BeginCreatePackage web service method the page calls exits as soon as the BeginExecute method exits, which is as right when the process starts. That means we're not tying up any threads in our .NET threadpools at any layer of our application during the time a task is executing. That's a Good Thing™.

At this point I'm used to writing highly asynchronous/threaded code. However, I still wouldn't call it easy. Why do it? I'd say there are three main reasons.

  1. To provide a smooth user experience. The last thing a developer wants is for his/her software to appear sluggish. There's nothing worse than opening Windows Explorer and watching your screen turn white (that application is NOT very asynchronous).
  2. To fully and most appropriately utilize the resources of the platform (Runtime/OS/Hardware). To scale vertically, you might call it.
  3. Because it makes you cool. AKA: To bill a lot more on consulting engagements.

Microsoft recommends two asynchronous design patterns for .NET developers exposing Asynchronous interfaces. These can be found on various classes throughout the framework. The Event Based pattern comes highly recommended from Microsoft and can be found all over new components they build (like the BackgroundWorker). Personally I think the event based pattern is overrated. The hassle of managing events and not knowing if the completed event will even fire typically steers me away from this one. However, it is certainly easier for those who are new to the asynchronous world. This pattern is also quite useful in many situations in Windows Forms and ASP.NET applications, leaving the responsibility of the thread switching to the asynchronous implementation (the events are supposed to be called in the thread/context that made the Async request -- determined by the AsyncOperationsManager). If you've ever used the ISynchronizeInvoke interface on a Winforms Control or manually done Async ASP.NET Pages you can really appreciate the ease of use of this new pattern...

The second recommended pattern, and usually my preference, is called the IAsyncResult pattern. IAsyncResult and I have a very serious love/hate relationship. I've spent many days with my IM status reading "Busy - Asyncifying" due to this one. But, in the end, it produces a simple interface for performing asynchronous operations and a callback when the operation is complete (or maybe timed out or canceled). Typically you'll find IAsyncResult interfaces on the more "hard core" areas of the framework exposing operations such as Sockets, File IO, and streams in general. This is the pattern I used for the Asynchronous Process Gate in the Package Service.

The Package Service has a user interface (an AJAXified asynchronous ASP.NET 2.0 page) which calls an asynchronous web service. The web service calls another asynchronous class which wraps a few asynchronous operations through the AsynchronousProcessGate and other async methods (i.e. to register a new user account) and exposes a single IAsyncResult interface to the web service.

Confused yet? Read that last paragraph again and re-look at the sequence. In order to make this whole thing scale it had to be asynchronous or we'd be buying a whole rack of servers to support even a modest load. Also because of the nature of the asynchronous operation (high cpu/disk IO) it had to be configurably queued/throttled. I went through a few possible designs on paper. But in the end I chose to push it down as far as possible. The AsynchronousProcessGate, quite simply, only allows a set number of processes to execute simultaneously, the number of CPU's reported by System.Environment.ProcessorCount by default. It does this by exposing the IAsyncResult pattern for familiar consumption. The piece of magic used internally is something we came up with after writing a lot of asynchronous code: LazyAsyncResult<T>.

LazyAsyncResult<T> provides a generic implementation of IAsyncResult. It manages your state, your caller's state, and the completion events. It also uses Joe Duffy's LazyInit stuff for better performance (initializing the WaitHandle is relatively expensive and usually not needed).

Using the asynchronous process gate is straight forward if you're used to the Begin/End IAsyncResult pattern. You create an instance of the class, and call BeginExecuteProcess with your ProcessStartInfo. When the process is complete you will get your AsyncCallback, or you can also wait on the IAsyncResult.WaitHandle that is returned from BeginExecuteProcess. You then call EndExecuteProcess and the instance of Process that was used is returned. If an exception occurred asynchronously, it will be thrown when you call EndExecuteProcess.

The Begin Code:
static void StartProcesses()
{
AsynchronousProcessGate g = new AsynchronousProcessGate();
while (!_shutdown)
{
//keep twice as many queued as we have cpu's.
//for a real, CPU or IO intensive, operation
//you shouldn't do any throttling before the gate.
//that's what the gate is for!
if (g.PendingCount < g.AllowedInstances * 2)
g.BeginExecuteProcess(
new ProcessStartInfo("notepad.exe"),
10000,
ProcessCompleted,
g);
else
System.Threading.Thread.Sleep(100);
}
}
The End Code:
static void ProcessCompleted(IAsyncResult ar)
{
try
{
AsynchronousProcessGate g =
(AsynchronousProcessGate)ar.AsyncState;

using (Process p = g.EndExecuteProcess(ar))
Console.WriteLine("Exited with code: " +
p.ExitCode + ". " +
g.PendingCount + " notepads pending.");
}
catch (Exception ex)
{
Console.WriteLine("("
+ ex.GetType().ToString()
+ ") - " ex.Message);
}
}

Phew! After all that, the end result for SoapBox: a single self extracting digitally signed file someone can download. Oh, and a simple library you can use as an Asynchronous Process Gate! Enjoy. Look, another download link so you don't even have to scroll back up. How nice am I?

Tuesday, April 10, 2007

Building Reactive User Interfaces in .NET: ISynchronizeInvoke on Idle Time

One of the most common things to do in a multi-threaded .NET Windows Forms application is to use the ISynchronizeInvoke interface on a Control to marshal things into the UI thread. The basic jist of things is: you can have the Windows Forms engine call your delegate on the thread that created the Control instance you're using. It does this using the window message pump. Typical uses of the ISynchronizeInvoke are: you want to read some data from a file, database, socket, or web service and you want your application to be responsive while you do it. When you're done you do a BeginInvoke/Invoke on the Control to update some UI elements. Well, sometimes using a Control's ISynchronizeInvoke leads to hung GUI's.

If you have a lot of asynchronous operations pending and they complete in bunches the affect on your UI can be the appearance of a hang or sluggishness as drawing is queued behind the delegates you registered with BeginInvoke on your control. During the login sequeunce in SoapBox Communicator there are potentially thousands of events that need to be processed on the UI thread. Yup, thousands. Your roster arrives from the server. You receive the current avialability from every online person on your roster. You got 10 messages while you were offline. Each of your contact's cached profiles (avatars, names, client capabilities, etc) are read from disk and compared with the user's current presence. This onslought of activity can be handled a number of ways. The basic principles I stumbled into after much trial and error is:

  1. Never, ever, ever, ever, ever do IO on a UI thread. Even something as simple as reading a 1KB image from disk can bring your UI to a halt under the right circumstances.
  2. Use timers and update common pieces of UI (like list views) in batches whose changes were caused changes by background operations. This will reduce flicker as you invalidate areas to redraw.
  3. Use Application.Idle to your advantage. (see my take on this below)
  4. Be careful how many window handles you create. Controls are useful, but sometimes you've just gotta draw your own.
  5. Profile your code where it seems sluggish.

I mentioned above that you should use Application.Idle to your advantage. The articles I've found on it all mention coding precise things you want to do in that event, like updating a single form. I wanted to do it more generically. So, I created an implementation of ISynchronizeInvoke that uses the Application.Idle event to process qeued items. I've created a (only slightly) contrived example that uses a Pi Calculator I found and an animated Gif to demonstrate a hanging GUI.


Both buttons calculate Pi to 50 digits on the UI thread using 100 separate ISynchronizeInvoke.BeginInvoke calls. The "Idle Invoke" button does so using my ApplicationIdleSynchronizer. The "Direct Invoke" button calls BeginInvoke on the Form directly. You can click the buttons any number of times and more and more Pi calculation runs will be queued. The completed label is incremented after every run. Here's the code:

    public partial class Form1 : Form
{
private const int WorkItems = 100;
private const int PiDigitsToCalc = 50;

private int _workItemsCompleted = 0;

private ApplicationIdleSynchronizer _idleSynchronizer = new ApplicationIdleSynchronizer();

public Form1()
{
InitializeComponent();
}

protected override void OnClosed(EventArgs e)
{
base.OnClosed(e);
_idleSynchronizer.Dispose();
}

private void button1_Click(object sender, EventArgs e)
{
System.Threading.ThreadPool.QueueUserWorkItem(QueueOnBackgroundThread, _idleSynchronizer);
}

private void button2_Click(object sender, EventArgs e)
{
System.Threading.ThreadPool.QueueUserWorkItem(QueueOnBackgroundThread, this);
}

private void QueueOnBackgroundThread(object state)
{
for (int i = 0; i < WorkItems; i )
{
((ISynchronizeInvoke)state).BeginInvoke(new ThreadStart(MyWorkItem), null);
}
}

private void MyWorkItem()
{
PiCalculator.CalculatePi(PiDigitsToCalc);
_workItemsCompleted++;
label2.Text = _workItemsCompleted.ToString();
}
}

Click the buttons. Move the form around. Resize it. You might be suprised by the result (or not). Both very adequately use your CPU, but the Idle Invoke produces a much more reactive UI. You can download the full source code here: ApplicationIdleInvoker.zip. Note, this exact code isn't in use in production. I wrote it for this blog. YMMV. Let me know if it has any issues.

Monday, June 26, 2006

How to Build Scalable .NET Server Applications: Memory Management

I'll get this out of the way from the start. This series of blogs will have nothing to do with ASP.NET or web services. However, if you plan on writing you own implementation of IIS in managed code this would probably be a good place to start. :) I also won't be providing very many code examples, as I'd be flogged by our intellectual property lawyers. You will not be able to copy and paste and create your own scalable server. However, I hope to provide enough insight so you can avoid a big list of gotchas we have had to figure out the hard way. This is one piece of a huge puzzle, memory management. Yes, you do have to think about that in .NET, at least if you want to build a large scale application.

For those who don't already know, SoapBox Server is a part of our SoapBox Collaboration Platform that supports the XMPP protocol as well most of the interesting JEP extensions. At the core of SoapBox Server is a highly efficient Socket server and thread machine capable of scaling into the hundreds of thousands of simultaneous users, and it's built 100% on .NET (C# now, but used to be VB).

SoapBox Server is the first multithreaded Socket based server application I've had the pleasure of working on. During the course of building the SoapBox Server into the extremely scalable and reliable system it is today I've learned a few things (as has the rest of the team, I hope). Thanks to Chris (who already had tons of experience with such things in Win32/C++), a few bloggers out there, some books, customers finding very interesting bugs, Windbg with Son of Strike, oh and Starbucks, I'd say I'm pretty well versed in the land of building scalable server applications. I'm no Jeff Richter, mind you, but I feel I have now learned enough to at least speak intelligently about it.

In that spirit I'd like to share the fruits of our tuning and debugging work, which, if history repeats itself, will continue to evolve as we begin work on our next major revision of the product. First, I'd like to repeat something I said a couple paragraphs ago, SoapBox now scales to hundreds of thousands of simultaneous connections with a single piece of server hardware. Think about that for a second. A user brings up an IM client, connects to SoapBox Server, and then holds that connection open until they Log Out. Repeat hundreds of thousands of times. This is no simple task. The .NET CLR does not provide a magic "Process.Scalable = true" property. We have invested hundreds of hours into tuning (maybe thousands) over the life of the server on classes of hardware varying from single processor laptops to 16-way Itanium2 systems with 64GB RAM. We've been through four distinct processing models as well as quite a few iterative improvements on our Socket interaction layer. Basically we have ran the server under a bunch of different profilers under many scenarios, found slow bits of code, and fixed them. But I'm not going to talk about profiling and performance tuning; perhaps another time. I'm going to talk about memory and scalable applications.

Every time your application creates a new Socket, Windows pulls memory from it's Nonpaged Kernel memory, which is simply physical memory that is reserved by the kernel and will never be paged out to disk. This block of memory has a finite limit and the kernel picks the limit based on the amount of phsyical RAM available to it. I don't know the exact algorithm, but with 4GB RAM it's usually somewhere around 150,000 TCP Socket connections, give or take. Want to see this in action? Simply create a loop that instantiates sockets. It will stop working eventually with a SocketException telling you there isn't enough buffer space. On top of this hard kernel level limitation, you also have to worry about how much memory each concurrent connection uses in your own application. In SoapBox we store a lot of information about each connection in memory in order to improve performance and decrease our IO operations. This includes things like the user's contact list, their last presence (available, away, busy, etc), authorization information, culture information, user directory information, etc. If we didn't hold this in memory we'd have to hit a file, database, or some other out of process persistent store for the information every time we needed it. Being IO bound is no fun. Believe me, we started out that way.

However, because of our extensive caching, SoapBox Server 2005 can only reliably handle about 20,000 simultaneous connections on the beefiest of 32 bit hardware (on 64 bit it's much, much, much higher -- I also have to admit we haven't stress tested the 2007 build on 32 bit hardware, it would probably be much higher now). It doesn't matter if you have 64GB RAM and 16 32bit processors, it we can still only handle 20,000 connections. Why, you ask? Well, it's because of the 2GB (well, really 3GB with a boot.ini switch) virtual memory limit per process in 32bit Windows. Without delving into managing your own memory your process is only allowed up to 3GB to play with. Typically, we use that up, or rather, .NET thinks we use it up, somewhere between 20,000 and 30,000 connections. Now why would I say ".NET thinks we use it up?" Story time!

A little over a year ago one of our customers kept running into a very bad situation. As evidenced by the Event Log, SoapBox Server was crashing (insert shock and awe here). It was an irregular occurance, but it did happen. However, we did no take this lightly. This customer was running about 2,500 simultaneous connections on a Dual Xeon with Hyperthreading and 4GB ram and the /3GB switch set. It was plenty of hardware for the job, and probably overkill. However, the service was still crashing. We set them up with the Debugging Tools For Windows and had them startup the process to wait for a crash (another blog we'll have to write some day). After a few tries we got a dump with some useful information in it. The result? We were out of memory, sort of.

In .NET when you call any socket operation and pass it a buffer, whether it be a send or receive, synchronous or asyncronous, it takes that buffer and pins it before giving it to the Winsock API's. Pinning, in a nutshell, is taking a .NET data structure and telling the .NET CLR memory manager not to move it, until it is explicitly un-pinned. The memory manager in the CLR is smart. As you allocate and deallocate memory it is constantly defragmenting it for you so the overall memory footprint is lower. There are quite a few really good/long/complicated articles on how this works so I won't bore you. However, pinning throws a wrench in this and the memory manager isn't quite smart enough to deal with it well (though it has gotten a lot better in 2.0). Basically, that buffer you want to put on the socket cannot move in memory (physically -- in terms of you virtual memory space) from the time the socket IO operation begins until it ends. If you look at the Winsock2 API's this is obvious, since the buffer is passed as a pointer. Anybody who's built this type of application in Winsock2 is probably saying "DUH!". I'd consider this a very leaky abstraction. Due to this behavior, it is quite easy to write a socket application in .NET that runs out of memory.

Back to the story! Not only were we out of memory, but the there was only about 200MB worth of data structures in the heap. For those of you like me that use calc.exe for all your basic math let me figure that out for you, 200MB > 3GB. Uhh, say what? How the heck were we out of memory? Well, we ran into the shortfall of pinning and memory fragmentation. The cause of this was a small number of small pinned buffers, in our case 2KB each, that were high enough in the heap to cause fragmentation spanning over 2.8GB. Where did the other 2.8GB go, you ask? Well, is was there, allocated by our process, but not being used by our code. In Son Of Strike (SoS -- a command line plug-in to the Windbg debugging tool I hope you never have to use) this showed up as free, empty, unused space! It was just sitting there waiting to be used, but we still ran out of memory. I think I mentioned earlier the memory manager in .NET isn't so smart when it comes to fragmented memory and pinning, well, this is what happens in the worst case.

Good thing for you, the answer to all your memory fragmentation and pinning woes is quite simple. Pre-allocate buffers for use by anything that will be causing pinning, and do it early on before there is a lot of memory thrash (when your application is rapidly allocating and deallocating a lot of memory). We created a simple class called a BufferPool that we use to pre-allocate a certain number of buffers. This pool can grow as need be, but it does so in large chunks and forces a garbage collection each time before the buffers are actually used. This considerably reduces the chances of fragmentation caused by pinned memory. If the pool starts off with 500 buffers, but then the 501st buffer is needed it will grow by a configurable value, typically another 500 buffers, and the induced garbage collection will cause these buffers to shift to the lowest possible point on the heap.

Interestingly enough when we found this bug we already knew about the pinning behavior of socket operations, but had only solved half of it. All of our BeginReceive calls were using the BufferPool because we knew the buffers would remain pinned until we received data from a client, but the BeginSend calls were not using the pool. We had not even considered the fact that sending a few KB of data might take long enough to pin memory, fragment the heap, and cause an OutOfMemoryException. But there is one case where they do, timeouts. The Windows TCP subsystem is very forgiving. If a client loses its connection and the server isn't explicitly told about it, the next piece of data you try to send to that client socket will end up being pinned while the TCP subsystem waits for the client to respond. It can take up to 5 minutes with the default configuration of Windows for the TCP subsystem to figure out the client isn't really there. During that entire time your buffer is pinned in memory. *poof* OutOfMemoryException.

Unfortunately, pre-allocating buffers does not completely fix the issue of running out of memory. There are also some other limits to the size of a .NET process's virtual memory space that are very complicated and I won't talk about, but basically you end up with anywhere from 1/2 to 2/3 usable virtual memory without running the risk of OutOfMemoryException. So, if you have 2GB virtual memory available (standard on a 32bit machine), you end up with about 1.3GB you can actually use reliably. Of course, this varies, and some applications will be able to use more, or maybe less. Your mileage may vary.

Don't fret, all of the issues I've talked about in here have been fixed since SoapBox Server 2005 SR1. And with the most common usage patterns people were not actually affected to begin with.

I hope this was at least marginally interesting to someone. :) Next up, I'll probably talk about limitations we discovered in the Windows Socket infrastructure, or maybe async IO, IOCP, and worker threadpools, or maybe how in the world we actually test at this scale. Only time will tell, unless Chris beats me to it.

Monday, May 8, 2006

FakeOutTheUserToThinkWeDontUseAnyMemory

There comes a time in every project where the developers realize we are building software for the users, rather than for ourselves. A user's perception can be the difference between a good and a bad reference, and we all know how detrimental bad word of mouth can be. This unfortunate reality hit me square in the face recently when I was told by a customer that "your application is bloatware".

Any desktop application with a user interface, written in .NET, that does anything interesting, can easily be mistaken for bloatware. It's quite easy to create a super elegant application with no memory leaks that appears to use 50MB or more of memory. I say appears, because the figure everyone sees in Task Manager is the "Working Set" size. Users (myself included, up until recently) see large working set sizes as a sign of bloatware and poor programming.

This is simply not the case. The working set is more along the lines of the amount of physical memory the OS thinks your application might need or needed at one point, including shared memory and all sorts of other complicated things that .NET developers aren't supposed to have to think about. If your system was in need of physical memory for other processes much of this perceived bloat would either be reclaimed and put to better use, or paged out to disk.

At the end of the day, the reality of the situation doesn't matter. Your users think your application is bloated. What do you do? Well, you FakeOutTheUserToThinkWeDontUseAnyMemory.


using System.Diagnostics;

namespace Coversant.Utility {
public static class MemoryUtility
{
private static volatile bool _enabled = true;

public static void FakeOutTheUserToThinkWeDontUseAnyMemory()
{
if (!_enabled)
return;

try
{
Process curProc = Process.GetCurrentProcess();
curProc.MaxWorkingSet = curProc.MaxWorkingSet;
}
catch
{
//Some users won't have permission to adjust their working set.
_enabled = false;
}
}
}
}

Yep, that's it. Call that method (.NET 2.0 only -- in 1.x you had to P/Invoke) and watch the magic happen. In our applications we set it up on a timer that runs every 30 seconds and after any events we know will raise the working set, usually after loading new assemblies or after a window is closed . Running this code causes Windows to free up as much of the working set as possible. Usually this sends most of your bloat to the page file where it will remain forever. In our case, the application had a 50MB working set and really only needed about 10MB of physical memory after it was running. There is, however, one big gotcha. An application can only attempt to adjust its working set if it is running with appropriate permissions (typically a local Administrator).

Yes, your users perceptions are reality. Using this trick/hack helps keep reality in line.

About the Author

JD Conley is an entrepreneur and hacker, currently working away his golden handcuffs at Playdom, a subsidiary of the Walt Disney Company, since Hive7 was acquired. We make social games. The views and opinions expressed on this post are his and do not necessarily represent or reflect those of The Walt Disney Company.