We’ve started to research how best to scale our WCF service, especially in the event of middleware failure that causes latency typical of IO activity. The problem that was initially found was that when an IIS hosted WCF service gets a request, the IIS worker thread handles the request and sends the work over to an IO thread. Meanwhile, the worker thread blocks until the IO thread completes. In the event of high unexpected IO latency, we can have a scenario that all the worker threads are quickly blocked on all the IO threads, thus making the server reject new requests. A way to alleviate this problem was to reduce the amount of worker threads used, and increase the amount of IO threads that are spawned. This would allow the server to continue to service new requests while the middleware recovers.

So I’ve began to investigate how to increase performance from this area. Not necessarily increasing throughput, but rather protecting a server from crashing in the event of middleware failure.

A few weeks back, .NET 3.5 SP1 was released as beta. It is claimed that the sp1 version increases performance specifically on WCF RESTful services. The next step in our research would be to compare the results below with the SP1 Beta bits.

Performance Testing Parameters

These tests done below do not fully test real-time scenarios. IO latency is simulated. Since WCF 3.5 does not support the Asynchronous Programming Model (APM) for REST services, all service calls are handled synchronously. However, the test compares a normal out of the box model with a modified HttpModule solution that uses the APM found internally inside the ServiceModel namespace. This code is based upon code provided by the WCF team. It makes use of reflection, which may contribute a performance hit.

These tests also take into account new features provided by the integrated pipeline of IIS 7. To be specific, I’ve varied performance testing on a registry key named MaxConcurrentRequestsPerCpu. This DWORD is located in HKEY_LOCAL_MACHINESOFTWAREMicrosoftASP.NET2.0.50727.0 and is non-existent by default. The default value would be 12, so if the DWORD is not present, the value IIS uses is 12. MaxConcurrentRequestsPerCpu specifies how many requests per CPU are going to be concurrently handled by IIS. The values we chose to use for testing are 12, 24, 48 and 0 (0 means there is no maximum). More information on MaxConcurrentRequestsPerCpu can be found on blog posts done by two members of the WCF product team, Thomas Marquardt and Wenlong Dong.

The test tool used to produce simulated load is the Web Application Stress Tool. I set the stress settings to have 50 threads and 10 sockets per thread.

The actual work done by the service call is a simple addition of two numbers. The simulated IO is conducted using a Thread.Sleep.

Virtual Machine

I tested on a virtual machine running Windows Server 2008 with 1 CPU, 1GB RAM, and the difference in worker thread usage between a service out-of-the- box and a service with our asynchronous module was clearly disparate. The out-of-the-box service would use up more worker threads and CPU, while as the async module service would use less worker threads, and over higher latency, would have lower CPU.


Here you can see a clear difference in worker thread usage between both models. This, however, is a virtual machine, so its data cannot be conclusive. The above graphs are indicative of behavior with a MaxConcurrentRequestPerCpu setting of 12. Higher settings showed similar disparateness.

Real Machine

For the more accurate testing, we used a more robust machine with the following schematics:

· Intel Xeon L5320 1.86GHZ Dual Quad-Core (8 CPUs)


· 32Bit Windows Server 2008

· .NET 3.5

Having such a robust machine greatly increased performance. The interesting result is that our initial tests using our asynchronous HttpModule on the VM produced way different results in terms of worker thread behavior than on the real machine. I noticed that both Async and Non-Async modes proved to have a consistently low usage of worker threads. At higher latencies, the AsyncModule would curtail performance more than the out-of-the-box model. Below is a breakdown of the data in more detail.

I chose to put the graphs of each mode together, as opposed to putting the graphs of the ConcurrentRequestsperCpu together because I felt there is more performance to be gained by manipulating this registry setting, than actually using an AsyncModule the way we used it.

The metrics recorded were averages over a 5 minutes load testing period:

· Execution Time:

· CPU%

· IO Threads

· Worker Threads

· ConcurrentRequests (will not be over 8* MaxConcurrentRequestsPerCpu)

· Requsts/Sec

Request Queued

· clip_image002[4]clip_image004[4]


clip_image010 clip_image012 clip_image014 clip_image016


There’s an obvious general trend that as latency increases, execution time naturally increases and throughput decreases. We also see from the data that as latency increases, more requests are handled concurrently through a higher volume of IO threads.

We also notice as latency increases, there is a slight tendency to get better throughput if you have a higher MaxConcurrentRequestsPerCpu. I say slight because as soon as the MaxConcurrentRequestsPerCpu goes over some threshold, the opposite starts to occur: we knowingly expect an increase in Concurrent Requests, but we also see an enormous increase in execution time, and a drastic decrease in throughput.

Another interesting finding is that worker threads on all test scenarios was incredibly low. I would attribute this to the hardware. I suspect that we would see higher worker thread values for the out-of-the-box scenario if we had a more robust load testing system. I also suspect, based upon the results on our VM system, that the usage of the async module would reduce the worker threads as well. On this box of ours, it wasn’t a factor at all. In fact, in some cases the added reflection and APM introduced by the AsyncModule actually reduced performance.

Also note the case of no maximum concurrent requests. Indeed it does eliminate any queuing, but the resulting overhead of managing all those concurrent requests destroys all performance on the server.

It looks from the above data that the best performance was gained with an out-of-the-box service with a MaxConcurrentRequestsPerCpu set to 24. At 1000 ms latency, we had lower execution times than all other scenarios, and throughput was nearly the highest for all other 1000ms scenarios. It still experienced high request queuing. Going up to a MaxConcurrentRequestsPerCpu of 48 would offer 1/3 the queuing, but also triple the execution time. At this point we would need to see what a more important tradeoff is. If we expect constant latency, we might be able to live with a controllable higher request queuing to keep the execution times lower. However, if we expect latency issues to come in uncontrollable spikes, we might opt to have slower overall execution times but safeguard our servers from running up their queues to high.


These tests attempted to investigate performance tweaks to WCF RESTful service out-of-the-box or with an asynchronous module. With less-than-adequate hardware, having the async module will reduce the amount of worker threads being blocked and thus increase performance. With more than adequate hardware, introduce the asynchronous module adds unnecessary overhead that actually reduces performance in some cases.

It is necessary for us to investigate the performance boosts introduced into the WCF 3.5 SP1 Beta bits that have recently been released. It is claimed that RESTful WCF services have a 5 to 10 times increase in performance. Since the bits are still in Beta and just recently released, results based upon them were not included in this paper.

To dive deeper into performance, I recommend we conduct similar tests using one or a combination of the following frameworks:

· Generic ASHX Handler

· WCF 3.5 SP1


· Using a middleware that follows the APM

· Microsoft Parallel Extensions

· Microsoft Robotics