Abstract
This article looks at performance and scalability issues on a
dynamic website and how the Transactional Cost Analysis
methodology can be used to see where real bottlenecks and
performance issues lie. This methodology allows you to accurately
model the performance of your website and see how potential changes
and optimizations affect the overall capacity of the site.
A step-by step guide to using the Transactional Cost Analysis
methodology with the freely available Web Application Stress tool is
given. The metrics that should be monitored and how these relate to
real-life performance are explained in detail.
Myths and misconceptions regarding ASP performance and
scalability are dealt with and example results discussed. The key
differences between performance and scalability are looked at along
with strategies that will work to improve both. In addition, real
world experiences with a large e-commerce site are given as well as
strategies that will help capacity planning and how resources are
best used.
Web stress performance overview
The performance of a web application should be thought about
during planning, technical design, architecture design, development
and testing. It cannot be "bug-fixed" on at the end.
It is still common for new sites to launch and be overloaded, or
even existing sites to come grinding to a halt. As anyone who tried
to place a bet on the English Grand National this year at most of
the online bookmakers will testify, their sites were unusable for
most of the day. The effects of such poor planning and design have a
major impact on company revenue. Given the maturity of the web, such
incidents should be and can be avoided.
All major websites should be assigned a person who is responsible
for capacity planning and ensuring that the website matches
predicted peaks in load. This is an ongoing job; it requires
detailed analysis of past trends and future marketing work. This
position is often overlooked or ignored, assuming that stress
testing during development has ironed out all problems. What is
often forgotten is the fact that different shopper behavior patterns
have a significant effect on site performance. This is where the TCA
methodology is so powerful, it allows us to run already gathered
performance data against potential shopper behavior models and sees
how performance and capacity is affected.
Scalability and performance; the difference
The terms scalability and performance are often used
interchangeably. This is a mistake because a high performance
website is not necessarily a scalable one.
- Performance denotes the speed and efficiency with which the
system performs its tasks
- The ability to scale a website means the ability to add new
servers (horizontal scalability) or upgrade existing hardware
(vertical scalability) to increase capacity. A well written,
scalable application will allow the addition of new resources to
increase capacity without any changes being required to the
application itself and with little impact on system behavior or
performance.
For a large website with growing demands on capacity, the most
critical measure is the ability to scale out effectively by adding
new servers. Windows 2000 Advanced Server comes with built in
load-balancing software that can be complemented by the addition of
Application Center to help manage and monitor the web-farm. Other
worthwhile alternatives include hardware load balancing such as F1
BigIP or Cisco Local Director.
If a website application scales effectively, i.e. doubling the
number of servers results in a near doubling of capacity, this is an
extremely cost-effective way of scaling a website. New web servers
cost in the region of £2000 today for a high specification
rack-mounted server. There is also the cost of hosting but often a
company will already have space free if they have rented half a rack
or a full rack. In addition, the state of the IT market at the
moment means that most data-centers have vast amounts of unused
space so rentals are coming down: That is comparable to a couple of
days of programmers consultancy fees.
This emphasizes how it is critical to ensure that during
development of a web application, care is taken to ensure
scalability (by performance testing against a single server, and
then a web farm of two or more servers) and also how expensive
development time is focused on areas that will significantly and
measurably improve performance and ability to scale.
Using the Web Application Stress tool
The Microsoft Web Application Stress tool is currently one of the
most used free tools, and with good reason. It has a lot of the
features of more powerful and costly software suites.
To fully cover the WAS would take a full article (there are
others on the http://www.asptoday.com/?WROXEMPTOKEN=582263ZQlT0wWrzzPVX3B3qjcD
site and I've added some links at the end of the article), but here
are a few relevant pointers to get up and running with the TCA
methodology.
- 1. Start up WAS and select "Manual Script".
- 2. You will be presented with the following screen:
- 3. Enter your server name where "localhost " is (don't prefix with http://
because this will cause WAS to error). It is not recommended to
run your tests from the same machine as your web server. This will
give you false results because generating the test hits will take
a significant amount of CPU time: Tests should always be run from
other machines.
- 4. Select "Get" as the Verb (this is the HTTP method used to
access the page) and enter the path to a page on your server.
Because with the TCA methodology we only stress test one page at a
time in order to build our model, we can ignore the "Group" and
"Delay" options. The screen should now look like this:
- 5. Now expand the New Screen item and enter the "Settings" as
follows:
Stress Level and Stress Multiplier can be used to adjust the
stress on the server. These are often misunderstood - fortunately
for the TCA methodology, their actual values are irrelevant as far
as the results are concerned because we are not trying to model a
real web load (100s of concurrent users), just stressing a single
page at a time and applying those results to a mathematical model.
For those who are interested, "Threads" are the number of NT threads
that WAS starts up to send requests from; if you are not using the
TCA methodology and are trying to stimulate a "real" load, you will
likely need multiple client machines, in which case the threads are
divided amongst these. Each thread can then contain multiple
"sockets" which are concurrent connections to the web server.
Microsoft's website has the following formula which can help explain
the relationship between these values: Total Concurrent Requests = Stress level (threads) x Stress multiplier
(sockets per thread) = Total Number Sockets
For those who want to delve deeper into this particular topic, I
have included a link at the end of the article.
The values you have entered are good to start with, but it is
worth taking some time to experiment with the two settings to see
how it affects how much stress you can put on an individual page. To
fine-tune the stress you're applying to the page, you can adjust the
random delay. It will take some time playing with these setting
before you get a feel for the best initial settings to avoid having
to keep re-running tests as you look for the maximum requests per
second that a given page can handle. Generally I suggest starting
off with two threads and one socket and a delay of between 100 and
150ms. If the site is not getting sufficiently stressed, increase
the stress multiplier by 1 and re-examine the results. If the site
just over-capacity and has 100% CPU usage, you can lower the
stress by increasing the random delay gradually until you reach peak
capacity without over-loading.
- For the TCA methodology, you also need to monitor CPU usage
during stress testing. Usually you will find that the processor
usage during peak requests per second is between 80% and 95%
usage. If the CPU usage is 100% for the duration of the test, it
means that the CPU is overloaded and the requests per second will
drop. To add performance counters, click on the "Perf Counters " link and add the processor
usage monitor.
- You are now ready to run the tests. Click on the "New Script " label to get back to the
first screen and press the "Run "
button on the toolbar. The test will initialize and run.
- To view the report that has been generated for that run of the
Stress Tool, click on the "View "
menu and then on "Reports ".
Where to run the tests from
It is important to run the tests from somewhere local to your web
servers, otherwise you will be severely restricted in how
effectively you can stress the site and both bandwidth limitations
and latency will skew the results. The best solution is to be
physically next to the machine on a fast LAN. If you don't do this
and try to perform the tests on a live site over the Internet, you
are just wasting your time. If you host your site at a data centre
such as Exodus you can take a powerful laptop to the data centre and
attach it to the network or if you have a backup or management
server that is hosted, you can Terminal Service or PC-Anywhere and
do the tests from there.
Monitor CPU Usage on WAS machine
The WAS tool can run across multiple machines to provide enough
hits to properly stress a web server. With the TCA methodology,
where each page is tested individually, this is usually not an issue
because it does not require the same volume of hits. However it is
still important to monitor the CPU usage on the WAS machine; if the
CPU utilization is consistently above 80%, the test may well be
invalid because the WAS machine is likely incapable of providing
enough hits to the server.
Double-check that everything ran as expected
There are two other values you should look at. These are:
Result Codes (on the main report screen): You should
always check that you are getting the correct results code back from
your web pages! The most common cause of erroneous results using WAS
is probably the web server not returning the page as expected.
Ensure that you have pointed to the right pages (you're not getting
404's) and the page is not reporting an error
(500s). Downloaded Content Length (under "Page Data" on
the reporting screen). Check that the downloaded content is
approximately what you would expect for the page you are running
tests again. If either of the above results suggest that your tests
are not running as expected, check the log files of IIS to see if
the correct page is being requested, and also what values are being
passed to it.
Aims of stress testing
Stress testing (or "load testing" - these terms are often used
interchangeably) is a term used to describe the stressing of a real
application under simulated load provided by virtual users.
Properly conducted stress testing should have a number of
aims:
- Check performance during development
- Testing before site launch
- Finding page response times
- Finding peak capacity
- Finding how well the site scales
Page Response Times (TTFB and TTLB)
Page response times measures how long it takes to get the
first byte of a page and the last byte of a page under different
loads. Page response times are critical because users will not wait
long before they get impatient and try to visit another web site.
There are two metrics to page response times, Time To First
Byte (TTFB) and Time To Last Byte (TTLB). When performing
stress tests, one measure of how stressed the server is and how
close to peak capacity you are is the time between these two values.
As an example, you may find that up until a point you find that TTFB
and TTLB are 230 and 244 ms respectively fairly consistently. There
will be a point as you increase the load on the server, either by
increasing threads and sockets or reducing the delay between
requests, that TTFB and TTLB start to move apart; they now record
250 and 420 ms. You will also find that hits per second decreases.
It is often useful to agree a maximum page load time (for example, 1
second) and stress the server until this maximum is reached (by
TTLB). That way, you know what the maximum site capacity is while
still keeping acceptable response times.
Requests Per Second
Requests Per Second is probably the key metric you should
monitor. Very crudely, it tells you how well the page you are
stressing is performing. You will see significant differences
between pages on your site depending on how they are written and how
they interact with back-end systems.
Testing for Scalability
Testing for the ability of a site to scale out is often
forgotten about. On large sites with multiple servers, it is of
critical importance. To test for scalability, first test against one
server only, then against two servers, then against your full
web-farm: This will show how scalable each page in the site is, i.e.
how adding new servers improves its ability to serve pages. You will
find that plain HTML pages or ASP pages with no back-end calls scale
almost linearly, so going from one server to two servers results in
a doubling of the number of pages served. Conversely, if you have a
poorly thought out site with multiple database calls per page, you
will find that adding new servers results in poor scaling - perhaps
improving performance by 1.1 times rather than doubling as
expected!
The Transactional Cost Analysis Methodology
Overview
The TCA methodology models the CPU load that a single user exerts
on your web server. The methodology works because although it may
seem absurd to equate a single browser session to a CPU usage, e.g.
1.5 Mhz, when there are hundreds of shoppers on your site, this
holds true. The TCA methodology as applied to websites is a product
of the Microsoft Research Laboratory and there are some interesting
papers, which use the methodology on Microsoft's MSDN site.
Why Use the TCA Methodology?
The Transactional Cost Analysis methodology is a theoretical
method of determining the CPU cost of transactions (think
"sessions") on a website. One of the key benefits of the TCA
methodology is it provides a mathematical model of your website
performance and capacity.
Two examples of the power of the TCA model are given below:
- It is possible to apply a different shopper profile to the
data to see how site capacity changes. During Christmas or
promotional periods, shopper behavior can differ significantly
(the ratio of the number of buyers to the number of shoppers
usually goes up). This means that the site is handling more
transactions, which usually involves more processing power. It is
easy to put the Christmas shopper profile into the TCA model and
see exactly how great this effect is.
- By examining CPU usage per page and how this affects overall
site capacity, it is possible to focus on areas that really affect
site capacity rather than those, which the raw throughput values
would suggest.
Performing the Testing
1st Step: Break down the site into
"transactions"
There are some operations on the site that involve more than one
request to an ASP page. This most commonly happens when one ASP page
posts to another or redirects to another. For example search.asp may call search_results.asp (2 requests form part of
the transaction). Similarly, product.asp may call add_to_basket.asp and redirect to basket.asp (3 requests form part of the
transaction). For the most part, pages will be distinct units
however, for example the home page, category pages, product page,
etc.
2nd Step: Determine Usage Profile
In order to use the TCA methodology, we need a profile of shopper
behaviour. This the "average" behaviour exhibited by the users of
your site. If you are logging page accesses to a SQL database or
have web site reporting tools, this is simple for a given time
period:
Total Hits To Page / Total Number Of User Sessions
The supplied spreadsheet contains calculations for determining
usage profile given page hits, number of sessions and average
session length.
3rd Step: Perform the Stress Tests
Run the stress tests on a page-by-page basis. Stress each
page (or group of pages if they form part of a "transaction") until
you find the maximum number of requests per second they can handle.
When you find the peak requests per second, note the:
- Requests Per Second
- Average CPU Usage
Those are all that is required for the TCA model. You may also
find it useful to make a record of TTFB and TTLB.
4th Step Use the TCA Model
Once the results have been gathered, they can be entered into
the TCA model using the supplied spreadsheet (see below).
The Download
I have provided a fully worked out Transactional Cost Analysis
spreadsheet model as the download to this article. The data in the
spreadsheet is imaginary, but you can use it to easily model the
real data from your own web site. It is worth taking time to examine
the spreadsheet and the example data and graphs that I've
included.
Common Myths about ASP performance
The most common myth regarding ASP performance is that
translating a page to use COM rather than ASP functions or VBScript
classes will result in a significant performance increase. Under
Windows NT this held true, but on Windows 2000 the ASP VBScript
Scripting engine is as fast as if not faster than compiled code in
most situations. If there are performance issues with a particular
page, look for bottlenecks in the page. Usually these will be to do
with the looping, dynamic creation of HTML (both of which reduce
performance but not ability to scale) or database calls (which more
worryingly affect both performance and ability to scale out). There
is a danger to think that coding trickery can turn a poorly
performing, poorly scaling page into a high performance scalable
one. This is rubbish - most performance problems are introduced as
fundamental flaws in the page design or overall site architecture.
As I have already stated, performance cannot be bug-fixed on at the
end of development!
Wherever possible, use cached data rather than making
database calls. This is the single biggest improvement you can make
to site performance and scalability. The overall issue of
performance is huge and encompasses web server, program
architecture, program implementation, logical database design,
physical implementation, hardware, network, connectivity, etc. The
best thing you can do is to test and keep testing.
Capacity planning
Capacity planning is the process of measuring a web site's
ability to serve content to its visitors at an acceptable speed for
foreseeable visitor numbers and usage trends.
The most important aspect of capacity planning is ensuring that
your website and associated systems can deal with peaks in
traffic. The usual assumption is that the 80/20 rule should be
followed - i.e. 80% of the traffic will occur only 20% of the time.
This indicates that a website should be able to handle peaks 4x
greater than the average traffic levels. From personal experience I
would say that this is optimistic. Prominent online promotions or
television features can result in peaks of between 10 and 20 times
average. If you have used the TCA methodology wisely and know both
the capacity of your website and how well it scales with the
addition of new servers, you should be well equipped to deal with
future demands.
On an ongoing basis, the web servers and database servers should
be monitored for performance. I would advice monitoring regularly
with Performance Monitor during normal browsing periods and peak
periods. I would also advise setting up alerts which will inform you
if certain thresholds are exceeded, for example processor usage
averaging greater than 80%, excessive memory usage, etc. You can
have these generate an email, which can then trigger a pager, or a
SMS text message that will allow you to respond to the situation or
at the very least, monitor it.
Conclusion
This article has covered many aspects of measuring performance
and scalability using the Transactional Cost Analysis Methodology.
The process of capacity planning has also been touched on. The
recent spectacular failure of the UK government's census website (a
link is provided at the end of the article) has again shown how
important capacity planning and stress testing are: They should be a
fundamental part of the development project through requirements
capture, architecture, design and implementation. Even when the
application has been deployed, its performance should be measured on
an ongoing basis and present and predicted usage trends analyzed to
ensure that demand could be met!
Links
The Microsoft Web Application Stress Tool site (the tool is also
available on the Windows 2000 Resource Kit Companion CD):
http://webtool.rte.microsoft.com/?WROXEMPTOKEN=582263ZQlT0wWrzzPVX3B3qjcD
Detailed information on WAS Threads and Sockets
http://webtool.rte.microsoft.com/Threads/WASThreads.htm?WROXEMPTOKEN=582263ZQlT0wWrzzPVX3B3qjcD
Detailed Article on using the WAS Tool
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnduwon/html/d5wast_2.asp?WROXEMPTOKEN=582263ZQlT0wWrzzPVX3B3qjcD
Excellent article on site design for performance. Compares all
ASP implementation to COM and COM+ under Windows 2000 and returns
results and conclusions which will be surprising to some
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnnile/html/docu2kbench.asp?WROXEMPTOKEN=582263ZQlT0wWrzzPVX3B3qjcD
"Census website goes offline": BBC news article about the recent
high profile failure of a UK government website to cope with
demand.
http://news.bbc.co.uk/hi/english/uk/newsid_1749000/1749045.stm?WROXEMPTOKEN=582263ZQlT0wWrzzPVX3B3qjcD |