How to overcome the weak link in your application performance infrastructure

THE NEED FOR APPLICATION SPEED

How often have your customers complained about slow applications and lost productivity? Well, more often than not, the cause lies in the storage tier which is hard to detect, let alone solve in a jiffy. It is quite possible that your servers are equipped with state of the art CPUs and networking gear. However, even though the applications may have been optimally tuned to perform at their best, the end users still see sluggish response times. It is the weakest link in your infrastructure chain that could be constraining application performance and that is the spinning disk. During peak or relatively higher server load times, server CPU power may be enough to process the incoming load, however the disk subsystem is limited by its capability in terms of maximum IOPS and data access latency times. The problem is compounded if your application workloads are primarily random in nature.

DETECTING THE I/O BOTTLENECK

How do you detect an I/O bottleneck? Is high CPU I/O_Wait time an appropriate indicator? Not really, if you look at this metric in isolation. High IO_Wait could be seen in a perfectly healthy system. It could be due to the fact that your system has much more CPU horsepower than it can actually put to use. One way to determine this is to analyze application transaction time, which comprises of time spent in both CPU and I/O processing. In a system with one or more fast CPUs but an average disk or I/O subsystem, your application I/O processing time is much higher {00650960c7c98e0cfb19b413dfe1b11628fd22333dea5da8eda402d86148a18a} of overall application transaction processing time. So your system may appear to be a perfectly healthy one but one with high IOWAIT {00650960c7c98e0cfb19b413dfe1b11628fd22333dea5da8eda402d86148a18a}.

In truth, slow applications response times is often attributed to a combination of factors and not just on CPU IO_Wait metric. Note that IO_WAIT is a CPU

performance metric. It’s a measure of how your CPU is performing but it could very well point you in the direction of your system I/O or its constraints. If your application server CPU utilization is fairly low and nowhere near 100{00650960c7c98e0cfb19b413dfe1b11628fd22333dea5da8eda402d86148a18a}, then a CPU bottleneck is ruled out. Assuming you have already ruled out network as the bottleneck, if your application response times are higher than say 20 ms or your average disk I/O response times and the CPU IO_Wait Metric is 25{00650960c7c98e0cfb19b413dfe1b11628fd22333dea5da8eda402d86148a18a} or higher, in that case you need to take a closer look at your Disk subsystem performance statistics over a reasonable period, say a couple of hours. If you see a combination of significantly high I/O Queues, high IOWAIT times, high disk utilization and very high application latency, (see Figure 1) you have an IO bottleneck. So in addition to CPU IO_Wait metric, you need to take a holistic view and focus on your application latency, Disk subsystem I/O Queues and overall disk utilization number before you can pin your application slowness woes to an I/O bottleneck.

ADDRESSING IO BOTTLENECKS

What are your options when faced with such an IO bottleneck? Well, there can be several options depending upon how deep your pockets are and how time critical the bottleneck is to your business. There are many who resort to the solution of upgrading servers or scaling up existing application infrastructure with more servers and dividing up the load among a higher number of servers. This is not only a disruptive option but it also increases your overall costs. In addition to your Capital expenditure, you now end up with more Operational expenses as well (think power, cooling, and infrastructure and SW management) to worry about. This may solve your application slowness temporarily, or postpone it rather to a later time when your IO workload or end users grow beyond current levels. In case of perceived IO bottlenecks, most sys admins try to tweak the service that’s using the most I/O and cache more of the important application data in RAM. That may not be enough or sufficient given the amount of free server RAM. This problem becomes more pronounced when the amount of data starts to grow so large that effective memory caching becomes impossible. For large databases where you access data more or less randomly, you can be sure that you need at least one disk seek to read and a couple of disk seeks to write things. The only way to minimize this problem is using storage tier that offers lower data seek times.

Today, another very promising and widely talked about option to address IO bottleneck is Flash Storage. It comes in various form factors and solutions today offering a very good alternative to spinning disk based storage subsystem with its blazingly high IOPS power and microsecond latencies which are an order of magnitude higher than spinning disks. However, it is 10-50 times more expensive per GB than spinning disks and it’s also a disruptive solution as you’d need to migrate data from disk based storage over to flash based devices, unless of course, yours is a Greenfield application deployment or a new application setup. Also, traditional or legacy enterprise applications themselves are not architected to exploit the full IO potential of Flash based storage. So dedicating flash based storage by replacing spinning disk based storage for a few enterprise applications or a few users may be a highly ineffective use and underutilization of a very expensive resource. This is comparable to server underutilization before the advent of consolidation and virtualization in the last decade.

For legacy and traditional applications that already have lot of business data and information entrenched into spinning disks, a very viable and smarter option to solve your application IO bottleneck is SSD caching. This is not only a cost effective manner but also helps you leverage your existing investment in spinning disks. An application-aware SSD caching solution can auto-magically determine the hot data frequently accessed by your application and provide it to your application at the speed of SSD. It doesn’t require any data migration and can solve your IO bottleneck by utilizing a smaller capacity of flash as compared to say replacing all your disks with flash option. All-flash solution would require you to have flash with a capacity equal to your current dataset size and capacity plus projected growth whereas SSD caching needs only 10-15{00650960c7c98e0cfb19b413dfe1b11628fd22333dea5da8eda402d86148a18a} of total dataset capacity for flash used as caching. This cache can grow dynamically and scale for meeting your future needs, thus providing you the best of both worlds – spinning disk for capacity and flash for speedy access to data. Figure 2 shows how a SSD caching solution can not only reduce your application latency but also alleviates the CPU IO_Wait enabling your application with better response times.

CONCLUSION

If your application is behaving sluggish and you see a very high CPU IO_Wait metric, coupled with disk latency of the order of 10-20ms and high number of disk IO_Wait Queues, it may be worthwhile to evaluate SSD caching solutions. These solutions can often resolve your application performance challenges at a fraction of all-flash or server upgrade costs. Our lab results show 10-18X transaction boost as compared to pure spinning disk based storage subsystem for MySQL and MongoDB databases by use of SSD caching software. The best part is – the IO latencies drop by almost 93{00650960c7c98e0cfb19b413dfe1b11628fd22333dea5da8eda402d86148a18a} – making all your IO queues disappear. Now your application servers CPU no longer have to wait for IO to complete and get utilized towards driving some of your business critical workloads achieving a larger quantum of work done in a much lesser time, as compared to a disk based subsystem. Your applications will run faster and users will be able to accomplish more than what your spinning disk based application setups can achieve today.