In my previous blog, I showed how seriously the performance of your website can be affected by your CDN, even though many don't include it in their monitoring strategy. To enable you to improve your performance tuning and rapid troubleshooting, you must monitor your CDN along with the rest of your systems and do so effectively. In this blog I'll show you how to do just that.
You may already use a Real User Monitoring (RUM) tool for the front end, and use an Application Performance Monitoring (APM) tool to track the backend performance. The next step is use your CDN's monitoring tools for monitoring the CDN itself. Akamai, for example, offers several data feeds (like Cloud Monitor, LDS and DLRs), which hold metrics about all aspects of your site's traffic as it passes along the Akamai platform.
Now you have all your bases covered, RUM on the front end, CDN Monitoring in the middle and APM in the back. But you are now left with two big obstacles to success. The first is making sense of the massive amounts of metric data, which we will discuss in the rest of this blog. The second problem is finding the relationship of metrics from one monitoring system to the other, and this we will cover in the next blog.
Too much data means that important information will get lost in the noise. The solution is to identify KPIs (Key Performance Indicators) that allow your performance team to easily identify problems and react. You need to identify these KPIs for each part of your site, collect them from your monitoring tools, and display them in an easy to read dashboard for all your teams to track.
Let's start with the CDN. In the previous blog, we saw the key contributions your CDN makes to performance are edge caching and accelerated routing, optimizing render times to reduce bad user experience. The metrics behind these actions should be available from your CDN system. The key performance metrics (KPI) to collect from your CDN are:
- Cache hit ratio - how many of the requests were serviced by edge cache
- Edge processing time - The time actually spent by the edge server, both retrieving and manipulating the content from cache. This replaces the usual backend processing time
- Distance from Browser to Edge - this should be as small as possible, ideally one hop and affects the transport time both for cached and non-cached scenarios. This can expose how useful a specific edge server is in various geographies
- Transport time from edge, through intermediate servers to backend servers in data center or cloud - this should be faster than just letting the browser send requests directly to the backend server across the public internet. This will show the acceleration due to routing optimization by the CDN. This metric may be hard to collect unless you match each request with its backend metrics. (CDN knows when it sent a request, the backend knows when it was received)
Now let us look at the KPIs from your front end (RUM) monitoring:
- Performance affected by various factors - load time across various dimensions
- By page - how long each web page took to render on the browser
- By mobile devices - on mobile you have less bandwidth, are pages smaller?
- By various browsers
- By geography
- By Edge cache hit or cache miss (might require integration with your CDN metrics)
- Error frequency and types of errors, by page
- Performance of various steps in the browser's functionality
- Download time - how long did it take the browser to suck in all the content
- DOM time - once data is sucked in, how long to parse and build the parse tree
- Render time - once all the data was read in and parsed, how long to actually render it
- Size of content - ensure pages are not too heavy
And now the KPIs for the backend (APM, etc.):
- Transaction time - how long do various transactions take (i.e. "purchase" or "search")
- Error rates and breakdown - what kinds of errors are backend servers encountering
- Load - how many requests per second are hitting the systems and for what transactions
- Response times per subsystem - how much of the transaction time is each subsystem contributing (i.e. DB call, message queue send, each app module, etc.)
- Infrastructure performance - how is the underlying hardware is performing e.g. CPU usage, memory consumption, Desk I/O, etc.
You may of course have a few of your own KPIs to add to this list (add them in the comments section below!!), but hopefully this will get you close to a final list that works for your site. You've hopefully built a dashboard for these, where all the KPIs above are shown on one easy to read web page. With quick access to these KPIs, your team should always have a better grasp on site performance.
You are now off to a great start, but there's still work to be done. One gap you may notice is that all these metrics, while important in themselves, do not give you a view into the business. The metrics above will alert you that a page is getting too big or there is an increase in errors but you don't know if there is any impact to the revenue of the site. This is why you must also work with your business analysts to ensure you are collecting business metrics (revenue per hour, sales per minute, abandoned shopping carts, etc.) and have access to them in your operations / devops organizations.
Another problem is that these three KPI groups are giving you different views over the same request/response flow in your site. In the next blog of this series, I will go over techniques in correlating these metrics and establishing a unified view of your site's performance.
Note: I work in Akamai's Advance Solutions and Services consulting team. In this team, web performance experts like myself help our customers improve the performance of their websites and applications. For help with your site, please contact us at email@example.com