Beijing Summer Olympics, 2008 - I remember the butterflies in my stomach as I saw Michael Phelps touch the wall to win his 7th gold at 100m butterfly. It was a record time of 50.58 but everyone was biting their nails because the difference between winning and losing here was only a hundredth of a sec.
That's not just true for sports anymore. Digital businesses today win or lose purely by virtue of knowing the right things at the right time and that can be a matter of seconds too.
As they evolve into high throughput dev shops pushing code multiple times a day, unlike the once or twice a year in early days, enabling their high velocity streamlined development & operational workflows with programmatic access to real time log data becomes paramount.
Businesses need to know what's happening at the 'edge' of their networks in real time and bake that information into a holistic system health monitoring portal view across their sites with integrated log feeds from other layers of the stack. This visibility and control become increasingly indispensable as we progressively move content and application logic to the Edge away from the congested origins.
In the following section, I will address three unique ways that we see leading organizations implement real time logging -
For Holistic Site Health View in Real Time
When downtime is expensive, from a revenue as well as brand perspective, organizations invest in tools for an always-on continuous monitoring system for operational health.
For end to end visibility at the http layer, most organizations invest in Application Performance Monitoring tools for server side, real user monitoring and customer analytics for browser side along with real time logs for the health and effectiveness of their CDN powered middle mile.
These tools help organizations answer questions such as - In response to the end users browser activity, how did the backend respond? What are the average latencies between origin, edge and end users? How many requests and how much content is being served from the edge cache and how many go to origin? What's the scope of fine tuning offload configuration and optimizing cache utilization?
These data streams can be easily converted to customized dashboards by streaming them into any logging tool such as Sumologic or Datadog or other open source tools such as Grafana or Google BigQuery.
The dashboards can be handy for getting a feel of your traffic and overall pulse of site health and performance right after deploying changes, or during expected seasonal traffic spikes. These are times when you expect things to break, and you need the real time visibility and control for immediate remediation.
Here's where I have a very relevant story to quote where a publisher jumped a smoking gun with real time logs. Most publishers typically push new content for their readers every few hours . As part of the update, they need to invalidate the edge caches every so often. They can easily and intelligently automate content invalidation by URL using Akamai CLI for Purge. Ideally, a purge should be complete, edge caches repopulated with new content and performance metrics stabilized within minutes. However, in one such instance, DataStream real time log streams kept showing high cache miss rates for an extended duration. Operations teams immediately triggered into action to dig into purge logs to find it was a minor syntax error which could have lead to major downstream issues if it weren't discovered in time.
As Proactive Alerting Systems
Streaming, storing and processing logs for every request/response cycle can be inundating, not to mention expensive. It also doesn't make sense to invest time and money for log lines for the duration when all is working fine. Here's where aggregation on meaningful metrics offers significant benefit.
Most organizations set up an automated API push for aggregated metrics, pipe them into their choice of SIEM or a log analytics engine viz. Sumologic or Datadog to trigger anomaly alerts in real time such as high count of http error codes beyond a predefined threshold. Operators can stay informed and bring themselves to immediate remediatory action.
For one of our customers, a new deployment caused broken internal links impacting end users as well search engine crawlers. The number of 404 errors jumped, and 2xx requests went down. This early warning system allowed them to immediately intervene in response to alerts.
When alerted, developers can pull raw logs to drill down to the cause and correlate with data from other layers of the stack for the time preceding or leading to the anomaly. More advanced use cases could include not only threshold violations alerts, but also anomaly scoring or pattern detection to find unusual patterns of errors compared to a window in the past.
For customers who need to stream, store and aggregate raw logs themselves, Push APIs allow them to operate with a low cost, scalable serverless architecture where they don't need servers forever polling APIs for data. Log collection runs automatically and regularly with DataStream pushing raw log streams through the processing pipeline with the necessary controls to customize the log fields and individually turn streams ON or OFF.
The flexibility to choose between push versus pull architectures, leverage aggregation at the source and be able to access raw logs (upto 24 hrs in the past) without needing to store them locally are few cool characteristics to look for in real time logging APIs.
Help Their Teams Work Together
Large organizations often have fragmented development teams building code that needs to be pieced together. They often don't have the same operational visibility of the overall system as Ops teams do. The goal is to enable Ops and Dev to both have the same visibility into how the individual parts, developed and owned by them, are working in the larger ecosystem translated into metrics like error rates (4xx and 5xx) by certain URL pattern or user agents. This single pane of view traced back to code ownership augments devops agility.
The User Analytics engines are able to do detailed aggregation on raw log streams by attaching useful qualifiers viz. URL pattern ID or user agents. These qualifiers help categorize the logs by page groups (product pages, search page, category page etc.) so that they can be piped to the respective code owners in the dev teams and build meaningful alerting for the right people.
At the end of the day, all digital businesses have the same high-level challenge which is to build and bring differentiated apps and services faster to their users. Sounds quite simple? Not really, because the systems that enable this goal have several moving parts. IT teams often worry about losing visibility and control when they add a cloud based solution in the chain of delivery. A Akamai, we are constantly striving to get that control back into your hands and snap right into your CI/CD workflows while doing that.
Akamai DataStream, which provides near real-time middle-mile visibility through customized data logs and aggregated metrics on CDN health, latency, offload, errors, and events is a big step in this direction.
Yes! It is a matter of seconds between winning and losing and we at Akamai are rooting for you to have the 'Winning Edge'.
Learn more about Akamai DataStream by watching a Demo HERE.