Frontend API Instrumentation: A Deep Dive into User-Centric Monitoring
Decoding Frontend API Instrumentation: What It Is and Why It Matters
In today's digital age, the performance and reliability of web applications play a pivotal role in ensuring a seamless user experience. As developers and operations teams strive to optimize and monitor their applications, there's a growing emphasis on understanding not just the backend mechanics, but also how these applications perform from the user's perspective. Enter the realm of frontend API instrumentation—a powerful approach to gauging the real-world performance and behavior of APIs as experienced by the end-users.
While backend monitoring gives us insights into server health, database performance, and internal service interactions, it's only one piece of the puzzle. To truly grasp the user experience, we need to dive into frontend instrumentation, which captures metrics from the client's vantage point. From measuring actual response times seen at the client end to tracking the errors users encounter, frontend API instrumentation provides invaluable data that can shape our optimization strategies.
In this blog, we'll explore the nuances of frontend API instrumentation, its significance, and how tools like Prometheus and Grafana can be leveraged to visualize these metrics. We will do a technical deep dive of our frontend API monitoring architecture. We'll also discuss the differences between frontend and backend instrumentation, shedding light on why both are essential for a holistic understanding of application performance.
Some Background
We have a module called Inbox which is used by live agents to respond to end customers. Systems serving live agents generally have a very low tolerance for service disruption, as they are serving live customers.
Before we implemented our frontend API telemetry solution, when our customers used to report an error, the only data point we had was from our backend monitoring system. The great thing about the backend monitoring system is that it can tell you, in granular detail, what is going on inside our backend services. But as soon as the packet leaves our infrastructure, we lose control of any metrics on how it arrives at the end customer's system.
Hence, while it might all seem green from our backend monitoring system, unwanted network latency added by our Proxy, Cloud Service Provider, or our customer's ISP remains in our blind spot. In light of that, our team started on a solution to increase our visibility around API monitoring.
Technical problems aside, not having the API telemetry coverage also caused a lot of operational hazards. A lot of time is wasted in triaging these issues reported by our customers,
A lot of time used to get wasted in triage
Customer dissatisfaction
Wasted product development bandwidth and focus looking at the wrong areas
Significance of Frontend API Instrumentation
In the intricate web of application development and monitoring, frontend API instrumentation emerges as a beacon of clarity. It's the lens through which developers can view the real-world performance and behavior of their applications from the user's perspective. While backend monitoring offers insights into server operations, it often misses the nuances of user experience, network latency, and client-side processing. Frontend instrumentation fills this gap.
By focusing on the frontend, we can capture metrics that truly matter to the end-users. For instance, even if a server processes a request swiftly, the user might still experience delays due to network issues or client-side rendering. Only through frontend instrumentation can such discrepancies be identified and addressed.
Frontend API telemetry can help us with the following:
User-Centric Monitoring: Frontend instrumentation offers insights directly related to the user experience, ensuring that applications meet user expectations.
Identifying Discrepancies: It helps in pinpointing differences between server-side performance and actual user-perceived performance.
Filling the Blind Spots: With the increasing complexity of web applications, frontend metrics ensure no performance issue goes unnoticed.
Network Latency Insights: Captures the impact of network conditions on application performance, which backend metrics might overlook.
Holistic Understanding: When combined with backend metrics, frontend instrumentation provides a comprehensive view of application health and areas for optimization.
The architecture of the system
Overview
We have a web application where we allow our users to log in and access different modules of the platform. The frontend telemetry SDK is integrated with our web application. The SDK collects the metric from our web application and reports it to a custom backend metric aggregator. This metric aggregator then exposes a metric API which is scraped by a Prometheus instance. We use Grafana to query the data collected by Prometheus. The Grafana stack is also used to set up alerts on the API metrics, and the alerts are then delivered to Slack.
Frontend SDK
The actual frontend SDK is bundled as a Web Worker. The main web application reports the data to the Web Worker using PostMessage
. The SDK is running a custom-built Prometheus client that converts the received event via postMessage
to predefined Prometheus metrics. The worker stores the collected metrics in memory and sends them to a backend metric aggregator every 15 seconds. Once the data is relayed to the backend service, the in-memory metric data is flushed out.
The main web application interacts with the Web Worker started by the SDK. The underlying framework uses the popular JavaScript library Axios to make API calls to our backend system. One cool thing about Axios is that it allows us to add request and response interceptors.
We added one request interceptor to extract certain metadata from the request header and added the API call start time metadata to the request config. We also added a response interceptor to actually track the latency using the request metadata, and response status code, in case the request failed, collect the error code, etc. This metadata about the request is then posted to a Web Worker.
Metric Aggregator
The backend metric collector is a simple NodeJS application that receives the metrics data from frontend. The frontend SDK is generic and sends telemetry data for all APIs that are called using Axios. The backend filters the reported metric based on different custom filters. After filtering the metrics, the backend stores the metric in memory. The backend service also aggregates the metric reported from the frontend to avoid wrong data being stored in Prometheus. More about the aggregation will be discussed later in the article. The backend service also exposes an endpoint for Prometheus to scrape. Prometheus scrapes the metric endpoint every 30s. Once Prometheus scrapes the endpoint, the data in memory is flushed.
Collected Metrics
Histograms are used for all latency-related metrics such as API response time, WebSocket ping latency, and counters for tracking error counts, request counts, etc.
Primarily we collect the following metrics
API Response time: This is one of the key metrics we track to keep track of our end customer experience for our most critical APIs. The metric adds project ID, and API path labels to allow granular data inspection.
Error count: We also track the error counts as a Prometheus counter metric. This allows us to understand what kind of errors our end users are facing. We also have alerts set up for this metric. The error count metric adds project ID, API path, status code, and error type labels to allow detailed investigation.
WebSocket ping latency: As our web app allows real-time communication, we also keep a tab on latency for our real-time messaging servers. This metric is generally a good indicator of any network bottleneck. It also allows us to get a rough understanding of our users' baseline network latency.
API call throughput: This is actually a derived metric from our API response time metric, which also pushes the API call count. This metric allows us to understand the load pattern originating from our users' browsers.
Dashboards and Alerting
We use the Grafana stack for creating beautiful dashboards to present the collected metric data. The dashboards allow us to filter the data by region, Project ID, path, etc.
We also use the Alerting stack provided by Grafana to set alerts for our dashboard panels. The alerts are then channeled to respective Slack channels for our engineers to take corrective actions.
Challenges
performance of the web application
SDK resource footprint: One of the most important aspects to consider before building the frontend SDK was keeping the resource footprint (CPU and Memory) of the SDK as low as possible. As the SDK runs in a separate Web Worker inherently the execution is isolated for efficient processing
User Experience: Ensure no disruption to the user's experience with the integration of the new SDK.
Data Reporting Frequency: Reports data to the backend every 15 seconds. This is decided based on heuristics keeping in mind that the interval at which the data gets reported is not too large to miss out on data points and, at the same time not too short to create noise. In our experiment, 15 seconds seemed to be right in the sweet spot.
Memory Management: Data collected for metrics is flushed out after each report, allowing GC to clear up the unused memory.
Adopting Prometheus
Prometheus stands out as one of the most valuable databases for storing telemetry data. It's not only well-accepted but also deeply understood by the SRE and developer community. While we initially toyed with the idea of using other time series databases like TimescaleDB or ClickHouse, our choice ultimately settled on Prometheus. This decision was driven by its ease of use and its alignment with the type of metrics we aimed to track.
However, implementing Prometheus was not as straightforward as it might seem. Those familiar with Prometheus might already sense the challenges we faced. Collecting telemetry data for thousands of users from their browsers running our custom Prometheus client is similar to collecting telemetry data from thousands of ephemeral servers. At its core, Prometheus operates on a pull-based system. This means that instead of pushing data to Prometheus, it's Prometheus that pulls data from its sources. Given the nature of browsers, it's impractical to scrape data directly from clients running in them. This is where our metric aggregator steps in.
Collecting telemetry data for thousands of users from their browsers running our custom Prometheus client is similar to collecting telemetry data from thousands of ephemeral servers
Browsers are dynamic environments. Users can close tabs, open new ones, or refresh pages at any given moment. Consequently, every time our browser SDK reports data, it starts from scratch, which is really problematic because it can skew the reported data. To address this, our metric aggregator backend plays a crucial role. It keeps track of the data reported by the browser client and, rather than starting afresh, aggregates this new data (for the same set of labels) with previously reported metrics, storing the combined data in memory.
Our Learnings
Backend metrics do not tell the full story
You only get to know half the story from your backend observability stack. While this can tell you more about what is going on inside the server and what is causing the issue, it cannot possibly tell the impact on user experience caused by a degraded network. The API metrics collected at the front end have allowed us to close the loop, giving us a full spectrum of data to make an informed decision.
Isolating the problem
With API telemetry data being available from both frontend and backend segregated by project ID and region, we can quickly isolate the problem whether it is caused by our backend or it is an isolated incident only disrupting a certain fraction of our users.
Fast support turnaround time
With the full spectrum of data being available, our on-call engineers and support team can now look at the frontend telemetry data and work with our customers for a faster resolution to the problem. Previously, a handful amount of time was being spent looking through the various APM and infrastructure dashboards to identify the root cause of the issue.
Designing UX keeping latency in mind
With this system in place, the team has access to latency data observed by our end customers, which allows the entire product team to craft a solution keeping end-user experience in mind.
Way Forward
Our vision for the future of frontend telemetry is not just about collecting data, but about harnessing that data to create tangible, positive impacts on user experience.
Imagine a world where our applications are so attuned to their environment that they can provide visual cues to users the moment network degradation is detected, ensuring transparency and trust. We're committed to refining our frontend SDK, pushing the boundaries of performance to ensure that our telemetry tools are not just observers but active enhancers of the user experience.
Furthermore, by adopting OpenTelemetry standards, we're looking to elevate our telemetry practices. This move promises better integration, richer visibility, and a seamless bridge between our frontend metrics and the broader ecosystem of application performance monitoring. But we won't stop there. Our ambition extends to diving deeper into the intricacies of frontend performance, capturing a plethora of metrics that will offer us a crystal-clear understanding of the end-user experience.
The path ahead is filled with challenges, innovations, and opportunities. But with a clear vision and unwavering commitment, we're poised to redefine the standards of frontend telemetry, ensuring that every user interaction is not just monitored but optimized, celebrated, and enhanced.