Analyzing web logs
Web logs capture traffic metadata, such as the request time and route, how long the server took to respond, the response size, and so on. Analyzing web logs sheds light on both server performance and client behavior. Yet the common practice of summary statistics (e.g., 95th-percentile latency) often hides interesting patterns! This is because performance varies wildly based on the nature of the request, and unusual clients such as bots can easily hide in a sea of “natural” traffic.
What if — instead of summarizing — we plotted every request as a dot with time along x→ and latency (on a log scale) along y↑?
Response latency, color by route
The plot above shows a sample of
The detail in this plot is astonishing: it shows the varying performance of different routes, and intriguing temporal patterns. We’ll tease apart these patterns in a bit. First let’s better understand what we’re looking at.
Collapsing x→ (time) gives a more traditional view of latency: a stacked histogram colored by route. This view focuses on performance. Notice
Response latency histogram
Analyzing web logs lets us focus on optimizing routes that are both slow and popular, such as
Top routes by total count
Top routes by total duration
But back to those temporal patterns. These are fascinating because they don’t just show server performance — they show how clients behave “in the wild.”
We can use a dense scatterplot to visualize any quantitative request metric. Below we show response size in bytes along y↑. Response sizes are also important for performance, especially if latency measurements only consider the time it takes the server to send the response and not user-perceived latency across the network.
Response size, color by route
The response size heatmap also highlights individual paths, visible as horizontal striations. The
The daily pattern for
By filtering on route, we can see the periodic behavior of the scraper more clearly.
Response latency of /documents/{public} and /document/{id}@{version}
Above, we color by frequency instead of route. This better reveals the density of requests, though we cannot differentiate routes and therefore infer as much about behavior. Notice the spike in the early morning hours of January 31. By looking at the associated logs, we determined this to be a bot scanning for vulnerabilities.
Response latency, color by frequency
Let’s look at a couple more routes of interest. The
Response latency of /avatar/{hash}
Unfortunately, avatars are slow. Serving an avatar requires fetching an image for S3 and rescaling to the requested size. S3 is slow, and images are often large and expensive to resize. Furthermore, avatars are often requested in bulk — for example, an activity feed might show hundreds of avatars! The vertical streaks here represent individual clients spawning many simultaneous requests. We have room for improvement here.
The
Response latency of /documents/@{login}
This route is slower than we like, mostly due to complicated permissions that make the underlying database queries difficult to index. But the temporal pattern is interesting: at midnight UTC, latency noticeably increases for an hour or two. We believe an internal scheduled batch job is causing resource contention. We want to optimize this route, too.
Web log analysis has been fruitful for the Observable team to prioritize optimization and manage traffic. Using these granular heatmaps, we’ve identified numerous opportunities for improvement that would otherwise go unnoticed.