Analyzing web logs

Web logs capture traffic metadata, such as the request time and route, how long the server took to respond, the response size, and so on. Analyzing web logs sheds light on both server performance and client behavior. Yet the common practice of summary statistics (e.g., 95th-percentile latency) often hides interesting patterns! This is because performance varies wildly based on the nature of the request, and unusual clients such as bots can easily hide in a sea of “natural” traffic.

What if — instead of summarizing — we plotted every request as a dot with time along x→ and latency (on a log scale) along y↑?

Response latency, color by route

The plot above shows a sample of requests to Observable servers over a 7-day period. Color encodes the associated route. Hover to see the route.

Since the image is discrete, this scatterplot is effectively a heatmap: each pixel corresponds to a 5-minute time interval and some narrow latency band, while color encodes the most-frequent route within the pixel. There are many routes, so categorical colors are recycled; yet much like political maps, color reveals boundaries.

The detail in this plot is astonishing: it shows the varying performance of different routes, and intriguing temporal patterns. We’ll tease apart these patterns in a bit. First let’s better understand what we’re looking at.

Collapsing x→ (time) gives a more traditional view of latency: a stacked histogram colored by route. This view focuses on performance. Notice tends to be slow (~1 second), and tends to vary widely. Performance is contextualized by showing how much traffic routes receive in aggregate: area is proportional to request volume. The popular and routes power notebook imports, so we want them to be fast (and they are).

Response latency histogram

The artifacts on the left of the histogram (as well as on the bottom of the heatmap above) are due to the millisecond precision of latency values. Latencies are uniformly jittered by ±0.5ms to smooth (or smear) the data.

Analyzing web logs lets us focus on optimizing routes that are both slow and popular, such as and . We can confirm this by aggregating routes by total count and duration.

Top routes by total count

Top routes by total duration

But back to those temporal patterns. These are fascinating because they don’t just show server performance — they show how clients behave “in the wild.”

We can use a dense scatterplot to visualize any quantitative request metric. Below we show response size in bytes along y↑. Response sizes are also important for performance, especially if latency measurements only consider the time it takes the server to send the response and not user-perceived latency across the network.

Response size, color by route

The response size heatmap also highlights individual paths, visible as horizontal striations. The line at 15,846 bytes represents the D3 gallery, one of the most popular pages on Observable. And the line at 12,193 bytes represents Jeremy’s Inputs, a popular import (though superseded by our official Observable Inputs). This heatmap previously revealed a bug where we were loading 200+ notebooks to serve a gallery of only 4–9 notebooks; by fixing the bug, we reduced the total number of >50KB responses by more than half!

The daily pattern for sticks out in this heatmap and on the latency heatmap. What’s going on there? By inspecting logs, we believe this represents an academic research project scraping public notebooks for content analysis. The scraper starts by fetching (recent notebooks) and then repeatedly requests to fetch notebook contents. Once a month, they do a deeper scrape.

By filtering on route, we can see the periodic behavior of the scraper more clearly.

Response latency of /documents/{public} and /document/{id}@{version}

We support visualization research. If you’re interested in studying public notebooks, please reach out and let us help you write a polite scraper.

Above, we color by frequency instead of route. This better reveals the density of requests, though we cannot differentiate routes and therefore infer as much about behavior. Notice the spike in the early morning hours of January 31. By looking at the associated logs, we determined this to be a bot scanning for vulnerabilities.

Response latency, color by frequency

Let’s look at a couple more routes of interest. The route is responsible for serving avatars (profile images). Avatars are used throughout Observable and this is one of the highest-traffic routes.

Response latency of /avatar/{hash}

Unfortunately, avatars are slow. Serving an avatar requires fetching an image for S3 and rescaling to the requested size. S3 is slow, and images are often large and expensive to resize. Furthermore, avatars are often requested in bulk — for example, an activity feed might show hundreds of avatars! The vertical streaks here represent individual clients spawning many simultaneous requests. We have room for improvement here.

The route is also interesting. It lists the notebooks in the given workspace, such as when you go to your home page, or visit someone’s profile.

Response latency of /documents/@{login}

This route is slower than we like, mostly due to complicated permissions that make the underlying database queries difficult to index. But the temporal pattern is interesting: at midnight UTC, latency noticeably increases for an hour or two. We believe an internal scheduled batch job is causing resource contention. We want to optimize this route, too.

Web log analysis has been fruitful for the Observable team to prioritize optimization and manage traffic. Using these granular heatmaps, we’ve identified numerous opportunities for improvement that would otherwise go unnoticed.