Web performance can mean a lot of different things to a lot of different people. Fundamentally, it's a question of how fast a web page is. But fast to whom?
When this page loaded moments ago, was it fast? If so, congratulations, you had a fast experience. So ask yourself, does that make this a fast page? Not so fast! Just because you had a fast experience doesn't mean everyone else does too. You might even revisit this page and have yourself a slow experience.
Let's say that you and everyone else who load this page all have fast experiences. Surely that makes it a fast page, right? Most people would agree. Hypothetically, what if everyone's internet speeds get 100x slower overnight? Now, all experiences on this page are suddenly slow. Is this page, which is byte-for-byte identical as it was yesterday, still fast?
Fast is a concept that exists in the minds of users as they browse the web. It's not that the page is fast—the experience is fast.
Ok, that's enough philosophy. Why does it matter? Because there's a difference between a page that's built for speed and a page that feels fast. A svelte page could feel slow to someone having network issues. A heavily unoptimized page could feel fast to someone on high-end hardware. The proportions of those types of users can determine how fast a page is experienced in aggregate, even more so than how well-optimized it actually is.
How you approach measuring a web page's performance can tell you whether it's built for speed or whether it feels fast. We call them lab and field tools. Lab tools are the microscopes that inspect a page for all possible points of friction. Field tools are the binoculars that give you an overview of how users are experiencing the page.
A lab tool like WebPageTest or Lighthouse can tell you thousands of facts about how the page was built and how quickly the page loaded from its perspective. This makes lab tools irreplaceable for inspecting and diagnosing performance issues. You can visualize every step of the page load and drill down into what's holding it up. Lab tools can even make informed recommendations for things they think you should fix, saving you the investigative time and effort. But despite their advantages, lab tools can lead you astray in subtle ways.
Similar to the problem of your fast experience not necessarily reflecting everyone else's, your lab test might not be configured like most users in two important ways: access and behavior. A lab tool accesses a web page from a specific hardware and network configuration, which can greatly affect the page's loading performance. A lab tool might not behave in ways that mimic real users either, for example the test might not be logged in, scroll the page after it loads, nor click on buttons.
This problem is becoming more and more apparent as developers rightly focus on user-centric metrics. Core Web Vitals represent a few distinct aspects of a good user experience: loading performance, input responsiveness, and layout stability. These are measured by Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS) respectively. So what could go wrong with measuring these metrics in the lab?
LCP is the time that the biggest content loaded on screen. The times at which things load can be highly dependent on how fast the network is, so the lab configuration can produce wildly different LCP values based on its bandwidth and latency settings. Large content like images may also be cached and immediately available for some users, but lab tests tend to run with empty caches, necessitating another trip over the network.
FID is the delay from first interacting with the page, like a click, to the time that the browser is ready to respond to it. The main thread could be so busy with script execution or DOM construction that the event handler has to wait its turn. The obvious limitation with testing a page in the lab is that there aren't any users to interact with it! There are diagnostic metrics for interactivity in the lab, like Total Blocking Time (TBT), but these don't actually measure the user experience. We can fake FID and simulate a user's click, but the questions of what to click and when to click it can be very subjective.
CLS is roughly the proportion of the viewport that shifted as a result of layout instability. A layout could have a moment of instability when elements are suddenly added or removed and the positions of neighboring contents shift. Because the layout shift score is a proportion of the viewport, CLS can be very different between phones and desktops. The type of device used or emulated in the lab directly affects how CLS is calculated. There's another issue having to do with user behavior: when to stop measuring. Lab tools tend to stop when the page is loaded, but real users are just getting started interacting with the page and potentially incurring many more layout shifts. Real users scroll and click and trigger new sorts of conditions that contribute to layout instability. Simulating these behaviors in the lab would be closer to reality but it has similar challenges to FID.
This is why field data is the ground truth for how a page is experienced. At best we can only simulate user experiences in the lab, and we'd still be hypothesizing how a user would access a page and how they'd behave once they get there.
But wait! What if we calibrate our lab configurations based on real-user data from the field? This isn't a new idea; developers have been calibrating access factors like geographic location, browser, and network speed based on field data for years. But now it's more important than ever to calibrate behavior as well. For example, we can use analytics to see what users tend to click on first and when they click on it.
Some lab tools like WebPageTest are advanced enough to be able to script that behavior into the test. But a popular tool like PageSpeed Insights (PSI) has no configurability beyond plugging in the URL you want to test, so you need to take its lab results with a grain of salt. Keep in mind that performance is a distribution, and one lab test is just a single contrived data point.
Fear not, even unrealistic lab tests can still be useful. One practical application of this is to test for worst case scenarios. You may not be able to say with certainty that anyone who visits your page will have a fast experience, but if you can make it seem fast under even the slowest conditions, that goes a long way. Stress testing your page's performance by using (or emulating) low-end hardware over strained network speeds is a great way to magnify the power of the microscope to bring more performance problems into focus. This is an opportunity to fix issues before users may even experience them.
What if users aren't experiencing this slow performance because they're conditioned not to? An experience can be so poor that the user abandons it before it gets any worse. They may never come back to the site at all, in which case your field data has survivorship bias where only the bearably slow experiences are measured. How many unbearably slow experiences aren't you measuring? And you thought we were done with the philosophical questions!
Let's stop here and recap:
- Individual experiences are just data points along a distribution. What feels fast depends on the conditions under which it was experienced. Everyone's conditions are different.
- Lab tests may not be configured to be representative of the most common experiences on the curve, or any experience on the curve for that matter.
- User-centric metrics require extra care to ensure that behaviors are emulated faithfully in the lab.
As a web development community, we need to change more than just our mindset about "fast" web pages. It's not enough to be aware of the pitfalls that can lead us astray: to avoid them requires a concerted effort between the makers and users of performance testing tools.
Lab tools must be configurable to access and behave like real users. There is no one-size-fits-all lab configuration that represents how users experience all pages. Developers need to be active participants in the configuration process—not necessarily down to the Mbps of bandwidth, but they should make high-level decisions about what type of user they're simulating in the lab. This could be a manual guessing game, but at least developers are made more conscious of the relevance of the results.
An even better solution would be to build stronger data bridges between field and lab tools, so that the lab tool itself can make informed recommendations about the most realistic user profiles to simulate.
We're at an exciting inflection point in the power of developer tooling. As newer metrics focus on how pages are experienced from users' perspectives, we have an opportunity to rethink and reshape the ways our tools help us to measure and optimize them. By instrumenting lab tools with the behavioral characteristics of real users, we can unlock new opportunities to improve experiences beyond the page load.