Sixing Chen for HTTP Archive

Posted on Jan 12, 2021

Correlation between Core Web Vitals and web characteristics

#performance #webperf #webdev #httparchive

Introduction

Core Web Vitals (CWV) are the metrics that Google considers to be the most important indicators of the quality of experience on the web. The process to identify and optimize CWV issues has typically been a reactive one. The decisions site owners make about which technologies to use or which metrics to look at are usually decided by trial and error, rather than empirical research. A site may be built or rebuilt using a new technology, only to discover that it creates UX issues in production.

In this analysis, we analyze the correlation between CWV and many different types of web characteristics simultaneously, rather than a single type of web characteristic in a vacuum, since web development choices are not in a vacuum but in the context of many parts of a website. We hope that these results will provide additional reference points to teams as they consider assessing various web development choices and invite the community to help further the understanding of the interplay between CWV and web characteristics.

Notable negative associations with largest contentful paint:
- TTFB, bytes of JavaScript, CSS, and images
- JavaScript frameworks - AngularJS, GSAP, MooTools, and RequireJS
- JavaScript libraries - Hammerjs, Lodash, momentjs, YUI, Zepto, jQueryUI, and prettyPhoto
- CMS - Joomla and Squarespace
- UI frameworks - animatecss
- Web frameworks - MicrosoftASPNet
- Widgets - FlexSlider and OWLCarousel
Notable negative associations with cumulative layout shift:
- Bytes of images
- JavaScript frameworks - AngularJS, Handlebars, and Vuejs
- JavaScript libraries - FancyBox, Hammerjs, Modernizr, and Slick
- Widgets - Flexslider and OWLCarousel

Methodology

Data source

This analysis is based on data from HTTP Archive. The HTTP Archive dataset is generated in a lab environment and contains detailed information on many characteristics of a website as well as performance data. Due to being lab-generated on a single set of hardware, HTTP Archive data will not be completely reflective of real usage and only allows for us to analyze LCP (largest contentful paint) and CLS (cumulative layout shift) as we do not have any user input for FID (first input delay). However, an advantage of being lab-generated is that all data is gathered on a single set of hardware with no bias in the types of websites that are loaded; thus, shields us from confounding due to user/device characteristics that we do not measure. Although we are not shielded from all confounding between website characteristics and web performance, this choice leaves us with far less confounding than a user-generated dataset where we often have no information on the user, and have only limited device information.

Web characteristics

We conferred with domain experts and established a list of web characteristics of interest:

TTFB, font requests, and bytes of content of various types
Counts of various types of third party requests
Web technologies (coded as binary to represent whether technology is used)
- JavaScript frameworks
- JavaScript libraries
- CMS
- UI frameworks
- Web frameworks
- Widgets

These characteristics represent the ways that web pages are built and experienced. Pages may be built using various technologies like content management systems (CMSs), JavaScript libraries and frameworks, etc. According to the Web Almanac, 40% of websites are built with a CMS. That makes it a useful category to inspect qualitatively to see if there are meaningful correlations between CMS and CWV. On the other hand, we use quantitative metrics that represent how users experience the page, including performance and page weight data.

The list of technologies we include in the analysis is only a subset of all technologies employed by sites in HTTP Archive. We have restricted the analysis to websites that employ only technologies that are used by at least 50,000 websites (this amounts to about 1% of sites in HTTP Archive). This removes underused technologies that we may not have sufficient data to evaluate. The presence of certain technologies also overlap highly, with CMS Wix and JavaScript library Zepto almost overlapping completely. Such high overlap creates modeling issues, and we have chosen to remove Wix from this analysis.

Analysis

With LCP and CLS as the outcomes and the web characteristic as the predictors, we attempt to model the relationship between the outcomes and the predictors through random forest. Random forest is a learning algorithm for both regression and classification based on a set of decision trees trained from a randomly chosen set of predictors and a bootstrap sample of the dataset.

To assess the correlation between the outcome and each predictor as well as their individual effects on the outcome, we derived a measure of correlation (% of higher >= split mean, %HSM) and a measure of effect size (mean split difference, MSD). Both measures are based on the types of splits the trained decision trees make based on the predictors. See appendix for more details.

%HSM is bounded between 0 and 1, with values close to 0 indicating negative correlation and values close to 1 indicating positive correlation, while values close 0.5 indicates little correlation. MSD’s magnitude is not bounded, and a large positive value indicates that the predictor appears to contribute positively to the mean of the outcome. Note positive here does not necessarily mean it is good, but merely in the numerical sense.

Results

Here, we present results on association and make note of specific characteristics that appear especially impactful on CWV.

When interpreting these results on association, an important thing to note is that positive and negative impact of a particular web characteristics should only be interpreted relative to that of other web characteristics and in the context of websites that employ an array of web technologies, various types of contents, and different third party requests. For instance, if a given web technology shows a strong positive impact, it should be interpreted as this technology appears to be good for performance relative to other technologies, instead of if we add this technology to a website, its web performance will improve.

LCP

LCP is modeled as the log of its numerical value, so higher values are worse.

Predictors with %HSM values close to 1 means higher values of a numerical/count characteristic or presence of a technology are strongly associated with higher values of LCP, and vice versa for predictors with %HSM close to 0 (high %HSM is worse).

Likewise, predictors with a relatively large and positive MSD means higher values of a numerical/count characteristic or presence of a technology shows a strong negative impact on LCP, and vice versa for predictors with relatively large and negative MSD (large positive MSD is worse).

Higher values of TTFB, bytes of JavaScript, CSS and images show the strongest positive correlation with LCP and most negative impact, though TTFB is not always actionable.

In general, third party requests do not show strong correlation or impact on LCP in the context of the other predictors we consider. This result could be due to most websites in HTTP Archive having a fair number of third party requests, so its effect could not be well ascertained.

The presence of most JavaScript frameworks show strong positive correlation with LCP and negative impact, except AMP. AngularJS, GSAP, MooTools, and RequireJS stand out the most.

Just as JavaScript frameworks, the presence of most JavaScript libraries also show strong positive correlation with LCP and negative impact. Hammerjs, Lodash, momentjs, YUI, and Zepto stand out in terms of correlation and effect size, while jQueryUI and prettyPhoto stand out in terms of correlation.

Among CMS, Joomla and Squarespace show strong positive correlation with LCP and negative impact. On the other hand, WordPress shows low correlation and impact.

Animatecss stands out among UI frameworks, MicrosoftASPNet stands out among web frameworks.

Among widgets, FlexSlider and OWLCarousel both show strong positive correlation with LCP, and Flexslider also shows a strong negative effect size.

CLS

CLS is modeled as a binary indicator of whether a given threshold is met. 1 indicates a website has CLS < 0.1, and 0 otherwise, so 1s are better than 0s.

Predictors with %HSM values close to 1 means higher values of a numerical/count characteristic or presence of a technology are strongly associated with meeting the CLS threshold, and vice versa for predictors with %HSM close to 0 (low %HSM is worse).

Likewise, predictors with a relatively large and positive MSD means higher values of a numerical/count characteristic or presence of a technology shows a strong positive impact on meeting the CLS threshold, and vice versa for predictors with relatively large and negative MSD (large negative MSD is worse).

Most of these characteristics show only weak correlation with LCP and low impact, except bytes of images that show a negative correlation with CLS compliance and negative impact.

Just as LCP, third party requests seem to have low correlation and impact on CLS compliance.

The presence of several JavaScript frameworks show strong negative correlation with CLS compliance and negative impact, while AMP, GSAP, and React show low correlation and impact. AngularJS, Handlebars, and Vuejs appear to have the most negative impact.

JavaScript libraries appear less bad for CLS compliance than frameworks, though most still show a negative impact. FancyBox, Hammerjs, Modernizr, and Slick are the most notable.

None of the CMSs have a notable negative impact, with WordPress showing a fairly positive correlation.

UI frameworks all show low impact. Among web frameworks, RubyonRails shows a fairly positive correlation with CLS compliance.

Among widgets, Flexslider and OWLCarousel both show a fairly negative impact on CLS compliance.

Conclusion

This analysis is a first step in an effort to more comprehensively understand the impact of web characteristics on CWV. While the results point out strongly associated characteristics, it would be impactful for the web community to further delve into the associations identified to ascertain which ones are truly causal and which ones are merely associative to inform web developers. In the meantime, the web characteristics with strong negative correlations/effects should be seen as a signal of things that require more attention and/or planning. Finally, it would be of interest to refresh these analyses in the future to see if the associations identified here still hold.

Appendix

Random forest trains decision trees by making binary splits of the data. Each split is based on a particular predictor and are of the form X <= c and X > c, for a predictor X based on some purity criterion. Then, all data points with value for X <= c will be put in the corresponding branch, and likewise for data points with X > c. The data points can then be further split in each subsequent branch based on other predictors in the same way. The measures of correlation and effect size we use exploit these splits.

Specifically, for a given predictor, we look for splits that are based on the predictor. For each such split, we compute the outcome mean of the data points that are in the <= and > branches. %HSM (% of higher >= split mean) checks the proportion of times that the outcome mean in the >= branch is higher than that in the < branch. This checks how frequently larger outcome means are associated with higher predictor value. MSD (mean split difference) is the outcome mean of the <= branch subtracted from that of the > branch, averaged across all relevant splits of the predictor. This checks the difference in the outcome mean between data points with higher values of the predictor and those with lower values.

Top comments (6)

DaisyGarcia • Oct 8 '21

These metrics are made up of three components: content loading speed, interactivity and visual stability. How to make her miss you and want you back

Todd H. Gardner • Apr 14 '21

Wow, this is fantastic. I've anecdotally shared these observed correlations before, and now there is real data to back it up!

Tim Bednar • Apr 12 '21

Hi -- I'm interested in the first chart. Did "size" work better than counting requests? I see the count for fonts, but then size is used for the other types. This so interesting.

Sixing Chen • May 18 '21

Hi Tim, sorry for the super delayed response!

This was simply a modeling choice that we made, as there was some extra interest regarding font, so I added an additional request count for font.

For third party requests though, we did use count of requests, and did not come up with much strong correlation there.

Tim Bednar • Jun 5 '21

@csxgg I find this very interesting. Right now I'm starting to research how to create a performance budget which would result in passing Core Web Vital metrics. Basically, I want analyze pages that already "pass" these benchmarks and then I want to see the breakdown (count and size) of files loaded by the browser before FCP and LCP (as well as other metrics), and I'd like to see how many scripts have "long tasks" that are loaded before FCP and LCP, etc. The result would be a budget that could be "predictive". Meaning if you build the page to load files in this way, we can predict that it will be fast. For example, do fast pages load 3 images before LCP or 20? Or do fast pages load 1 script with a long task before LCP or 5? Or how many render blocking resources are loaded before FCP? Third parties would be good to know as well. The idea is that if you use the app shell modal, critical rendering page and this type of "budget" then you have a boiler plate for building a fast page. Then if you know this, you can evaluate existing pages. That all said, I find this research really interesting and I return to it often as I work on waterfaller.dev.

Sixing Chen • Jun 7 '21

There's so much intricacies on web development that could affect performance! I had no understanding at all of these things under the hood, and merely tried to represent characteristics of a web page with numbers. I'm glad this work can potentially be useful, but clearly there are a lot of things that are not captured by the training data.

DEV Community

Correlation between Core Web Vitals and web characteristics

Introduction

Methodology

Data source

Web characteristics

Analysis

Results

LCP

CLS

Conclusion

Appendix

Top comments (6)

Read next

Why are Props Immutable in React?

The End of Front-End Development????

Export .env from Vercel: New Laptop, Old Project? No Problem!

10 Game-Changing Frontend Tools You Can't Afford to Miss in 2025🔥