Web Assets & Benford's Law

TL;DR: My pseudo-research shows that web asset file sizes pretty much follow Benford's Law, using a very small sample of the top 100 websites.

Background

I first discovered Benford's Law on an episode of Radio Lab. Simply put, Benford's Law states that in many data sets certain digits appear more often than others as the leading digit of the numbers in that set. For example, a "1" appears as the first digit about 30% of the time in the numbers of some data sets, a "2" appears as the first digit about 17% of the time, and so on. A graph of the trend is shown below.

After learning about Benford's Law, I wanted to research if web asset file sizes follow this law. Web assets are defined here as HTML documents, external CSS files, external JavaScript files, and images.

Methodology

The following technologies were used:

I wrote a Python script to fetch the URLs of the top 100 sites from Alexa. These URLs were stored in a SQLite database accessed via Dataset. I then wrote another Python script that fetched the documents and their external assets for each of the top 100 sites and stored the size of each asset in the aforementioned SQLite database using Dataset. The asset sizes were stored in bytes.

Results

The following charts display the frequencies of leading digits in asset file sizes versus Benford's Law's expected frequncies.

HTML Document Sizes

The chart of HTML Document Sizes below represents 96 of the top 100 sites from Alexa's top sites list. Four of the top 100 returned byte sizes of 0, so those were not included in the results. The URLs used were for home pages only. Despite the increase from 3 to 4, the overall trend follows the Benford's Law curve fairly well.

Image Sizes

External image files were found in each HTML document using BeautifulSoup, then requested as individual HTTP requests using the Requests library. The graph below consistutes 3443 image file sizes. Overall, the trend follows the Benford's Law curve, but something very strange happens with 4. Looking at the data revealed a large number images with a file size of 43 bytes. Many of these files were GIF files that were probably tracking pixels.

Removing the files that were 43 bytes smoothed out the trend a little bit, but 4 still appears more than predicted.

CSS Sizes

External CSS files and other <link> tags were found in each HTML document using BeautifulSoup, then requested as individual HTTP requests, the same as with the images mentioned above. So, the results below include CSS files, favicon images, and any other asset that can be linked to with a <link> tag. Here, 5 appears to be overrepresented, but from looking at the raw data it was not immediately clear what may be causing the spike.

Script Sizes

External script files were found in each HTML document using BeautifulSoup, then requested as individual HTTP requests, the same as the images and the CSS files mentioned above. The numbers 5 and 9 are slightly overrepresented here, but the overall curve follows the Benford's Law curve well.

All File Sizes

Looking at all file sizes together, 4 and 5 are overrepresented because they appear so often for image file sizes (lots of 4s) and for script and CSS file sizes (lots of 5s). The overrepresentation by these numbers causes the frequencies of the other numbers to be lower, so the whole trend line is below the Benford's Law curve.

Conclusion

Despite some outliers, trends of leading digits in web asset file sizes appears to follow Benford's Law to a certain degree.

Limitations and Future Research

Considering the size of the web, 100 websites is a very small sample size to use when collecting data. Future research would benefit from using a much larger sample size of sites. Also, guidelines should be created up front to consider how to handle outliers, such as transparent tracking pixels.

Code and Data

All the scripts used and the original database is available in a github repository.