Background
I first discovered Benford's Law on an episode of Radio Lab. Simply put, Benford's Law states that in many data sets certain digits appear more often than others as the leading digit of the numbers in that set. For example, a "1" appears as the first digit about 30% of the time in the numbers of some data sets, a "2" appears as the first digit about 17% of the time, and so on. A graph of the trend is shown below.
After learning about Benford's Law, I wanted to research if web asset file sizes follow this law. Web assets are defined here as HTML documents, external CSS files, external JavaScript files, and images.
Methodology
The following technologies were used:
- Python
- Requests, a Python library for HTTP requests
- Dataset, a Python library for simple database interaction
- BeautifulSoup, a Python library for reading and manipulating HTML and XML
- Highcharts JS, a JavaScript library for displaying the charts in this document
I wrote a Python script to fetch the URLs of the top 100 sites from Alexa. These URLs were stored in a SQLite database accessed via Dataset. I then wrote another Python script that fetched the documents and their external assets for each of the top 100 sites and stored the size of each asset in the aforementioned SQLite database using Dataset. The asset sizes were stored in bytes.
Results
The following charts display the frequencies of leading digits in asset file sizes versus Benford's Law's expected frequncies.
HTML Document Sizes
The chart of HTML Document Sizes below represents 96 of the top 100 sites from Alexa's top sites list. Four of the top 100 returned byte sizes of 0, so those were not included in the results. The URLs used were for home pages only. Despite the increase from 3 to 4, the overall trend follows the Benford's Law curve fairly well.
Image Sizes
External image files were found in each HTML document using BeautifulSoup, then requested as individual HTTP requests using the Requests library. The graph below consistutes 3443 image file sizes. Overall, the trend follows the Benford's Law curve, but something very strange happens with 4. Looking at the data revealed a large number images with a file size of 43 bytes. Many of these files were GIF files that were probably tracking pixels.
Removing the files that were 43 bytes smoothed out the trend a little bit, but 4 still appears more than predicted.
CSS Sizes
External CSS files and other <link>
tags were found in each HTML document using BeautifulSoup, then requested as individual HTTP requests,
the same as with the images mentioned above. So, the results below include CSS files, favicon images, and any other asset that can be linked to
with a <link>
tag. Here, 5 appears to be overrepresented, but from looking at the raw data it was not immediately clear what may be causing
the spike.
Script Sizes
External script files were found in each HTML document using BeautifulSoup, then requested as individual HTTP requests, the same as the images and the CSS files mentioned above. The numbers 5 and 9 are slightly overrepresented here, but the overall curve follows the Benford's Law curve well.
All File Sizes
Looking at all file sizes together, 4 and 5 are overrepresented because they appear so often for image file sizes (lots of 4s) and for script and CSS file sizes (lots of 5s). The overrepresentation by these numbers causes the frequencies of the other numbers to be lower, so the whole trend line is below the Benford's Law curve.
Conclusion
Despite some outliers, trends of leading digits in web asset file sizes appears to follow Benford's Law to a certain degree.
Limitations and Future Research
Considering the size of the web, 100 websites is a very small sample size to use when collecting data. Future research would benefit from using a much larger sample size of sites. Also, guidelines should be created up front to consider how to handle outliers, such as transparent tracking pixels.
Code and Data
All the scripts used and the original database is available in a github repository.