Releases: privacy-tech-lab/gpc-web-crawler
August 2025 Crawl
Differences from May 2025 Crawl:
- The RestAPI DockerFile now uses node:18 instead of node:16, fixing a bug caused by an archived version of Debian. The actual crawl for August 2025 was performed with node:16, but this change should not affect the data collected (#182 for more details).
To pull the exact image versions used in this release:
docker pull ghcr.io/privacy-tech-lab/crawl-driver@sha256:0a304d6a105da5a01e45ea6462255187d0b1bab3f5ba2571489815958b425c31
docker pull ghcr.io/privacy-tech-lab/well-known-crawl@sha256:6b3cf17d156566159826eeebde0f495d7d096d13e09d29b3922eefc6f21c4469
docker pull ghcr.io/privacy-tech-lab/rest-api@sha256:a9bcac0bbfc35b05bfa6f6c536b1d20fa4d46b7d3fa6b432f6c0dc035de3a509
docker pull ghcr.io/privacy-tech-lab/mariadb-custom@sha256:24b7cb1fb51433b2149043de27182dd66c14329c7620ac6c417c52e9b235acf8
May 2025 Crawl
Differences from Feb/Mar 2025 Crawl:
- Updated Well-known python script to write "None" to well-known-data.csv file rather than
None
to prevent blank cells.
To pull the exact image versions used in this release:
docker pull ghcr.io/privacy-tech-lab/crawl-driver@sha256:1955a5a8e9dd06a84e92e87c44eecaaa248d44ce03c1a155888b00a82b1833df
docker pull ghcr.io/privacy-tech-lab/well-known-crawl@sha256:06110d2c0c56a6e214a03c93d9e88f0ea4c25953318587f8fa9d939789cad191
docker pull ghcr.io/privacy-tech-lab/rest-api@sha256:932cd43ebb38d4ff9e806822c5ecd8e886484e77bdf0361c3749a44f7ba7daf8
docker pull ghcr.io/privacy-tech-lab/mariadb-custom@sha256:0f08aa392b890a33f8f747c123d45577be2184115461385729c7fcab86691ab0```
February/March 2025 Crawl
Introduced Docker containerization for the web crawler.
- Dockerization: The crawler is now fully containerized, including the addition of Dockerfiles for MariaDB and the complete isolation of containers from the local machine
- Improved Efficiency: Reduced wait times and added a retry policy for persistent services, enhancing crawl speed and reliability
- Enhanced Error Handling and Cleanup: More robust error handling implemented, along with automatic container shutdown and cleanup of volumes to ensure a cleaner environment after crawls
June 2024 Crawl
Differences from April 2024 Crawl:
- addition of GPP version that identifies whether the site is using GPP v1.0 or v1.1 version
April 2024 Crawl
Differences from February 2024 crawl:
- well-known data is no longer collected by the crawler. We use a python script instead, which is also included in this repo.
- longer database values are now stored as TEXT instead of varchar
- addition of OneTrustWPCCPAGoogleOptOut and OTGPPConsent cookies
February 2024 Crawl
This is largely the same as the December 2023 crawl code.
Differences:
- well-known data is collected by the crawler
- column values in the debugging table are capped at 4,000 characters, as this is what is specified in our table
- one new human check regular expression
December 2023 Crawl
This is the code we used to perform our crawl on 11,708 sites in December 2023.
The extension collects data from Firefox's urlClassification object in order to determine whether a site is subject to the CCPA. It collects data on the USPS, GPP string, and the OptanonConsent cookie to determine whether sites recognize GPC signals. This version uses a SQL database to store the data.
Firefox-analysis-mode-crawler
The Firefox-analysis-mode-crawler is used to crawl the top 1000 sites of the US Privacy String Test Set.