Introducing the wide crawl initiative
In March 2011, we embarked on an ambitious project known as the wide crawl, anchored by a carefully curated seed list and a unique crawler configuration. This effort utilized cutting-edge HQ software designed by Kenji Nagahashi, aimed at streamlining the process of distributed crawling across the web.
What the data reveals
This crawl produced impressive results as evidenced by the data collected. The crawl concluded on December 23, 2011, with a staggering 2,713,676,341 captures, totaling over 2,273,840,159 unique URLs across 29,032,069 hosts. It’s noteworthy that our starting point was a list of the top 1 million sites from Alexa, gathered shortly before we commenced the crawl.
Challenges and observations
Despite the breadth of our project, this crawl was largely experimental. We tackled several operational challenges arising from the new software managing URL feeds to the crawlers. In many instances, not all embedded and linked resources on a webpage were indexed due to the URL queues outpacing our crawl’s intended capacity. This sometimes resulted in missed elements. Additionally, we conducted repeated crawls of specific Argentinian government sites, introducing biases when analyzing results by country.
Improvements and future plans
Since this initial foray into wide crawling, we’ve implemented numerous enhancements to our methodologies. Our goal in making this dataset available is to foster experimentation and exploration, revealing all the quirks and complexities associated with the crawling process. Further analysis of the gathered content has also been undertaken to provide additional insights.
Accessing the dataset
We welcome inquiries into accessing this rich set of crawl data. Should you wish to utilize it, please reach out to us at info at archive dot org. Kindly provide information about your identity and your intended use of the dataset. While we cannot guarantee approval for all requests, each one will be considered thoughtfully.
Laisser un commentaire