Crawling a billion web pages in just over 24 hours
SMRTR summary
A developer successfully crawled over one billion web pages in 25.5 hours for just $462 using a cluster of 12 AWS machines, proving that massive web crawling has become dramatically cheaper since similar experiments cost $41,000 in 2012. The project revealed that parsing HTML has become a major bottleneck due to average page sizes growing from 51KB to 242KB, while SSL encryption now consumes 25% of CPU time during crawling operations.
SMRTR provides this summary for quick context. The original article belongs to lobste.rs.
Read the original article