Scraping is always green

Web scraping and crawling played a major role in creating the Internet we know today. Although the technology, process and results remain invisible to most, it is all here to stay. I would even say that scraping will never be “outdated”, barring extreme regulatory changes.
Of course, over its history, web scraping has undergone significant changes, primarily due to the ever-increasing complexity of the Internet. I think relatively few people remember the beautiful simplicity of web pages from the 90s. Scraping was a little easier back then.
Tandem start
If you were to ask around for the origin story of web scraping, most people would point to relatively new inventions or products. Most likely, you’ll get the answer everyone knows: Google. It is certainly the most successful crawling-based company, but far from the first.
To our knowledge, the first web crawler application was developed in 1993. Matthew Gray built the aptly named “Wanderer”, which was used to discover new websites and estimate the size of the World Wide Web. It’s no surprise that Matthew is now director of engineering for search at Google.
Obviously, web scraping started soon after the internet (or to be exact, the World Wide Web) was created in 1989. It was only a few odd years before someone started collecting data stored on the Internet.
Of course, it was mainly driven by curiosity and passion. There was probably little financial value on the Internet in 1993. In the era of Netscape Navigator, many websites were still far from anything close to a business.
It wasn’t long before the usefulness of web scraping was discovered, as in the same year Jump Station was launched – the first crawling-based search engine. Upgrades, competitors and new technologies have followed suit.
Most search engines used rudimentary scraping to collect and index pages. Rankings were usually actionable by stuffing keywords all over the place. This was a problem that arose due to a lack of sophisticated data analysis.
What could be considered the most significant early advance in scraping is Larry Page’s PageRank Algorithm, which has been adopted by Google. Instead of just using keywords, inbound and outbound links have become a measure of a website’s importance.
The professional web
However, web scraping never really caught on back then. Search engines and companies that profit from data are the only ones that have truly engaged in scraping and crawling. For much of the early history, there was no reason to scratch for anyone else.
As the Internet moved away from TXT files glorified as websites, Geocities and AngelFire towards professionally designed pages with payment gateways and products, business interest grew. An opportunity to reach new audiences and buyers has emerged. In turn, companies have started to go digital.
Suddenly, monitoring specific pages on the Internet has become something that could be useful. Data on the Internet is no longer just information. The data had gained in usefulness. It could be analyzed for profit or research incentives.
There was (and still is) a problem, however. While regular internet users created simplistic websites back then, doing business meant doing marketing and sales. The companies had pulled all the best practices from regular advertising and put them online. That meant shiny, sleek, and optimized websites. Those optimized for viewing, browsing and purchasing.
The professionalization of the Internet had led to the creation of websites that were more than just glorified Excel spreadsheets. As a result, the underlying HTML has become more complex, which means data extraction has become much more difficult.
We found ourselves with an interesting dilemma. In a sense, the Internet has become a treasure trove of incredibly useful data. On the other hand, accessing this data has become unreasonably difficult. This has been made even more complicated by the ever-increasing speed of website changes.
Dedicated scraping
As a result, scratching had to become highly specialized and dedicated. Scrapers and parsers had to be written for specific websites. Many homebrew projects still follow the same process.
Oddly enough, many industry-level scrapers didn’t go much further. Some dedicated scrapers can support specific page types. For example, at Oxylabs we have the SERP Scraper API, E-Commerce Scraper API and Web Scraper API – dedicated scrapers for search engines, e-commerce pages and generic websites respectively.
These splits are necessary due to the nature of the pages. Product pages, by their purpose, differ greatly from search engine pages, which makes their structure different by necessity. Theoretically, as the difference between page structure increases, so does the complexity of an all-in-one scraper and parser. Given that there are so many types and variations of pages, the complexity of an all-in-one scraper and analyzer that never breaks would be nearly endless.
In practice this means that dedicated scrapers and analyzers are and will be needed for the foreseeable future. There is some hope that Solutions based on AI and machine learning might make the process easier. Our own tests have shown promising results for ML-based analysis.
Scratching is (now) forever
Some would say that there is a growing global demand for data. I think it would be a little misleading to assume that. The demand for data has always existed and will always exist. There is nothing more valuable to any business, business or otherwise, than being able to understand the environment.
Sentiments about “increasing demand for data” are not unlike a distorted mirror. What they reflect exists (and is true), but not in its entirety. Data has always been the foundation of business, research and government. Even relatively simple businesses use ledgers, write invoices and manage inventory.
As such, data has always had its place. What has changed with the advent of the internet and the evolution of digital businesses is the break with the constraints of geographic space (and, in a way, time). Businesses no longer need to be tied to a physical location.
Companies were, in a way, liberated and gained better access to other markets. On the other hand, more data sources have become relevant as the scope of competition and resources have also increased. Thus, digitization accelerated the data request.
Previously, there was no reason to compete with a company on the other side of the world. All the data about them could have been at best interesting, at worst useless. Yet such data is at worst interesting and at best vital.
Web scraping is the way to meet this demand. There’s no reason to believe demand will slow either. Digitization, the opening of new markets and the importance of having even more data go hand in hand. So web scraping, barring extreme regulatory oversight or a global apocalypse, is now forever.

Juras Juršėnas is COO at Oxylabs.io