How to Create a Reusable Web Scraper
Stop constantly modifying your web scrapers
Web scrapers are a ton of fun. What isn’t so fun is constantly modifying code as elements change on the web pages. That is why I set out to create a better web scraping project — one that could be updated to scrape new websites with minimal edits.
The first step was separating each part of the web scraper into separate logical pieces:
- Page requester
- Page validator
- Templated page processor
Page Requester
The page requester has a few tricks up its sleeve. There is a lot to consider when downloading sites. You want to make sure you are randomizing your user agent and not requesting too frequently from the same domain.
Also, the cost of stopping to analyze why a page did not download can be really expensive — especially if you have a scraper that runs for several hours at a time for several sites. For that reason, we will pickle the requests and save them off as a file.
Saving requests into a file has another big benefit. You don’t have to worry about a missing tag blowing up your web scraper. If your processor is separate and you already have the pages downloaded, you can process them as quickly and frequently as you want. What if you discover there is another data element you want to scrape? No worries. Just add the tag and rerun your processor on your already downloaded page.
Below is some sample code for what this looks like in practice:
Page Validator
The page validator goes through the files and unpickles the requests. It will read off the status code of the request, and if the request code is something along the lines of a 408 (timeout), you can have it re-queue the site for downloading. Otherwise, the validator can move the file to be processed by the actual web scraping module.
You could also collect data on why a page didn’t download. Maybe you requested pages too quickly and were banned. This data can be used to tune your page downloader so that it can run as fast as possible with the minimum amount of errors.
Templated Page Processor
Here is the magic sauce you have been waiting for. The first step is creating our data model. We start with our URL. For each different site/path, there is likely a different method for extracting data. We start with a dict then, like so:models = {
'finance.yahoo.com':{},
'news.yahoo.com'{},
'bloomberg.com':{}
}
In our use case, we want to extract the article content for these sites. To do this, we will create a selector for the smallest outer element that still contains all our data. For example, here is what the following might look like with a sample page for finance.yahoo.com:Webpage Sample
<div>
<a>some link</a>
<p>some content</p>
<article class="canvas-body">
<h1>Heading</h1>
<p>article paragraph 1</p>
<p class="ad">Ad Link</p>
<p>article paragraph 2</p>
<li>list element<li>
<a>unrelated link</a>
</article>
</div>
In the snippet above, we want to target the article. So we will use both the article tag and the class as an identifier since that is the smallest containing element with the article content in it.models = {
'finance.yahoo.com':{
'root-element':[
'article',
{'class': "canvas-body"}
]
},
'news.yahoo.com'{},
'bloomberg.com':{}
}
Next, we will want to identify what elements within our article are garbage. We can see one has the ad
class (keep in mind it will never be this simple in real life). So to specify that we want to remove it, we will create an unwanted_elements
element in our config model:models = {
'finance.yahoo.com':{
'root-element':[
'article',
{'class': "canvas-body"}
],
'unwanted_elements': [
'p',
{'class': "ad"}
]
},
'news.yahoo.com'{},
'bloomberg.com':{}
}
Now that we have weeded out some of the garbage, we need to note what elements we want to keep. Since we are only looking for article elements, we just need to specify we want to keep the p
and h1
elements:models = {
'finance.yahoo.com':{
'root-element':[
'article',
{'class': "canvas-body"}
],
'unwanted_elements': [
'p',
{'class': "ad"}
]
'text_elements': [
['p'],
['li']
] },
'news.yahoo.com'{},
'bloomberg.com':{}
}
Now for the final piece — the master aggregator! I am going to disregard the unpickling and loading of the config file. If I wrote the whole code, this article would never end:
Conclusion
With this code, you can create a template for extracting the article text from any website. You can see the full code and how I have started implementing this on my GitHub.