If You are Web Scraping Don’t Do These Things

Create better web scrapers by avoiding these common mistakes.

Source: Imgur

Web scraping is a favorite past time of many programmers. I feel like 2 out of 3 projects I get involved with end up needing me do web scraping. That being said I have seen a LOT of bad web scraping scripts. Even worse is there are people actually charging for code with these issues.

1. Don’t Hard Code Session Cookies

I promise the rest of my pictures are better… Source: teamscs.com

Stop. Just. Stop. Anything you hard code is something that has the potential to fail miserably. Here is an example of what this could look like versus what you should instead do.

Your client has a site they want to scrape that requires a login. No problem right, just login to that site from a browser, grab the session cookie and send it every time your code calls the server. Ez pz. What you don’t know is the TTL (Time to live) for that session. What if that session expires after one month? That means once you have handed your client their script you have at most one month before their code is dead in the water.# This is bad. Don't do this
headers = {
'Cookie': '_session=23ln4teknl4iowgel'
}for url in url_list:
response = requests.get(url, headers=headers)

So what should you do instead? Code your program to login and use the sessions to ensure your cookies get sent with every request!

requests.Session() 
s.post("https://fakewebsite.com/login", login_data)

for url in url_list:
response = s.get(url)

It takes just a little extra work but it will save you time from having to constantly update the code.

2. Don’t DOS Websites

Not that type of DOS. I mean Denial Of Service. If you don’t think you are doing this you should read this section because I’m about to blow your mind. Writing a for loop to access a website is a DOS. Take this code for example:for page in range(1000):
response = requests.get("https://search.com?page=" + page)

This is the type of stuff that will get your IP banned and then you will either have to switch to a rotating proxy (potentially expensive and time consuming) or get a new public IP every time you are banned. Depending on your ISP that could be an issue.

Instead just add some friendly delays! WOW SO SIMPLE! (I am literally copying and pasting my solution from stack overflow and that is okay).

from random import randint 
from time import sleep

for page in range(1000):
response = requests.get("https://search.com?page=" + page)
sleep(randint(2,5))

This way you can avoid being banned for making too many requests too fast.

3. Don’t Copy and Paste Reusable Code

I keep seeing people copying and pasting their http validation code logic all over the place. Write it once and forget it! Here is a sample gist you can use as a starting point. Side note to learn more about logging check out my article on it.

In the above code you can pass in the response and boom! You are off to the races. Note for my sake I am only returning False when there is an issue with the request. This will just make the script pass over that endpoint. This is because so much can happen when you are scraping and unless you are tracking your status better to just move on and come back later to correct.

Here is how you could leverage something like this would be leveraged:for url in url_list:
response = requests.get(url)if (successful_request(response)):
process_data(response)

4. Don’t Write Single Threaded Scrapers

That went 0–100 real quick. I know this sounds challenging but hear me out. In step 2 we talked about not DOS’ing websites. The time you lose to random timeouts between requests can be made back by separating your work onto several threads. I would recommend keeping each domain name on a single threads to prevent our work from step 2 from being undone.

In the following example we are scraping 4 websites. Each of them contains a list of 10 urls we want to scrape. Here is what that might look like.

Above we give each website their own thread. Note that more threads doesn’t always mean better performance. This is because all these threads live on the same core. Confusing I know but this is something you will likely come across in testing.

5. Don’t Use the Same Pattern for Scraping

Many websites will ban you if you do the same thing over and over again. There are some strategies you can use to circumvent this.

First, is scrambling the order in which you access pages. Here is a short way to do this with a list. As always credit to stack overflow for the scrambled function.import random
url_list = ['fakesite.com/1',
'fakesite.com/2',
'fakesite.com/3']

def scrambled(orig): 
    dest = orig[:] 
    random.shuffle(dest) 
    return dest

for url in scrambled(url_list):
# do something

This way we are never accessing resources in the same order. This way we cannot be tracked by the way we are accessing the website. This works even better if you are scraping several websites at once.

The second method for escaping the pattern matching is to randomize your user agent. User agent is what your system reports to the web server it is. For example chrome would send back the following as its user agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36

In this there are several pieces of information such as browser version and what not. You can find a list of user agents on developers.whatismybrowser.com. An example of randomizing user agents can be found on the NewsTicker project under my GitHub. See the get_random_ua function.

The third method for avoiding pattern detection is to run your script at random intervals with random subsets of the urls you want to scrape.

Conclusion

Web scraping doesn’t have to be hard. The best thing you can do for yourself is build good tools that you can reuse and your web scraping life will be much easier. If you need assistance with a web scraping project feel free to reach out to me on twitter as I do consulting.