/robots.txt … huh?!

Ryan McDonald
3 min readApr 7, 2021

So, you’ve got a great idea! Maybe it’s an app, or perhaps some time series analysis. But… you need data.

It’s likely your data will not exist on Kaggle, or in a clean, updated ‘.CSV’ within easy reach. So, what do you do?

Photo by Glenn Carstens-Peters on Unsplash

You SCRAPE!

Assuming you already know how to set up a scraper (or use a third-party API), AND you decide to follow the website scraping policy, what can you scrape?

That’s where /robots.txt comes into play. Here’s a quick summary of what to look for when reviewing the /robots.txt page. Load up the site you’re looking to scrap in your browser, but, add ‘/robots.txt’ to the end of the address. This will bring up the documentation.

What you’re hoping to see:

The wildcard (*) shown with ‘User-agent’ represents all users/ bots. And when there is no ‘Disallow’ stated, we’re in the clear! You are allowed to access all information on the site!

This isn’t always the case though. Some sites don’t want to allow free reign on scraping. There are several reasons for this, primarily, performance. At any given time, popular sites may have hundreds (or more) bots scraping information…

--

--