The Manners of Webscraping - The Short Version for Busy Analysts

EDI, a robot character from Mass Effect, explains she cannot turn to crime for money, because that would look bad on a resume.
Robots are crime averse

The ethics of web scraping is not an unknown topic in the discourse of development and coding (even I've mentioned this before), with a lot of material out there explaining the different moral positions of scraping. At work, our short hand to new analysts is often to be polite - fucking up someone else's server is rude. But when you're new to something, how do you know what's going to fuck it up? You're learning about APIs while you're learning about a thousand other things and those 70 pages of documentation for an end point you just want to look into seems like a lot, because right now even the best written API material is 90% completely meaningless to you. A subsequent refrain after that is to check terms and conditions or robots.txt, but they too, whether by legalese or googlese, are also 90% meaningless to a new analyst.

I was on Github looking at something else (as always) when I saw someone I follow forking an R repository called Polite. It was written by Dmytry Perepolkin and is a small package to enable R users to web scrape pages politely. Dmytry's markdown to explain the pillars of politeness are well written and absolutely applicable to the wider corners of scraping that a lot of analysts do using tools like Alteryx.

  1. Introduce Yourself (if required)
  2. Seeking Permission
  3. Taking Slowly
  4. Never Ask Twice

Within Alteryx and other "no code" type data handling tools, API query workflows can become quite complex very quickly, particularly if the workflow is having to handle a particular authentication method (such as session tokens) or forms of pagination, which generally entail writing iterative calls. I have been in discussions with colleagues and clients on occasion where someone will say "why don't we just...." in a certain way, which will usually involve violating one of the above rules - seeking permission is the most common one, because people often feel that since they can look at the website as a user, what's the difference between looking at it as a user and as a machine?

EDI, a robotic character from Mass Effect 3, says "I am not formally employed. I have no legal standing in Citadel space." to explain the difference between humans and robots using an electronic service.
You can sell advertising to keep the site running to a user, but machines (at the moment) cannot participate in our ad marketing system. Picture from Mass Effect 3, copyright EA.

Although in that previous blog post I pointed at another article on Medium where James Densmore writes a set of guidelines for scrapers and scrapees, I like this four item format for its brevity, and plain rules which make sense to any analyst (though please read those guidelines!).

The idea of introducing yourself is one that is is the hardest to teach a new analyst. Partly because, as above, I don't necessarily need an account to browse or AirBnB, but also because the idea of sessions, handshakes, "users" and the like are the concealed cookie crumbs we drop across the internet, disguised by our browsers - AirBnB might not know exactly who you are, but they do know if you've been there before. This isn't something we particularly notice until our AirBnB searches start following us around the web.

"Ask Permission" is a bit easier to understand because permission generally isn't concealed to users when interacting with software. Working out which permissions you need can be a different story depending on the complexity of the API (e.g. if permissions are tied to specific user profiles), but most analysts can start working through this quickly.

"Taking slowly" is one of my favourites because this is so many analysts practical introduction to the requirement of iterative or paced functions, and therefore their first real use case of how to build these in something like Alteryx. It's also a very useful way to establish the idea of data as a commodity which takes time and space to be transmitted, which can be very hard to explain when, for obvious reasons, very few people have actually seen server infrastructure in real life.

Lastly, "Never Ask Twice". The joy that went up in our office when Alteryx unveiled their cache in workflow feature in 2018.3 was genuine excitement, because when you're mindful of taking slowly, you know that every time you run that workflow you are eating into precious politeness points just because of something downstream not quite looking right. As practice, it is also an invaluable way of teaching our consultants to consider the longevity of their data pipeline by ensuring that the pipeline has exception handling or accommodations for requirements which might result in "asking twice" situations occurring - such as diverting a copy of the raw data to a lake before transformation.