ScrapingHub: Deploying & Scheduling Scrapy Spiders
In this Scrapy tutorial, we are going to cover deploying spider code to ScrapingHub. What is it? scrapinghub.com is a cloud-based web crawling platform, where we can send our spider code and run it from there.
Scrapinghub is an advanced platform for deploying and running web crawlers (also known as spiders or scrapers). It allows you to build crawlers easily, deploy them instantly and scale them on demand, without having to manage servers, backups or cron jobs. Everything is stored in a highly available database and retrievable using an API.
At Scrapinghub provides users with a variety of web crawling and data processing services. Its APIs allow users to schedule scraping jobs, retrieve scraped items, retrieve the log for a job, retrieve information about spiders.
Scrapinghub, Register for FREE or Sign in with Google or Github. On the overview page, we can create our projects. Name your project, and we built the tool with Scrapy, we select that and click Create. And finally we can deploy our spider; you get the instructions on how to actually do this. The tool that is going to be needed is called Scrapinghub command line client, and it can be installed with just typing: pip install shub in the Terminal. So that is going to be a no-brainer really, and it's going to be extremely easy.
Make sure you are in the Scrapy spider folder, and then type shub deploy followed by the project ID. In a few seconds, we will get the status, and once it is okay, the page "Codes and Deploys" at Scrapinghub will be changed. On the Scrapinghub Dashboard, there is a Run button to run our Scrapy spider. Once the scraping job finishes, we can Export the data into CSV, JSON, or XML and download the file.
One of the important features of Scrapinghub is that you can run "Periodic Jobs". You can select a Scrapy spider and priority, and running day and hour. So for example, if you want to run this spider code each day at around 12 o'clock, so you would just select here 12 o'clock, and then click Save. At the Dashboard, you will see the "Next Jobs" and then at around 12 or so o'clock, it will be running and after 30 or so seconds for example, it will go to the Completed Jobs.
Other scraping help tools that Scrapinghub offers is a partially free service used for visual web scraping which is a perfect solution when you are scraping a website that throws captcha. So this is a tool to integrate your already existing spider codes with pool of different IPs and once that IP is getting banned or throwing captcha, it will move to the next IP.