About this project
it-programming / web-development
Open
**Project Description:** We are seeking an experienced Python developer to create a web scraping script that extracts information from a specific website containing approximately 5 million pages. The extracted data will be stored in a MySQL database and visualized in a dashboard for reporting, filtering, and data views.
The website has implemented blocking mechanisms for direct urllib requests, so a reliable approach is needed to avoid being blocked. Strategies such as rotating user agents and introducing time delays between requests may be necessary. Please note that Selenium, Requests, and Beautiful Soup have already been tested and were unsuccessful.
Requirements:
Web Scraping Framework: Anti-Bot Avoidance Techniques: Implement measures to prevent being blocked by the website, such as: Randomized user-agents, Time delays between requests
Data Storage: Store scraped data in a structured MySQL database. Database design should optimize for both data storage and retrieval efficiency.
Error Handling: Implement error handling to retry failed requests or skip pages after multiple attempts.
Scalability and Distributed Execution: The script should be capable of running on multiple computers simultaneously, allowing the scraping process to be distributed across different machines for faster completion.
Deliverables:
Script for scraping with comments and documentation.
MySQL database structure and schema for storing the scraped data.
Instructions for running the script on multiple computers and setting up the MySQL database for concurrent data collection.
Please provide:
Your relevant experience with large-scale scraping projects.
Estimated timeline and cost for the project.
Thank you!
Category IT & Programming
Subcategory Web development
What is the scope of the project? Medium-sized change
Is this a project or a position? Project
I currently have I have specifications
Required availability As needed
Roles needed Developer
Delivery term: Not specified
Skills needed