Perform data migration from a PostgreSQL database to MySQL using Docker Compose to set up containers.
Develop ETL scripts in Python to manage the extraction, transformation, and loading (ETL) process, including data cleaning to handle missing, duplicate, and inconsistent values. Use a Kaggle database as the source, prioritizing large and complex datasets.
Kaggle:
https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
Tasks:
Setting up Databases in Containers:
Configure PostgreSQL and MySQL in separate containers using Docker Compose.
Create databases and tables in both databases, ensuring compatibility for migration.
ETL Development:
Extraction: Import data from the selected Kaggle database into PostgreSQL.
Transformation: Perform cleaning and standardization, addressing:
Identifying and handling missing values.
Removing or adjusting duplicate records.
Fixing data inconsistencies.
Loading: Migrate the transformed data to MySQL.
Data Validation:
Conduct robust validations, such as:
Quantitative: Compare the number of records between databases to ensure consistency.
Qualitative (optional): Review data samples to ensure successful transformation.
Modeling and Architecture:
Structure the project based on a star schema or snowflake schema diagram, as appropriate for the chosen dataset.
Document the overall architecture, including table relationships and ETL processes.
Deliverables:
Present Python scripts, the project diagram, and a data validation report.
Provide well-separated scripts for execution on your machine, along with an installation tutorial.
ETL with the objective of creating a more professional, well-structured, visually appealing, and thoroughly documented process.
Delivery term: November 28, 2024