Google Cloud Storage(GCS) Data Ingestion and Building Data Lake Example 🚀

This is a personel project 🚀 to try pythonic code standarts and pyspark 🐍

Aim of the Project 🎯

There are some kind of e-commerce data stored on google cloud storage as json format. I am aiming to achieve these items within the project:

Ingesting data from GCS with Pyspark
To be able to make intended transformations on Pyspark
Desinging a Data Lake adopting Medallian Architecture
Adopting snapshotting of data to be able to track new addendum and being able to revert previous version
Implementing code quality standarts like ci/cd and testing

How to run? 🏃‍♀️

Please follow these steps:

Running with Docker

Clone this project to your local environment
Run docker-compose build on the terminal. Please be sure thayou should be in the same directory with docker-compose yaml.
Run docker-compose up on terminal again.
Then, you can connect to swagger document of application vihttp://0.0.0.0:8000/docs

Google Cloud Storage Structure ☁️

The source files are stored in a hierarchy given below:

+-- Buckets
|   +-- webshop-simulation-streaming-landing
|   |   +-- prod
|   |   |   +-- webshop.public.category/
|   |   |   |   +-- file1.json
|   |   |   |   +-- file2.json
|   |   |   |   +-- ...
|   |   |   +-- webshop.public.customer/
|   |   |   |   +-- file1.json
|   |   |   |   +-- file2.json
|   |   |   |   +-- ...
|   |   |   +-- webshop.public.customeradress/
...
|   |   |   +-- webshop.public.customerpaymentprovider/
|   |   |   +-- webshop.public.event/
|   |   |   +-- webshop.public.orders/
|   |   |   +-- webshop.public.paymentprovider/
|   |   |   +-- webshop.public.product/
|   |   |   +-- webshop.public.productbrand/
|   |   |   +-- webshop.public.productcategory/

Data will be divided 3 layers inside data lake from raw to the most mature version, and stored by snapshotting. I assumed that storage cost is not much and being able to return a previous version is really essential for this imaginary multi-billion company.

The sink or data lake is stored in a structure like this:

+-- Buckets
|   +-- webshop-simulation-streaming-landing
|   |   +-- <name-of-datalake> (I used my name :d)
|   |   |   +-- bronze/silver/gold (a folder for each layer)
|   |   |   |   +-- webshop.public.category/
|   |   |   |   |   +-- <processTime>
|   |   |   |   |   |   +-- file1.json
|   |   |   |   |   |   +-- file2.json
|   |   |   |   |   |   +-- ...
|   |   |   |   +-- webshop.public.customer/
|   |   |   |   |   +-- <processTime>
|   |   |   |   |   |   +-- file1.json
|   |   |   |   |   |   +-- file2.json
|   |   |   |   |   |   +-- ...
...
|   |   |   |   +-- webshop.public.customerpaymentprovider/
|   |   |   |   +-- webshop.public.event/
|   |   |   |   +-- webshop.public.orders/
|   |   |   |   +-- webshop.public.paymentprovider/
|   |   |   |   +-- webshop.public.product/
|   |   |   |   +-- webshop.public.productbrand/
|   |   |   |   +-- webshop.public.productcategory/

Data Lake Design 🌊

I followed the Medallion Architecture for designing data lake implementation.

I will have 3 layers when it is completely done.

Bronze: Raw data are directly read from source and kept on that layer without any transformation. This will give us the ability to track the differences between source system and object storage when there is a problem in prod. Metadata information is also kept here.
Silver: Basic transformations like millis to human-readable time format change and aliasing for columns are made on that layer. Besides, metadata information is excluded for that layer.
Gold: I didn't start to build that layer but my plans are like adopting star schema modelling to create 2-3 big summary tables. These tables will be ready for directly being used in BI tools and analytical purposes. Important thing is; this layer is really dependent on the requirements and needs of businesses inside corporations.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
bin		bin
lib/python3.9/site-packages		lib/python3.9/site-packages
src		src
test		test
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyvenv.cfg		pyvenv.cfg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Cloud Storage(GCS) Data Ingestion and Building Data Lake Example 🚀

Aim of the Project 🎯

How to run? 🏃‍♀️

Google Cloud Storage Structure ☁️

Data Lake Design 🌊

About

Releases

Packages

Languages

BurakCakan/gcs-data-ingestion

Folders and files

Latest commit

History

Repository files navigation

Google Cloud Storage(GCS) Data Ingestion and Building Data Lake Example 🚀

Aim of the Project 🎯

How to run? 🏃‍♀️

Google Cloud Storage Structure ☁️

Data Lake Design 🌊

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages