This project focuses on ingesting, transforming, and analyzing structured (CSV) and semi-structured (JSON) YouTube trending video data across multiple regions. The primary goal is to automate a cloud-based ETL pipeline using AWS services for scalable data engineering and business insights.
| Service | Purpose |
|---|---|
| Amazon S3 | Data lake to store raw, cleansed, and analytics data |
| AWS IAM | Access and role management for secure and granular service permissions |
| AWS CLI | Upload data to S3 with Hive-style partitioning (by region) |
| AWS Glue | Serverless ETL: Crawlers, Jobs, and Studio used for transformation |
| AWS Lambda | On-the-fly JSON transformation triggered by S3 events |
| AWS Athena | Serverless querying of both raw and enriched datasets |
| Amazon QuickSight | Visualization and dashboarding on final analytics data |
video_id, title, views, likes, dislikes, comments, tags, publish_time, category_id, region, etc.Kaggle Data → S3 (Landing Bucket)
→ AWS Glue Crawlers → Glue Catalog Tables
→ AWS Lambda (Trigger on JSON) → S3 (Cleansed Bucket)
→ AWS Glue Jobs (ETL for CSV) → S3 (Cleansed Bucket)
→ AWS Glue Studio (Join JSON + CSV) → S3 (Analytics Bucket)
→ Athena + QuickSight Dashboards
—
s3://<bucket-name>/youtube/raw_statistics/region=XX/s3://<bucket-name>/youtube/raw_statistics_reference_data/items and writes to S3 in Parquet format.predicate_pushdown for regions.region.raw_statistics with cleaned_statistics_reference_data on category_id = id.Inner Join node to merge CSV and JSON sources.s3://<analytics-bucket>/.
Use Amazon QuickSight to create dashboard with some basic visualizations:

This project is inspired by the work of Darshil Parmar