- Codelabs Documentation: [https://codelabs-preview.appspot.com/?file_id=1TyTEfFZ0RGRb6SoKJ_QyV0ByJaUFGLQ3JKNmdKlni_g/edit?tab=t.5cpih9qtxm58#6]
- Project Submission Video (5 Minutes): [https://drive.google.com/drive/u/1/folders/1898HGutXjQIxwx3OVnr_Yvx9Uq_SKAE1]
- Hosted Application Links:
- Frontend (Streamlit): [Insert Frontend URL]
- Backend (FastAPI): [Insert Backend URL]
This project focuses on building a master financial statement database to support analysts conducting fundamental analysis of US public companies. The database is built using Snowflake and sourced from SEC Financial Statement Data Sets.
The project includes:
- Data Scraping from the SEC Markets Data page to retrieve dataset links.
- Data Storage Design exploring three approaches (Raw Staging, JSON Transformation, Denormalized Fact Tables).
- Data Validation using Data Validation Tool (DVT) for schema integrity.
- Operational Data Pipelines with Apache Airflow and S3 for staging.
- Post-Upload Testing ensuring correctness across all three storage solutions.
- A client-facing Streamlit application with a FastAPI backend to facilitate access and interaction with stored data.
The challenge is to design a scalable financial data storage solution by:
- Scraping and extracting SEC Financial Statement Data.
- Comparing multiple data storage strategies (Raw, JSON, and RDBMS).
- Ensuring data integrity with validation techniques before ingestion.
- Developing Airflow pipelines to automate the ETL process.
- Building a front-end interface for analysts to query and visualize data.
- Efficient storage and retrieval of financial statement data.
- A robust Snowflake-based infrastructure supporting real-time queries.
- A Streamlit and FastAPI-powered interface enabling intuitive access.
- A pipeline architecture that ensures data consistency and reliability.
- Handling large SEC datasets efficiently in Snowflake.
- Managing data schema transformation (from raw SEC format to JSON and denormalized tables).
- Ensuring data validation before ingestion and transformation.
The project implements three data storage strategies:
- Raw Staging: Storing data as-is from SEC Financial Statement Datasets.
- JSON Transformation: Converting data into JSON format for faster access.
- Denormalized Fact Tables: Structuring data into Balance Sheet, Income Statement, and Cash Flow tables.
- Data Extraction: Scraping SEC links and retrieving datasets.
- Schema Design: Creating table structures for all three storage solutions.
- ETL Pipelines: Using Airflow and S3 staging for efficient data handling.
- Data Validation: Implementing DVT checks for schema and data integrity.
- Frontend & Backend: Developing Streamlit (UI) + FastAPI (API) + Snowflake (DB).
- Python 3.1 or higher
- Snowflake Account (for database operations)
- Airflow (for automated ETL workflows)
- Streamlit & FastAPI (for client-server communication)
- AWS S3 (for intermediate data storage)
- DVT (Data Validation Tool) for schema checks
git clone <repository-url>
cd <repository-name>
python -m venv venv
venv\Scripts\activate
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
SNOWFLAKE_ACCOUNT=your_snowflake_account
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
AWS_ACCESS_KEY=your_aws_key
AWS_SECRET_KEY=your_aws_secret
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
airflow dags trigger financial_data_pipeline
python backend/main.py
streamlit run frontend/streamlit_app.py
Assignment02/
│── airflow/
│ ├── dags/
│ ├── airflow.sh
│ ├── cleanup.sh
│ ├── requirements.txt
│
│── backend/
│ ├── controllers/
│ ├── models/
│ ├── routes/
│ ├── utils/
│ ├── config.py
│ ├── main.py
│ ├── requirements.txt
│
│── dbt/
│ ├── txt.txt
│
│── docs/
│ ├── txt.txt
│
│── frontend/
│ ├── app.py
│ ├── requirements.txt
│
│── json_converter/
│ ├── airflow_forconverter/
│ ├── dags/
│ ├── logs/scheduler/
│
│── snowflake/
│ ├── sql/
│ ├── raw/
│ ├── setup/
│ ├── __init__.py
│ ├── config.yml.template
│ ├── requirements.txt
│ ├── run_pipeline.py
│
│── src/
│ ├── scrape_to_json.py
│
│── tests/
│ ├── txt.txt
│
│── .gitignore
│── LICENSE
│── README.md
- airflow/: Contains DAGs and scripts for automating ETL workflows.
- backend/: FastAPI-based backend for API endpoints, data processing, and interactions.
- dbt/: Configuration files for Data Build Tool (DBT) transformations.
- docs/: Documentation, reports, and system diagrams.
- frontend/: Streamlit-based web application for user interaction.
- json_converter/: Utilities for extracting and converting JSON data.
- snowflake/: SQL scripts, raw extracted data, and setup configurations for Snowflake.
- src/: Scripts for web scraping and JSON conversion.
- tests/: Unit and integration tests to validate system functionality.
- .gitignore: Specifies files and directories to ignore in version control.
- LICENSE: Licensing details for the project.
- README.md: Main project documentation and setup guide.
- Sai Priya Veerabomma - 33.3% (Frontend, API Development)
- Sai Srunith Silvery - 33.3% (Backend, Airflow)
- Vishal Prasanna - 33.3% (Deployment, Snowflake Configuration)
This project is open-source under the MIT License. See LICENSE
for d
646D
etails.