Personal Reporting Documentation
Contents:
personal-reporting-pipelines
Quality
Refresh
Personal data integration and analytics platform using dlt and dbt against BigQuery, orchestrated with GitHub Actions.
Overview
This repository contains a Docker Development Container for VSCode and the infrastructure and workflows for my personal data platform. It leverages Google Cloud Platform services including BigQuery for data warehousing and Secret Manager for secure credential management, with automated orchestration through GitHub Actions.
Data Sources
The following data will be ingested from my personal systems into a BigQuery warehouse for automation and analysis.
Notion
HubSpot
Fitbit
Warehouse Data Flow
graph TB
%% Sources
S1[Notion]
S2[HubSpot]
subgraph raw
direction TB
R1[Daily Habits]
R2[Weekly Habits]
R3[Monthly Habits]
R4[Contacts]
R5[Companies]
R6[Engagements]
end
%% Source to Raw Flows
S1 --> R1
S1 --> R2
S1 --> R3
S2 --> R4
S2 --> R5
S2 --> R6
subgraph staging
C1[Notion Habits]
C2[CRM Contacts]
C3[CRM Companies]
C4[CRM Engagements]
end
%% Raw to Staging Flows
R1 --> C1
R2 --> C1
R3 --> C1
R4 --> C2
R5 --> C3
R6 --> C4
Architecture
Data Pipeline Stack
dlt hub - Extract, load, and transform source data into BigQuery raw layer
dbt core - Transform raw data into analytics-ready models and views
BigQuery - Cloud data warehouse for storage and analysis
GCP Secret Manager - Secure credential management for API keys and connections
GitHub Actions - Automated orchestration and scheduling of data pipelines
Project Structure
The project follows modern data engineering best practices with clear separation of concerns:
├── pipelines/ # dlt data extraction pipelines
│ ├── hubspot.py # HubSpot CRM data pipeline
│ ├── fitbit.py # Fitbit health data pipeline
│ ├── notion.py # Notion habits data pipeline
│ └── common/ # Shared utilities and helpers
├── dbt/ # dbt transformation models
│ └── michael/ # Personal dbt project
├── .github/ # GitHub Actions workflows
│ └── workflows/ # CI/CD and orchestration
├── scripts/ # Utility scripts and helpers
└── config/ # Configuration files and templates
Naming Conventions
dlt pipelines:
{source}__{entity}(e.g.,hubspot__contacts,fitbit__sleep)dbt models:
{layer}_{source}__{entity}(e.g.,staging_hubspot__contacts,contacts)GitHub Actions:
{actions}-{frequency}(e.g.,dlt-daily)
Setup
Prerequisites
Google Cloud Platform Account with BigQuery and Secret Manager APIs enabled
GitHub Account with repository access
Python 3.12+ and pip for local development
Docker (optional, for containerized development)
GCP Setup
Create a GCP Project and enable required APIs:
BigQuery API
Secret Manager API
Set up BigQuery:
Create a dataset for raw data (
raw_data)Create a dataset for transformed data (
analytics)Ensure proper IAM permissions for service accounts
Configure Secret Manager:
Configure GCP service account key as default application credentials
Store static keys and credentials for each data source in TOML fragments:
[sources.hubspot] api_key = "your-key" [sources.notion] api_key = "your-key" [sources.fitbit] client_id = "your-id" client_secret = "your-secret"
After authorising Fitbit OAuth client, store refresh token in dedicated secert
sources-fitbit-refresh_token
Create Service Account:
Service account with BigQuery Admin and Secret Manager Secret Accessor roles
Download JSON key file for GitHub Actions
Local Development Setup
Clone and setup:
git clone <repository-url> cd personal-reporting-pipelines make install
Configure credentials:
# Set GCP project gcloud config set project YOUR_PROJECT_ID # Authenticate with GCP gcloud auth application-default login
Set environment variables:
export GCP_PROJECT_ID=your-project-id export RAW_SCHEMA_NAME=raw_data
GitHub Actions Setup
Add repository secrets:
GOOGLE_APPLICATION_CREDENTIALS: Service account JSON key
Configure workflow schedules in
.github/workflows/:Daily pipelines for HubSpot and Fitbit
Weekly pipelines for Notion
Manual triggers for full refresh scenarios
Integrations
Google Cloud BigQuery - Data warehouse using dbt BigQuery adapter
GCP Secret Manager - Secure credential management for API keys and service accounts
Pipeline Refresh Patterns
Your pipelines support flexible refresh modes for data loading:
Incremental (default): Only loads new/changed data since last run
Full refresh: Completely reloads all data, useful for data quality issues or schema changes
How to Trigger Full Refresh
Method 1: Environment Variable Override (Global)
export FORCE_FULL_REFRESH=true
pipenv run python -m pipelines.hubspot
Method 2: Pipeline-Specific Override
# Force full refresh for HubSpot only
export PIPELINE_NAME=HUBSPOT
export HUBSPOT_FULL_REFRESH=true
pipenv run python -m pipelines.hubspot
Method 3: Direct Function Parameter
from pipelines.hubspot import refresh_hubspot
# Force full refresh
refresh_hubspot(is_incremental=False)
# Use environment-based detection (default)
refresh_hubspot() # or refresh_hubspot(is_incremental=None)
Method 4: GitHub Actions Integration
# .github/workflows/hubspot-pipeline.yml
name: HubSpot Pipeline
on:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM UTC
workflow_dispatch:
inputs:
force_full_refresh:
description: 'Force full refresh'
required: false
default: 'false'
type: boolean
jobs:
run-pipeline:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install dependencies
run: make install
- name: Run HubSpot Pipeline
env:
FORCE_FULL_REFRESH: ${{ inputs.force_full_refresh }}
run: |
pipenv run python -m pipelines.hubspot
Environment Variables Reference
Variable |
Description |
Example |
|---|---|---|
|
Global override for all pipelines |
|
|
Pipeline identifier for specific overrides |
|
|
Pipeline-specific full refresh flag |
|
Best Practices
Default to incremental: Always use incremental loading for regular operations
Use full refresh sparingly: Only when necessary for data quality or schema changes
Environment-based control: Use environment variables for production deployments
Logging: Always log the refresh mode being used for debugging
GitHub Actions integration: Use workflow inputs and environment variables to control refresh mode dynamically
Secret management: Store all credentials in GCP Secret Manager, never in code
Monitoring: Set up alerts for pipeline failures and data quality issues
Development and Testing
Environments
Development: Local development with GCP dev project
Staging: GitHub Actions testing environment
Production: Automated production pipelines via GitHub Actions
Local Development
Build Dev Container in VSCode, this will run
script/setupto install dependenciesRun pipelines locally for testing:
pipenv run python -m pipelines.hubspot pipenv run python -m pipelines.fitbit pipenv run python -m pipelines.notion
Test dbt models:
cd dbt/michael pipenv run dbt run --target dev pipenv run dbt test --target dev
Testing
Unit tests for pipeline functions and utilities
Integration tests for dlt sources and dbt models
End-to-end tests for complete pipeline execution
Data quality tests using dbt’s built-in testing framework
dbt Development
Open dbt project as root directory for SQLFluff and other utilities
Local profile: Copy BigQuery service account key to
~/.dbt/profiles.ymlModel development: Use
pipenv run dbt run --select model_namefor iterative developmentDocumentation: Generate with
pipenv run dbt docs generateandpipenv run dbt docs serve
GitHub Actions Orchestration
Workflow Structure
The project uses GitHub Actions for automated pipeline execution:
Scheduled runs: Daily/weekly automated data ingestion
Manual triggers: On-demand pipeline execution with full refresh options
Pull request validation: Automated testing on code changes
Deployment: Automated deployment to production environment
Pipeline Workflows
HubSpot Pipeline (
.github/workflows/hubspot-pipeline.yml)Daily execution at 2 AM UTC
Manual trigger with full refresh option
Uses
make installandpipenv run python -m pipelines.hubspotdlt state files (
~/.dlt/**/state.json) uploaded for debugging
Fitbit Pipeline (
.github/workflows/fitbit-pipeline.yml)Daily execution at 3 AM UTC
Health data synchronization
Uses
make installandpipenv run python -m pipelines.fitbitdlt state files uploaded for debugging and monitoring
Notion Pipeline (
.github/workflows/notion-pipeline.yml)Weekly execution on Sundays at 9 AM UTC
Habits and productivity data
Uses
make installandpipenv run python -m pipelines.notiondlt state files uploaded for data quality validation
dbt Transform (
.github/workflows/dbt-transform.yml)Daily execution at 4 AM UTC (after data ingestion)
Automatically triggered after successful pipeline runs
Runs dbt models, tests, and generates documentation
Uses
make installand pipenv for dbt operations
Pipeline Tests (
.github/workflows/pipeline-tests.yml)Runs on pull requests and main branch pushes
Executes
make testfor comprehensive testingUploads test coverage reports as artifacts
Monitoring and Alerts
Pipeline status: GitHub Actions workflow status monitoring
Data quality: dbt test results and alerts
Performance: Execution time tracking and optimization
Error handling: Automated retry logic and failure notifications
dlt Logging and Artifacts
dlt pipelines handle logging and artifacts differently than traditional applications:
Console logging: All pipeline output goes to stdout/stderr (captured by GitHub Actions)
Pipeline state: Stored in
~/.dlt/folder (home directory) with metadata, progress, and state informationNo log files: dlt doesn’t create separate log files in a
logs/directoryTargeted artifacts: GitHub Actions uploads only
state.jsonfiles for debugging and monitoringPipeline info: The
pipeline.run()method returns execution information that gets logged
Note: The ~/.dlt/ folder location is dlt’s default behavior. You can customize this using the DLT_DATA_DIR environment variable if needed, but for GitHub Actions, the default home directory location works well.
What’s in state.json Files
The state.json files contain essential pipeline information without the bulk of raw data:
Pipeline execution status: Success/failure states and timestamps
Resource progress: Which data sources were processed and their status
Incremental state: Cursor values and last processed timestamps
Error information: Any failures and their context
Schema changes: Data structure modifications detected during execution
This targeted approach provides debugging insights while keeping artifact sizes manageable.