Personal Reporting Documentation

Contents:

personal-reporting-pipelines

Development and Documentation

Data Refresh Workflows

Personal data integration and analytics platform using dlt and dbt against BigQuery, orchestrated with GitHub Actions.

Overview

This repository contains a Docker Development Container for VSCode and the infrastructure and workflows for my personal data platform. It leverages Google Cloud Platform services including BigQuery for data warehousing and Secret Manager for secure credential management, with automated orchestration through GitHub Actions.

Data Sources

The following data will be ingested from my personal systems into a BigQuery warehouse for automation and analysis.

Notion
HubSpot
Fitbit

Architecture

Data Pipeline Stack

dlt hub - Extract, normalise, and load source data into BigQuery raw layer
dbt core - Transform raw data into analytics-ready models and views
BigQuery - Cloud data warehouse for storage and analysis
GCP Secret Manager - Secure credential management for API keys and connections
GitHub Actions - Automated orchestration and scheduling of data pipelines

Project Structure

The project follows modern data engineering best practices with clear separation of concerns:

├── pipelines/          # dlt data extraction pipelines
│   ├── hubspot.py      # HubSpot CRM data pipeline
│   ├── fitbit.py       # Fitbit health data pipeline
│   ├── notion.py       # Notion habits data pipeline
│   └── common/         # Shared utilities and helpers
├── dbt/                # dbt transformation models
│   ├── models/         # personal dbt models (staging, intermediate, marts)
│   ├── macros/         # custom macros for adapters and mocking
│   ├── tests/          # custom tests when missing from packages
│   └── seeds/          # reference tables and mock sources
└── .github/            # github actions workflows
    └── workflows/      # CI/CD and refresh orchestration

Naming Conventions

dlt pipelines: {source}__{entity} (e.g., hubspot__contacts, fitbit__sleep)
dbt models (except marts): {layer}_{source}__{entity} (e.g., stg_hubspot__contacts, contacts)

Setup

Prerequisites

Google Cloud Platform Account with BigQuery and Secret Manager APIs enabled
GitHub Account with repository access
Python 3.12+ and pip for local development
Docker (optional, for containerized development)

GCP Setup

Create a GCP Project and enable required APIs:
- BigQuery API
- Secret Manager API
Set up BigQuery:
- Create a dataset for raw data (e.g., raw_data)
- Create a dataset for transformed data (e.g., analytics)
- Ensure proper IAM permissions for service accounts
Configure Secret Manager:
- Configure GCP service account key as default application credentials
- Store static keys and credentials for each data source in TOML fragments:
```
[sources.hubspot]
api_key = "your-key"

[sources.notion]
api_key = "your-key"

[sources.fitbit]
client_id = "your-id"
client_secret = "your-secret"
```
- After authorising Fitbit OAuth client, store refresh token in dedicated secret
  - sources-fitbit-refresh_token
Create Service Account:
- Service account with BigQuery Admin and Secret Manager Secret Accessor roles
- Download JSON key file for GitHub Actions

Local Development Setup

Clone and setup:

git clone <repository-url>
cd personal-reporting-pipelines
make install

Set environment variables:

export GCP_PROJECT_ID=your-project-id
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service/account/key.json
export RAW_SCHEMA_NAME=raw_data

GitHub Actions Setup

Add repository secrets:
- GOOGLE_APPLICATION_CREDENTIALS: Service account JSON key
- GCP_PROJECT_ID: Google cloud project identifier
Configure workflow schedules in .github/workflows/:
- Scheduled pipelines for source refreshes: Notion, HubSpot, Fitbit
- Scheduled pipeline for data transformations after source refresh
- Manual trigger option for full refresh scenarios
Gemini workflows are leveraged from run-gemini-cli

Integrations

Google Cloud BigQuery - Data warehouse using dbt BigQuery adapter
GCP Secret Manager - Secure credential management for API keys and service accounts

Pipeline Refresh Patterns

The pipelines support flexible refresh modes for data loading:

Incremental (default): Only loads new/changed data since last run
Full refresh: Completely reloads all data, useful for data quality issues or schema changes

How to Trigger Full Refresh

Method 1: Environment Variable Override (Global)

export FORCE_FULL_REFRESH=true
pipenv run python -m pipelines.hubspot

Method 2: Pipeline-Specific Override

# Force full refresh for HubSpot only
export HUBSPOT_FULL_REFRESH=true
pipenv run python -m pipelines.hubspot

Method 3: Direct Function Parameter

from pipelines.hubspot import refresh_hubspot

# Force full refresh
refresh_hubspot(is_incremental=False)

# Use environment-based detection (default)
refresh_hubspot()  # or refresh_hubspot(is_incremental=None)

Method 4: GitHub Actions Integration

# .github/workflows/hubspot-pipeline.yml
name: HubSpot Pipeline
on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM UTC
  workflow_dispatch:
    inputs:
      force_full_refresh:
        description: 'Force full refresh'
        required: false
        default: 'false'
        type: boolean

jobs:
  run-pipeline:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      - name: Install dependencies
        run: make install
      - name: Run HubSpot Pipeline
        env:
          FORCE_FULL_REFRESH: ${{ inputs.force_full_refresh }}
        run: |
          pipenv run python -m pipelines.hubspot

Environment Variables Reference

Variable	Description	Example
`FORCE_FULL_REFRESH`	Global override for all pipelines	`export FORCE_FULL_REFRESH=true`
`PIPELINE_NAME`	Pipeline identifier for specific overrides	`export PIPELINE_NAME=HUBSPOT`
`{PIPELINE_NAME}_FULL_REFRESH`	Pipeline-specific full refresh flag	`export HUBSPOT_FULL_REFRESH=true`

Best Practices

Default to incremental: Always use incremental loading for regular operations
Use full refresh sparingly: Only when necessary for data quality or schema changes
Environment-based control: Use environment variables for production deployments
Logging: Always log the refresh mode being used for debugging
GitHub Actions integration: Use workflow inputs and environment variables to control refresh mode dynamically
Secret management: Store all credentials in GCP Secret Manager, never in code
Monitoring: Set up alerts for pipeline failures and data quality issues

Development and Testing

Environments

Development: Local development with duckdb database
Testing: Local tests with non-production BigQuery schemas
Production: Automated production pipelines via GitHub Actions

Local Development

Build Dev Container in VSCode, this will run make install to install dependencies

Run pipelines locally for testing:

# Local unit tests
make test-local
# Bigquery end-to-end tests
make test-e2e

Test dbt models:

# Local duckdb test
make dbt-build target=dev
# Bigquery non-prod schema test
make dbt-build target=test

Testing

Unit tests for pipeline functions and utilities
Integration tests for dlt sources and dbt models
End-to-end tests for complete pipeline execution
Data quality tests using dbt’s built-in testing framework

dbt Development

Model development: Use make dbt-run target=dev select="model_name" for iterative development
Model testing: Use make dbt-test target=dev select="model_name" for iterative testing
Documentation: Generate with make dbt-docs target=dev

GitHub Actions Orchestration

Workflow Structure

The project uses GitHub Actions for automated pipeline execution:

Scheduled runs: Daily/weekly automated data ingestion
Manual triggers: On-demand pipeline execution with full refresh options
Pull request validation: Automated testing on code changes
Deployment: Automated deployment to production environment

Monitoring and Alerts

Pipeline status: GitHub Actions workflow status monitoring
Data quality: dbt test results and alerts
Performance: Execution time tracking and optimization
Error handling: Automated retry logic and failure notifications

dlt Logging and Artifacts

dlt pipelines handle logging and artifacts differently than traditional applications:

Console logging: All pipeline output goes to stdout/stderr (captured by GitHub Actions)
Pipeline state: Stored in ~/.dlt/ folder (home directory) with metadata, progress, and state information
No log files: dlt doesn’t create separate log files in a logs/ directory
Targeted artifacts: GitHub Actions uploads only state.json files for debugging and monitoring
Pipeline info: The pipeline.run() method returns execution information that gets logged

Note: The ~/.dlt/ folder location is dlt’s default behavior. You can customize this using the DLT_DATA_DIR environment variable if needed, but for GitHub Actions, the default home directory location works well.

What’s in state.json Files

The state.json files contain essential pipeline information without the bulk of raw data:

Pipeline execution status: Success/failure states and timestamps
Resource progress: Which data sources were processed and their status
Incremental state: Cursor values and last processed timestamps
Error information: Any failures and their context
Schema changes: Data structure modifications detected during execution

This targeted approach provides debugging insights while keeping artifact sizes manageable.