Edwin's Blog

Data pipeline architecture flowing from raw bank data to analytics dashboards

I wanted to know where my money was going. So I built an entire data warehouse.

Look, I know there are budgeting apps. I've tried them. But when you're an engineer with access to your bank's API and a GCP account, you don't download an app -- you over-engineer a solution. And honestly? It turned out to be one of the most useful things I've built.

The problem was simple

Investec gives you a programmable banking API. Every transaction -- bank transfers, card swipes, debit orders -- gets pushed via PubSub as raw JSON into BigQuery. That's great. But raw JSON in a table is about as useful as a receipt drawer you never open.

I needed structure. I needed categories. I needed to know if my golf membership had actually gone up 40% or if I was imagining things (spoiler: it did).

So I built four layers

The architecture follows a classic data warehouse pattern, but tuned for a single-person use case running entirely on BigQuery:

Raw (PubSub JSON) --> Staging (parse + dedup) --> Curated (facts + dimensions) --> Marts (reporting views)

Raw: Untouched JSON from Investec's API. PubSub dumps it straight into BigQuery. I don't manage this layer -- it's the source of truth.
Staging: Views that parse the JSON, cast types, and deduplicate. This is where PARSE_JSON() and ROW_NUMBER() do the heavy lifting.
Curated: Proper tables with schemas -- fact tables for transactions, dimension tables for categories, budgets, and a calendar.
Marts: The reporting layer. Four views that answer actual questions about my money.

Everything is managed by Terraform. One bigquery.tf file, a handful of SQL files, and GitLab CI handles the rest.

The category system was the hardest part

Categorizing transactions sounds easy until you try it. "WOOLWORTHS" is groceries, but "WOOLWORTHS FOOD" is also groceries, and "WW WATERKLOOF" is... still Woolworths. Then there's "INVESTECPB" which is a bond payment, not an Investec fee.

I built a data-driven category engine using a dimension table called dim_category_rule. It holds 144+ pattern-matching rules with three match types:

(cr.match_type = 'EXACT' AND UPPER(bt.description) = UPPER(cr.pattern))
OR (cr.match_type = 'CONTAINS' AND UPPER(bt.description) LIKE CONCAT('%', UPPER(cr.pattern), '%'))
OR (cr.match_type = 'LIKE' AND UPPER(bt.description) LIKE UPPER(cr.pattern))

Each rule has a priority. EXACT matches win over CONTAINS, and specific patterns beat broad ones. When multiple rules match the same transaction, ROW_NUMBER() picks the winner by priority.

The best part? Adding a new category doesn't need a Terraform deploy. I just run a SQL MERGE:

MERGE `project.investec_curated.dim_category_rule` AS target
USING (
  SELECT * FROM UNNEST([
    STRUCT('transaction_category' AS category_group, 'Daily Living' AS parent_category,
           'Coffee' AS category_name, 'CONTAINS' AS match_type,
           'SEATTLE COFFEE' AS pattern, CAST(NULL AS STRING) AS type_filter, 5 AS priority),
    -- add more rules here
  ])
) AS source
ON target.category_group = source.category_group
AND target.match_type = source.match_type
AND target.pattern = source.pattern
WHEN NOT MATCHED THEN INSERT (...)

No IDs. No auto-increment (BigQuery doesn't have one anyway). The natural key (category_group, match_type, pattern) is the identity. I spent way too long debating whether to add surrogate IDs before realising nothing downstream ever joins on an ID -- so why maintain one?

Budget tracking with a twist

I wanted budget vs actual reporting. But budgets change mid-year. My golf membership went from R500/month to R700/month in July 2025. How do you model that without overwriting history?

SCD Type 2. Each budget row has an effective_from date:

STRUCT(2025, 'Golf', DATE '2025-01-01', 500.00),   -- Jan-Jun
STRUCT(2025, 'Golf', DATE '2025-07-01', 700.00),   -- Jul-Dec

The mart view resolves the correct version per month using a window function:

ROW_NUMBER() OVER(
  PARTITION BY year, category_name, month
  ORDER BY effective_from DESC
) AS rn
-- then filter WHERE rn = 1

January through June gets R500. July onwards gets R700. The annual budget isn't stored -- it's computed dynamically as the sum of 12 resolved months. Simple, auditable, and it handles mid-year changes without breaking a sweat.

The card-to-bank join that broke my brain

Bank transactions and card transactions come from different feeds. A card swipe at Woolworths creates one record in the card feed (with merchant details like category code and city) and a separate record in the bank feed (with the actual amount deducted). I needed to join them to enrich bank transactions with merchant data.

The join key? Date + amount. But what happens when you buy two things for the same price on the same day?

Duplicates. Cartesian products. Chaos.

The fix was dual ROW_NUMBER() -- one on each side:

-- Bank side
ROW_NUMBER() OVER(
  PARTITION BY transaction_date, abs_amount_zar
  ORDER BY posted_order
) AS bank_seq

-- Card side
ROW_NUMBER() OVER(
  PARTITION BY DATE(create_datetime), transaction_amount_zac
  ORDER BY create_datetime
) AS card_seq

-- Join
ON bk.transaction_date = DATE(ce.create_datetime)
AND SAFE_CAST(bk.abs_amount_zar * 100 AS INT64) = ce.transaction_amount_zac
AND bk.bank_seq = ce.card_seq

First bank transaction matches first card transaction. Second matches second. No duplicates. This one took me an embarrassingly long time to figure out.

Terraform keeps it sane

The entire infrastructure lives in Terraform. Datasets, tables, views, scheduled queries, PubSub subscriptions, IAM bindings -- all of it. One terraform apply and everything exists.

The trick that makes it work across environments is templatefile():

view {
  query = templatefile("bigquery/investec_marts/mrt_transactions.sql", {
    project_id = local.gcp_project_id
  })
}

Every SQL file uses ${project_id} as a placeholder. Dev gets my-project-dev, prod gets my-project-prod. Same SQL, different targets. GitLab CI handles the routing -- push to main and it's production. Push to anything else and it's dev.

Scheduled ETL queries run nightly at 04:00, 04:05, and 04:10. In dev, they're disabled by default (disable_scheduled_queries = true) to keep things clean. In prod, they run every night and my data is fresh by morning. The whole thing runs within GCP's free tier -- zero cost.

What the marts actually tell me

Four reporting views answer the questions I actually care about:

mrt_transactions: Every transaction with its category, parent category, cash flow direction, and card merchant details. This feeds the main Looker Studio dashboard.
mrt_budget: Monthly actual vs budget by category. Shows me exactly where I'm overspending and by how much.
mrt_rental_pnl: Per-property profit and loss for my two rental properties. Income, bond payments, levies, repairs -- broken down monthly.
mrt_debit_order_inflation: Year-over-year tracking of every recurring payment. This is the one that showed me my insurance went up 12% while inflation was 5%.

The bugs that taught me things

A few hard-earned lessons:

FLOAT64 will betray you. Financial amounts need NUMERIC. I found rounding errors in my reports that traced back to 0.1 + 0.2 != 0.3 in BigQuery. Classic floating point.
Deduplication keys matter. I forgot account_id in my PARTITION BY clause and was deduplicating across accounts. Transactions disappeared. Took me a day to notice.
Mid-month date logic is tricky. Rental income that arrives on the 17th -- does it belong to this month or next? I needed a binary split at day 15 to get reporting dates right. Days 16-19 were returning NULL before I fixed it.

Was it worth it?

Absolutely. Not because it saves me time -- a budgeting app would do that faster. But because I now understand exactly how my money moves. I can see that my debit orders have inflated 8% year-over-year. I can see that I spend more on coffee than I'd like to admit.

More than the analytics though, this project taught me that BigQuery is a genuinely good platform for small-scale data warehousing. The combination of scheduled queries, PubSub ingestion, and views-as-transformations means you can build a proper ELT pipeline without any external orchestration. No Airflow. No dbt. Just SQL files and Terraform.

Sometimes the best tool is the one you build yourself -- even if it takes ten times longer than downloading an app.

Programmable Banking to BigQuery Analytics