Edwin's Blog

I got tired of manually pulling financial statements off Google Finance, so I built an entire data platform to do it for me.

Data pipeline architecture for value investing on GCP

Look, I know what you're thinking. "You could just use a spreadsheet." And you'd be right. But I'm a software engineer who happens to be obsessed with value investing on the JSE, and when those two worlds collide, you get an over-engineered cloud data platform called Contrarian.

What does it actually do?

Contrarian is a fully automated pipeline that ingests stock market data from the Johannesburg Stock Exchange, calculates Buffett-Munger value investing metrics, tracks my portfolio, and spits everything out into Looker Studio dashboards. Every quarter, it wakes up, pulls fresh financials for 17 JSE stocks, and tells me which ones are screaming "value."

The whole thing runs on Google Cloud Platform -- Cloud Functions, BigQuery, Cloud Workflows, Firestore, and Terraform holding it all together.

Why build this at all?

I've always been drawn to value investing. The idea that you can find companies trading below their intrinsic value, buy them, and wait for the market to catch up -- that's the game. But doing this properly on the JSE means digging through financials for banks like Capitec, Nedbank, and Standard Bank, comparing them against mining stocks like Thungela and Exxaro, and somehow making sense of it all.

I was doing this manually. Every quarter. In a spreadsheet. It was painful.

So I thought: what if I could automate the data collection and have BigQuery calculate every ratio I care about? What if I could just open a dashboard and see which stocks pass my value criteria right now?

The architecture -- how the data flows

The pipeline follows a medallion architecture pattern (raw, staging, curated, marts). Data flows through four layers, each adding more structure and meaning:

Cloud Scheduler (Daily 10pm SAST)
    |
Cloud Workflows (loads enabled tickers, filters already-ingested ones)
    |
Cloud Function (calls SerpAPI, stores raw JSON in BigQuery)
    |
BigQuery Scheduled Queries (transform raw JSON into typed fact tables)
    |
BigQuery Mart Views (calculate ratios, value flags, portfolio P&L)
    |
Looker Studio Dashboards

The clever bit is the scheduling logic. Financials only change quarterly, so the workflow checks fct_ingestion_log to see when each stock was last ingested. If it was this quarter already, skip it. This keeps API costs down -- I'm running on a 250 calls/month SerpAPI budget.

Here's the actual SQL from the workflow that decides which tickers need fresh data:

SELECT t.ticker, t.exchange,
  COALESCE(CAST(JSON_VALUE(t.source_config, '$.include_history') AS BOOL), false) as include_history
FROM contrarian_curated.dim_ticker t
LEFT JOIN (
  SELECT ticker, exchange, MAX(completed_at) as last_ingested
  FROM contrarian_curated.fct_ingestion_log
  WHERE status = 'success'
  GROUP BY ticker, exchange
) l ON t.ticker = l.ticker AND t.exchange = l.exchange
WHERE t.enabled = true
  AND t.source_name = 'serpApi'
  AND (
    DATE(l.last_ingested) IS NULL
    OR DATE(l.last_ingested) < DATE_TRUNC(CURRENT_DATE(), QUARTER)
  )

No hardcoded schedule. The data tells the system when to run. Add a new ticker to dim_ticker, and it gets picked up on the next run automatically.

The worker function -- where the API calls happen

The actual data fetching is a Python Cloud Function. It's surprisingly simple -- validate the request, check the API budget in Firestore, call SerpAPI, dump the raw JSON into BigQuery. That's it.

@functions_framework.http
def handle(request: Request):
    data = request.get_json(silent=True) or {}
    ticker = data.get("ticker")
    exchange = data.get("exchange")
    include_history = data.get("include_history", False)

    # Check budget in Firestore
    usage_tracker = get_usage_tracker()
    calls_needed = 2 if include_history else 1
    has_budget, calls_remaining = usage_tracker.check_budget(calls_needed)

    if not has_budget:
        return {"status": "skipped", "reason": "budget_exceeded"}, 200

    # Fetch from SerpAPI and store raw in BigQuery
    serpapi = _get_serpapi_client()
    default_result = make_call("default", lambda: serpapi.fetch_stock(ticker, exchange))
    # ... store responses, update usage tracker

One design decision I'm proud of: if the Firestore usage update fails after the data is already stored in BigQuery, the function still returns success. Data integrity matters more than a counter being slightly off. I can always fix a counter. I can't un-lose data.

The budget tracking uses Firestore's atomic Increment() operation, so even if multiple tickers are being processed in parallel (the workflow fans out), there's no race condition:

def increment_calls(self, count: int = 1) -> None:
    doc_ref = self._get_doc_ref()
    doc_ref.set({
        "calls_made": firestore.Increment(count),
        "calls_limit": self.default_limit,
        "updated_at": _utc_now()
    }, merge=True)

The Buffett checklist -- five flags that matter

This is where it gets interesting. The mrt_value_metrics mart view calculates a five-flag system inspired by Buffett and Munger's approach. Every stock gets scored on:

has_high_roe: Is Return on Equity above the industry minimum? This is Buffett's favourite metric -- management earning good returns on your capital.
has_low_debt: Is Debt-to-Equity below the industry maximum? Companies drowning in debt at South African interest rates (11-12%) are a red flag.
has_good_margins: Is the net profit margin healthy for its industry? A bank at 25% is fine. A retailer at 25% is suspiciously good.
has_positive_fcf: Is Free Cash Flow positive? Earnings can be manipulated. Cash flow is harder to fake.
has_reasonable_pe: Is the P/E ratio within a sane range for the industry? I don't want to overpay, but I also don't want a stock at a P/E of 2 (that usually means something is broken).

Here's one of those flags in SQL -- industry-aware, not a hardcoded threshold:

-- ROE above industry minimum
CASE WHEN SAFE_DIVIDE(i.net_income, b.total_equity) * 100 > ib.min_roe_pct
  THEN TRUE ELSE FALSE END AS has_high_roe,

-- P/E within industry-reasonable range
CASE WHEN SAFE_DIVIDE(p.close_price, i.earnings_per_share)
  BETWEEN ib.min_pe_ratio AND ib.max_pe_ratio
  THEN TRUE ELSE FALSE END AS has_reasonable_pe,

The thresholds come from dim_industry_benchmark, which stores JSE-specific ranges. Because a bank with a Debt-to-Equity of 10 is normal (that's how banking works), but a construction company at 10 is in trouble. Context matters.

The JSE is not the S&P 500

This was a big lesson. Most value investing frameworks you find online are calibrated for the US market. South Africa is a different beast:

Interest rates are 11-12%. A P/E of 25 makes no sense when you can get 10% risk-free from government bonds. The earnings yield has to beat the bond yield, or why bother?
The rand depreciates. Your returns need to compensate for currency risk. A 12% return that would be great in the US barely keeps up in SA.
Banks are leveraged by design. A D/E of 12 for Standard Bank is normal. Screening these out with a generic "D/E must be below 1" filter would eliminate the entire banking sector.

So I built industry-specific benchmarks right into the data model:

STRUCT('Banking' AS industry, 'JSE' AS exchange,
  NUMERIC '12.0' AS min_roe_pct,
  NUMERIC '12.0' AS max_debt_to_equity,
  NUMERIC '20.0' AS min_net_margin_pct,
  NUMERIC '0.0' AS min_pe_ratio,
  NUMERIC '12.0' AS max_pe_ratio,
  'JSE banks: highly leveraged by design (D/E 5-15x normal)' AS notes),

STRUCT('Mining', 'JSE',
  8.0, 1.0, 5.0, 0.0, 15.0,
  'JSE mining: cyclical -- earnings swing with commodity prices.'),

Eight industries, all with JSE-specific thresholds. And because they're versioned with an effective_date, I can update them over time without losing history.

So, where did it all fall apart?

Currency conversion. SerpAPI returns JSE prices in ZAC (cents), not ZAR (rands). I didn't catch this immediately, so for a glorious few days my dashboards showed Capitec trading at R454,926 per share. That was fun to debug.

The fix lives in the ETL layer -- every price from SerpAPI gets divided by 100 before landing in fct_stock_price. A simple rule, but it took me longer than I'd like to admit to figure out why my P/E ratios were off by two orders of magnitude.

The other headache was Cloud Workflows. The YAML syntax for parallel execution with shared state is... not intuitive. Getting the workflow to fan out across tickers, collect logs from each worker, and batch-insert them into BigQuery took several iterations. But once it worked, it just worked.

Everything is Terraform

The entire infrastructure -- BigQuery datasets, table schemas, scheduled queries, Cloud Functions, service accounts, IAM bindings, the workflow, the scheduler -- all defined in Terraform. About 500 lines of bigquery.tf alone.

This was non-negotiable for me. I have a dev environment and a production environment, both deployed from the same code with different tfvars. Tearing down and rebuilding is a terraform destroy and terraform apply away.

It also means I can share the project. If someone wanted to run this for their own portfolio, they'd just need a GCP project, a SerpAPI key, and a terraform apply.

What I learned

Building this taught me that the best investment tools aren't the ones with the most features -- they're the ones that remove friction from your process. I used to spend hours every quarter pulling numbers. Now I open a dashboard.

But more importantly, it forced me to codify my investing framework. Writing SQL to calculate "is this a good value stock?" made me think much harder about what "good value" actually means. When you have to express your thesis in code, the fuzzy thinking disappears fast.

The platform tracks 17 JSE stocks across banking, mining, technology, retail, construction, and insurance. It calculates over 20 financial ratios per stock per period, flags them against industry benchmarks, and tracks my actual portfolio P&L. And it costs nothing to run -- the entire thing fits within GCP's free tier.

If you're a value investor who also happens to write code, I'd encourage you to build something like this. Not because the technology is impressive, but because the process of building it will make you a better investor. You'll stop hand-waving about "good fundamentals" and start defining exactly what you mean.

Value Investing on GCP