# Engineering Audit — data.aykhan.net

_Audited as a senior/staff engineering review. Scope: the whole repository, with
emphasis on the only real source file, `generate_index.py`._

## 1. Architecture summary

`data.aykhan.net` is a **static JSON "API"** served by GitHub Pages. There is no
server, database, authentication, or runtime — every response is a committed
static file under `data/`.

A single Python script, `generate_index.py`, has two jobs that run on every push
(via `.github/workflows/generate_index.yml`, which commits the result back):

1. **`generate_index_html('.')`** — walks the entire repo and writes a browsable
   `index.html` directory listing into every folder. It protects hand-written
   listings via an `<!-- Auto-generated by Python script -->` marker check
   (`is_auto_generated`).
2. **`generate_data_index()`** — produces the two public, machine-readable files
   consumed by the [Terminal Gateway](https://aykhan.net/terminal):
   `data-index.json` (metadata for every indexed `.json` endpoint) and
   `build-report.json`. This half is **whitelist-based**: only top-level folders
   in `WHITELIST_DIRS = ["data", "public"]` are scanned, only `.json` files are
   indexed, and `DENY_DIR_NAMES` (`private`, `drafts`, `secrets`, …) are pruned.

**Data flow:** files on disk → `generate_index.py` → `data-index.json` /
`build-report.json` → fetched read-only by the terminal. Nothing is ever
inlined; only `name/path/url/sizeBytes` metadata is published.

The design intent (read-only, public-metadata-only, whitelist-based) is sound.
The issues below are about **determinism, the listing/whitelist mismatch, and
output-injection hardening**, not about the core security model.

## 2. Findings

### Critical
_None._ The security model (read-only, whitelist JSON indexing, deny-list) holds;
no secret-exposure or injection path is reachable with the current repo contents.

### High

| ID | File | Issue | Why it matters |
|----|------|-------|----------------|
| H1 | `generate_index.py` `generate_index_html` | **Non-deterministic row order.** `os.walk` yields `dirs`/`files` in arbitrary filesystem order and they are emitted unsorted. | Every CI run can reshuffle table rows, producing large, meaningless diffs and an auto-commit on every push even when nothing changed. Verified: re-running reordered `task1/2/4/7`. |
| H2 | `generate_index.py` `format_date` | **Machine-local timezone.** `dt_utc.astimezone()` formats "Last Modified" in the runner's local time with no TZ label. | Output depends on _where_ the script runs (dev machine vs GitHub Actions UTC). Combined with H1 this guarantees churn, and the displayed time is ambiguous (no zone shown). |

### Medium

| ID | File | Issue | Why it matters |
|----|------|-------|----------------|
| M1 | `generate_index.py` `EXCLUDED_DIRS` vs `DENY_DIR_NAMES` | **The HTML listing is _not_ whitelist-based.** `generate_index_html` walks everything from `.`, excluding only `.git/.github/__pycache__`. The documented "whitelist / nothing sensitive published by accident" guarantee only covers the JSON indexer. | If a `private/`, `drafts/`, or `secrets/` folder is ever added, the JSON index would correctly skip it but the browsable `index.html` tree would still list it. Defense-in-depth gap against the stated security posture. |
| M2 | `generate_index.py` `generate_index_html` | **No HTML escaping** of file/dir names interpolated into rows, `href`s, `<title>`, and `<h1>`. | A filename containing `<`, `>`, `&`, or `"` would break the markup or inject. Currently low-risk (no such names exist — verified), but it is unsafe output construction in the one place that renders untrusted-ish input. |

### Low

| ID | File | Issue | Why it matters / recommendation |
|----|------|-------|---------------------------------|
| L1 | `generate_index.py` | `import time` is unused (dead import). | Remove. Verified no `time.*` usage. |
| L2 | `generate_index.py` `to_header_case` / `to_title_case` | Docstrings say "title case" but the functions only lowercase + insert separators. | Misleading; minor. Left as-is to stay surgical (cosmetic). |
| L3 | `generate_index.py` `format_size` | Falls through to `None` for sizes ≥ 1024 TB. | Impossible at this scale; not worth guarding (would be speculative error handling). Noted only. |
| L4 | `generate_index.py` | `generate_index_html` and `generate_data_index` each walk the tree separately. | Negligible at this size; not worth coupling the two passes. |
| L5 | both repos | `generate_index.py` is ~250 lines duplicated verbatim between `data.aykhan.net` and `media.aykhan.net`. | Real duplication, but the repos deploy independently to separate GitHub Pages domains. Extracting a shared module would add cross-repo coupling/deploy complexity for little gain — **left intentionally.** |
| L6 | `generate_index.py` | No tests. | Addressed: added `test_generate_index.py`. |

## 3. Recommended fixes (safe implementation order)

1. **L1** — remove `import time` (zero-risk cleanup). _verify:_ script still runs.
2. **H1** — sort `dirs` and `files` before emitting rows. _verify:_ rows appear in lexicographic order; re-running no longer reshuffles.
3. **H2** — format `format_date` in UTC with an explicit `UTC` suffix. _verify:_ output is identical regardless of machine TZ.
4. **M2** — `html.escape` names and derived title/header text. _verify:_ a `<`-containing name renders as `&lt;`.
5. **M1** — extend the listing's excluded-dirs to mirror `DENY_DIR_NAMES`. _verify:_ a `secrets/` folder is not listed.
6. **L6** — add `test_generate_index.py` (stdlib `unittest`, no new deps) covering ordering, UTC formatting, whitelist/deny exclusion, and escaping.

**Behavior change to note:** the "Last Modified" column switches from machine-local
time to UTC (now suffixed `UTC`). This is human-facing display only — no JSON API
field, schema, URL, or deployment changes. It is the fix for H1/H2 churn.

**Not changing** (would exceed scope / violate "minimal, surgical"): cross-repo
deduplication (L5), the public JSON schema, the workflow, or the deployment model.
