SQL Lab — Pratique SQL no navegador

Seven tables. Three core entities, two event streams, one daily roll-up, one experiment plane.

Core entities: members, titles, episodes. Event streams: playback_events, media_workflow_events. Daily roll-up: engagement_daily. Experiment plane: experiment_assignments. Learn the shape once; every challenge will feel familiar.

ERD

Entity-relationship diagram showing the 7 tables in the Netflix pack and their foreign-key relationships.

Jump to a table

members titles episodes playback_events engagement_daily media_workflow_events experiment_assignments

members

30 rows

One row per subscriber. Demographics + plan + tenure bucket + bot flag.

Columns

Name	Type	What it means
`member_id`PK	`int`	Subscriber identifier.
`signup_date`	`date`	First day of subscription. Anchor for cohort and retention math.
`country`	`string`	ISO-style country code (e.g. US, BR, UK).
`plan_type`	`string`	basic / standard / premium.
`subscription_status`	`string`	active / churned / paused. Filter to active for live-base questions.
`tenure_bucket`	`string`	0-3m / 3-6m / 6-12m / 12m+. Pre-computed band for churn-risk scoring.
`is_bot`	`bool`	TRUE for synthetic/bot traffic. Filter out for any user-facing metric.

Common filters

```
WHERE is_bot = FALSE
```
Exclude bot traffic from every member-facing metric. Forgetting this is a senior-interview red flag.
```
WHERE subscription_status = 'active'
```
Use when measuring the live base (churn risk, current engagement). Skip for cohort/retention questions where you want everyone who ever signed up.

Common joins

```
FROM members m JOIN engagement_daily ed USING (member_id)
```
Member-level daily engagement; the spine of retention and streak questions.
```
FROM members m JOIN experiment_assignments ea USING (member_id)
```
Member attributes attached to experiment variant — needed for slicing lift by tenure/country.

Challenges that lean on this table

q14_retention_cohort q15_churn_risk

titles

12 rows

One row per show/film. Genre + runtime + origin + budget band. The dimension table behind every content question.

Columns

Name	Type	What it means
`title_id`PK	`int`	Content identifier.
`title_name`	`string`	Display title (e.g. 'Stranger Things'). Don't GROUP BY this if you have title_id — slightly slower.
`title_type`	`string`	series / movie. Drives whether episode_id is meaningful for joins.
`genre`	`string`	Drama / Comedy / Documentary / ...
`runtime_minutes`	`int`	Total runtime in minutes. Critical for completion-rate and rewatch-cap logic.
`release_date`	`date`	First public release date.
`country_of_origin`	`string`	Where the title was produced.
`budget_band`	`string`	tier_a / tier_b / tier_c — internal budget tier.

Common filters

```
WHERE title_type = 'series'
```
Series-only when you need episode-level granularity (binge detection, episode counts).

Common joins

```
FROM playback_events p JOIN titles t USING (title_id)
```
Attach content metadata (name, runtime, genre) to playback. The most common join in this pack.
```
FROM titles t JOIN episodes e USING (title_id)
```
Resolve season/episode structure when you need episode counts per series.

Challenges that lean on this table

q01_top10_watched q02_completion_rate

episodes

75 rows

One row per episode. Season + episode number + runtime — only series have rows here.

Columns

Name	Type	What it means
`episode_id`PK	`int`	Episode identifier.
`title_id`FK	`int`	FK → titles. The series this episode belongs to.
`season_number`	`int`	1-indexed season.
`episode_number`	`int`	1-indexed episode within its season.
`runtime_minutes`	`int`	Episode runtime. Often shorter than the title's runtime_minutes (which is for movies or overall).

Common joins

```
playback_events p ... AND p.episode_id IS NOT NULL
```
Movies leave episode_id NULL; filter when you need episode granularity (binge detection, series resume position).

Challenges that lean on this table

q12_binge_detection

playback_events

462 rows

Event stream: every impression, detail view, play_start, play_pause, play_complete. The richest table in the pack.

Columns

Name	Type	What it means
`event_id`	`int`	Stream event identifier. NOT unique — duplicates appear from client retries.
`member_id`FK	`int`	FK → members.
`title_id`FK	`int`	FK → titles.
`episode_id`FK	`int (nullable)`	FK → episodes. NULL for movies; populated for series episodes.
`session_id`	`string`	Client-supplied session id. May not align with the canonical 30-min-gap session — derive it yourself when correctness matters (see q11).
`event_ts`	`timestamptz`	Event time. Cast to plain TIMESTAMP for DuckDB-WASM arithmetic (see gotchas below).
`event_name`	`string`	impression / title_detail_view / play_start / play_pause / play_complete. Filter to 'play_start' for true watching.
`watch_seconds`	`int`	Seconds watched in this event. Can exceed runtime when looped — cap at runtime for valued-hours metrics.
`device_type`	`string`	web / mobile / tv / tablet. QoE breakdowns slice on this.
`country`	`string`	Country at event time (denormalized — can differ from members.country during travel).
`startup_time_ms`	`int`	Time-to-first-frame in ms. Source for p95 / p99 startup metrics. Means hide the unhappy 5%.
`rebuffer_seconds`	`int`	Seconds spent rebuffering during this play. Numerator for rebuffer ratio.
`error_code`	`string (nullable)`	Populated on playback errors; NULL on healthy plays.
`ingested_at`	`timestamptz`	When the event landed in the warehouse. ingested_at > event_ts indicates late arrival.
`metadata_valid`	`bool`	Per-row DQ flag. FALSE means the producer flagged this copy as malformed. Group-aware dedup (q16) must propagate FALSE to all retries of the same event_id.
`entry_source`	`string`	homepage_row / recommendation / search / billboard. The discovery surface that drove the impression.

Common filters

```
WHERE event_name = 'play_start'
```
Impressions and detail views aren't watching. Default to play_start for engagement/hours/completion math.
```
WHERE entry_source = 'homepage_row'
```
Scope to a specific discovery surface — the homepage funnel is the canonical question (q05).
```
WHERE event_ts >= (SELECT MAX(event_ts) FROM playback_events)::TIMESTAMP - INTERVAL 30 DAY
```
Anchor 'last 30 days' to the data's max event, not CURRENT_DATE — tests are deterministic that way.

Common joins

```
FROM playback_events p JOIN titles t USING (title_id)
```
Attach title name/runtime for every metric that reports per title.
```
Self-join on (member_id, title_id, session_id) with a time-window predicate
```
The funnel pattern — match impression → next step within an attribution window. Always pre-aggregate to the impression key first to avoid fan-out.

Challenges that lean on this table

q01_top10_watched q02_completion_rate q03_rebuffer_ratio q04_startup_p95 q05_discovery_funnel q06_time_to_first_play q11_sessionize q12_binge_detection q16_dedup_dq

engagement_daily

12,711 rows

One row per (member, calendar day). Pre-aggregated daily roll-up — start here for retention, streaks, churn risk.

Columns

Name	Type	What it means
`member_id`FK	`int`	FK → members. Combined with `date`, forms the natural key.
`date`	`date`	Calendar day in UTC.
`watch_seconds`	`int`	Total seconds watched on this day. Zero on inactive days.
`titles_started`	`int`	Distinct titles with at least one play_start.
`titles_completed`	`int`	Distinct titles whose capped valued seconds crossed the completion threshold.
`active_viewing_flag`	`bool`	TRUE if the day passed Netflix's 'meaningful' bar (typically >2 minutes of valued play). Use this — not watch_seconds > 0 — for retention.
`sessions`	`int`	Count of distinct sessions on this day.
`meaningful_play_count`	`int`	Count of plays that passed the meaningful threshold.

Common filters

```
WHERE active_viewing_flag = TRUE
```
Background open-app days inflate retention. active_viewing_flag is the right binary for 'did they really watch?'.
```
WHERE date BETWEEN signup_date + 24 DAY AND signup_date + 31 DAY
```
Day-30 retention windows — center on the anniversary day with a small buffer to absorb signup-time noise (q14).

Common joins

```
FROM members m LEFT JOIN engagement_daily ed USING (member_id)
```
LEFT JOIN keeps members who have *never* been active — critical so they show up as churn-risk.

Challenges that lean on this table

q13_seven_day_streak q14_retention_cohort q15_churn_risk

media_workflow_events

127 rows

Studio post-production workflow stream. Each row is an event in a workflow stage's lifecycle (started / completed / etc.).

Columns

Name	Type	What it means
`workflow_event_id`PK	`int`	Stream event identifier.
`title_id`FK	`int`	FK → titles. The content this workflow produces an asset for.
`asset_id`	`int`	FK → assets (not in this pack). The specific deliverable (audio mix, subtitle file, image).
`workflow_id`	`int`	Groups events into one workflow instance. Pair started ↔ completed within (workflow_id, workflow_stage).
`workflow_stage`	`string`	ingest / transcode / qc / localize / publish / ...
`event_timestamp`	`timestamptz`	When the event happened. Note: column is `event_timestamp`, not `event_ts` like playback.
`event_type`	`string`	stage_started / stage_completed. Anti-join these for in-flight detection (q10).
`status`	`string`	Status at the moment of the event, not current. A stage_started row's `status` is 'in_progress' even after it has long completed.
`owner_team`	`string`	Internal team responsible for this stage execution.
`vendor_id`	`int`	External vendor handling the stage (some stages are outsourced).
`error_code`	`string (nullable)`	Populated on failure.
`rework_flag`	`bool`	TRUE if this stage entry is rework. Note: q09 derives rework structurally (count starts) instead of trusting this flag.
`sla_due_timestamp`	`timestamptz`	SLA deadline for this stage. Compare against completed_timestamp for miss-rate.
`completed_timestamp`	`timestamptz (nullable)`	Populated only on stage_completed events. NULL on stage_started rows.

Common filters

```
WHERE event_type = 'stage_started'
```
Find when stages began. Combine with an anti-join against stage_completed to identify in-flight (q10).
```
WHERE event_type = 'stage_completed' AND completed_timestamp > sla_due_timestamp
```
Operational miss-rate. Numerator only — pair with COUNT(*) over all completions for the denominator.

Common joins

```
Self-pair on (workflow_id, workflow_stage) using FILTER for started vs completed
```
Use MIN(event_timestamp) FILTER (WHERE event_type='stage_started') and the matching completed FILTER — earliest-paired-with-earliest, safe under rework loops (q07).

Challenges that lean on this table

q07_cycle_time_p50_p90 q08_sla_miss_rate q09_rework_detection q10_aging_tasks

experiment_assignments

59 rows

One row per (experiment, member) assignment. Variant + assignment time + whether/when the member was actually exposed.

Columns

Name	Type	What it means
`experiment_id`	`string`	Experiment name (e.g. 'homepage_ranking_v2'). Filter by this first.
`member_id`FK	`int`	FK → members.
`variant`	`string`	control / treatment (sometimes more arms — verify per experiment).
`assignment_timestamp`	`timestamptz`	When the member was assigned to a variant. Use this as the time-zero for ITT analysis.
`exposed_flag`	`bool`	TRUE if the member was actually shown the treatment surface. ITT = assigned; TOT = exposed (q19).
`exposure_timestamp`	`timestamptz (nullable)`	When exposure happened. NULL for unexposed assignees.

Common filters

```
WHERE experiment_id = 'homepage_ranking_v2'
```
Always scope per-experiment first — analyses across experiments are usually meaningless.
```
WHERE exposed_flag = TRUE -- TOT only; ITT skips this
```
ITT counts everyone assigned. TOT restricts to actually-exposed — biased upward when exposure correlates with engagement (q19).

Common joins

FROM experiment_assignments ea JOIN engagement_daily ed ON ed.member_id = ea.member_id AND ed.date BETWEEN ea.assignment_timestamp::DATE AND ea.assignment_timestamp::DATE + INTERVAL 6 DAY

Per-member 7-day post-assignment window — the canonical lift measurement (q17).

Challenges that lean on this table

q17_ab_lift_7d q18_srm_check q19_itt_vs_tot q20_guardrail_check

Cross-cutting gotchas

Three traps recur across the pack. Skim these before you start any challenge.

1. Late-arriving events → group-preserving dedup
playback_events.event_id is not unique — client retries during ingestion create duplicates. Take the latest by ingested_at, but propagate metadata_valid = FALSE across the group before collapsing — otherwise you lose a DQ flag that an earlier retry recorded. Naïve DISTINCT ON doesn't do this. See q16 →
```
-- The pattern: window-rank by ingested_at, but BOOL_OR the DQ flag first
ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY ingested_at DESC) AS rn,
BOOL_OR(NOT metadata_valid) OVER (PARTITION BY event_id) AS dq_flag
-- ... then WHERE rn = 1
```
2. Same-timestamp ties → break with event_id
Two events for the same member can share event_ts exactly. Sessionization, ordering, and LAG will produce non-deterministic results unless you add a secondary key. Standard tiebreaker: order by event_ts, event_id. See q11 →
```
-- Always add event_id when ordering by event_ts:
ROW_NUMBER() OVER (PARTITION BY member_id ORDER BY event_ts, event_id)
```
3. TIMESTAMPTZ arithmetic → cast to TIMESTAMP
DuckDB-WASM (which runs the in-browser engine) is missing some of native DuckDB's TIMESTAMPTZ + INTERVAL overloads. When you need to add/subtract an interval or take EPOCH(...) of a difference, cast to plain TIMESTAMP first. This is a portability concession; if you only see the error in the browser engine and the same query works in your local DuckDB, this is why.
```
-- Instead of:   event_ts + INTERVAL 30 MINUTE        (may fail in WASM)
-- Do:           event_ts::TIMESTAMP + INTERVAL 30 MINUTE
```

Seven tables. Three core entities, two event streams, one daily roll-up, one experiment plane.

ERD

Jump to a table

Columns

Common filters

Common joins

Challenges that lean on this table

Columns

Common filters

Common joins

Challenges that lean on this table

Columns

Common joins

Challenges that lean on this table

Columns

Common filters

Common joins

Challenges that lean on this table

Columns

Common filters

Common joins

Challenges that lean on this table

Columns

Common filters

Common joins

Challenges that lean on this table

Columns

Common filters

Common joins

Challenges that lean on this table

Cross-cutting gotchas

1. Late-arriving events → group-preserving dedup

2. Same-timestamp ties → break with event_id

3. TIMESTAMPTZ arithmetic → cast to TIMESTAMP