Seven tables. Three core entities, two event streams, one daily roll-up, one experiment plane.

Core entities: members, titles, episodes. Event streams: playback_events, media_workflow_events. Daily roll-up: engagement_daily. Experiment plane: experiment_assignments. Learn the shape once; every challenge will feel familiar.

ERD

Entity-relationship diagram showing the 7 tables in the Netflix pack and their foreign-key relationships.

Jump to a table

members

30 rows

One row per subscriber. Demographics + plan + tenure bucket + bot flag.

Columns

NameTypeWhat it means
member_idPKintSubscriber identifier.
signup_datedateFirst day of subscription. Anchor for cohort and retention math.
countrystringISO-style country code (e.g. US, BR, UK).
plan_typestringbasic / standard / premium.
subscription_statusstringactive / churned / paused. Filter to active for live-base questions.
tenure_bucketstring0-3m / 3-6m / 6-12m / 12m+. Pre-computed band for churn-risk scoring.
is_botboolTRUE for synthetic/bot traffic. Filter out for any user-facing metric.

Common filters

  • WHERE is_bot = FALSE

    Exclude bot traffic from every member-facing metric. Forgetting this is a senior-interview red flag.

  • WHERE subscription_status = 'active'

    Use when measuring the live base (churn risk, current engagement). Skip for cohort/retention questions where you want everyone who ever signed up.

Common joins

  • FROM members m JOIN engagement_daily ed USING (member_id)

    Member-level daily engagement; the spine of retention and streak questions.

  • FROM members m JOIN experiment_assignments ea USING (member_id)

    Member attributes attached to experiment variant — needed for slicing lift by tenure/country.

Challenges that lean on this table

titles

12 rows

One row per show/film. Genre + runtime + origin + budget band. The dimension table behind every content question.

Columns

NameTypeWhat it means
title_idPKintContent identifier.
title_namestringDisplay title (e.g. 'Stranger Things'). Don't GROUP BY this if you have title_id — slightly slower.
title_typestringseries / movie. Drives whether episode_id is meaningful for joins.
genrestringDrama / Comedy / Documentary / ...
runtime_minutesintTotal runtime in minutes. Critical for completion-rate and rewatch-cap logic.
release_datedateFirst public release date.
country_of_originstringWhere the title was produced.
budget_bandstringtier_a / tier_b / tier_c — internal budget tier.

Common filters

  • WHERE title_type = 'series'

    Series-only when you need episode-level granularity (binge detection, episode counts).

Common joins

  • FROM playback_events p JOIN titles t USING (title_id)

    Attach content metadata (name, runtime, genre) to playback. The most common join in this pack.

  • FROM titles t JOIN episodes e USING (title_id)

    Resolve season/episode structure when you need episode counts per series.

Challenges that lean on this table

episodes

75 rows

One row per episode. Season + episode number + runtime — only series have rows here.

Columns

NameTypeWhat it means
episode_idPKintEpisode identifier.
title_idFKintFK → titles. The series this episode belongs to.
season_numberint1-indexed season.
episode_numberint1-indexed episode within its season.
runtime_minutesintEpisode runtime. Often shorter than the title's runtime_minutes (which is for movies or overall).

Common joins

  • playback_events p ... AND p.episode_id IS NOT NULL

    Movies leave episode_id NULL; filter when you need episode granularity (binge detection, series resume position).

Challenges that lean on this table

playback_events

462 rows

Event stream: every impression, detail view, play_start, play_pause, play_complete. The richest table in the pack.

Columns

NameTypeWhat it means
event_idintStream event identifier. NOT unique — duplicates appear from client retries.
member_idFKintFK → members.
title_idFKintFK → titles.
episode_idFKint (nullable)FK → episodes. NULL for movies; populated for series episodes.
session_idstringClient-supplied session id. May not align with the canonical 30-min-gap session — derive it yourself when correctness matters (see q11).
event_tstimestamptzEvent time. Cast to plain TIMESTAMP for DuckDB-WASM arithmetic (see gotchas below).
event_namestringimpression / title_detail_view / play_start / play_pause / play_complete. Filter to 'play_start' for true watching.
watch_secondsintSeconds watched in this event. Can exceed runtime when looped — cap at runtime for valued-hours metrics.
device_typestringweb / mobile / tv / tablet. QoE breakdowns slice on this.
countrystringCountry at event time (denormalized — can differ from members.country during travel).
startup_time_msintTime-to-first-frame in ms. Source for p95 / p99 startup metrics. Means hide the unhappy 5%.
rebuffer_secondsintSeconds spent rebuffering during this play. Numerator for rebuffer ratio.
error_codestring (nullable)Populated on playback errors; NULL on healthy plays.
ingested_attimestamptzWhen the event landed in the warehouse. ingested_at > event_ts indicates late arrival.
metadata_validboolPer-row DQ flag. FALSE means the producer flagged this copy as malformed. Group-aware dedup (q16) must propagate FALSE to all retries of the same event_id.
entry_sourcestringhomepage_row / recommendation / search / billboard. The discovery surface that drove the impression.

Common filters

  • WHERE event_name = 'play_start'

    Impressions and detail views aren't watching. Default to play_start for engagement/hours/completion math.

  • WHERE entry_source = 'homepage_row'

    Scope to a specific discovery surface — the homepage funnel is the canonical question (q05).

  • WHERE event_ts >= (SELECT MAX(event_ts) FROM playback_events)::TIMESTAMP - INTERVAL 30 DAY

    Anchor 'last 30 days' to the data's max event, not CURRENT_DATE — tests are deterministic that way.

Common joins

  • FROM playback_events p JOIN titles t USING (title_id)

    Attach title name/runtime for every metric that reports per title.

  • Self-join on (member_id, title_id, session_id) with a time-window predicate

    The funnel pattern — match impression → next step within an attribution window. Always pre-aggregate to the impression key first to avoid fan-out.

Challenges that lean on this table

engagement_daily

12,711 rows

One row per (member, calendar day). Pre-aggregated daily roll-up — start here for retention, streaks, churn risk.

Columns

NameTypeWhat it means
member_idFKintFK → members. Combined with `date`, forms the natural key.
datedateCalendar day in UTC.
watch_secondsintTotal seconds watched on this day. Zero on inactive days.
titles_startedintDistinct titles with at least one play_start.
titles_completedintDistinct titles whose capped valued seconds crossed the completion threshold.
active_viewing_flagboolTRUE if the day passed Netflix's 'meaningful' bar (typically >2 minutes of valued play). Use this — not watch_seconds > 0 — for retention.
sessionsintCount of distinct sessions on this day.
meaningful_play_countintCount of plays that passed the meaningful threshold.

Common filters

  • WHERE active_viewing_flag = TRUE

    Background open-app days inflate retention. active_viewing_flag is the right binary for 'did they really watch?'.

  • WHERE date BETWEEN signup_date + 24 DAY AND signup_date + 31 DAY

    Day-30 retention windows — center on the anniversary day with a small buffer to absorb signup-time noise (q14).

Common joins

  • FROM members m LEFT JOIN engagement_daily ed USING (member_id)

    LEFT JOIN keeps members who have *never* been active — critical so they show up as churn-risk.

Challenges that lean on this table

media_workflow_events

127 rows

Studio post-production workflow stream. Each row is an event in a workflow stage's lifecycle (started / completed / etc.).

Columns

NameTypeWhat it means
workflow_event_idPKintStream event identifier.
title_idFKintFK → titles. The content this workflow produces an asset for.
asset_idintFK → assets (not in this pack). The specific deliverable (audio mix, subtitle file, image).
workflow_idintGroups events into one workflow instance. Pair started ↔ completed within (workflow_id, workflow_stage).
workflow_stagestringingest / transcode / qc / localize / publish / ...
event_timestamptimestamptzWhen the event happened. Note: column is `event_timestamp`, not `event_ts` like playback.
event_typestringstage_started / stage_completed. Anti-join these for in-flight detection (q10).
statusstringStatus *at the moment of the event*, not current. A stage_started row's `status` is 'in_progress' even after it has long completed.
owner_teamstringInternal team responsible for this stage execution.
vendor_idintExternal vendor handling the stage (some stages are outsourced).
error_codestring (nullable)Populated on failure.
rework_flagboolTRUE if this stage entry is rework. Note: q09 derives rework structurally (count starts) instead of trusting this flag.
sla_due_timestamptimestamptzSLA deadline for this stage. Compare against completed_timestamp for miss-rate.
completed_timestamptimestamptz (nullable)Populated only on stage_completed events. NULL on stage_started rows.

Common filters

  • WHERE event_type = 'stage_started'

    Find when stages began. Combine with an anti-join against stage_completed to identify in-flight (q10).

  • WHERE event_type = 'stage_completed' AND completed_timestamp > sla_due_timestamp

    Operational miss-rate. Numerator only — pair with COUNT(*) over all completions for the denominator.

Common joins

  • Self-pair on (workflow_id, workflow_stage) using FILTER for started vs completed

    Use MIN(event_timestamp) FILTER (WHERE event_type='stage_started') and the matching completed FILTER — earliest-paired-with-earliest, safe under rework loops (q07).

Challenges that lean on this table

experiment_assignments

59 rows

One row per (experiment, member) assignment. Variant + assignment time + whether/when the member was actually exposed.

Columns

NameTypeWhat it means
experiment_idstringExperiment name (e.g. 'homepage_ranking_v2'). Filter by this first.
member_idFKintFK → members.
variantstringcontrol / treatment (sometimes more arms — verify per experiment).
assignment_timestamptimestamptzWhen the member was assigned to a variant. Use this as the time-zero for ITT analysis.
exposed_flagboolTRUE if the member was actually shown the treatment surface. ITT = assigned; TOT = exposed (q19).
exposure_timestamptimestamptz (nullable)When exposure happened. NULL for unexposed assignees.

Common filters

  • WHERE experiment_id = 'homepage_ranking_v2'

    Always scope per-experiment first — analyses across experiments are usually meaningless.

  • WHERE exposed_flag = TRUE -- TOT only; ITT skips this

    ITT counts everyone assigned. TOT restricts to actually-exposed — biased upward when exposure correlates with engagement (q19).

Common joins

  • FROM experiment_assignments ea JOIN engagement_daily ed ON ed.member_id = ea.member_id AND ed.date BETWEEN ea.assignment_timestamp::DATE AND ea.assignment_timestamp::DATE + INTERVAL 6 DAY

    Per-member 7-day post-assignment window — the canonical lift measurement (q17).

Challenges that lean on this table

Cross-cutting gotchas

Three traps recur across the pack. Skim these before you start any challenge.

  1. 1. Late-arriving events → group-preserving dedup

    playback_events.event_id is not unique — client retries during ingestion create duplicates. Take the latest by ingested_at, but propagate metadata_valid = FALSE across the group before collapsing — otherwise you lose a DQ flag that an earlier retry recorded. Naïve DISTINCT ON doesn't do this. See q16 →

    -- The pattern: window-rank by ingested_at, but BOOL_OR the DQ flag first
    ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY ingested_at DESC) AS rn,
    BOOL_OR(NOT metadata_valid) OVER (PARTITION BY event_id) AS dq_flag
    -- ... then WHERE rn = 1
  2. 2. Same-timestamp ties → break with event_id

    Two events for the same member can share event_ts exactly. Sessionization, ordering, and LAG will produce non-deterministic results unless you add a secondary key. Standard tiebreaker: order by event_ts, event_id. See q11 →

    -- Always add event_id when ordering by event_ts:
    ROW_NUMBER() OVER (PARTITION BY member_id ORDER BY event_ts, event_id)
  3. 3. TIMESTAMPTZ arithmetic → cast to TIMESTAMP

    DuckDB-WASM (which runs the in-browser engine) is missing some of native DuckDB's TIMESTAMPTZ + INTERVAL overloads. When you need to add/subtract an interval or take EPOCH(...) of a difference, cast to plain TIMESTAMP first. This is a portability concession; if you only see the error in the browser engine and the same query works in your local DuckDB, this is why.

    -- Instead of:   event_ts + INTERVAL 30 MINUTE        (may fail in WASM)
    -- Do:           event_ts::TIMESTAMP + INTERVAL 30 MINUTE