Phase 3

Data Preparation

Turn scattered raw records into a dataset a model can trust.

What this phase actually is

Data preparation is where most of the quiet work happens. Raw records are cleaned, joined, filtered, aggregated, and reshaped into features that match the question.

Good preparation is disciplined. It preserves the meaning of the original data, records assumptions, and avoids using information that would not be known at the moment of prediction.

The useful output is a reproducible dataset with a clear target, clear features, and enough checks that the team can trust what happens next.

How this looks at Bertelsmann

Try it

Mess-to-Clean Transformer

Messy

IDLoginHoursPlan
A-10226/05/264.2Premium
A-1022026-05-264.2Premium
B-114May 2NULLbasic
C-0312026/04/280.5Family

Clean

IDLoginHoursPlan
A-10226/05/264.2Premium
A-1022026-05-264.2Premium
B-114May 2basic
C-0312026/04/280.5Family

Pitfalls

  • Cleaning away inconvenient cases that the model will still face later.
  • Leaking future information into the training data.
  • Creating a dataset nobody can reproduce.