Phase 3

Data Preparation

Turn scattered raw records into a dataset a model can trust.

What this phase actually is

Data preparation is where most of the quiet work happens. Raw records are cleaned, joined, filtered, aggregated, and reshaped into features that match the question.

Good preparation is disciplined. It preserves the meaning of the original data, records assumptions, and avoids using information that would not be known at the moment of prediction.

The useful output is a reproducible dataset with a clear target, clear features, and enough checks that the team can trust what happens next.

How this looks at Bertelsmann

Try it

Mess-to-Clean Transformer

Messy

ID	Login	Hours	Plan
A-102	26/05/26	4.2	Premium
A-102	2026-05-26	4.2	Premium
B-114	May 2	NULL	basic
C-031	2026/04/28	0.5	Family

Clean

ID	Login	Hours	Plan
A-102	26/05/26	4.2	Premium
A-102	2026-05-26	4.2	Premium
B-114	May 2		basic
C-031	2026/04/28	0.5	Family

Pitfalls

Cleaning away inconvenient cases that the model will still face later.
Leaking future information into the training data.
Creating a dataset nobody can reproduce.