Sorry Engineering by Rafal Makara

Share this post

User's avatar
Sorry Engineering by Rafal Makara
Dataset Engineering
Copy link
Facebook
Email
Notes
More
AI Learning Notes

Dataset Engineering

Rafal Makara's avatar
Rafal Makara
Feb 08, 2025

Share this post

User's avatar
Sorry Engineering by Rafal Makara
Dataset Engineering
Copy link
Facebook
Email
Notes
More
Share

Dataset engineering refers to designing, collecting, curating, generating, annotating, optimizing the data needed for training and adapting AI models.

Imagine two problem types: classification and open chat response.

In closed-ended models, such as traditional classification models, dataset engineering is straightforward. For example, labeling an image as “cat” or “not a cat” is a well-defined task with clear ground truth.

However, in open-ended models, such as foundation models, dataset engineering becomes more complex. Since these models (through e.g. chatbot UI) can generate responses in an almost unlimited number of ways, it extends beyond simple labeling. Instead, dataset engineering focuses on tasks like deduplication, tokenization, context retrieval, quality control, removal of sensitive information.

How to prepare datasets, so specific model can train effectively on them?


Sources:

  • AI Engineering by Chip Huyen (O’Reilly). Copyright 2025 Developer Experience Advisory LLC, 978-1-098-16630-4

  • Additional research


Subscribe to Sorry Engineering by Rafal Makara

Common sense driven Engineering Management

Share this post

User's avatar
Sorry Engineering by Rafal Makara
Dataset Engineering
Copy link
Facebook
Email
Notes
More
Share

Discussion about this post

User's avatar
I got fired once
I got fired from one of the companies in the past. That became a funny story to tell during job interviews and parties. I think... company owners who…
Nov 7, 2022 • 
Rafal Makara
5

Share this post

User's avatar
Sorry Engineering by Rafal Makara
I got fired once
Copy link
Facebook
Email
Notes
More
Journey to High-Performing Team
When referring to building teams, we often mention Tuckman's stages of group development: forming–storming–norming–performing. That helps understand the…
Nov 3, 2022 • 
Rafal Makara
2

Share this post

User's avatar
Sorry Engineering by Rafal Makara
Journey to High-Performing Team
Copy link
Facebook
Email
Notes
More
Agenda of "hello meetings" when joining a new company/team
When I join a new company or a team, I meet with everyone for a short 1:1 hello meeting. The agenda of each meeting is as follows: Story about Rafal…
Nov 16, 2022 • 
Rafal Makara
2

Share this post

User's avatar
Sorry Engineering by Rafal Makara
Agenda of "hello meetings" when joining a new company/team
Copy link
Facebook
Email
Notes
More

Ready for more?

© 2025 Rafal Makara
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.