Is everyone’s data a mess? Recently, I came across a post in the data engineering subreddit that asked the question. The answer is yes, but no. As someone who has seen data infrastructure at FAANGs, Enterprises, start-ups, and every other company in between, all companies need to make some concessions that can build up and become messy over a long period of time.
Aug 8, 2023·edited Aug 8, 2023Liked by SeattleDataGuy
That's why we need to advocate for lean data practices: pull only the data you need, use it until you don't need, then discard it. By adopting lean data practices, we can minimize data waste and optimize our data operations.
From my experience, working at 8+ companies' codebases... yes. Yes yes yes yes yessssss.
And it's not even for a good reason. Most of the time it's just people wanting to make projects for the sake of making projects and they spin up new services or databases.
Excellent points. Sometimes you don't realize how bad the data in your source system is until you try to extract into a data warehouse to load . Worked on a team in the early 2000's to build a data warehouse from the source system. We discovered a lot of issues with source system we didn't know were there. So it helped identify and clean up issues in the source system too
That's why we need to advocate for lean data practices: pull only the data you need, use it until you don't need, then discard it. By adopting lean data practices, we can minimize data waste and optimize our data operations.
From my experience, working at 8+ companies' codebases... yes. Yes yes yes yes yessssss.
And it's not even for a good reason. Most of the time it's just people wanting to make projects for the sake of making projects and they spin up new services or databases.
Excellent points. Sometimes you don't realize how bad the data in your source system is until you try to extract into a data warehouse to load . Worked on a team in the early 2000's to build a data warehouse from the source system. We discovered a lot of issues with source system we didn't know were there. So it helped identify and clean up issues in the source system too