Setting Standards For Your Data Team

Jul 12, 2022

Against these, the architectural unity of Reims stands in glorious contrast. The joy that stirs the beholder comes as much from the integrity of the design as from any particular excellences. As the guidebook tells, this integrity was achieved by the self-abnegation of eight generations of builders, each of whom sacrificed some of his ideas so that the whole might be of pure design - The Mythical Man Month

Standards.

They often can feel like small nits that slow down PR reviews and further increase the bottleneck nature of data teams. But they really have a much larger impact in the grand scheme of running a data team. Having standardized SQL and coding styles help make logic and code easier to maintain. Not just now, but into the future as new heads of data take over. It’s not just about how you write code but also how teams work and decide what projects to take on. Standards make it easy for new team members to onboard because they can quickly pattern match how the team is setting up their models, files and code.

But before diving deeper into some examples of standards and SQL style guides, let’s dive into why standards are important.

Why Standards Are Important

Enhanced Efficiency - Standards can enhance efficiency in the long run. It removes micro-decisions(think Steven Jobs wearing the same thing every day) and makes it easier for future developers to walk through code, add new modules and debug any issues.

Easy to Maintain - Consistent coding standards makes it easy to maintain legacy code. Let me add to that.

It’s better to be wrong, but consistently wrong vs. inconsistent.

Many coding standards that may be incorrect can be walked back through massive code modes. I have seen thousands of files updated in a tenth of the time thanks to consistent naming structures. But if you’re coding style is inconsistent, then it’s far harder to change. Sometimes, a coding decision is merely temporary and everyone knows it. For example, at Facebook, we had to add in a specific comment during our Python 3 migration to ensure the right Python version was used.

Following said migration we removed it. Had we all been inconsistent on how we implemented this comment, then reverting the code once we were done would have been painful.

Standards Set The Tone - Indulge me in a nostalgic flashback. In a previous life, I worked in fine dining. Meaning I spent a good part of my time perfectly chopping shallots, chives, garlic, and other fundamental building blocks for any dish. The reason most kitchens start stages and early cooks on this type of task is because it sets the standard.

If you send poorly chopped shallots to the line. They will be returned, and you will be berated(toxic culture be dammed). If your shallots are chopped correctly but too wet, they will be sent back and someone will hand you a sharper knife.

These standards set the tone. There are no questions on what is expected. From the most basic ingredient onward. Perfection is the only expected result.

In the same way, creating clear standards, in terms of quality of work, naming conventions and processes communicate to everyone what is expected.

Onboarding Is Easier - There have been times as a consultant when I have been called in to migrate custom-built pipeline systems. Meaning I need to onboard very quickly and understand exactly what is going on. When coding styles are standardized, this is very easy because I can immediately get a sense of how the code was structured. The same goes for new hires. If you have clear patterns in how you name variables and tables as well as how you structure your folders and scripts, then you remove a lot of ambiguity.

On the flip side, I have come into some situations where it was clear there were multiple generations of code styles all in the same data pipeline system. This increases the complexity of onboarding as every module is slightly different.

Bug Rectification - Along with maintenance, debugging is another common task that data experts need to perform. Debugging is made far easier when SQL logic, tables and column naming conventions are consistent. Instead of having to go into 40 different files to find a location of a one off query, the name of a column or function may provide all the traceability required.

In addition, if queries are structured well with logic broken up into clear CTEs, it’s very easy to read and understand what is going on.

What Do Standards Look Like

It should be clear why standards make life easier for everyone who works in data or engineering. Now let’s go over some examples.

Columns

The easiest place to enact standards is in naming conventions. In particular, as data engineers, this happens with tables and columns.

Some really quick and dull examples would be(these don’t have to be your standard, but they do tend to be pretty consistent across a lot of companies):

Booleans - should start with an “is_” , “has_” or ““ and should always been a “True” or “False” value and not “Y” or “N'“ or other variants.
Dates - should end in “_date”
Timestamps -- should end with “_ts”

Some other general conventions would be to avoid too many abbreviations. Especially if they are team specific abbreviations. That way other uses can quickly parse through your tables.

I will avoid diving too deeply into table standards in terms of modeling and structure. This really will be team dependent as some heads of data like heavily normalized core data sets, others like strict snowflake schemas and I am going to avoid jumping into a religious level debate.

Instead, let’s talk about query style.

Query Style

Besides naming conventions, how you write a query is also an opportunity to set standards.

For example, CTEs vs subqueries. In nearly any style guide you look up it is recommended that you avoid subqueries and utilize CTEs instead.

CTEs tend to provide a clearer breakdown of logic and can provide other benefits depending on the SQL engine you are running on. When you use too many subqueries the logic can easily get muddle and people will get lost.

Some other points worth noting in terms of query style are:

Avoid implicit joins such as:
```
SELECT * FROM table1, table2 
```
Avoid ambiguous column names like “id” or “category”
Use LOWER() when comparing values to avoid risking a capital letter

If you are looking for a SQL style guide to build your teams standards on, check out the list below.

The Draw Backs Of Standards

Of course, like anything else, standards can also become a hinderance. Rigid standards can remove the desire to think outside of the box.

Instead of thinking why you are doing what you are doing, your actions may become overly dogmatic. To add to this engineers may believe that the standards they learned at their first company are the only ways to operate.

Actually this was really common in kitchens. You’d hire a new cook and suddenly all you would hear for a week was “This wasn’t how we did it at the French Laundry”.

Similarly, standards can cause a team to have tunnel vision and not put effort into constantly trying to improve and think about problems in new ways.

Overall, I find that standards are mostly there to act as guardrails and only occasionally broken when the need arises.

Standards Are Meant To Improve Efficiency Not Slow Work Down

In order to go fast, we often need to move slow. Standards can feel like they slow down our work. They can feel like nits that another developer is imposing upon us. However, in the long run they make maintaining large code bases easier and ensure your team is working on the right projects.

Coding standards can ensure a lot of benefits even if it doesn’t feel like it when you’re fixing another nit on a PR review. They really are a crucial part of your data strategy.

I promise, standards will make your life easier.

DROP the Modern Data Stack

It’s time to make sense of today’s data tooling ecosystem. Check out rudderstack.com/dmds to get a guide that will help you build a practical data stack for every phase of your company’s journey to data maturity. The guide includes architectures and tactical advice to help you progress through four stages: Starter, Growth, Machine Learning, and Real-Time. Visit RudderStack.com/dmds today to DROP the modern data stack and USE a practical data engineering framework.

Read More Here

Very special thank you Rudderstack for sponsoring the newsletter this week!

Video Of The Week: Why Are Data Teams Still Struggling To Answer Basic Questions

“Why do we still struggle to answer basic questions like how many customers are still active or what exactly is the company’s churn?”

This was a question posed to a panel I was sitting on at the Snowflake Summit. Several panelists provided different answers and all come from very different perspectives. Truth be told I have mulled over this question for the last few years as I worked on projects that involved reconciling basic numbers like several active customers or total sales for a company.

Since I didn’t get to fully answer the question on the panel, I put together a video here.

Join My Data Engineering And Data Science Discord

Recently my Youtube channel went from 1.8k to 34k and my email newsletter has grown from 2k to well over 10k.

Hopefully we can see even more growth this year. But, until then, I have finally put together a discord server. Currently, this is mostly a soft opening.

I want to see what people end up using this server for. Based on how it is used will in turn play a role in what channels, categories and support are created in the future.

Join Now

Articles Worth Reading

There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!

Lyft Data Scientist Interview Question Walkthrough

Nearly 20 million people use Lyft to move around their cities. The company’s main product is a user-friendly mobile application that allows users to seamlessly move from point A to point B.

Behind the user friendliness of the smartphone app, Lyft puts a lot of effort into collecting and analyzing data about its users and drivers. This helps them plan and optimize their resources to provide the most efficient inner city transportation considering the price, time, and level of comfort.

Interviewers at Lyft try to ask the right questions to hire capable data scientists. In this article, we will walk you through one of the common Lyft data scientist interview questions, where candidates have to calculate driver churn rate based on available data.

Read More Here

Adapting to Endure

We believe the current market environment is a Crucible Moment that will provide challenges but also opportunities for all of you. Many legendary companies are forged during challenging environments as competition thins, real businesses get built and the opportunity for innovation is seized by those who see it.

We shared this presentation with Sequoia founders in May, and are now sharing it publicly with the wider startup community. Whether you are a CEO, operator or an aspiring founder pondering what it will take to build a successful company in a rapidly changing business climate, we hope this will prove a useful toolkit. No matter your role, the key to thriving in the next period is confronting reality, and acting decisively to adapt.

Read More Here

A linear programming approach for optimizing features in ML models

Whether it’s iterating on Facebook’s News Feed ranking algorithm or delivering the most relevant ads to users, we are constantly exploring new features to help improve our machine learning (ML) models. Every time we add new features, we create a challenging data engineering problem that requires us to think strategically about the choices we make. More complex features and sophisticated techniques require additional storage space. Even at a company the size of Facebook, capacity isn’t infinite. If left unchecked, accepting all features would quickly overwhelm our capacity and slow down our iteration speed, decreasing the efficiency of running the models.

To better understand the relationship between these features and the capacity of the infrastructure that needs to support them, we can frame the system as a linear programming problem. By doing this, we can maximize a model’s performance, probe the sensitivity of its performance to different infrastructure constraints, and study the relationships between different services. This work was done by data scientists embedded in our engineering team and demonstrates the value of analytics and data science in ML.

Read More Here

End Of Day 47

Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, the modern data stack, tech, and start-ups.

If you want to learn more, then sign up today. Feel free to sign up for no cost to keep getting these newsletters.

SeattleDataGuy’s Newsletter

Discussion about this post