4 Must Have Skills Every Data Scientist Should Learn
We wanted to follow up our previous piece about how to grow as a data scientist with some other skills senior data scientists should have. Our hope is to bridge the gap between business managers and technical data scientists by creating clear goals senior data scientists can aim for. Both entities have to take on very different problems. Both benefit when they are on the same page. This is why the previous post focused so highly on communication. It seems simple, but the gap between technical and business continues to grow as new technologies keep getting piled on every year. Thus, we find it important that managers and data scientists have a clear path of expectations.
Both business and IT knowledge are very specialized. However, due to this specialization of skills, most businesses see a gap between the two specializations. Our role is to help fill it!
We find that is beneficial when data scientists are starting their journey that they focus heavily on the technical aspects. This means programming, queries, data cleansing, etc. However, as data scientists grow. They need to focus more on design decisions and communication with management. This will multiply the impact of the more experienced data scientists knowledge. Instead of being stuck in the day to day of coding. They can make higher level decisions and help the younger data scientists if they get stuck. More experienced data scientists benefit both themselves and their companies more when they are utilizing their experience to help make design decisions that simplify complex systems, optimize data flows, and help make decisions on what projects are most pertinent.
Being Able To Simplify The Complex
Data scientists have a tendency to want to use every technique and algorithm they know on every problem and in every solution. In turn, this creates complex systems that are difficult to maintain.
Data science does require complex and abstract modeling as well as plethora of complex technologies (from Hadoop to Tensorflow). With all the complexity that surrounds the field, it is tempting to develop systems and algorithms that are in turn complex. There is the temptation to involve 4 or 5 different technologies and utilize every new hot algorithm or framework. However, like most other fields that have some engineering involved. Reducing complexity is often better for multiple reasons.
If If John von Neumann, Erwin Schrödinger and Albert Einstein can help us understand the complexities of their very math and physics driven fields, then we data scientists can’t hide behind complexity., Erwin Schrödinger and Albert Einstein can help us understand the complexities of their very math and physics driven fields, then we data scientists can’t hide behind complexity.
The role of an engineer is to simplify a task. If you have ever built or seen a Rube Goldberg machine you will understand the idea of over engineering a simple task. Some data scientists algorithms and data systems would look more like some crazy mouse trap held together by duct tape and gum instead of an elegant but effective solution. Making simpler systems means the systems will be easier to maintain over time as well as provide future data scientists the ability to add and take away modules as needed. The next data scientist taking your position will thank you if you create a simple framework. On the other hand, if you use 3 different languages, 2 data sources, 10 algorithms and leave no documentation, then just know the future engineer is cursing your name under his breath.
Simple algorithms and systems also allow for easier additions and subtractions to be made. Thus, as technology changes and updates are required or a module needs to be taken out. A poor future data scientist isn’t stuck with playing a game of Jenga with your code. If I remove this block of code, will everything fall apart(have you heard of technical debt?)
Knowing How To Mesh Data Without Primary Keys
One of the big values strong data experts should provide is tying together data sets that might not inherently have a primary or obvious connection. Data can represent a person or business’s day to day interactions. Having the ability to find statistical patterns in this data is what allows data scientists the ability to help decision makers make wise choices. However, the data you would like to mesh together is not always on the same system or the same granularity.
Those who have worked with data will know it is not always integrated together nicely in one database. Finance data is often kept separate from IT Service Management data, and outside data sources might not have the same level of aggregation. This is a problem because to find value in data sometimes requires data from other departments and systems.
Data meshing requires building pieces at the same level of granularity. One way to think of it is having one large puzzle piece being joined together by another large piece created by lots of smaller puzzle pieces of data.
For example, what if you are provided medical claims, credit card and criminal rates of neighborhoods and want to figure out how these socio-economic factors affect the patient?. Some sets of data might be on a person by person level while the others might be on a street or city level with no clear method to connect the the data sets. What is the best way to proceed? This becomes a design issue that one, must be recorded and two must be thought out.
Each situation is different as there are many ways to mesh data. It could be based of region, traits, spending habits, etc. This is why experience is important. An experienced data scientist will have the intuition on how the data can be joined. Mostly because they have already tried hundreds of methods that don’t work. Often times, the closer you can combine both data sets to person by person the better. So if region or city happens to be the lowest level (Lowest level refers to granularity of the data, like person level, household level, street level, city level, state level, or many other groupings ) of connection, then that would be a great place to start.
Being Able To Prioritize Projects
As a data scientist, you have to know how to explain the ROI of projects that might not turn out. This is just about good direct communication(Our team will never stop talking about communication). This is about being able to articulate value as well as prioritize long term vs short term goals(again, easier said than done).
Teams will always have more projects and project requests than they can handle. More experienced team members need to take the lead and help their managers decide which projects are actually worth taking on. There is a fine balance between quick projects that might not have the highest ROI but have a good chance of succeeding and long term projects that are more likely to fail but also provide a large ROI.
In this case it is good to have a decision matrix of sorts to help simplify the process.
One of the classical decision matrices for projects is a 2 by 2 matrix that is importance and urgency. This matrix can be found in most business courses at college and it is really simple. That is why it is great!
I have worked at companies with really smart people. Yet, every project was treated as a priority and if you haven’t heard the saying, we will say it here.
If everything is a priority then nothing is.
Choosing the right projects requires making had calls. Not everything is a priority.
Many other companies have this problem. This is why it is important for the experienced members of the data science teams to be to clearly articulate which projects really should be done now, vs later. Thus, using the simple matrix will do that.
(Like we said in our last post, being concise is important. Using the matrix to help specify ROI will help).
When there is concise and straightforward communication, projects continue to move forward and trust is built.
Being Able To Develop Robust And Optimal Systems
Making an algorithm or model that operates in a controlled environment is one thing. Integrating a robust model into a system that is live and deals with massive amounts of data is a whole other thing. Depending on the company, sometimes the data scientist will just have to develop the algorithm itself. Then either a developer or machine learning engineer will be responsible for putting it into production.
However, this is not always the case. Smaller companies, and smaller teams might have the data science team put the code into production. This means the algorithm needs to be able to manage the data traffic at a reasonable speed. If your algorithm takes 3 hours to run and needs to be accessed live. It is not going into production. Thus, good system design and optimization is necessary.
As data grows, and more and more people interact with a system. It is important your model keeps up.
Data science is complex field that requires an understanding of data, statistics, programming, and subject matters. In order to grow, data scientists need to be able to simplify and distill these complexities into algorithms. They need to be able to focus more on making design decisions. This helps maximize their knowledge and experience that they have.
Summary
Senior data experts provide the largest impact for both themselves and their companies when they go beyond their technical abilities. The value they bring to the table is their experience, it can help guide younger developers to make better design decisions, and help managers make better decisions on which projects will have the best ROI. In turn, this magnifies the impact of their involvement on the team.