Discover more from SeattleDataGuy’s Newsletter
Realities of Being A Data Engineer - Migrations
Or really any engineer
In 1981 Indiana Jones gave us the classic scene where Indy is trying to switch out an idol for a rock and when he fails he starts getting chased by a boulder.
That’s what a migration is or at least, that's how stressful it can be and the consequences can be just as dire. No one is going to die, but it could cost your company millions.
So why risk it?
Why We Migrate
First, we need to answer why? Why invest what could end up being millions of dollars to switch from one system to another?
Also, pausing other work to take on a project that doesn’t directly add to your company’s bottom line? Here are a few major reasons.
Limitations Of Current Systems
By 2015 logging volume had increased to 500 billion events/day (1PB of data ingestion), up from 45 billion events/day in 2011. The existing logging infrastructure (a simple batch pipeline platform built with Chukwa, Hadoop, and Hive) was failing rapidly against the alarmingly increasing subscriber numbers every week. By estimation, we had about six months to evolve towards a streaming-first solution.
One of the major driving factors for a migration is the current system can’t handle the ever growing list of needs from end-users. Whether it be a server needing to be able to handle more traffic or an analytics system that will no longer be able to manage processing data in a batch pattern. The limitations of a current system can often push the need to migrate.
If you are on Salesforce NGOC, Python 2, or have used Chart.io you know that products aren’t around forever. Eventually, either a company decides to stop supporting a solution or a product could be bought out by a larger company that gets rid of it.
All of this means your company now needs to replace the previous functionality. Thus another migration.
Another major driving factor for migrations is cost. For example in the BI space, I have worked on a few migration projects away from Looker due to costs. This could be driven by the downsizing of a company or the right-sizing of analytics initiatives. When first finding solutions there are all sorts of pricing strategies that SaaS offers. In turn, what may have seemed liked a good deal a year ago, may have become untenable now.
Difficulty finding talent can also drive the need to migrate. As technologies fall out of favor, the talent to support them become more limited. In turn, companies are often forced to spend a lot of time hiring replacement employees. This can become very costly and eventually the cost of migrating might become less than the cost of trying to constantly rehire.
Types Of Migrations
As a data engineer, migrations are inevitable. They are a harsh reality we face and it goes far beyond just migrating from one data warehouse to another.
There are multiple types of migrations that occur. Some involve software, others hardware, still others a combination of both. The more components and functional pieces that need to be replaced the riskier the entire operation becomes. In addition, the type of migration also can increase and decrease the risk depending on how mission-critical the component is.
Here are a few migration types I have been a part of.
Operations System Migrations - Migrations that involve software that is directly involved in operations. Think CRM, ERP, or internally developed services migrations where failing to migrate means you can’t operate.
Hardware Migrations - Hardware migrations have likely reduced as many companies are now relying on the Cloud to act as their hardware. Where in the past IT departments would need to do hardware migrations any time a server ran out of space or when there is a need for new network infrastructure. Of course, now they can just scale up and down as required in terms of servers and the cloud.
Cloud Migrations - Over the past decade a common migration that companies of all sizes started to take on was cloud migrations. Going from on-premise to in the cloud or from one cloud to another. A quick note on - Lift and Shift - With the recent popularization of the cloud a common approach is to simply lift and shift. We had x amount of servers, and firewalls set up like y and now we will just do the same thing in the cloud. This misses opportunities to take advantage of some cloud-specific features.
Analytics Migrations - By analytical migrations I am referring to analytic solutions and systems like your data warehouse or BI layer. When migrating analytical systems the risk isn’t as high as operations migrations(of course perhaps this is changing with more teams integrating operations with analytical data). If your team fails on an analytics migration, generally, the largest issue you will run into is knowing how many active customers you have. This will impact some decision making but overall there are workarounds.
Data Migrations - Another common migration is needing to switch databases or schema designs often for performance reasons. If this is on an operational system this poses a lot of risks because if data is migrated incorrectly there can be data loss or corruption that could completely break the application itself.
The first phase of a migration is de-risking it, and to do so as quickly and cheaply as possible. Write a design document and shop it with the teams that you believe will have the hardest time migrating. Iterate. Shop it with teams who have atypical patterns and edge cases. Iterate. Test it against the next six to twelve months of roadmap. Iterate.
Like any project, migrations have risks and one major goal of any migration team is to de-risk as much as possible. The image above is one view of how you can de-risk. Staged rollouts can de-risk in terms of ensuring that each piece runs successfully prior to migrating the next dependency. However it also often increases cost by increasing time. But let’s look at some other ways you can de-risk.
Data Loss And Corruption- A large risk many migrations face is data loss or corruption. This can be devastating when referring to operational systems(analytics systems often have some form of recovery). One easy mistake that can be made when migrating is not correctly setting up IDs to be the same as before. This can occur when a developer forgets to turn off an identity column and re-inserts users and the order isn’t correct. Now each user gets a new ID. Perhaps this is fine in the current system. However, the other systems that are integrated with the newly migrated system may now have mismatched IDs.
Operations Risk - The risk of migrating an analytics system and failing is usually you have bad numbers for a while. The risk of failing to properly migrate a component attached to operations is loss of sales and business functionality. It’s one thing to have to manually pull numbers into excel when migrating analytics components, it's another thing to no longer be able to integrate systems or automate operational activities that directly touch the bottom line.
Adoption - Even once a company has migrated services, there are long-tail issues such as adoption risk. For example, many times at Facebook a service might be migrated but the older service was left around. Guess what happened, people would continue to use the older service. Managing end-users expectations and moving them over 100% should be the eventual goal. If thats not clearly defined then users will continue to operate on the older services.
Attrition - There are several ways attrition risks occur. One of the major ones is having employees leave mid-migration and them playing a key role in said migration. This could cause the migration to be derailed by months as you scramble to replace the individual and then get the new hire up to speed. This happened at one company I worked at where there was only one DevOps individual and they left after the initial planning phase of the migration. This was devastating and took a significant amount of time to recoup as the company looked to hire and get the individual plugged into the project.
Enough talking about risks. Let’s start talking about the migration itself. I have seen all forms of migration approaches. Some made a lot of sense. Others made me cringe.
I have seen one team get pulled into a project that eventually turned into a migration. Instead of planning it out, they kept treating it like some form of agile MVP-style project. The problem with this is they never really defined an end-state of work nor did they truly understand everything the migration entailed from an operations standpoint.
They were focused purely on the data migration aspect and ignored the operations and service migration components.
You can hate on waterfall, but step zero, before starting the migration is to admit you need a migration.
Team Planning And Mobilizing - Before starting you should put together a clear team with clear responsibilities for your migration project. These teams will likely be built of a PM, business analysts, architects, engineers and implementers, end-users for UATs, and either designated testers or engineers that will double as both testers and engineers. You will want to make sure they are all onboard and likely will need some redundancy in case of attrition.
Prioritize The Migration - Migrations face the risk of not completing and having half-migrated modules and components that leave a company far worse off than when they started. Thus, the migration itself needs to be prioritized. To avoid any lingering or half-finished projects.
Get A Business Champion On Your Side - The technical teams deciding that a migration is necessary is not enough when you realize doing a migration impacts the business. It doesn’t just impact it due to down time. It also will reduce the amount of engineers that can work on new features. Meaning if the business has some strategy that relies on engineers, they will have to put it on pause. In turn it is important to have a business champion on your side so as the migration moves forward it doesn’t get derailed. This is somewhat related to prioritizing the migration but I wanted to have a clear call out.
Outline The Current And Future State - When starting a migration project it’s important to understand where the current system is and what the future state will look like. Is there functionality that can be removed or added? Are there limitations that your team faced in the previous system that can now be avoided?
End-User Interviews - One of the most valuable resources when migrating is the people who work with the systems themselves. There is how the system looks like in terms of code and infrastructure and there are the people that use said systems. Perhaps they have developed a work around to deal with some of the limitations of the current state and maybe you can remove them in the future. You do want to avoid adding too much extra functionality in the future state because if you let everyone you interview dictate where the future state should be, you will never get there.
Functionality Analysis - One of the mistakes that get made when migrating is not fully understanding the current functionality that is being provided by the old system. It can be tempting to take a very cursory review of the system, and only migrate the top few layers. However, there are always hidden touch points and use cases that will take time to find.
Diagrams - Pictures are worth 1000 words and when migrating having diagrams will help the technical team and architects share what the future state will look like from multiple levels. In addition, these diagrams will be used to compare the progress of the implementation.
Design Reviews - Along with all the diagrams will come reviews where teams will analyze the future state being recommended and ensure that all of the design decisions will continue to support the functional needs.
Avoid Over-designing - Whenever a migration occurs it can be tempting for every team to add in their perspective of what the new system should do. Yes, you should try to improve the overall workflows and system integrations. You probably should avoid any gold-plating or additional functionality as all of this just further increases the risk of the project failing.
Execute And Implementation
Get Buy-In Along The Way - You can’t just wait until everything is migrated to get buy-in from all the end-users. As the migration is moving forward make sure to have a few key end-users involved to make sure they are constantly buying into the changes being made(or at least understand why they are happening.
Iterate - Even with the best discovery phases there will always be gotchas and a need to iterate what and how things should be migrated. Your migration plan should cover this so that as you execute you can iterate when necessary.
Communicate Status Constantly - Letting others know where the status of your migration is important. Simply staying quiet and not giving updates can have some people become apathetic to the migration and they might be harder to move over in the future.
Testing And Validation
Unit And Integration Testing - As with any technical project there will need to be unit and integration testing in order to ensure that whatever work is being done works as expected. What is great is that you do have a current system that likely operates in a way that is expected. This means your team should have a general idea of what it needs to test.
UAT (User Acceptance Testing) - Your team shouldn’t be migrating in a vacuum. Instead, you should involve key end-users and make sure they have defined UATs. During UAT, actual software users test the migrating system to make sure it can handle necessary tasks in real-world scenarios, according to specifications. UAT is one of the final and critical software project procedures that must occur before launching your migrated system.
Regression Testing - It’s not uncommon to need change requests or need to re-implement various bits of logic or modules throughout a migration. When these changes happen your team will want to be able to regression test to ensure all previous tests don’t fail.
Data Validation - As your team migrates, especially in terms of data migrations you will want to validate that everything is moved correctly.
Staged Vs Big Bang - Finally, once you start getting components ready to migrate the question becomes how. There are benefits to both staging out migrations as well as switching over large parts of the system all at once. I like the quote that states “If you use the big bang approach all you’ll get it a big bang”. In theory rolling out all at once is faster. However, it’s usually always some form of staged roll out because of various reasons. I also tend to prefer staged rollouts because that is where I have the most experience. In several projects including data migration projects where I was leading a migration project to a new data model I created a clear roadmap with different domains of the data schema being migrated every 2 weeks.
Overall Migration Tips
Create How-To Documentation - If other teams will also need to help migrate to a new system then create clear how-to guides for them. Otherwise you risk having inconsistent migration patterns. Remember, always be de-risking.
Create A List Of What Needs To Be Migrated - Similar to the above note. Make sure you create a clear list of what needs to be migrated. At Facebook most teams work in a very centralized-decentralized way. Meaning, that one person or team would likely be in charge of the migration and they would assign it out to other teams. Especially since Facebook had a data mesh like approach. Each team managed their own set of tables. However, they might end up relying on a different team’s tables that might decide they need to migrate to a new pattern. Thus the team making the change would be responsible for letting the team using their tables know. But the team using the tables would likely have to migrate them.
Automate The Process - Migrations generally involve very tedious work. For example, perhaps you will need to transcribe your Oracle PL/SQL scripts to MSSQL T-SQL. This will often require the same types of copy-paste, and replacing over and over again.
There are solutions that can help automate a good portion of this work. Truth be told most of these solutions can get around 70-80% of your code migrated. The problem is they usually can migrate the easier 70-80% of work. The remaining work will be the more difficult edge cases.
Plan For Dry Runs - Migrations shouldn’t just be a blind swap. One day you’re on system A the next day you're on system B. There are far too many issues and failures that can and will arise. Instead, your team should plan several dry runs where you run the entire system successfully prior to starting.
Stress Test - One step that can easily get missed is stress testing your new system. If you replacing a system that usually gets 10 million views a day. Then you should find a way to mimic that. The last thing you want is for all your tests to pass in isolation but once the system is put out in real production it blows up.
Pay Down Tech Debt - Migrations are a great time to review your current system and see how you can improve it. Your team can set standards because they should have a better big picture view of how the overall system being migrated has been built up over the years. Meaning you can see patterns and clear areas you can improve.
Migrating to new systems is technically difficult. However, the other major challenge companies run into when migrating is the change being adopted by people.
This is where the term change management comes in.
In recent years, many change management gurus have focused on soft issues, such as culture, leadership, and motivation. Such elements are important for success, but managing these aspects alone isn’t sufficient to implement transformation projects. Soft factors don’t directly influence the outcomes of many change programs. - Harold L. Sirkin, Perry Keenan, and Alan Jackson
Simply announcing that the company will be changing over to a new ERP or data visualization tool is not sufficient.
First, if you only do this once, people will forget or leave and be replaced which will lead to confusion.
Second, not every user will want to migrate. Just like adoption of new technologies and products in the real world, there tends to be a similar distribution inside companies. There are early adopters, early majority, laggers, etc.
There are various reasons end-users don’t want to adopt new technology.
They have other priorities and don’t have time
They don’t want to spend time learning new things
They believe the previous way was better
Overall, each of these provides a different challenge when migrating. For example, when a company has limited employees and there is a need to migrate who has time to do work that feels as if it isn’t drastically moving the needle?
It can feel far more productive to develop new features and functionality on an older technology instead of migrating to a new platform.
In this case, there needs to be buy-in from stakeholders who will align the migration with key company goals. That way, the engineers migrating are rewarded for their work and it's not just viewed as a task that needs to get done, but instead is viewed as an initiative that is driving value that the company has agreed is worth investing in.
Another key person to involve is a senior engineer who can also drive the more technical adoption. As referenced sometimes some engineers believe the previous solution was better or that they don’t have time to learn new technologies.
To mitigate this issue it's important to get a champion engineer on the side of the migration that will continue to drive the project forward through these issues.
This engineer may need to spend time 1:1 with some of the holdovers to see if there are any specific concerns. As someone who has been on all ends of what felt like endless migrations and enhancements, I have played both roles. Both the engineer who had to go team by team to answer questions as well as the engineer who was constantly busy and didn’t want to have to migrate my pipelines.
Once your systems are ready to migrate the question becomes how. Does your team perform a sudden big bang switchover or a phased rollout? Both will have various issues and risks.
I prefer staged rollouts but these take longer and are often more expensive. In addition, it also increases the risk of not completing as the lead of said migration could leave, teams that did agree to migration could be restructured or dozens of other internal issues could arise.
But how you roll out your migration should be decided before starting.
There are a lot of ways migrations can go wrong through out the process and only a few ways they can go right. The more your team de-risks the more likely you will succeed.
I would love to hear your tips for de-risking migration projects below or feel free to contact me if you’re looking to start a migration!
Video Of The Week: The Harsh Reality of Being a Data Engineer
Join My Data Engineering And Data Science Discord
Recently my Youtube channel went from 1.8k to 43k and my email newsletter has grown from 2k to well over 20k.
Hopefully we can see even more growth this year. But, until then, I have finally put together a discord server. Currently, this is mostly a soft opening.
I want to see what people end up using this server for. Based on how it is used will in turn play a role in what channels, categories and support are created in the future.
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Is Code the Best Way to Represent a Data Analysis?
Over the past 20 years, the principles of transparency and openness have not changed. But I have changed in the sense that I’ve found myself wanting more. While transparency has inherent value, I have found that it’s not exactly what people want when they see a data analysis. What they want is an answer to the question, “Is this data analysis any good?”
Looking at code usually does not tell me anything about the quality of the analysis because in the best case scenario, the code matches what was written in the description of the analysis, which is what I would expect. In my experience, most people get value out of the code (and the data) when they can go into the code, make modifications, and run alternate analyses on the data. They may be interested in the sensitivity to a certain parameter or certain assumption. These things cannot be evaluated without doing a new analysis that differs from the original published analysis.
Upgrading Data Warehouse Infrastructure at Airbnb
In this blog, we will introduce our motivations for upgrading our Data Warehouse Infrastructure to Spark 3 and Iceberg. We will briefly describe the current state of Airbnb data warehouse infrastructure and the challenges. We will then share our learnings from upgrading one critical production workload: event data ingestion. Finally, we will share the results and the lessons learned.
End Of Day 54
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.
If you want to learn more, then sign up today. Feel free to sign up for no cost to keep getting these newsletters.
SeattleDataGuy’s Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.