A Simple, Fundable Program For Breaking Out Of ETL Prison

When it comes to ETL, I’ve come to identify two types of organizations. ETL masters engineer their code to a high standard. It’s live, working code that’s documented and built to evolve. ETL masters are a rare breed. More often, organizations are ETL prisoners, to some extent, whether they want to admit it or not. These organizations may have hundreds of thousands or millions of lines of ETL code that has been built over generations of staff. When new data sources come in, no one wants to touch that code. There is no testing or reuse. As a result, new data often needs a new ETL program written from scratch which adds to the mess.

So how do you break out of the ETL prison? There are two fundamental strategies. First, you could just suck it up and rewrite the massive hairball of ETL code, transforming it into a shiny, well-engineered system. Wouldn’t that be nice? (Cue record scratch.)

Now is the time to start your ETL prison break / A shared jail cell is seen through a barred window... [+] at Sasebo Prison in Sasebo, Nagasaki Prefecture, Japan, on Friday, March 13, 2015. Among developed economies, Japan has one of the highest proportions of elderly prisoners. Crimes committed by senior citizens have quadrupled over the past two decades. Today, almost one in five convicts is over 60. Photographer: Kiyoshi Ota/Bloomberg

Nope. Not gonna happen. There are any number of things you can – and may want to -- spend your budget on. Replacing a million lines of ETL code is not one of them. You might have the justification for it – badly engineered ETL can cause delays in time to market and restrict your ability to use the data you have, for example – but no one wants to spend that much money to essentially stand still. That said, at some point you have to do something about it. Must you remain confined by the limitations of how you have implemented ETL?

The other choice is to use an ETL system like Pentaho, Talend, or Redpoint (there are several others), or a data warehouse automation solution like Attunity, or many other relevant solutions such as Lavastorm, Informatica, Mulesoft, CloverETL. All of them offer different escape routes from ETL that have the following benefit: Instead of replacing your million lines of code, you create models of the transformation, often using visual drag-and-drop interfaces which dramatically simplify the process.

[Disclosure: I have done work as a content marketer or IT analyst for Attunity, Pentaho, Talend, Redpoint, Lavastorm and other companies with ETL-related offerings.]

Being able to construct data transformations using visual modeling has a bunch of benefits:

You broaden the number of people who can create ETL programs. Now, anyone who can handle the visual interface, not just coders, can do ETL programming. This usually means a huge increase in the number of people who can help, depending on the data and tech literacy of your staff.
Once people have learned the system coding should be faster.
The ETL programs created are usually more portable and can be deployed on a variety of platforms easily.
You reduce the technical debt created. Maintaining visual models is easier than maintaining code, so future costs for maintenance should be lower.
Often, the larger system comes with a workflow management system that allows you to schedule and monitor ETL jobs.

The biggest problem is that the visual modeling language may not fit your needs. There will always be transformations that are tricky or special. There must be a way to escape from the visual environment to code so you can handle these situations.

So far, so good, but we have to cue the record scratch again. The problem is the same. Just as there is no budget for a wholesale migration from a crappy body of code to a better body of code, there is also no budget for moving all your ETL from a mess of hand-crafted spaghetti code to a model-driven version of ETL.

So what is there budget for, and how do you escape?

Attunity’s Strategy for Setting Free from ETL

Each of the ETL vendors has a different tactical plan for setting free from ETL. In this article, I’m going to explain an approach based on Attunity’s capabilities because they have assembled, through a variety of acquisitions, a portfolio of products that support an interesting strategy. Note that you can use this strategy without buying Attunity. In many ways, the ideas embedded in enterprise products are often as valuable as the products themselves. I hope you find this strategy inspiring. In future articles, I will cover strategies based on ideas from other vendors or open source projects.

So, what can you get budget for? In my view, it is possible to fund a project with the following dimensions:

Goal one: We are going to improve our ETL capabilities so any data that arrives in the future can become useful much faster.
Goal two: We are going to identify the most used and fastest changing parts of the data provided by our current ETL and migrate those to the new platform.
Goal three: Over time, opportunistically, based on business need, as systems change and are decommissioned, we will gradually clean up our ETL mess and migrate the rest.

No need for a record scratch here. This sort of a program can get a budget. But there are other problems to solve. That’s where some of Attunity’s capabilities become interesting.

The first step to replacing ETL code under the program described above is to understand what it is you want to migrate. You need to understand what data is used for what, and which ETL jobs support which systems. Attunity Visibility profiles usage of data in a broader ecosystem. You can find out what data in a database is accessed most often. Track what data changes most often, and also keep track of changes in the structure of the data. This information helps categorize the ETL jobs in your portfolio. For example, you may have ETL jobs that were created over a decade ago to support a system that will soon be retired. Other jobs may be in containment, meaning that the systems are relatively stable and may or may not need new data associated with them. These two ETL categories are unlikely candidates for migration. That leaves two remaining categories: jobs used for important systems that actively provide high value, and jobs for mission-critical systems.

Once you’ve identified the last two groups of ETL jobs – high-value and mission-critical – you can begin to think about how to replace legacy ETL jobs. That’s where Attunity Compose comes in. In contrast to many of the other ETL tools, Attunity Compose combines the modeling of the data to be transformed, the data warehouse that will contain it, and data marts to be used for special purposes in one product. This approach is powerful because it solves the problem of creating the repository for the data not just the work of transforming the data.

The architecture of Attunity Compose takes an explicit side in a long standing debate in the data warehouse community. There are two ways to automate the creation of data warehouses and data marts. The Kimball approach identifies the data in the environment and then attempts to align this data with the business requirements. It’s much more difficult to share information using this approach, and it can result in the proliferation of many data marts. The Inmon approach, on the other hand, starts by modeling the environment and then building up a central data repository. This results in a more agile environment, enabling you to more easily respond to changing business requirements.

Attunity Compose applies the Inmon approach to data modeling automation. Compose allows the data being moved into the data warehouse and the data warehouse itself to be modeled. The model represents an understanding of the data, thus taking this understanding out of people’s heads or out of the code. Compose also proposes mappings and provides an interface to modify those mappings where needed. Once this is done, you have the foundation for automated code generation. You can then automatically create a database and the ETL code needed to populate it. Compose will generate the ETL code for physical generation on the fly, allowing you to manage multiple projects and very quickly provision them to different platforms or consumers.

Using Attunity Compose, a large global insurance company was able to reduce its ETL coding by 95%. The team had built 45 days into their schedule for ETL coding, but was able to complete it in 2 days using Compose. That last 5% consisted of complex business rules that couldn’t be automatically generated by the visual transformation language. For these rules, Attunity allows you to plug in before every step pre-and post-processing callout routines. Because Compose is executed on the data warehouse, it uses SQL optimized for the platform you’re writing on.

“We look at ETL coding more as potential quicksand than a prison,” said Kevin Petrie, Senior Director, Marketing at Attunity. “Whatever the analogy, we work with a lot of enterprises that need to free themselves from the cost and time commitment of manual ETL coding. We’ve architected our software to automate as many aspects of the process as possible, so IT can execute and iterate analytics projects in a more agile way.”

Compose also allows you to do a variety of more advanced functions, like configuring the type of logging you want to do on a specific field. Compose supports type 1 and type 2 fields for auditing and analytics. Type 1 fields are usually primary keys, and they don’t have any history. Type 2 fields have history or can generate history. For example, if an employee moves, you may want to keep the history of that fact in the data warehouse. Instead of updating the record in the data warehouse, you generate an insert with the new address in a type 2 field.

Type 1 and type 2 fields are also valuable for optimizing the physical data warehouse and data marts. The data warehouse design incorporates best practices, so the generated data warehouse will have a different structure than the model. Understanding those relationships helps to build the normalized form of the data. Again, this normalized form is automatically generated. Data marts for specific purposes are also modeled, created automatically, and populated with model-driven ETL.

The third leg in Attunity’s architecture is Attunity Replicate, which is used to harvest data from a variety of external sources and deliver it to Compose, or to extract data from a database and deliver it somewhere else in a new form.

By easily building out a data warehouse model and automating the majority of ETL, Attunity Compose enables organizations to reduce the cost of change and break out of ETL prison. Once that’s done,you’ve achieved your goal – you’ve replaced ETL with a new system that allows you to move forward quickly with your high-value and mission-critical systems. Of course, this model-driven approach involves some form of commitment to the technology, but the switching costs from this model are lower than switching from a lot of handwritten code. And, of course, there are other ways to do this. But this approach provides a coherent vision for how you can escape the ETL prison and is worth considering.

The biggest innovation in Attunity’s approach is the idea of combining the modeling of ETL with the modeling of the resulting database. WhereScape is the only other product I’ve seen that implements these techniques together.

But, whatever technology you use, this program for escaping the ETL prison is more likely to be funded and to provide business benefits than any other I’ve run across.

Follow Dan Woods on Twitter:

Follow @danwoodsearly

Dan Woods is on a mission to help people find the technology they need to succeed. Users of technology should visit CITO Research, a publication where early adopters find technology that matters. Vendors should visit Evolved Media for advice about how to find the right buyers. See list of Dan's clients on this page.

Follow me on Twitter or LinkedIn. Check out my website.

More From Forbes

A Simple, Fundable Program For Breaking Out Of ETL Prison