Building Custom ETL Tools When Necessitated by Massive Data Volumes

blog-image

Overview

This blog will show some tips and tricks that Comp Three employed to develop a custom ETL process for a full and delta load from a multi-billion row transactional Product Engineering SQL database, to a simplified data store useful for BI tool connectivity, integrations to other downstream systems, or data migrations off the source system.

Problem

The complexity of the source database’s table relationships, and the sheer volume of data proved too much for off-the-shelf data wrangling tools like Informatica, and the client had employed custom ETL processes which reduced the full load time from several weeks (COTS data wrangling tools) to several days (custom solution). Comp Three’s task was to employ methodologies and optimizations to reduce the multiple day load to a few hours, and to provide quick delta load process with transactionality, with its solution.

A Few Useful Methodologies:

#1. Mapping Abstraction: For configurability as the source system data structures evolve, the mapping from the source to target tables was kept in a mapping table, as an encrypted SQL statement complete with joins, etc. In our case, several highly normalized source tables typically create a single target table. Example:

 INSERT INTO <target db>.Part

 SELECT <source db.columnt> AS <target db.column>,

          <source db.column> AS <target db.column>…

 FROM  <source db>.PARTMASTER table

 JOIN    <source_db>.PARTREV table ON…

 JOIN   <source db>.MFGDIVSION table…

 etc.

The mapping table is then used to create corresponding views in the database for direct usage during the load. (i.e., INSERT INTO PART SELECT * from PART_VIEW). One great advantage of this technique is that the mapping queries can be customized for optimal performance purposes, including Oracle hints, fields used for joins, and order of joining tables. Direct control of SQL is always the most powerful.

#2. Replication of source DB identity fields into the target DB: During delta load and auditing (see below) it is useful to keep the identity of the row on the source db, somewhere where the load process can easily use it, in our case on the corresponding target db row. This allows us to easily determine rows on the source db which are new (not in target db), have been deleted (on target db but no longer in source db); as well as quickly access rows in the source db which have changed.

#3. Logging and Performance Tuning: When dealing with billions of rows, ETL speed become a key objective. As the load transforms and loads table-by-table, some are invariably too slow during initial testing of the load process. Being able to LOG the actual queries being performed (in a ‘log mode’) enables the query to be pulled out, executed directly in a UI tool like SQL Developer or TOAD, and tweaked for optimization. The optimized query can then be used to replace the mapping SQL (referenced in 1, above).

#4. CREATE TABLE AS: For replacing the contents of an entire table during the full load process, CREATE TABLE AS proved much quicker, in Oracle, than TRUNCATE TABLE / INSERT INTO TABLE.

#5. Detection of changed rows: Delta loads must determine which rows on the source database are new, deleted, or changed, relative to the last load of the target database. For new and deleted rows, comparison of identities can be used. Changed rows may require examination of timestamps, which if used, should be indexed.

#6. Auditing: Invariably the user will not agree with the results of some report they see which is connected to the target database. A simple, UI based auditing process which compares source rows to target db rows is a great sanity check and will allow the “target db is wrong” step to be quickly eliminated as a problem, when this occurs. The auditing tool should allow for selection of specific target tables, and can use much of the same logic/code as the delta and full load processes.

Conclusion

Using these methodologies, an extremely complex several billion row transactional database was successfully ETL’ed to a report-ready, simplified BI instance, resulting in orders-of-magnitude performance improvement over COTS data wrangling tools. With mapping abstraction techniques, the custom ETL tool can be as easily maintained (by SQL professionals) as the COTS product.

Interested in learning more about Building Custom ETL Tools When Necessitated by Massive Data Volumes?
Enter your email below: