



This is the second in a series of articles, collaboratively written by data and solution architects at Myers-Holum, Inc., PRA Health Sciences, and Google, describing an architectural framework for conversions of data warehouses from Teradata to the Google Cloud Platform. The series will explore common architectural patterns for Teradata data warehouses and outline best-practice guidelines for porting these patterns to the GCP toolset.
David Chu, Knute Holum and Darius Kemeklis, Myers-Holum, Inc.
Michael Trolier, Ph.D., PRA Health Sciences
July 2018
Table of Contents
⇒ Capture versus land
⇒ Capture is a formal contract with the source system
⇒ Nuances of landing data
⇒ Checklist for evaluating your source data capture processes
⇒ ETL is still the most robust and platform portable tool for implementing your source data capture processes
⇒ Implementing the Landing Zone on Google Cloud Platform
⇒ Sample scenarios for Landing Zone implementations
⇒ Wrap-up, what’s next
Recap: In the first article of this series, “Article 1: Understanding Your Current Data Warehouse”, we discussed source data capture as the first architectural layer of most data warehouses. We categorized the three types of source capture processes (pull, push or stream) and the typical Teradata implementation strategies for landing the data into the data warehouse for each type of process.
We recommended conversion rather than re-implementation of your current source data capture processes as a very important aid to facilitate side-by-side validation of your converted data warehouse. We also recommended the creation of a unified landing model, as an architectural component we called the “Landing Zone”. All source data capture activities target the Landing Zone with associated control processes managing the accessibility and lifespan of those slices of captured data.
In this article, we will discuss detailed requirements you should consider in your implementation of the Landing Zone. An understanding of these requirements will facilitate your decisions on whether existing Teradata source data capture processes can truly be converted.
Conversion versus Re-implementation: As you read this article, we are going to differentiate between approaches for converting your data warehouse from Teradata to the Google Cloud Platform versus re-implementing your data warehouse in Google Cloud Platform. Arbitrarily, we are going to assume certain differences between the two approaches. The conversion project will be assumed as justified based on an expense reduction/capital spend avoidance ROI model, and therefore the cost of the conversion must be constrained to fit within this cost reduction model. We would also expect the implementation timeline to be oriented toward a single deliverable, and timed such that the next Teradata infrastructure upgrade can be avoided.
On the other hand, the re-implementation project will be assumed as justified based on a business benefit model, with implementation (capital) costs factored into the internal rate of return for the project. We would also expect the implementation timeline to be phased into multiple deliverables based on the order of business initiatives you want to undertake.
Your existing Teradata implementation first captures data from various sources of record, then lands the data either in some form of staging (flat files, relational tables or XML formats) or loads it directly to your data warehouse. Let’s discuss the capture and land processes separately.
This is important to remember. We doubt your original data warehouse implementers had rights to access any source data of their choosing without regard to operational system impact, data security constraints, or IT infrastructure considerations. They worked out a “contract” with the stakeholders of that source data repository, within which the source capture process operates today.
So in the analysis to determine the extent you can convert your existing source capture processes, it is important to separate the analysis of how well the contract meets the business need, from how well the actual source capture process is implemented. Over time you may have created a variety of methods in which you implemented the actual source capture (BTEQ to TPT to ETL). Clearly, you can define scope within your conversion process to standardize the implementation method without impacting the underlying contract it operates within. Saying this in another way, you may consider the modernization of your source capture processes while striving to deliver the same dataset as before.
It is when you decide the contract no longer meets the business need that you cross from conversion to re-implementation. If the majority of source data contracts no longer meet the business need, then you should be able to quantify those business requirements to justify the additional cost of a re-implementation, especially considering that source data capture changes will make automating side by side testing against the existing Teradata data warehouse much more difficult.
All relational databases, Teradata included, enforce data integrity through the use of pre-defined schemas and data definitions that include type, length, and nullability. Therefore, there is always a data quality component to landing captured source data into a relational database. Specific functionality may be added to your source data capture processes to handle source data that does not meet these minimal requirements. Below are some examples of how you may be handling this:
So in the analysis to determine the extent you can convert your existing source capture processes, it is important to determine which of those processes are depending on BAD files or error tables, either to stop the capture process entirely, or provide a mechanism to analyze and potentially correct and recycle the bad source data. Knowing which source capture processes rely on the data integrity provided by the relational database is needed to best understand which Google Cloud Platform product can supply similar capabilities.
You can use the following checklist to help evaluate the extent that conversion rather than re-implementation of your current source data capture processes will be possible:
The more you have used ETL to implement your existing source data capture processes for loading your Teradata data warehouse, the easier your conversion will be. If you have Teradata specific source data load processes you need to modernize, you should consider re-implementing them in ETL.
It is important to note the above comment applies to the source data capture processes only in a conversion type project. In a future article, we will discuss ETL in comparison to other approaches like Google Cloud Dataproc or Cloud Dataflow.
First, let’s show the logical design for the Landing Zone. This is the portion that reflects the final target (persisted data) of the source data capture process. This is the component that will be implemented in Google Cloud Platform.

Within the above diagram:
Now let’s discuss the key rules for implementing the Landing Zone as the final target in your source data capture processes:
Now that you captured the source data and know how to move it to Google Cloud (ETL is still the most robust and platform portable tool), the next question is where exactly to actually store this data. The Landing Zone can be instantiated in one or more Google Cloud products. Typical requirements that help identify which Google Cloud product to use are:
Google provides a high-level overview of the Google Cloud storage options on this page. Let’s review in more detail various requirements and how they map to Google Cloud Storage and Database technologies as they apply to the Landing Zone.
Unstructured data formats that might be loaded into the Landing Zone are JSON, XML, log/text files, or binary formats (images, videos). Cloud Storage is just a file store, so any type of data can be stored and it likely is the best option for storing truly unstructured data. Cloud Spanner also has capabilities to load binary data into columns. Cloud BigQuery has capabilities to load json files into a column to be later processed by using either regex or XPath expressions. It is important to realize you want the unstructured data to be landed (as raw data), not parsed or interpreted into a different format.
Sometimes source data systems change their data format very frequently. As an example, consider a web application log file where new data elements are frequently added (or removed) as application functionality is changed. In this situation, it is a maintenance burden to have to make a schema change every time the underlying source data format changes. Cloud Storage is just a file store; it is agnostic to the file format being stored. Cloud BigQuery has the capability to adjust table format on the fly based on the new fields appearing in the data extracts. Cloud BigTable can accommodate a virtually unlimited number of columns, does not require predefined schema before writing, and can accommodate changing data formats easily. However, Cloud Spanner and Cloud SQL require implicit DDL execution on existing tables to accommodate new data formats.
Cloud Spanner and Cloud SQL have pre-defined schemas with strong data types. Cloud BigTable uses key/value pairs with no support for data typing. Cloud BigQuery supports either explicit data typing where table schema is predefined up front, or implicit data typing where it relies on the source format for supplying the data type.
Cloud Spanner, Cloud SQL and Cloud BigQuery support all the typical ANSI SQL join types (inner, outer). Cloud BigTable does not support joins, you would have to combine multiple sources using the API or using ETL look-ups. Cloud Storage is just a file store so no joins are supported unless data retrieval is federated through Cloud BigQuery. Note that there are certain performance implications on performing join operations on massive tables that can be improved through schema design.
Cloud Storage has lifecycle configuration capabilities to move data from Nearline to Coldline storage to reduce storage fees, or files can be deleted altogether by a custom process. Cloud BigQuery table partitions, if not accessed after 90 days, will result in reduced storage fees, or can be dropped altogether by a custom process. Data archiving and purging can be easily scripted using command line tools to either delete Cloud Storage folders, or Cloud BigQuery table partitions from a specific date range. Cloud BigTable tables and column families can have TTL (time to live) semantics configured to store values for a specific amount of time only after which those values are automatically purged opportunistically by the background process. Cloud Spanner and Cloud SQL have row delete syntax that can be executed for a subset of rows to be deleted. Remember to favor the technologies that support archive or remove at the most effective Google Cloud pricing model.
You need to consider which Google Cloud Storage technologies are supported by the toolsets you plan to utilize to implement your data load processes, initially as a target and subsequently as a source to transform and load your data into your Cloud BigQuery data warehouse.
In the Teradata conversion scenario, we have already recommended ETL as the most robust approach for implementing source data capture. But as we move on to loading your Cloud BigQuery data warehouse, the good news is that there is a wide variety of both Google and third-party toolsets, and both ETL and ELT style processing can be used. Make sure your desired ETL and/or ELT toolset, such as Informatica, Talend, Snaplogic, Cloud DataFlow, supports your Landing Zone as a source and Cloud BigQuery as a target. In a future article we will discuss ETL/ELT in comparison to other approaches like Cloud Dataproc or Cloud Dataflow.
If you plan on using the Landing Zone as a component of your data lake, consider the ability of the underlying Google Cloud storage product to support your data mining toolsets.
Google DataPrep, Google DataStudio, and Google Datalab provide robust mining capabilities and can connect to all Google Cloud storage products.
Cloud BigQuery can directly access Cloud BigTable and Cloud Storage to allow a federated query of data in the Landing Zone with your data warehouse in Cloud BigQuery. This is a frequent data mining need.
Numerous third party products such as Looker, iCharts, Tableau provide various degrees of connectivity to various Google Cloud storage products and are evolving rapidly through 2017.
In a Teradata conversion scenario, the ability to migrate from a managed MPP to a serverless environment may be a significant factor in your conversion ROI, at least for the data warehouse. If the requirement also applies to the Landing Zone, then consider that Cloud Storage and Cloud BigQuery are completely server-less no-ops environments. There are no compute instances or nodes to manage and they are “hands-off” scalable. Cloud BigTable and Cloud Spanner require you to manage number of virtual nodes (note: these are NOT virtual machines/compute instances). Node scaling can be automated, but still requires an explicit effort to monitor and execute resizing when needed. Cloud SQL has higher infrastructure management needs as you are managing an instance of either MySQL or PostgreSQL, just in the Google Cloud.
Now that we explored various conversion related requirements for the Cloud storage products and reviewed various Google Cloud product capabilities in this regard, let’s look at two specific examples.
In this second article of the series we explored the details of implementing the Landing Zone:
In the next (third) article of the series, we will start to discuss the data warehouse layer in the reference architecture. We will focus on the some of the early data modeling decisions you will need to make, as well as use cases that will require a staging area prior to loading the actual data warehouse.


