



This is the third in a series of articles, collaboratively written by data and solution architects at Myers-Holum, Inc., PRA Health Sciences, and Google, describing an architectural framework for conversions of data warehouses from Teradata to the Google Cloud Platform. The series will explore common architectural patterns for Teradata data warehouses and outline best-practice guidelines for porting these patterns to the GCP toolset.
David Chu, Knute Holum and Darius Kemeklis, Myers-Holum, Inc.
Michael Trolier, Ph.D., PRA Health Sciences
August 2018
Table of Contents
⇒ Capture versus land
⇒ What does a data warehouse look like on Google Cloud Platform?
⇒ Dust-off Your Semantic Model
⇒ Creating the Physical Schema
⇒ Example using our Travel Reservation Model
⇒ Semantic Model Example using our Travel Reservation Star Schema
⇒ Physical Model Example for Reservation_Payment_Fact
⇒ Physical Model Example for Room_Reservation_Fact
⇒ Additional Considerations
⇒ Wrap-up, what’s next
Recap: In the first article of this series, “Article 1: Understanding Your Current Data Warehouse”, we discussed source data capture as the first architectural layer of most data warehouses. We categorized the three types of source capture processes (pull, push or stream) and the typical Teradata implementation strategies for landing the data into the data warehouse for each type of process.
In the second article of this series, “Article 2: Source Data Capture”, we recommended the creation of a unified landing model, as an architectural component we called the “Landing Zone”. All source data capture activities target the Landing Zone with associated control processes managing the accessibility and lifespan of those slices of captured data. We discussed the detailed requirements you should consider in your implementation of the Landing Zone to facilitate your decisions on whether existing Teradata source data capture processes can truly be converted.
Conversion versus Re-implementation: As you read this series of articles, we are going to differentiate between approaches for converting your data warehouse from Teradata to the Google Cloud Platform versus re-implementing your data warehouse in Google Cloud Platform. Arbitrarily, we are going to assume certain differences between the two approaches. The conversion project will be assumed as justified based on an expense reduction/capital spend avoidance ROI model, and therefore the cost of the conversion must be constrained to fit within this cost reduction model. We would also expect the implementation timeline to be oriented toward a single deliverable, and timed such that the next Teradata infrastructure upgrade can be avoided.
On the other hand, the reimplementation project will be assumed as justified based on a business benefit model, with implementation (capital) costs factored into the internal rate of return for the project. We would also expect the implementation timeline to be phased into multiple deliverables based on the order of business initiatives you want to undertake.
To understand the scope of your Teradata conversion effort, you need to understand both the source and target requirements. Only with this knowledge, can you understand the data integration processes in between that will make up the bulk of your conversion effort.
Cloud BigQuery is Google’s recommended technology for implementing your data warehouse. BigQuery is a fully managed, petabyte-scale, low-cost enterprise data warehouse for business intelligence. It is probably one of the principal reasons you are considering a data warehouse conversion.
Migrating your Teradata data warehouse means that you will be instantiating your semantic logical data model into a new physical data model optimized for BigQuery. As you recall in Article 1, we discussed the importance understanding how the Teradata semantic layer has been implemented and the extent to which it is actually being used.
The migration from Teradata will require a bottom up approach, you will be implementing a series of BigQuery tables representing the semantic requirements of a given subject area. Subject areas will be integrated using common dimensional data (such as customer, product, supplier, organization or employee).
The bottom up approach will allow for an iterative conversion process. Subject areas can be implemented using an incremental approach, where you weigh subject areas that provide highest conversion benefit. For example, at our client “Company X”, the customer marketing subject area consumes 60 percent of the existing Teradata storage capacity. In addition, 14 of the top 25 most frequently used queries access this data. Therefore, migrating this subject area off Teradata not only provides BigQuery benefits to the users of customer marketing, but also to the remaining Teradata users utilizing different subject areas which now benefit from the new Teradata capacity now available.
You will need a semantic logical data model that represents the data presentation requirements for a given subject area before you can begin any BigQuery physical design. We recommend using a star schema (following Kimball standards) as the basis for your semantic model. We also recommend the use of a data modeling tool (ER/Studio, CA-ERWin, InfoSphere, etc.) to maintain it. The metadata created and maintained in the data modeling tool will become an important component of your overall data warehouse metadata strategy, which we will cover in a future article.
If you do not have an existing semantic logical data model, creating one in a conversion scenario is not that difficult. Remember you have an existing Teradata implementation you can analyze to create it. Here are some approaches for doing that:
As you create the star schema representing your semantic model, it is important to focus on the following objectives:
The physical schema represents how your end users and associated end user oriented tool sets will see the data warehouse in BigQuery. Implementing this schema requires the same compromise between ease of use (with potentially inefficient access) versus efficient data access (with potential difficulty to use) that we have all made many times before. Our physical design approach seeks to optimize ease of use for as many business use cases as possible but also recognizing that physical schema optimization for query performance and cost may be needed in certain situations.
BigQuery’s ability to support a huge data set is a clear advantage of the Google product and most of our clients have at least future plans to utilize this capability. However, it is important to realize that in a conversion scenario, typical data set sizes coming from Teradata will allow a degree of sub-optimal design while still meeting expected end user SLA’s and cost constraints.
BigQuery physical schema optimization revolves around two key capabilities:
Our design approach seeks to apply a consistent set of physical modeling rules to the star schema logical model, then perform testing using actual data, and then flatten tables further if justified. Here are the physical modeling rules we use:
Beyond physical modeling rules, it is important to consider the data integration requirements that will correspond to the physical schema being created. A single source to single target is the ideal (and easiest to implement) data integration model. You will find the more you create a big flat table, the more data sources you will need at data integration time to load it. And the complexities of ensuring all those source tables are in sync will often require sophisticated data staging approaches that will need to be coded and that will impact your end latency.
Let’s work through the logical and physical modeling process using a subset of our travel industry reservation model. Here is the entity relationship diagram for the model:

A travel reservation is a variant of a standard sales order. In the above model, colors are used to reflect the various levels of granularity within this model.
As we design our semantic layer, we identify the levels of granularity we want to present to our business users and the associated measures we want to incorporate at that grain. We represent these requirements using fact tables at the associated grain. In our example, we are declaring fact tables at the reservation, reservation payment, and reservation item levels. We also identify the supporting dimensions, their dimensional type and whether they are conformed dimensions in this semantic model.
Let’s look at a subset of the star schema model we designed:

In the above model, colors are used to reflect the type of entity. Let’s declare the grain of the two fact tables (shown in light blue):
Now let’s go through our physical modeling rules we discussed in “Creating the Physical Schema” above.
We end up with the following physical BigQuery table:

Using our physical modeling rules we discussed in “Creating the Physical Schema” above.
We end up with the following physical table:

We try to keep conforming dimensions as separate tables as much as possible. They are enterprise dimensions standardized across many subject areas. They are typically enhanced (more attributes) over time as more and more subject areas use them. As they are enhanced, every subject area using them should benefit, which will happen automatically if they remain as separate tables.
With BigQuery’s support for nested and repeating structures, we could have physically modeled this as a single nested table at the reservation grain, with nested structures for reservation item, room booking, booking detail and booking amount. Essentially, you are just implementing your operational model as a single nested table and you will end up with nesting within nesting due to the relational nature of the original data model. For this type of data, we find this nested/repeating structure more difficult to use for the business end user, and less compatible with many legacy BI tools implemented in your enterprise. While we certainly implement BigQuery’s support for nested and repeating structures for other types of data structures (for example unstructured or XML based), in a Teradata conversion scenario, the dominant source data structures will be relational. Therefore, we like the use the star schema fact table based approach.
In this third article of the series, we explored the details of how to design your data warehouse for Google Cloud BigQuery. Starting with a semantic logical model, we discussed logical data modeling techniques using a star schema. We then discussed how to apply a consistent set of physical modeling rules to the star schema in order to create a well performing physical schema for BigQuery. Finally, we used a specific subject area example (travel reservations) to further explain the techniques we use.
In the next (fourth) article of the series, we will start to discuss the data integration approaches that will be necessary to load the actual data warehouse. We will focus introduce the staging area and the typical data integration use cases that require it.


