Learning Objectives
By the end of this lesson you will be able to:
- Describe the components of the end-to-end migration architecture
- Explain the role of the staging area in migration design
- Identify the key design decisions that must be made before ETL build begins
📦 The DHGS Story - Episode 4: Designing the Kelsey Pit migration. Now the Kelsey Pit data gets its ETL design and a System Retirement Plan: what migrates, what’s archived, and the fallback if go-live fails. The “one-way-street” risks flagged back in Analysis shape the transformation rules here - the equipment codes Whit Bissell relied on have to be reconciled before they can land on the ABC platform.
Everything designed in this module gets built and run in Module 06.
The End-to-End Architecture
A PDM migration has three zones:
Legacy Systems → [Staging Area] → Target System
↑ ↑
Extract Load
└──── Transform ────────────────┘
Each zone has distinct characteristics:
Legacy Zone The source systems. The migration team’s access to this zone is constrained - legacy systems are often production systems that cannot be disrupted. Extract windows (the times when data can be safely extracted) must be defined. The legacy systems will continue operating during the migration period.
Staging Area An intermediate environment where extracted legacy data is held, transformed, and validated before loading to the target. The staging area is under the migration team’s control. It is where:
- Data profiling runs against actual extracted data
- DQR resolutions are applied as transformations
- Data lineage records are created
- Validation against target rules occurs before load
- Trial run data is held for review
The staging area is not a target system - it is a working environment. At go-live, its contents are the last verified extract of the legacy data.
Target Zone The new system receiving the migrated data. The migration team’s access here is also constrained - it is often the supplier’s environment during implementation. Load rules and sequencing must be agreed with the target system team.
Design Questions Before Build
Before the ETL build begins, the design must answer:
Extract design:
- What is the extraction window for each legacy system? (When can the system be safely accessed?)
- What is the extraction method? (Direct database access, API, file export, manual extract?)
- What is the data format from each source? (Character sets, date formats, numeric precision)
- How are incremental extracts handled? (Do subsequent releases extract all data or only changes?)
Staging design:
- What technology platform for the staging area?
- How is it secured? (Data classification, access controls)
- How long is extracted data retained?
- How are multiple releases managed? (Overlay, archive, or separate schemas?)
Transform design:
- Which transformations are applied in-flight vs. pre-load?
- How are look-up tables managed and versioned?
- How are parsing errors handled? (Error log, fallout, DQR trigger)
Load design:
- What is the target load sequence? (Dependencies between entity loads)
- What is the method for each entity? (Insert, update, insert-or-update)
- How are constraint violations handled?
- How is the load verified? (Record counts, checksum validation)
The Content Matrix
PDM uses an E2E Content Matrix to document the end-to-end design. This is a structured spreadsheet or table that captures, for each data entity:
- The source LDS
- The extract method and window
- The key transformations
- The target entity
- The load sequence and method
- Dependencies on other entities
The Content Matrix is the primary design specification. It evolves from the mapping documents in GAM and is the basis for the ETL build specification.
Key Design Trade-offs
Performance vs. control: Faster ETL (fewer transformation stages, bulk loads) reduces the project’s ability to inspect and validate intermediate states. PDM favours a staging area specifically to maintain control over the process.
Automation vs. visibility: Fully automated transformations are fast and repeatable, but errors are invisible until a validation step fails. The design should include validation checkpoints that surface errors early.
In-flight correction vs. source fix: Applying DQR resolutions in-flight (during extraction or load) is faster but creates One Way Streets. Fixing in the legacy source is slower but preserves auditability.
Key Takeaways
- The three-zone architecture (legacy → staging → target) is the standard PDM migration model
- The staging area is under the migration team’s control and is where profiling, transformation, and validation occur
- Design questions must be answered before ETL build: extract windows, access methods, staging technology, load sequence
- The Content Matrix documents the end-to-end design per entity
Book Reference
Practical Data Migration by Johny Morris (BCS, The Chartered Institute for IT):
- Chapter 12 - Migration Design and Execution