Learning Objectives
By the end of this lesson you will be able to:
- Describe the main architectural patterns for ETL in a data migration
- Explain the trade-offs between different ETL tool categories
- Identify the architectural factors that are specific to migration vs. integration ETL
Migration vs. Integration ETL
ETL (Extract, Transform, Load) is used in both data integration (ongoing data flows between live systems) and data migration (one-time movement of data from legacy to target). The same tools are often used, but the requirements differ significantly:
| Dimension | Integration ETL | Migration ETL |
|---|---|---|
| Duration | Ongoing, indefinite | Time-limited (one-time or final run) |
| Volume | Regular, manageable batches | Often very large one-time volumes |
| Complexity | Typically simpler transforms | Complex historical data remediation |
| Error handling | Retry, queue, dead-letter | DQR-governed fallout |
| Auditability | Transaction-level | Data lineage tracing |
| Testing | Regression testing | Trial runs against real data |
Migration ETL must handle the complexity of historical data - inconsistencies, gaps, nulls, and formats that have accumulated over years of system use. Integration ETL typically works with clean, current data from a live system.
ETL Architectural Patterns
Batch extract with staging The most common pattern in PDM migrations. Data is extracted in batches from each legacy LDS into the staging area, transformed and validated, then loaded to the target. Each step can be inspected. Errors are caught before they reach the target.
Direct ETL (no staging) Legacy → transform → target in a single pipeline. Faster and simpler for small, clean datasets. Not recommended for large, complex migrations where auditability and validation are essential.
API-based extraction When direct database access is not available, data is extracted through the legacy system’s API. Slower, more constrained, but sometimes the only option for cloud-based legacy systems.
File-based extraction Legacy systems that cannot be directly accessed export data to files (CSV, XML, fixed-width). The migration team receives these files and processes them. The file format and extraction schedule must be agreed with the legacy system team.
Tool Categories
General-purpose ETL tools (e.g. Microsoft SSIS, Informatica, Talend) Designed for integration as well as migration. Graphical design interface. Strong transformation capabilities. Require licensing and specialist skills. Best suited to large, long-running programmes.
Database-native scripting (SQL, stored procedures) Often the fastest approach for simple, well-structured migrations where the staging area and target are on the same database platform. Low licensing cost. High skill dependency.
Purpose-built migration tools Tools designed specifically for migrating from one specific platform to another (e.g. ERP-to-ERP migration tools). Reduce design effort for standard scenarios. Less flexible for complex or non-standard migrations.
Scripting languages (Python, PowerShell) Increasingly used for smaller migrations and for data profiling and validation tasks. Low licensing cost. Flexible. Require careful version control and documentation.
Staging Area Technology
The staging area is typically implemented as:
- A relational database (SQL Server, PostgreSQL, Oracle) - most common; familiar tooling; good for profiling and transformation
- A data warehouse platform (Snowflake, BigQuery) - better for very large volumes; more expensive
- File-based (CSV files in a secure folder) - appropriate only for small, simple migrations
The staging database schema mirrors, at a minimum, the source structures needed for transformation. It should be under version control and have clear access controls.
Continuity Across Releases
Each ETL release in the PDM release cycle re-runs against the latest extracted data. The staging area must be managed across releases:
- Full refresh: Each release extracts and processes all in-scope records from scratch. Simple but expensive for large datasets.
- Incremental: Each release only processes records that have changed since the last extract. More efficient but requires change detection logic.
For most migrations, a full refresh is used for trial runs and the final migration run, with incremental processing only for interim development releases.
Key Takeaways
- Migration ETL differs from integration ETL in complexity, volume, auditability requirements, and error handling
- The standard PDM architecture uses a staging area between legacy and target
- Tool selection depends on scale, budget, team skills, and the complexity of the transformation work
- The staging area must be managed across releases, typically with full refresh for final runs
Book Reference
Practical Data Migration by Johny Morris (BCS, The Chartered Institute for IT):
- Chapter 12 - Migration Design and Execution