ETL Architecture

Checking your access…

Sign in to view this lesson

PDM Academy is open access with no payment required, but you need to be signed in so we can save your progress. Sign-in takes one click from your inbox.

Sign in to continue

Learning Objectives

By the end of this lesson you will be able to:

Describe the main architectural patterns for ETL in a data migration
Explain the trade-offs between different ETL tool categories
Identify the architectural factors that are specific to migration vs. integration ETL

Migration vs. Integration ETL

ETL (Extract, Transform, Load) is used in both data integration (ongoing data flows between live systems) and data migration (one-time movement of data from legacy to target). The same tools are often used, but the requirements differ significantly:

Dimension	Integration ETL	Migration ETL
Duration	Ongoing, indefinite	Time-limited (one-time or final run)
Volume	Regular, manageable batches	Often very large one-time volumes
Complexity	Typically simpler transforms	Complex historical data remediation
Error handling	Retry, queue, dead-letter	DQR-governed fallout
Auditability	Transaction-level	Data lineage tracing
Testing	Regression testing	Trial runs against real data

Migration ETL must handle the complexity of historical data - inconsistencies, gaps, nulls, and formats that have accumulated over years of system use. Integration ETL typically works with clean, current data from a live system.

ETL Architectural Patterns

Batch extract with staging The most common pattern in PDM migrations. Data is extracted in batches from each legacy LDS into the staging area, transformed and validated, then loaded to the target. Each step can be inspected. Errors are caught before they reach the target.

Direct ETL (no staging) Legacy → transform → target in a single pipeline. Faster and simpler for small, clean datasets. Not recommended for large, complex migrations where auditability and validation are essential.

API-based extraction When direct database access is not available, data is extracted through the legacy system’s API. Slower, more constrained, but sometimes the only option for cloud-based legacy systems.

File-based extraction Legacy systems that cannot be directly accessed export data to files (CSV, XML, fixed-width). The migration team receives these files and processes them. The file format and extraction schedule must be agreed with the legacy system team.

Tool Categories

General-purpose ETL tools (e.g. Microsoft SSIS, Informatica, Talend) Designed for integration as well as migration. Graphical design interface. Strong transformation capabilities. Require licensing and specialist skills. Best suited to large, long-running programmes.

Database-native scripting (SQL, stored procedures) Often the fastest approach for simple, well-structured migrations where the staging area and target are on the same database platform. Low licensing cost. High skill dependency.

Purpose-built migration tools Tools designed specifically for migrating from one specific platform to another (e.g. ERP-to-ERP migration tools). Reduce design effort for standard scenarios. Less flexible for complex or non-standard migrations.

Scripting languages (Python, PowerShell) Increasingly used for smaller migrations and for data profiling and validation tasks. Low licensing cost. Flexible. Require careful version control and documentation.

Staging Area Technology

The staging area is typically implemented as:

A relational database (SQL Server, PostgreSQL, Oracle) - most common; familiar tooling; good for profiling and transformation
A data warehouse platform (Snowflake, BigQuery) - better for very large volumes; more expensive
File-based (CSV files in a secure folder) - appropriate only for small, simple migrations

The staging database schema mirrors, at a minimum, the source structures needed for transformation. It should be under version control and have clear access controls.

Continuity Across Releases

Each ETL release in the PDM release cycle re-runs against the latest extracted data. The staging area must be managed across releases:

Full refresh: Each release extracts and processes all in-scope records from scratch. Simple but expensive for large datasets.
Incremental: Each release only processes records that have changed since the last extract. More efficient but requires change detection logic.

For most migrations, a full refresh is used for trial runs and the final migration run, with incremental processing only for interim development releases.

Key Takeaways

Migration ETL differs from integration ETL in complexity, volume, auditability requirements, and error handling
The standard PDM architecture uses a staging area between legacy and target
Tool selection depends on scale, budget, team skills, and the complexity of the transformation work
The staging area must be managed across releases, typically with full refresh for final runs

Book Reference

Practical Data Migration by Johny Morris (BCS, The Chartered Institute for IT):

Chapter 12 - Migration Design and Execution

Was this lesson helpful?