<evan.rosa/>
<evan.rosa/>

What is ETL in 2025? Moving Beyond Extract, Transform, Load

Cover Image for What is ETL in 2025? Moving Beyond Extract, Transform, Load

ETL isn’t dead—it just changed its wardrobe. In 2025, data pipelines are faster, smarter, and more flexible than ever. But to build them right, you need to understand the difference between ETL and ELT—and when to reach for each.

The Classic ETL Playbook

Let’s start at the beginning: Extract, Transform, Load. Once upon a time, this was the standard play in every data engineer’s book of spells:

  1. Extract data from source systems—databases, APIs, logs, you name it.
  2. Transform the data in a staging area—cleaning, joining, shaping.
  3. Load the finished product into a warehouse for querying and dashboards.

This approach worked well in the days of scheduled batch jobs and tightly controlled schemas. But the modern data stack doesn’t sit still long enough for that old-school rhythm.

ELT: Flipping the Flow

ELT (Extract, Load, Transform) turns the process inside-out. Now we load data before transformation—letting your warehouse handle the heavy lifting. It’s faster, cheaper, and more scalable for today’s cloud-native environments.

Category ETL ELT
Transformation Before loading After loading (in-warehouse or lakehouse)
Flexibility Hard to change post-build Modular and analyst-friendly
Real-Time Support Mostly batch Streaming-friendly
Best Fit Compliance-heavy or legacy systems Cloud-native, modern stacks
Cost & Scale Fixed compute costs Elastic, pay-as-you-go compute
Tool Examples Informatica, Talend, Airflow, Spark dbt, SQLMesh, BigQuery, Iceberg, Snowflake

Why the Shift?

Modern businesses don’t want yesterday’s reports tomorrow—they want insight in motion:

  • Real-time updates
  • Scalable pipelines
  • Change-tolerant schemas
  • Observability and lineage
  • Self-serve tooling for data consumers

Put simply, the old ETL model can’t keep up with today’s speed of business. Enter streaming-first, modular architectures designed to evolve—not break.

Batch vs Streaming: Know When to Use What

Pipeline Type Best For Tooling Stack
Batch Nightly loads, BI reporting, warehousing Airflow, Spark, SQLMesh, dbt, Pandas
Streaming Event tracking, fraud detection, real-time alerts Kafka, Flink, Spark Structured Streaming, Pulsar
Hybrid Historical + real-time context Kafka + Flink, Iceberg + SQLMesh, Airflow + dbt

Streaming gives you speed. Batch gives you depth. Smart data systems use both—and know when to lean on each.

My Recommended 2025 Data Stack

When I build pipelines today, here’s what’s in my belt:

Data Ingestion

  • Kafka – battle-tested pub/sub for high-volume event streams
  • Apache Pulsar – flexible alternative with multi-tenancy and tiered storage
  • Fivetran / Airbyte (open-core) – connectors galore for quick batch loads
  • Custom Python loaders – when you need more control

Data Processing

  • Apache Flink – real-time stream transformations at scale
  • Apache Spark – batch processing powerhouse with rich APIs
  • SQLMesh – modular, version-controlled SQL transformations with CI/CD baked in
  • dbt – excellent for transformation logic inside warehouses

Data Storage

  • Apache Iceberg – versioned, schema-evolving tables for lakes and lakehouses
  • BigQuery – serverless warehousing with blazing performance
  • Snowflake – cloud-native warehouse with secure data sharing
  • PostgreSQL – trusty relational DB for operational storage

Orchestration

  • Apache Airflow – robust DAG-based scheduling and orchestration
  • Dagster – opinionated, type-safe workflows with stronger developer ergonomics
  • Prefect – more dynamic orchestration with less boilerplate

Monitoring & Observability

  • Great Expectations – test and validate your data with assertions and alerts
  • Datafold – regression testing and column-level lineage for transformations
  • DataHub – open metadata platform for lineage, ownership, and discovery
  • Prometheus + Grafana – metrics and dashboards for pipeline health

Visualization

  • Apache Superset – open-source dashboards with SQL-friendly flexibility
  • Metabase – fast, self-service dashboards for cross-functional teams
  • Looker – semantic layer and governance with clean UX (if budget allows)

Choosing the Right Tool: It’s All About Fit

Think of this stack like a well-stocked toolbox. You don’t use a sledgehammer to tighten a bolt—and you don’t need Flink to parse a daily CSV.

Each tool listed here has its strengths, but I don’t believe in hammering every problem with the same wrench. Instead, I match tools to the problem space:

  • Need low-latency insights from a high-volume event stream? I’ll reach for Kafka + Flink.
  • Need quick batch transforms for a reporting layer? Airflow + SQLMesh does the trick.
  • Need stakeholder-friendly dashboards? Sometimes Metabase is the fastest win.

The point is: great data engineering isn’t about committing to one stack—it’s about knowing your tools and selecting the right one for the job.

Because in the end, a good data engineer doesn’t just build pipelines—they build systems that solve problems efficiently.

Best Practices for Modern ETL & ELT

Whether you’re ETLing, ELTing, or somewhere in between—here’s what I recommend:

  1. Design for failure
    Pipelines break. Make sure yours can recover gracefully. Think retries, timeouts, and dead-letter queues.

  2. Test early and often
    Treat data like code. Use tools like Great Expectations or dbt tests to catch bad data before it spreads.

  3. Make everything modular
    Pipelines are living systems. Structure them in a way that’s easy to extend and refactor.

  4. Track lineage and versions
    Use Iceberg and Nessie to track changes to your data just like Git tracks your code. Debugging becomes a breeze.

  5. Empower end users
    Data should be self-serve. Build pipelines and dashboards that analysts love using—not just ones engineers can tolerate.

Real-World Impact: From Legacy to Lift-Off

At Digital Turbine, I refactored legacy batch pipelines that were brittle, slow, and expensive—migrating them to Spark on Airflow. The result?

  • Over $100K in annual cloud savings
  • Massive improvements in throughput
  • Faster feedback loops for data consumers

On the side, I’ve built hybrid streaming-first platforms combining Kafka, Flink, Iceberg, and SQLMesh to power systems with millions of daily events—and they’re still humming along.


Let’s Connect
If you’re building modern data systems—or dreaming of pipelines that don’t fall apart when change happens—let’s talk.

LinkedIn | Portfolio