For most of computing history, moving data from one system to another followed a logical sequence: pull it out, clean it up, then load it somewhere useful. That sequence — Extract, Transform, Load — made perfect sense when storage was expensive and compute was finite. Then cloud warehouses arrived and quietly made the "clean it first" assumption obsolete. Understanding why that happened, and which pattern to use in which situation, is one of the most practically useful things a data professional can learn in 2026.
The Classic ETL Model
ETL earned its reputation across four decades of enterprise data warehousing. The pipeline worked like this: a source system (an ERP, a CRM, a point-of-sale database) exports records on a schedule. A staging server — often a dedicated Windows machine running Informatica PowerCenter or Microsoft SSIS — receives that raw data, applies business rules, cleans nulls, standardises date formats, joins lookup tables, and emits a clean, structured dataset. Only then does that transformed data land in the warehouse, typically Oracle or SQL Server.
This transform-first architecture existed for a solid economic reason: warehouse storage and compute were expensive. Writing 50 GB of raw, messy data to a data warehouse when only 8 GB of it was analytically useful was genuinely wasteful. The staging server did the heavy lifting so the warehouse did not have to.
The tooling ecosystem that grew around ETL — Informatica, DataStage, Talend, SSIS — reflected this philosophy. These tools were powerful, heavily visual, and priced accordingly. A mid-sized enterprise might spend ₹15–20 lakh per year on Informatica licenses, plus the cost of dedicated staging infrastructure. For companies that could afford it, this worked well. For everyone else, the cost was prohibitive.
What Cloud Warehouses Changed
Snowflake launched in 2012. Google BigQuery had been available since 2010. Amazon Redshift arrived in 2012. What made these products genuinely different was not just that they lived in the cloud — it was their pricing and compute architecture.
BigQuery separates storage from compute entirely. You pay roughly $0.023 per GB stored per month, and $5 per TB of data scanned — which works out to approximately ₹0.50 per TB scanned at current exchange rates. Snowflake uses virtual warehouses that you spin up, run a transform job on, and suspend — paying only for the seconds of compute you use.
This created a stark economic comparison. Running an Informatica staging server on-premises: somewhere between ₹1.5–2 lakh per month once you factor in licensing, hardware, and the DBA who maintains it. Running a dbt transform job on BigQuery that scans 500 GB of raw data: roughly ₹25. The compute cost to transform inside the warehouse became so negligible that the entire rationale for transforming before loading collapsed.
If transformation is cheap inside the warehouse, why not load everything first — raw, unmodified — and transform it there? That inversion is what ELT is: Extract, Load, Transform. Load the raw data first. Transform it inside the warehouse using SQL that the warehouse executes at scale. The "staging server" becomes the warehouse itself.
ELT in Practice — A Real Walkthrough
Here is what a modern ELT pipeline looks like for a mid-sized e-commerce company, described as a text diagram:
Source Systems Ingestion Layer Cloud Warehouse Transform Layer
────────────────── ─────────────── ─────────────── ───────────────
Shopify Orders ───► raw.shopify_orders
PostgreSQL DB ───► Fivetran or ───────► raw.postgres_users ──► dbt models ──► analytics.fct_orders
Google Ads API ───► Airbyte raw.google_ads analytics.dim_customers
WhatsApp CRM ───► raw.whatsapp_chats analytics.marketing_spend
Fivetran or Airbyte handles the Extract and Load steps — pulling data from APIs and databases and landing it as raw tables in BigQuery or Snowflake. These tables are unmodified. A Shopify order lands exactly as Shopify emits it, null values, duplicate webhook events and all.
dbt then handles the Transform step inside the warehouse. A staging model (stg_orders.sql) renames columns, casts types, and removes duplicates. A mart model (fct_orders.sql) joins orders to customers, applies revenue attribution logic, and produces the clean table that feeds dashboards.
The practical advantage that often gets overlooked: because raw data is always present in the warehouse, you can re-run your transforms any time without re-extracting from the source. If your revenue attribution logic was wrong last quarter, you fix the dbt model and rebuild the entire history in minutes — from data you already have. With ETL, a wrong transform meant the bad data was loaded and the raw source data might be gone.
When ETL Still Wins
ELT is not universally superior. There are specific conditions where transforming before loading remains the right call.
Compliance and PII masking. If you are processing health records under HIPAA, or personal data under GDPR or India's DPDPA 2023, loading raw PII into a cloud warehouse — even briefly — may violate data handling requirements. ETL lets you mask, anonymise, or tokenise sensitive fields on a controlled staging server before any cloud system touches them. Many hospital information systems in Kerala that handle patient data follow this pattern precisely because the regulated data never reaches a third-party cloud service unmasked.
Legacy on-premises systems with strict firewalls. Some enterprise environments — government departments, defence contractors, legacy banking systems — do not permit outbound connections from their data sources to cloud services. An ETL staging server sitting inside the same network can pull data locally, transform it, and push only the clean, approved output to an external system.
Bandwidth-constrained sources. Factory IoT sensors on 2G or 3G connections in rural India generating continuous telemetry cannot reliably stream raw data to a cloud warehouse. Pre-aggregating at the edge — summarising 1,000 sensor readings per minute into one row per hour — and then loading the compressed result is exactly what ETL was designed for. The transform happens close to the source to reduce the data volume before it travels over the network.
The Modern Data Stack in 2026
The term "modern data stack" has become shorthand for a specific combination of tools that covers each step of ELT independently, rather than one monolithic platform doing everything.
Extract and Load (E + L): Fivetran (managed, paid, over 500 connectors), Airbyte (open-source, self-hostable, over 350 connectors). For teams with budget, Fivetran's zero-maintenance approach is worth the cost. For a bootstrapped team comfortable with Docker, Airbyte running on a ₹800/month DigitalOcean droplet gives you virtually the same connectivity for close to nothing.
Storage (L destination): Google BigQuery, Snowflake, Amazon Redshift, or Databricks SQL. BigQuery's free tier covers 10 GB of storage and 1 TB of queries per month — enough for an early-stage startup to run their entire analytics operation at zero cost before needing to spend a rupee.
Transform (T): dbt Core (open-source) or dbt Cloud (managed). This is where SQL models, tests, and documentation live. dbt has become so central to the modern data stack that knowing it is now a listed requirement on most data engineering job descriptions in India.
Reverse ETL — The Third Pattern
Once your warehouse contains clean, trusted data, there is a natural next question: how do you get it back into the operational systems where your team actually works? That is the problem Reverse ETL solves.
Tools like Hightouch and Census connect to your warehouse and push data outward — to your CRM, to Salesforce, to WhatsApp Business API, to marketing automation platforms. A common example: a warehouse model identifies customers who have not purchased in 90 days, and Hightouch automatically syncs that segment to a WhatsApp campaign list every morning.
For Indian sales and marketing teams, this pattern closes the loop between analytics and action. The analyst builds the segment in SQL; the sales team sees it in their WhatsApp dashboard without downloading any spreadsheet. Reverse ETL is what makes data warehouses operationally useful rather than just analytically interesting.
Choosing for Your Use Case
The decision is not ETL vs ELT as a philosophical stance — it is a function of your specific constraints.
| Factor | Lean ETL | Lean ELT |
|---|---|---|
| PII / regulated data | Mask before load | Warehouse-level access controls sufficient |
| Data volume | Under 10 GB/day, on-prem | Hundreds of GB/day, cloud-native |
| Network bandwidth | Constrained (IoT, 2G/3G) | Reliable broadband or cloud-to-cloud |
| Team skills | Python/Java engineers comfortable with ETL tools | SQL-fluent analysts comfortable with dbt |
| Reprocessing history | Painful — requires re-extraction | Trivial — raw data persists in warehouse |
| Compliance environment | HIPAA, DPDPA with strict data residency | General business analytics, GDPR with warehouse encryption |
Most data teams in India building from scratch in 2026 should default to ELT. Cloud warehouse costs have continued to fall. The tooling ecosystem is mature and well-documented. The SQL-first approach means analysts can own more of the pipeline without waiting for engineers. ETL remains the right choice when regulation or infrastructure genuinely demands it — not as a default.
Frequently Asked Questions
Can I do ELT with a traditional on-prem SQL Server?
Technically yes, but without the elastic compute of cloud warehouses, the Transform step becomes a bottleneck fast. When your SQL Server is also serving your transactional application, running heavy dbt-style transforms during business hours will compete for resources. On-prem ELT only makes practical sense with a dedicated high-memory transform server. Most on-prem shops are better off sticking with ETL on a staging server and keeping the warehouse for clean, analytics-ready data. The economic case for ELT depends specifically on cloud compute elasticity.
What does dbt actually do in an ELT pipeline?
dbt runs SQL SELECT statements that reshape raw tables into analytics-ready models — entirely inside the warehouse. It does not move data between systems, does not connect to source databases, and does not orchestrate the Extract or Load steps. Think of it as version-controlled SQL with built-in data quality tests and auto-generated documentation. You write a .sql file, dbt compiles and executes it in BigQuery or Snowflake, and the result materialises as a view or physical table in your warehouse. The ref() function lets models declare dependencies on each other, so dbt knows the correct execution order automatically.
For an Indian startup on a tight budget, which approach should I start with?
ELT with BigQuery free tier plus dbt Core (open-source) plus self-hosted Airbyte. Your total cloud cost is ₹0 until you exceed 1 TB of queries per month — that is BigQuery's free tier. The only real ongoing expense is a ₹800/month DigitalOcean droplet to run Airbyte and its Temporal orchestrator. This stack comfortably handles tens of millions of rows and dozens of data sources before you hit a cost that warrants attention. Many early-stage startups in Bangalore and Kochi run their entire analytics infrastructure on this combination through their Series A.