Bauplan: A Serverless Lakehouse Without Vendor-Lock In
Simplifying data engineering with a Python-native data platform
This post was written in collaboration with the Bauplan team. The final wording and opinions are mine.
Iceberg has been a hot topic in the data community over the past couple of years. You’ve probably heard of it, or maybe even played around with it in a side project.
But how many people are actually using Iceberg in production?
We talk about it a lot, but few seem to know how to truly get started with it, or how to use it well. Many data professionals wonder how Iceberg might benefit them in their data projects.
But what if there were a solution that gives you all the benefits of Iceberg, without having to manage any infrastructure at all?
That’s where Bauplan comes in. It’s a modern data platform that fully embraces Iceberg and rethinks how data pipelines are built.
This article isn’t meant to dive deep into Bauplan’s technical internals (there’s already a great piece on that by another author). Instead, I’ll focus on the main features that make Bauplan unique, and why it might be useful in your data engineering workflows.
What is Bauplan?
Bauplan is a serverless data platform that makes it easy to build data pipelines in Python. You can use it to run transformations, analytics, or machine learning jobs directly on your cloud storage. You don’t need to manage infrastructure or learn complex tools. Bauplan takes care of things like containerization, runtime setup, schema changes, and data versioning so developers can focus on writing Python code.
Have you used a serverless function like Lambda Functions in AWS, Cloud Functions in GCP, or Azure Function Apps in Azure? Bauplan is basically serverless functions on steroids. Each function acts like a mini, stateless job that runs on-demand, automatically scaled and managed by the platform.
What Makes Bauplan Unique?
Let’s break down a few components that make Bauplan unique and useful:
Integration with Apache Iceberg
Data Branching
Serverless Execution
Declarative Pipeline Definition
Integration with Apache Iceberg
Bauplan natively supports Apache Iceberg, a modern open table format designed for large-scale analytics on object storage.
This means you get:
Schema evolution: Safely add/remove fields as your data evolves.
Time travel: Query past versions of your data.
Efficient partitioning and metadata: Faster queries and efficient storage.
Because Iceberg works directly on object storage (like S3), Bauplan can orchestrate rich data workflows without needing a traditional data warehouse, and without locking you into a vendor.
Open formats like Iceberg help ensure your data is future-proof and portable.
Data Branching
One of Bauplan’s most outstanding features is data branching, inspired by Git workflows. It is Git-like version control for data, but what makes it especially powerful is that it uses zero-copy branching (thanks to Iceberg).
You can create branches of your data, each representing a different version or environment (e.g. dev, staging, production).
Like Git, you can commit changes to datasets, merge branches, and even revert to a previous state.
This makes it possible to test pipeline changes safely without affecting production data.
Data branching brings software development best practices to data engineering, unlocking better collaboration, reproducibility, and governance.
And data branching is done with zero-copy, without duplicating data. It creates metadata references to the original data. Changes only materialize as new files if and when you write to the branch. You can experiment, transform, or test pipelines in isolation, all without bloating your storage costs.
It's like version-controlling your datasets the way you version your code.
Serverless Execution
Bauplan runs on a Function-as-a-Service (FaaS) model. Every step in your data pipeline is a standalone Python function that runs in its own lightweight container. You don't need to worry about provisioning servers, scaling clusters, or managing runtimes.
When a pipeline runs, Bauplan automatically:
Spins up isolated compute environments on-demand
Optimizes runtime environments for the task (e.g. memory, packages)
Tears down containers after execution
This gives you the flexibility of distributed systems (each function runs independently), with the simplicity of local Python scripts, all without managing infrastructure.
I’ve said this before, but it’s worth repeating: think of Bauplan like AWS Lambda or Google Cloud Functions, but purpose-built for data pipelines.
Declarative Pipeline Definition
Pipelines in Bauplan are defined using functional DAGs (Directed Acyclic Graphs), each node in the graph is a Python function, and edges represent data dependencies.
This structure enables:
Clear modular workflows: Each function does one thing and feeds into the next.
Automatic optimization: Bauplan can intelligently schedule and parallelize tasks.
Reusability and testability: Functions can be developed and tested independently.
The DAG model gives you a clear, maintainable blueprint of your data pipeline, one that's easy to reason about, extend, and debug.
You describe what needs to happen, and Bauplan handles how it happens.
Getting Started with Bauplan: A Mini Project with Jaffle Shop
Alright, enough talk, let’s get our hands dirty. You’ll get a much clearer sense of Bauplan once you see it in action.
If you’re a data or analytics engineer who works on data transformations, chances are you’ve used dbt at some point. In this project, we’ll use dbt alongside Bauplan to compare the two and give you a better idea of what Bauplan offers, and how it differs from tools like dbt.
The full code is available in this GitHub repo. The source data comes from dbt’s public repo. If you’d like to follow along, you’ll need to request early access to Bauplan.
In this project, we’ll:
Review the dbt project (
dbt-duckdb
)Rebuild the same project using Bauplan (DuckDB + PyArrow)
As much as I love Polars (and you probably know I’m a big fan), I chose to stick with SQL and DuckDB for this project. The goal was to showcase how you can build SQL-based pipelines in Bauplan, and how that experience compares to using a leading tool like dbt.
Note: I’m using
uv
to manage dependencies. If you’re not familiar with it, feel free to ignore theuv run
prefix in the CLI commands throughout this project. It’s not essential to follow along.
First things first, let’s do some set up:
# set up the project
git clone # clone the repo
cd intro-to-bauplan # navigate to the project folder
uv venv # or python -m venv .venv without uv
source .venv/bin/activate
uv sync # or pip install -r requirements.txt
dbt
Now we run dbt models:
# dbt setup
cd dbt_bauplan
uv run dbt debug
uv run dbt deps # Install dbt packages
# run dbt
uv run dbt seed
uv run dbt run
Once you successfully ran these commands, you’ll see dev.duckdb
file in the directory:
If you look at the data lineage, it looks like this:
Ultimately, this project builds models to feed into the customer mart table calculating some customer metrics.
Let’s also review the folder/file structure of the dbt project a bit. You have a SQL file per model along with yaml files for documentations, data tests, etc. A very typical structure for a dbt project:
dbt_bauplan/
├── macros/
│ ├── .gitkeep
│ └── cents_to_dollars.sql
├── models/
│ ├── marts/
│ │ ├── customers.sql
│ │ ├── customers.yml
│ │ ├── order_items.sql
│ │ ├── order_items.yml
│ │ ├── orders.sql
│ │ └── orders.yml
│ └── staging/
│ ├── __sources.yml
│ ├── stg_customers.sql
│ ├── stg_customers.yml
│ ├── stg_order_items.sql
│ ├── stg_order_items.yml
│ ├── stg_orders.sql
│ ├── stg_orders.yml
│ ├── stg_products.sql
│ ├── stg_products.yml
│ ├── stg_supplies.sql
│ └── stg_supplies.yml
├── seeds/
├── .gitkeep
├── raw_customers.csv
├── raw_items.csv
├── raw_orders.csv
├── raw_products.csv
└── raw_supplies.csv
Now let’s build this in Bauplan.
Bauplan
Project and Model Structure Overview
To get a clearer understanding of the project, let’s first look at the overall data flow and architecture of this Bauplan setup:
You have staging and mart tables, along with utils.py
(utility functions) and data_tests.py
(data audits). Bauplan is to provide the compute and to manage Iceberg tables in your S3 bucket.
Let’s compare the folder structure to see how it’s different than dbt:
bauplan_pipeline/
├── ingest/ # just a placeholder
│ └── .gitkeep
└── transform/
├── bauplan_project.yml
├── create_tables.py # register S3 files in Bauplan as Iceberg
├── data_tests.py # data tests/audits
├── marts.py # mart tables
├── staging.py # staging/prep tables
└── utils.py # utility functions
Instead of having one SQL file per model like in dbt, I use two main Python files,
marts.py
andstaging.py
, to group models by transformation stage.utils.py
works like dbt macros: it holds reusable UDFs that I can import into model files as needed (yep, it’s just a regular Python project!).data_tests.py
contain tests such asexpect_column_no_nulls
andexpect_column_all_unique
from bauplan.standard_expectations. These are similar to dbt’s data tests.I’ve also included a
create_tables.py
script for creating the source tables in Bauplan as Iceberg from files in S3. This step can also be done using Bauplan’s CLI, but I wanted to show how flexible the Python-first approach can be.1.
Note: Bauplan doesn’t enforce a strict project structure. You have the flexibility to organize your code like a typical Python project. For example, while many official examples use
models.py
for defining models, I chose a different approach here to highlight that flexibility.
Let’s take a look at one of the functions that defines a Bauplan model:
A Bauplan model takes tabular data as input and returns tabular data as output. And these bauplan decorators help you configure the model settings. You can set configurations such as:
Python version
Dependencies/Python packages
Materialization strategy
etc (link to the doc)2
Kind of feels like a serverless function, doesn’t it? You define the Python version and dependencies for each function, just like you would with AWS Lambda or similar platforms.
Also, you might’ve noticed a few other things about this function/model:
It uses
bauplan.Model
class to get upstream tables (raw_customers in this specific model).It returns an arrow table: a Bauplan model needs to return an arrow table.
It has a doc string containing model metadata: At the moment, there is no official way to register metadata for the model like dbt does, but you can take advantage of Python’s flexibility here.
Executing Bauplan Models
For the best experience, it's important to create a safe development environment where you can freely experiment and test your models. This is where data branching shines. With a simple CLI command like the one below, you can instantly create a new data branch:
uv run bauplan branch create yuki.sandbox
And run a few more CLI commands:
uv run bauplan branch checkout yuki.sandbox # switch to a data branch
uv run bauplan branch # list available data branches
As you can see, these CLI commands are very much like how you use Git!
Now, let’s run Bauplan models. You simply run a CLI command:
uv run bauplan run
You can also pass in some parameter:
uv run bauplan run --namespace yuki # run and build models in a specific namespace
uv run bauplan --dry-run # bauplan only operates in memory
Once your model finished running, you’ll able to query the table:
uv run bauplan query "select * from yuki.customers;"
Data Tests in Bauplan
Data tests are run when you execute models with bauplan run
. And if any of the data tests fails, bauplan run will halt the execution of the pipeline. This makes it easy for you to build your pipelines with the Write-Audit-Publish (WAP) pattern.
Let’s change a data test to something that’ll 100% fail.
From this:
To this:
Now that the data test expects a customer_id
column in stg_customers
, let's run the bauplan run
command to execute the pipeline.
Bauplan successfully caught the failed data test and you can see what exactly went wrong. From here, you can start troubleshooting and fixing the issue.
The transactional pipelines in Bauplan are a unique approach that ensures data integrity across your transformations. When you run a pipeline, Bauplan creates a temporary branch, similar to a SQL transaction:
If all expectations pass, the temp branch is merged into your working branch.
If any expectation fails, the temp branch is NOT merged, leaving your branch unchanged.
This mechanism prevents issues like partial executions or corrupted tables, a common pain point with tools like dbt. Everything in Bauplan is treated as a transaction, ensuring atomic, consistent, and safe operations.
Deploying Change into Production
So far, we’ve been working in a separate development branch. To deploy new models and changes to production, you simply switch to the main
branch and run a single CLI command. It’s that straightforward.
uv run bauplan branch checkout main # switch to the main branch
uv run bauplan branch merge <your branch>
I’d suggest running another CLI command,
bauplan branch diff main
before merging changes to the main branch. It is very useful in checking exactly what’s changed and different between your branch and the main branch. Here’s an example:
Summary
Do you see how easily these CLI commands fit into your workflow? That’s one of Bauplan’s strengths. You can also run them using the Bauplan Python SDK. This flexibility makes it straightforward to integrate Bauplan into your CI/CD pipelines. For more details on available CLI commands, check out Bauplan’s documentation.
In this project, we used DuckDB in Python to create SQL pipelines. While Bauplan does support .sql
files similar to dbt, there are currently some limitations, such as the inability to perform joins. For now, the most reliable and flexible way to build SQL pipelines in Bauplan is by using DuckDB within Python.
Bauplan vs. Traditional Data Platforms
Beyond all the features I’ve mentioned, the biggest differentiator of Bauplan is its simplicity. With Bauplan, you bring your own object storage (like S3), write Python functions, and that’s it. There’s no infrastructure to configure or manage, you get access to the power of the cloud instantly.
Because compute and storage are decoupled, and your workflows are just Python (or SQL) code, portability becomes a major advantage. You’re not locked into proprietary runtimes or storage formats, which makes it easier to migrate or evolve your stack down the line.
In today’s landscape, many vendors are actively designing platforms to lock customers in. But for the health of the data ecosystem, I believe the future lies in platforms like Bauplan, tools that are easy to adopt and just as easy to leave. That kind of openness benefits everyone.
Conclusion
Have I convinced you to give Bauplan a try yet? If so, go ahead and request access to their sandbox environment. You’ll be surprised to see how simple it is to build data pipelines using Iceberg in the cloud.
Feel free to reach out to me or the Bauplan team if you have any questions about the product!
You can use CLI commands like these to create source Iceberg tables in Bauplan:
# create the table schema and register in the iceberg catalog
uv run bauplan table create --name raw_items --namespace yuki --search-uri 's3://alpha-hello-bauplan/yuki/raw_items.csv'
# load data into the table
uv run bauplan table import --name raw_items --namespace yuki --search-uri 's3://alpha-hello-bauplan/yuki/raw_items.csv'
Data contracts: you can define data contracts in your pipelines by simply declaring the expected input columns and the expected output columns of your models (an example below). You can enforce contracts by running Bauplan with a special flag --strict.
@bauplan.model(columns=['column_a', 'column_b', 'column_c'])
@bauplan.python(...)
def ...
Appreciate the deep dive and planning to also do my own hacking around with it. Do you know how it works on larger volume/scale workloads that typically need more than a single DuckDB node?