TransformPlan¶

A Python library for safe, reproducible data transformations with built-in auditing and validation.

TransformPlan tracks transformation history, validates operations against DataFrame schemas, and generates audit trails for data processing workflows.

Features¶

Declarative transformations: Build transformation pipelines using method chaining
Schema validation: Validate operations before execution with dry-run capability
Audit trails: Generate complete audit protocols with deterministic DataFrame hashing
Multi-backend support: Works with both Polars (primary) and Pandas DataFrames
Serializable pipelines: Save and load transformation plans as JSON

Quick Example¶

from transformplan import TransformPlan, Col

# Build readable pipelines with 70+ chainable operations
plan = (
    TransformPlan()
    # Standardize column names
    .col_rename(column="PatientID", new_name="patient_id")
    .col_rename(column="DOB", new_name="date_of_birth")
    .str_strip(column="patient_id")

    # Calculate derived values
    .dt_age_years(column="date_of_birth", new_column="age")
    .math_clamp(column="age", min_value=0, max_value=120)

    # Categorize patients age
    .map_discretize(column="age", bins=[18, 40, 65], labels=["young", "adult", "senior"], new_column="age_group")

    # Filter and clean
    .rows_filter(Col("age") >= 18)
    .rows_drop_nulls(columns=["patient_id", "age"])
    .col_drop(column="date_of_birth")
)

# Execute with schema validation — catch errors before they hit production
df_result, protocol = plan.process(df, validate=True)

# Serialize pipelines to JSON — version control your transformations
plan.to_json("patient_transform.json")

# Reload and reapply — reproducible results across environments
plan = TransformPlan.from_json("patient_transform.json")
df_result, protocol = plan.process(new_data)

Full Audit Trail — Every Step Tracked and Hashed¶

protocol.print(show_params=False)

======================================================================
TRANSFORM PROTOCOL
======================================================================
Input:  1000 rows × 5 cols  [a4f8b2c1]
Output: 847 rows × 5 cols   [e7d3f9a2]
Total time: 0.0247s
----------------------------------------------------------------------

#    Operation            Rows         Cols         Time       Hash
----------------------------------------------------------------------
0    input                1000         5            -          a4f8b2c1
1    col_rename           1000         5            0.0012s    b2e4a7f3
2    col_rename           1000         5            0.0008s    c9d1e5b8
3    str_strip            1000         5            0.0013s    c9d1e5b8        ○
4    dt_age_years         1000         6 (+1)       0.0041s    d4f2c8a1
5    math_clamp           1000         6            0.0015s    e1b7d3f9
6    map_discretize       1000         7 (+1)       0.0028s    f8a4c2e6
7    rows_filter          858 (-142)   7            0.0037s    a2e9f4b7
8    rows_drop_nulls      847 (-11)    7            0.0019s    b5c1d8e3
9    col_drop             847          6 (-1)       0.0006s    e7d3f9a2
======================================================================
○ = no effect (steps 3 did not change data)

Available Operations¶

Category	Description	Examples
col_	Column operations	`col_rename`, `col_drop`, `col_cast`, `col_add`, `col_select`
math_	Arithmetic operations	`math_add`, `math_multiply`, `math_clamp`, `math_round`, `math_abs`
rows_	Row filtering & reshaping	`rows_filter`, `rows_drop_nulls`, `rows_sort`, `rows_unique`, `rows_pivot`
str_	String operations	`str_lower`, `str_upper`, `str_strip`, `str_replace`, `str_split`
dt_	Datetime operations	`dt_year`, `dt_month`, `dt_parse`, `dt_age_years`, `dt_diff_days`
map_	Value mapping	`map_values`, `map_discretize`, `map_case`, `map_from_column`
enc_	Categorical encoding	`enc_onehot`, `enc_ordinal`, `enc_label`

Getting Started¶

Installation - How to install TransformPlan
Quickstart - Your first transformation pipeline

API Reference¶

TransformPlan - Main class for building pipelines
Filters - Filter expressions for row operations
Protocol - Audit trail generation
Chunked Processing - Process large files that exceed RAM
Validation - Schema validation utilities