Protocol¶
The Protocol class captures transformation history for auditability and reproducibility.
Overview¶
When you process data with a TransformPlan, you receive both the transformed DataFrame and a Protocol object. The protocol contains:
- Input/output DataFrame hashes for verification
- Step-by-step operation details
- Shape changes and timing information
- Optional metadata
from transformplan import TransformPlan
plan = TransformPlan().col_drop("temp").math_multiply("price", 1.1)
df_result, protocol = plan.process(df)
# View the protocol
protocol.print()
# Save for audit
protocol.to_json("audit_trail.json")
Protocol Class¶
Protocol
¶
Captures the transformation history for auditability.
Initialize an empty Protocol.
Source code in transformplan/protocol.py
input_hash
property
¶
Hash of the input DataFrame.
Returns:
| Type | Description |
|---|---|
str | None
|
Hash string or None if not set. |
output_hash
property
¶
Hash of the final output DataFrame.
Returns:
| Type | Description |
|---|---|
str | None
|
Hash string or None if no steps. |
metadata
property
¶
Protocol metadata.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary of metadata. |
set_input
¶
set_metadata
¶
Set arbitrary metadata on the protocol.
Example
protocol.set_metadata(author="alice", project="analysis-v2")
add_step
¶
add_step(
operation: str,
params: dict[str, Any],
old_shape: tuple[int, int],
new_shape: tuple[int, int],
elapsed: float,
output_hash: str,
) -> None
Record a transformation step in the protocol.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
operation
|
str
|
Name of the operation. |
required |
params
|
dict[str, Any]
|
Operation parameters. |
required |
old_shape
|
tuple[int, int]
|
Shape before operation (rows, cols). |
required |
new_shape
|
tuple[int, int]
|
Shape after operation (rows, cols). |
required |
elapsed
|
float
|
Time taken in seconds. |
required |
output_hash
|
str
|
Hash of the output DataFrame. |
required |
Source code in transformplan/protocol.py
to_dataframe
¶
Convert protocol to a Polars DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with step information. |
Source code in transformplan/protocol.py
to_csv
¶
Write protocol to CSV file.
Params are serialized as JSON strings to avoid nested data issues.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
File path to write to. |
required |
Source code in transformplan/protocol.py
to_dict
¶
Serialize protocol to a dictionary.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary representation of the protocol. |
Source code in transformplan/protocol.py
from_dict
classmethod
¶
from_dict(data: dict[str, Any]) -> Protocol
Deserialize protocol from a dictionary.
Returns:
| Type | Description |
|---|---|
Protocol
|
Protocol instance. |
Source code in transformplan/protocol.py
to_json
¶
Serialize protocol to JSON.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path | None
|
Optional file path to write to. |
None
|
indent
|
int
|
JSON indentation level. |
2
|
Returns:
| Type | Description |
|---|---|
str
|
JSON string. |
Source code in transformplan/protocol.py
from_json
classmethod
¶
from_json(source: str | Path) -> Protocol
Deserialize protocol from JSON.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str | Path
|
Either a JSON string or a path to a JSON file. |
required |
Returns:
| Type | Description |
|---|---|
Protocol
|
Protocol instance. |
Source code in transformplan/protocol.py
summary
¶
Generate a clean, human-readable summary of the protocol.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
show_params
|
bool
|
Whether to include operation parameters. |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Formatted string summary. |
Source code in transformplan/protocol.py
340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 | |
print
¶
Print the protocol summary to stdout.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
show_params
|
bool
|
Whether to include operation parameters. |
True
|
frame_hash Function¶
frame_hash
¶
Compute a deterministic hash of a DataFrame.
The hash is: - Row-order invariant (sorted row hashes) - Column-order invariant (columns sorted before hashing) - Content-sensitive (any value change = different hash)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to hash. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A 16-character hex string. |
Source code in transformplan/protocol.py
Example Output¶
The print() method generates a formatted summary:
======================================================================
TRANSFORM PROTOCOL
======================================================================
Input: 1000 rows x 5 cols [a1b2c3d4e5f6g7h8]
Output: 850 rows x 4 cols [h8g7f6e5d4c3b2a1]
Total time: 0.0234s
----------------------------------------------------------------------
# Operation Rows Cols Time Hash
----------------------------------------------------------------------
0 input 1000 5 - a1b2c3d4e5f6g7h8
1 col_drop 1000 4 (-1) 0.0012s b2c3d4e5f6g7h8a1
-> column='temp'
2 math_multiply 1000 4 0.0008s c3d4e5f6g7h8a1b2
-> column='price', value=1.1
3 rows_filter 850 (-150) 4 0.0214s h8g7f6e5d4c3b2a1
-> filter=(age >= 18)
======================================================================
Reproducibility¶
The frame_hash function computes a deterministic hash that is:
- Row-order invariant: Same rows in different order produce the same hash
- Column-order invariant: Same columns in different order produce the same hash
- Content-sensitive: Any value change produces a different hash
This enables verification that the same pipeline on the same input produces identical results.