Gluestick
Snapshotting functions
snapshot_records
A function for managing data snapshots by updating existing snapshots or creating new ones. Handles both CSV and Parquet formats with support for type coercion.
Installation
Basic Usage
Parameters
stream_data
(DataFrame): New data to be included in snapshotstream
(str): Name of the streamsnapshot_dir
(str): Directory for storing snapshotspk
(str|list): Primary key(s) for deduplicationjust_new
(bool): Adds only new records if Trueuse_csv
(bool): Use CSV format instead of Parquetcoerce_types
(bool): Force snapshot types to match stream_data**kwargs
: Additional pandas read options
Returns
pandas DataFrame containing either complete snapshot or just new records
read_snapshots
Read existing snapshot data into a dataframe.
Usage
Parameters
stream
(str): Name of the stream to readsnapshot_dir
(str): Directory containing snapshots**kwargs
: Additional pandas read options
Returns
pandas DataFrame containing snapshot data, or None if no snapshot exists
drop_redundant
Removes duplicate rows based on content hashing, maintaining a snapshot of processed data.
Usage
Parameters
df
(DataFrame): DataFrame to check for duplicatesname
(str): Name for snapshot hash fileoutput_dir
(str): Directory to save state filespk
(str|list): Primary key(s) for state trackingupdated_flag
(bool): Add flag for new/updated rowsuse_csv
(bool): Use CSV format instead of Parquet
Returns
pandas DataFrame with redundant rows removed
Common Patterns
Incremental Processing
etl.py