snapshot_records
A function for managing data snapshots by updating existing snapshots or creating new ones. Handles both CSV and Parquet formats with support for type coercion.Installation
Basic Usage
Parameters
stream_data(DataFrame): New data to be included in snapshotstream(str): Name of the streamsnapshot_dir(str): Directory for storing snapshotspk(str|list): Primary key(s) for deduplicationjust_new(bool): Adds only new records if Trueuse_csv(bool): Use CSV format instead of Parquetcoerce_types(bool): Force snapshot types to match stream_data**kwargs: Additional pandas read options
Returns
pandas DataFrame containing either complete snapshot or just new recordsread_snapshots
Read existing snapshot data into a dataframe.Usage
Parameters
stream(str): Name of the stream to readsnapshot_dir(str): Directory containing snapshots**kwargs: Additional pandas read options
Returns
pandas DataFrame containing snapshot data, or None if no snapshot existsdrop_redundant
Removes duplicate rows based on content hashing, maintaining a snapshot of processed data.Usage
Parameters
df(DataFrame): DataFrame to check for duplicatesname(str): Name for snapshot hash fileoutput_dir(str): Directory to save state filespk(str|list): Primary key(s) for state trackingupdated_flag(bool): Add flag for new/updated rowsuse_csv(bool): Use CSV format instead of Parquet
Returns
pandas DataFrame with redundant rows removedCommon Patterns
Incremental Processing
etl.py