A function for managing data snapshots by updating existing snapshots or creating new ones. Handles both CSV and Parquet formats with support for type coercion.
stream_data
(DataFrame): New data to be included in snapshotstream
(str): Name of the streamsnapshot_dir
(str): Directory for storing snapshotspk
(str|list): Primary key(s) for deduplicationjust_new
(bool): Adds only new records if Trueuse_csv
(bool): Use CSV format instead of Parquetcoerce_types
(bool): Force snapshot types to match stream_data**kwargs
: Additional pandas read optionspandas DataFrame containing either complete snapshot or just new records
Read existing snapshot data into a dataframe.
stream
(str): Name of the stream to readsnapshot_dir
(str): Directory containing snapshots**kwargs
: Additional pandas read optionspandas DataFrame containing snapshot data, or None if no snapshot exists
Removes duplicate rows based on content hashing, maintaining a snapshot of processed data.
df
(DataFrame): DataFrame to check for duplicatesname
(str): Name for snapshot hash fileoutput_dir
(str): Directory to save state filespk
(str|list): Primary key(s) for state trackingupdated_flag
(bool): Add flag for new/updated rowsuse_csv
(bool): Use CSV format instead of Parquetpandas DataFrame with redundant rows removed
A function for managing data snapshots by updating existing snapshots or creating new ones. Handles both CSV and Parquet formats with support for type coercion.
stream_data
(DataFrame): New data to be included in snapshotstream
(str): Name of the streamsnapshot_dir
(str): Directory for storing snapshotspk
(str|list): Primary key(s) for deduplicationjust_new
(bool): Adds only new records if Trueuse_csv
(bool): Use CSV format instead of Parquetcoerce_types
(bool): Force snapshot types to match stream_data**kwargs
: Additional pandas read optionspandas DataFrame containing either complete snapshot or just new records
Read existing snapshot data into a dataframe.
stream
(str): Name of the stream to readsnapshot_dir
(str): Directory containing snapshots**kwargs
: Additional pandas read optionspandas DataFrame containing snapshot data, or None if no snapshot exists
Removes duplicate rows based on content hashing, maintaining a snapshot of processed data.
df
(DataFrame): DataFrame to check for duplicatesname
(str): Name for snapshot hash fileoutput_dir
(str): Directory to save state filespk
(str|list): Primary key(s) for state trackingupdated_flag
(bool): Add flag for new/updated rowsuse_csv
(bool): Use CSV format instead of Parquetpandas DataFrame with redundant rows removed