Gluestick
Data Reading Functions
Reader
Recommended: A class for reading ETL files from the sync-output directory. This is the recommended way to read ETL files as it provides consistent error handling and metadata access.
Installation
Basic Usage
Key Methods
get(stream, default=None, catalog_types=False, **kwargs)
Reads the selected file into a pandas DataFrame.
get_metadata(stream)
Retrieves metadata from parquet files.
get_pk(stream)
Gets primary key(s) from parquet file metadata.
read_csv_folder
Legacy function for reading multiple CSV files. Consider using Reader class instead.
Usage
Parameters
path
(str): Directory containing CSV files or path to single CSV fileconverters
(dict): Entity-specific column convertersindex_cols
(dict): Entity-specific index columnsignore
(list): Files to ignore
Returns
Dictionary of pandas DataFrames, keyed by entity name
read_parquet_folder
Legacy function for reading multiple parquet files. Consider using Reader class instead.
Usage
Parameters
path
(str): Directory containing parquet files or path to single fileignore
(list): Files to ignore
Returns
Dictionary of pandas DataFrames, keyed by entity name
get_catalog_schema
Retrieves DataFrame schema from Singer catalog file.
Usage
Returns
Dictionary containing the stream’s schema definition
Notes
- Requires
catalog.json
in root directory - Raises exception if stream not found in catalog
- Filters schema to include only type and properties
- Ensures array types have items dictionary
Common Patterns
Iterating through sultiple streams
etl.py