Documentation Index
Fetch the complete documentation index at: https://docs.hotglue.com/llms.txt
Use this file to discover all available pages before exploring further.
Reader
Recommended: A class for reading ETL files from the sync-output directory. This is the recommended way to read ETL files as it provides consistent error handling and metadata access.
Installation
from gluestick.reader import Reader
Basic Usage
# Initialize with default directories
reader = Reader() # Uses ROOT_DIR/sync-output
# Get available streams
print(reader) # Shows list of available streams
Key Methods
get(stream, default=None, catalog_types=False, **kwargs)
Reads the selected file into a pandas DataFrame.
# Basic reading
df = reader.get("users")
# Using catalog types
df = reader.get("users", catalog_types=True)
Retrieves metadata from parquet files.
metadata = reader.get_metadata("users")
# Returns dict of metadata key-value pairs
get_pk(stream)
Gets primary key(s) from parquet file metadata.
primary_keys = reader.get_pk("users")
# Returns list of primary key column names
get_catalog_schema
Retrieves Singer schema from catalog file.
Usage
from gluestick.singer import get_catalog_schema, to_singer
# Get schema for specific stream
schema = get_catalog_schema("users")
# Use schema with Singer export
to_singer(df, "users", OUTPUT_DIR, schema=schema)
Returns
Dictionary containing the stream’s schema definition
Notes
- Requires
catalog.json in root directory
- Raises exception if stream not found in catalog
- Filters schema to include only type and properties
- Ensures array types have items dictionary
Common Patterns
Iterating through multiple streams
import gluestick as gs
# iterate through each stream in the input directory
reader = gs.Reader()
for key in eval(str(reader)):
# define a dataframe and apply transformations
input_df = reader.get(key, catalog_types=True)
input_df["tenant"] = TENANT_ID
# define the primary keys (assuming the data is Parquet)
key_properties = reader.get_pk(key)
# write the data out to the output directory
gs.to_export(input_df, key, OUTPUT_DIR, keys=key_properties)