Writing a basic script
This guide walks through writing a basic transform script using the gluestick package
Before you start
This guide assumes you have already launched a JupyterLab workspace, as described in the Transformations Overview.
Environment variables
There are several environment variables that will be available when a transformation script runs in hotglue during a job:
Environment Variable | Description |
---|---|
TENANT | ID of the tenant (test-user ) |
FLOW | ID of the flow (FZev7QqK- ) |
ENV_ID | ID of the hotglue environment (prod.hg.example.com ) |
JOB_ID | ID of the job (ZVonkl ) |
API_KEY | Environment API key (XXXX ) |
API_URL | URL of the hotglue API (https://client-api.hotglue.xyz ) |
JOB_ROOT | A unique key marking the root directory for the job (cust_1/flows/FZev7QqK-/jobs/2024/07/4/25/17/ZVonkl ) |
SYNC_TYPE | Type of sync (incremental_sync , auto_sync , or full_sync ) |
JOB_TYPE | Type of the job, in V2 flows (write or read ) |
CONNECTOR_ID | ID of the connector, in V2 flows (salesforce ) |
TAP | The connector managing the sync process, in both V1 and V2 flows (api ) |
TARGET | The connector managing the export, in both V1 and V2 flows (salesforce ) |
You can also add custom env variables for used in your scripts in your environment settings. You can then reference any env variable in your script using os.environ
:
Standard directories
In hotglue, there are three standard directories:
Directory Path | Description |
---|---|
sync-output | This directory contains all input data from the data source (typically Parquet or CSV). |
snapshots | This directory can be used to store (snapshot) any data for the current tenant. It also stores the tenant-config.json used for tenant-specific settings. (typically JSON, Parquet, or CSV). |
etl-output | This directory is where you should put all output data, formatted as needed for the target your flow is using (typically a data.singer file). |
Accessing JSON Objects
In hotglue jobs, several key JSON objects are available in the root directory for accessing configurations, state, and tenant-specific settings. Here’s an overview of the primary JSON files you might interact with:
-
config.json: This file stores credentials and flags required by the tap (v1 flows) or connector (v2 flows).
-
state.json: This file usually tracks the datetime bookmarks, which drive incremental sync behavior.
-
tenant-config.json: This is a flexible JSON configuration file stored at the tenant level. Unlike
config.json
andstate.json
, this file resides in thesnapshots
folder and is shared across any number of linked integrations.
By default, config.json
and state.json
are stored in the root_dir
. Access them as follows:
The tenant-config.json
, on the other hand, is accessed from the snapshots directory. If you need to read or modify this object, your code will look something like below:
Example: Read the data
To start, we will import pandas and the gluestick Python package, and establish the standard input/output directories above. Note:snapshots
are optional, and may not be needed for your scripts.
My sync_output
folder contains a CSV file called campaigns
. Learn more how to get sample data in the Debugging a script section.
Now we can go ahead and read the data in INPUT_DIR
with gluestick’s reader function.
Gluestick automatically parses out the timestamp in the CSV name, so we can access the campaigns
data as follows:
By calling campaigns.head()
we can preview the data
Manipulate the data
Now that you have the data as a pandas DataFrame, you can do any transformations you need to.
For this example, we will select a few of the columns: id
, emails_sent
, create_time
, and status
, and then rename them.
This generates the final output below. You can see samples of more complex use cases on GitHub and refer to the pandas documentation.
Write the output data
Finally, we can write the output to the standard OUTPUT_DIR
using pandas to_csv
function:
This generates the output as a CSV file in the etl-output
directory:
If you want to write this data to a non-filestore target, such as a database, CRM, or any other standard Singer target, you can use Gluestick’s to_singer
function in lieu of to_csv
: