Before you start

This guide assumes you have already launched a JupyterLab workspace, as described in the Transformations Overview.

Environment variables

There are several environment variables that will be available when a transformation script runs in hotglue during a job:

Environment VariableDescription
TENANTID of the tenant (test-user)
FLOWID of the flow (FZev7QqK-)
ENV_IDID of the hotglue environment (prod.hg.example.com)
JOB_IDID of the job (ZVonkl)
API_KEYEnvironment API key (XXXX)
API_URLURL of the hotglue API (https://client-api.hotglue.xyz)
JOB_ROOTA unique key marking the root directory for the job (cust_1/flows/FZev7QqK-/jobs/2024/07/4/25/17/ZVonkl)
SYNC_TYPEType of sync (incremental_sync, auto_sync, or full_sync)
JOB_TYPEType of the job, in V2 flows (write or read)
CONNECTOR_IDID of the connector, in V2 flows (salesforce)
TAPThe connector managing the sync process, in both V1 and V2 flows (api)
TARGETThe connector managing the export, in both V1 and V2 flows (salesforce)

You can also add custom env variables for used in your scripts in your environment settings. You can then reference any env variable in your script using os.environ:

Python
import os

tenant_id = os.environ.get("TENANT")

Standard directories

In hotglue, there are three standard directories:

Directory PathDescription
sync-outputThis directory contains all input data from the data source (typically Parquet or CSV).
snapshotsThis directory can be used to store (snapshot) any data for the current tenant. It also stores the tenant-config.json used for tenant-specific settings. (typically JSON, Parquet, or CSV).
etl-outputThis directory is where you should put all output data, formatted as needed for the target your flow is using (typically a data.singer file).

Accessing JSON Objects

In hotglue jobs, several key JSON objects are available in the root directory for accessing configurations, state, and tenant-specific settings. Here’s an overview of the primary JSON files you might interact with:

  • config.json: This file stores credentials and flags required by the tap (v1 flows) or connector (v2 flows).

  • state.json: This file usually tracks the datetime bookmarks, which drive incremental sync behavior.

  • tenant-config.json: This is a flexible JSON configuration file stored at the tenant level. Unlike config.json and state.json, this file resides in the snapshots folder and is shared across any number of linked integrations.

By default, config.json and state.json are stored in the root_dir. Access them as follows:

import json
import os

config_path = f"./config.json"

if os.path.exists(config_path):
    with open(config_path) as f:
        config = json.load(f)

The tenant-config.json, on the other hand, is accessed from the snapshots directory. If you need to read or modify this object, your code will look something like below:

import json
import os

SNAPSHOT_DIR = f"{ROOT_DIR}/snapshots" 

tenant_config_path = f"{SNAPSHOT_DIR}/tenant-config.json"

tenant_config= None

if os.path.exists(tenant_config_path):
    with open(tenant_config_path) as f:
        tenant_config = json.load(f)

Example: Read the data

To start, we will import pandas and the gluestick Python package, and establish the standard input/output directories above. Note:snapshots are optional, and may not be needed for your scripts.

etl.ipynb
import gluestick as gs
import pandas as pd
import os

 # standard directory for hotglue
ROOT_DIR = os.environ.get("ROOT_DIR", ".")
INPUT_DIR = f"{ROOT_DIR}/sync-output"
OUTPUT_DIR = f"{ROOT_DIR}/etl-output"
SNAPSHOT_DIR = f"{ROOT_DIR}/snapshots"

My sync_output folder contains a CSV file called campaigns. Learn more how to get sample data in the Debugging a script section.

Now we can go ahead and read the data in INPUT_DIR with gluestick’s reader function.

etl.ipynb
# Read input data
input_data = gs.reader()

Gluestick automatically parses out the timestamp in the CSV name, so we can access the campaigns data as follows:

etl.ipynb
# Get the campaigns data
campaigns = input_data.get("campaigns")
campaigns.head()

By calling campaigns.head() we can preview the data

Manipulate the data

Now that you have the data as a pandas DataFrame, you can do any transformations you need to.

For this example, we will select a few of the columns: idemails_sentcreate_time, and status, and then rename them.

etl.ipynb
# Select the columns we want
campaigns = campaigns[['id', 'emails_sent', 'create_time', 'status']]

# Rename the columns
campaigns = campaigns.rename(columns={
    'id': "Id",
    'emails_sent': "Sent",
    'create_time': "CreatedAt",
    'status': "Status"
})

campaigns.head()

This generates the final output below. You can see samples of more complex use cases on GitHub and refer to the pandas documentation.

Write the output data

Finally, we can write the output to the standard OUTPUT_DIR using pandas to_csv function:

etl.ipynb
# Write the output
campaigns.to_csv(f"{OUTPUT_DIR}/campaigns.csv", index=False)

This generates the output as a CSV file in the etl-output directory:

If you want to write this data to a non-filestore target, such as a database, CRM, or any other standard Singer target, you can use Gluestick’s to_singer function in lieu of to_csv:

etl.ipynb
# Write the output
gs.to_singer(campaigns, "Campaigns", OUTPUT_DIR, keys=["Id"])