Writing a basic script

This guide walks through writing a basic transform script using the gluestick package

Before you start

This guide assumes you have already launched a JupyterLab workspace, as described in the Transformations Overview.

Environment variables

There are several environment variables that will be available when a transformation script runs in hotglue during a sync job:

Environment VariableDescription
TENANTID of the tenant
TAPID of the tap
FLOWID of the flow
ENV_IDID of the hotglue environment
JOB_IDID of the job

If you wanted to reference this in Python, you would use the os.environ variable to access the environment variables. See the sample below.

import os

tenant_id = os.environ.get("TENANT")
tap = os.environ.get("TAP")
flow_id = os.environ.get("FLOW")
env_id = os.environ.get("ENV_ID")
job_id = os.environ.get("JOB_ID")

Standard directories

In hotglue, there are three standard directories:

Directory PathDescription
sync-outputThis directory contains all input data from the data source (typically CSV).
snapshotsThis directory can be used to store (snapshot) any data for the current tenant (typically JSON or CSV).
etl-outputThis directory is where you should put all output data, formatted as needed for the target your flow is using (typically JSON or CSV).

Read the data

To start, we will import pandas and the gluestick Python package, and establish the standard input/output directories above. Note that you do not have to use the snapshots feature if you do not need to.

import gluestick as gs
import pandas as pd
import os

# Establish standard directories for hotglue
INPUT_DIR = "./sync-output"
OUTPUT_DIR = "./etl-output"

As you can see below, my sync_output folder contains a CSV file called campaigns. Learn more how to get sample data in the Debugging a script section.

30683068

Now we can go ahead and read the data in INPUT_DIR with gluestick's read_csv_folder function.

# Read input data
input_data = gs.read_csv_folder(INPUT_DIR)

gluestick automatically parses out the timestamp in the CSV name, so we can access the campaigns data as follows:

# Get the campaigns data
campaigns = input_data["campaigns"]
campaigns.head()

By calling campaigns.head() we can preview the data

22182218

campaigns data in Jupyter

Manipulate the data

Now that you have the data as a pandas DataFrame, you can do any transformations you need to.

For this example, we will select a few of the columns: id, emails_sent, create_time, and status, and then rename them.

# Select the columns we want
campaigns = campaigns[['id', 'emails_sent', 'create_time', 'status']]

# Rename the columns
campaigns = campaigns.rename(columns={
    'id': "Id",
    'emails_sent': "Sent",
    'create_time': "CreatedAt",
    'status': "Status"
})

campaigns.head()

This generates the final output pictured below. You can see samples of more complex use cases on GitHub and refer to the pandas documentation.

22362236

Final output data

Write the output data

Finally, we can write the output to the standard OUTPUT_DIR using pandas to_csv function:

# Write the output
campaigns.to_csv(f"{OUTPUT_DIR}/campaigns.csv", index=False)

As pictured below, this generates the output as a CSV file in the etl-output directory:

30683068

Output CSV