Writing a basic script

Before you start

This guide assumes you have already launched a JupyterLab workspace, as described in the Transformations Overview.

Environment variables

There are several environment variables that will be available when a transformation script runs in hotglue during a job:

Environment Variable	Description
`TENANT`	ID of the tenant (`test-user`)
`FLOW`	ID of the flow (`FZev7QqK-`)
`ENV_ID`	ID of the hotglue environment (`prod.hg.example.com`)
`JOB_ID`	ID of the job (`ZVonkl`)
`API_KEY`	Environment API key (`XXXX`)
`API_URL`	URL of the hotglue API (`https://api.hotglue.com`)
`JOB_ROOT`	A unique key marking the root directory for the job (`cust_1/flows/FZev7QqK-/jobs/2024/07/4/25/17/ZVonkl`)
`SYNC_TYPE`	Type of sync (`incremental_sync`, `auto_sync`, or `full_sync`)
`JOB_TYPE`	Type of the job, in V2 flows (`write` or `read`)
`CONNECTOR_ID`	ID of the connector, in V2 flows (`salesforce`)
`TAP`	The connector managing the sync process, in both V1 and V2 flows (`api`)
`TARGET`	The connector managing the export, in both V1 and V2 flows (`salesforce`)

You can also add custom env variables for used in your scripts in your environment settings. You can then reference any env variable in your script using os.environ:

Python

import os

tenant_id = os.environ.get("TENANT")

Standard directories

In hotglue, there are three standard directories:

Directory Path	Description
`sync-output`	This directory contains all input data from the data source (typically Parquet or CSV).
`snapshots`	This directory can be used to store (snapshot) any data for the current tenant. It also stores the `tenant-config.json` used for tenant-specific settings. (typically JSON, Parquet, or CSV).
`etl-output`	This directory is where you should put all output data, formatted as needed for the target your flow is using (typically a data.singer file).

The Working Directory

When hotglue executes your ETL script, it provides a structured working directory with all the necessary files and data for your transformation. You should reproduce this directory structure when developing scripts locally.

Directory Structure Overview

ROOT/
├── sync-output/
│   ├── streamone.csv
│   └── streamtwo.csv
├── snapshots/
│   ├── streamone.csv
│   ├── streamtwo.csv
│   └── tenant-config.json
├── etl-output/
│   └── (where output will be written)
├── entry file (either etl.py, etl.ipynb, etl.js, etl.ts)
├── any other files or folders you deploy
├── catalog.json
├── source-config.json
├── target-config.json
├── target-catalog.json (Only available in bidrectional write jobs)
├── source-state.json
└── state.json

📄 Configuration Files

The key JSON configuration files available at runtime by your ETL script are:

catalog.json - Schema catalog for the data
source-config.json - Source system credentials and configuration
target-config.json - Target system credentials and configuration
target-catalog.json - Target schema catalog (V2 write jobs only)
source-state.json - Incremental sync state and bookmarks
state.json - Current job execution state

You can access these files with:

import json
import os

config_path = f"./source-config.json"

if os.path.exists(config_path):
    with open(config_path) as f:
        config = json.load(f)

The tenant-config.json, on the other hand, is accessed from the snapshots directory. If you need to read or modify this object, your code will look something like below:

import json
import os

SNAPSHOT_DIR = f"{ROOT_DIR}/snapshots" 

tenant_config_path = f"{SNAPSHOT_DIR}/tenant-config.json"

tenant_config= None

if os.path.exists(tenant_config_path):
    with open(tenant_config_path) as f:
        tenant_config = json.load(f)

Example: Read the data

To start, we will import pandas and the gluestick Python package, and establish the standard input/output directories above. Note:snapshots are optional, and may not be needed for your scripts.

etl.ipynb

import gluestick as gs
import pandas as pd
import os

 # standard directory for hotglue
ROOT_DIR = os.environ.get("ROOT_DIR", ".")
INPUT_DIR = f"{ROOT_DIR}/sync-output"
OUTPUT_DIR = f"{ROOT_DIR}/etl-output"
SNAPSHOT_DIR = f"{ROOT_DIR}/snapshots"

My sync_output folder contains a CSV file called campaigns. Learn more how to get sample data in the Debugging a script section. 3068

Now we can go ahead and read the data in INPUT_DIR with gluestick’s reader function.

etl.ipynb

# Read input data
input_data = gs.reader()

Gluestick automatically parses out the timestamp in the CSV name, so we can access the campaigns data as follows:

etl.ipynb

# Get the campaigns data
campaigns = input_data.get("campaigns")
campaigns.head()

By calling campaigns.head() we can preview the data 2218

Manipulate the data

Now that you have the data as a pandas DataFrame, you can do any transformations you need to. For this example, we will select a few of the columns: id, emails_sent, create_time, and status, and then rename them.

etl.ipynb

# Select the columns we want
campaigns = campaigns[['id', 'emails_sent', 'create_time', 'status']]

# Rename the columns
campaigns = campaigns.rename(columns={
    'id': "Id",
    'emails_sent': "Sent",
    'create_time': "CreatedAt",
    'status': "Status"
})

campaigns.head()

This generates the final output below. You can see samples of more complex use cases on GitHub and refer to the pandas documentation. 2236

Write the output data

Finally, we can write the output to the standard OUTPUT_DIR using pandas to_csv function:

etl.ipynb

# Write the output
campaigns.to_csv(f"{OUTPUT_DIR}/campaigns.csv", index=False)

This generates the output as a CSV file in the etl-output directory: 3068

If you want to write this data to a non-filestore target, such as a database, CRM, or any other standard Singer target, you can use Gluestick’s to_singer function in lieu of to_csv:

etl.ipynb

# Write the output
gs.to_singer(campaigns, "Campaigns", OUTPUT_DIR, keys=["Id"])

Getting Started

Key Concepts

Managed authentication

Transformation

Custom Connectors

Unified Schema

Environment Settings

CLI

Connectors

Before you start

Environment variables

Standard directories

The Working Directory

Directory Structure Overview

📄 Configuration Files

Example: Read the data

Manipulate the data

Write the output data

Getting Started

Key Concepts

Managed authentication

Transformation

Custom Connectors

Unified Schema

Environment Settings

CLI

Connectors

​Before you start

​Environment variables

​Standard directories

​The Working Directory

​Directory Structure Overview

​📄 Configuration Files

​Example: Read the data

​Manipulate the data

​Write the output data

Before you start

Environment variables

Standard directories

The Working Directory

Directory Structure Overview

📄 Configuration Files

Example: Read the data

Manipulate the data

Write the output data