> ## Documentation Index
> Fetch the complete documentation index at: https://docs.hotglue.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Writing a basic script

> This guide walks through writing a basic python transform script using the gluestick package

## Before you start

This guide assumes you have already launched a JupyterLab workspace, as described in the [Transformations Overview](transformations/python-scripts#start-the-jupyterlab-workspace).

## Environment variables

[](https://docs.hotglue.com/docs/writing-a-basic-script#environment-variables)

There are several environment variables that will be available when a transformation script runs in hotglue during a job:

| Environment Variable | Description                                                                                                |
| -------------------- | ---------------------------------------------------------------------------------------------------------- |
| `TENANT`             | ID of the tenant (`test-user`)                                                                             |
| `FLOW`               | ID of the flow (`FZev7QqK-`)                                                                               |
| `ENV_ID`             | ID of the hotglue environment (`prod.hg.example.com`)                                                      |
| `JOB_ID`             | ID of the job (`ZVonkl`)                                                                                   |
| `API_KEY`            | Environment API key (`XXXX`)                                                                               |
| `API_URL`            | URL of the hotglue API (`https://api.hotglue.com`)                                                         |
| `JOB_ROOT`           | A unique key marking the root directory for the job (`cust_1/flows/FZev7QqK-/jobs/2024/07/4/25/17/ZVonkl`) |
| `SYNC_TYPE`          | Type of sync (`incremental_sync`, `auto_sync`, or `full_sync`)                                             |
| `JOB_TYPE`           | Type of the job, in V2 flows (`write` or `read`)                                                           |
| `CONNECTOR_ID`       | ID of the connector, in V2 flows (`salesforce`)                                                            |
| `TAP`                | The connector managing the sync process, in both V1 and V2 flows (`api`)                                   |
| `TARGET`             | The connector managing the export, in both V1 and V2 flows (`salesforce`)                                  |

You can also add custom env variables for used in your scripts in your [environment settings](https://docs.hotglue.com/environment-settings/environment-variables). You can then reference any env variable in your script using `os.environ`:

```python Python theme={null}
import os

tenant_id = os.environ.get("TENANT")
```

## Standard directories

[](https://docs.hotglue.com/docs/writing-a-basic-script#standard-directories)

In hotglue, there are three **standard** directories:

| Directory Path | Description                                                                                                                                                                                   |
| -------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `sync-output`  | This directory contains all **input** data from the data source (typically Parquet or CSV).                                                                                                   |
| `snapshots`    | This directory can be used to store (snapshot) any data for the current tenant. It also stores the `tenant-config.json` used for tenant-specific settings. (typically JSON, Parquet, or CSV). |
| `etl-output`   | This directory is where you should put all **output** data, formatted as needed for the target your flow is using (typically a data.singer file).                                             |

## The Working Directory

When hotglue executes your ETL script, it provides a structured working directory with all the necessary files and data for your transformation. You should reproduce this directory structure when developing scripts locally.

### Directory Structure Overview

```
ROOT/
├── sync-output/
│   ├── streamone.csv
│   └── streamtwo.csv
├── snapshots/
│   ├── streamone.csv
│   ├── streamtwo.csv
│   └── tenant-config.json
├── etl-output/
│   └── (where output will be written)
├── entry file (either etl.py, etl.ipynb, etl.js, etl.ts)
├── any other files or folders you deploy
├── catalog.json
├── source-config.json
├── target-config.json
├── target-catalog.json (Only available in bidrectional write jobs)
├── source-state.json
└── state.json
```

#### 📄 **Configuration Files**

The key JSON configuration files available at runtime by your ETL script are:

* **catalog.json** - Schema catalog for the data
* **source-config.json** - Source system credentials and configuration
* **target-config.json** - Target system credentials and configuration
* **target-catalog.json** - Target schema catalog (V2 write jobs only)
* **source-state.json** - Incremental sync state and bookmarks
* **state.json** - Current job execution state

You can access these files with:

```python theme={null}
import json
import os

config_path = f"./source-config.json"

if os.path.exists(config_path):
    with open(config_path) as f:
        config = json.load(f)
```

The `tenant-config.json`, on the other hand, is accessed from the snapshots directory. If you need to read or modify this object, your code will look something like below:

```python theme={null}
import json
import os

SNAPSHOT_DIR = f"{ROOT_DIR}/snapshots" 

tenant_config_path = f"{SNAPSHOT_DIR}/tenant-config.json"

tenant_config= None

if os.path.exists(tenant_config_path):
    with open(tenant_config_path) as f:
        tenant_config = json.load(f)
```

## Example: Read the data

To start, we will import pandas and the gluestick Python package, and establish the standard input/output directories above. Note:`snapshots` are optional, and may not be needed for your scripts.

```python etl.ipynb theme={null}
import gluestick as gs
import pandas as pd
import os

 # standard directory for hotglue
ROOT_DIR = os.environ.get("ROOT_DIR", ".")
INPUT_DIR = f"{ROOT_DIR}/sync-output"
OUTPUT_DIR = f"{ROOT_DIR}/etl-output"
SNAPSHOT_DIR = f"{ROOT_DIR}/snapshots"
```

My `sync_output` folder contains a CSV file called `campaigns`. Learn more how to get sample data in the [Debugging a script](https://docs.hotglue.com/docs/debugging-a-script) section.

![3068](https://files.readme.io/a45c8e6-Screen_Shot_2021-12-15_at_3.10.05_PM.png "Screen Shot 2021-12-15 at 3.10.05 PM.png")

Now we can go ahead and read the data in `INPUT_DIR` with gluestick's [reader](https://github.com/hotgluexyz/gluestick/blob/master/gluestick/etl_utils.py) function.

```python etl.ipynb theme={null}
# Read input data
input_data = gs.reader()
```

Gluestick automatically parses out the timestamp in the CSV name, so we can access the `campaigns` data as follows:

```python etl.ipynb theme={null}
# Get the campaigns data
campaigns = input_data.get("campaigns")
campaigns.head()
```

By calling `campaigns.head()` we can preview the data

![2218](https://files.readme.io/496ec12-Screen_Shot_2021-12-15_at_3.14.18_PM.png "Screen Shot 2021-12-15 at 3.14.18 PM.png")

## Manipulate the data

Now that you have the data as a pandas DataFrame, you can do any transformations you need to.

For this example, we will select a few of the columns: `id`, `emails_sent`, `create_time`, and `status`, and then rename them.

```python etl.ipynb theme={null}
# Select the columns we want
campaigns = campaigns[['id', 'emails_sent', 'create_time', 'status']]

# Rename the columns
campaigns = campaigns.rename(columns={
    'id': "Id",
    'emails_sent': "Sent",
    'create_time': "CreatedAt",
    'status': "Status"
})

campaigns.head()
```

This generates the final output below. You can see [samples of more complex use cases on GitHub](https://github.com/hotgluexyz/recipes) and refer to the [pandas documentation](https://pandas.pydata.org/docs/reference/frame.html).

![2236](https://files.readme.io/f4d1b4c-Screen_Shot_2021-12-15_at_3.20.24_PM.png "Screen Shot 2021-12-15 at 3.20.24 PM.png")

## Write the output data

Finally, we can write the output to the standard `OUTPUT_DIR` using pandas `to_csv` function:

```python etl.ipynb theme={null}
# Write the output
campaigns.to_csv(f"{OUTPUT_DIR}/campaigns.csv", index=False)
```

This generates the output as a CSV file in the `etl-output` directory:

![3068](https://files.readme.io/1009aee-Screen_Shot_2021-12-15_at_3.22.05_PM.png "Screen Shot 2021-12-15 at 3.22.05 PM.png")

If you want to write this data to a non-filestore target, such as a database, CRM, or any other standard Singer target, you can use Gluestick's `to_singer` function in lieu of `to_csv`:

```python etl.ipynb theme={null}
# Write the output
gs.to_singer(campaigns, "Campaigns", OUTPUT_DIR, keys=["Id"])
```
