Get Your Data out of S3 and into Snowflake with Meltano

Data and Analytics

Meltano in action

Use cases

by Meltano

on June 15 2022

Typically, data engineers spend their days building data pipelines. The goal of a pipeline is to get data from a data source to its intermediary or final destination. This process usually encompasses multiple technologies, both on-premise and in the cloud, and these technologies need to be perfectly aligned to move data from point A to point B. That’s why data pipelines define steps that should be executed in a specific order.

A common component of a data pipeline is to get data from a storage bucket, such as GCP Cloud Storage or Amazon Web Services (AWS) S3 into a data warehouse. These buckets are often used as data sinks, where batch jobs or streams from operational databases or Software-as-a-Service (SaaS) tools place their data in a semi-structured format. However, this is rarely the end goal because data only becomes useful for business intelligence purposes when its queried, for example, in a data warehouse.

In this article, you’ll learn how to get data from AWS S3 to Snowflake, a popular cloud data warehouse solution.

Data Pipelines

Many people in most organizations are eager to access, process, and analyze data. However, most companies centralize their data to guarantee consistency and governance regarding what they track, store, and expose to data consumers. More than ever, data is exposed via specialized analytical databases—such as data lakes and data warehouses—to data scientists and other analytical roles.

Data engineers set up and maintain data pipelines to get data from a source system or data sink to an analytical database, and even back into production systems such as CRMs and advertising platforms. Typically, a data pipeline contains three elements:

Extraction (E) of data from the source system or data sink
Transformation (T) of the data into a specific schema or data model
Loading (L) of the data into its destination

Often, the data technologies that an organization uses define the order in which they execute these steps. In the past, when storing data was expensive, organizations transformed data before loading it; this process was known as extract, transform, load (ETL). However, keeping both raw and transformed data in modern technologies is relatively cheap, which is why—among other reasons—many organizations have adopted an extract, load, transform (ELT) approach.

A very common ELT data pipeline is to get data from a data repository such as cloud storage into a data warehouse. These storage technologies are often used as data sinks, where snapshots from production databases or extracts from SaaS systems are dumped for further processing. When it’s time to analyze that data, a data engineer copies it to a data warehouse and prepares it for consumption.

For example, imagine that your analysts want to analyze the evolution of slow-moving properties of their customers: where they live, where they work, how many children they have, and so on. One way to enable these kinds of analyses is by taking a daily snapshot of the source customer database by unloading it to a storage bucket. The data pipeline then loads all the snapshot data from the storage bucket into the data warehouse. For organizations with millions of customers, the volume of this data could quickly grow to data sets with billions of rows.

A very popular storage service (if not the most popular) is Amazon’s Simple Storage Service (S3). However, there is also Snowflake, which is a data warehouse that can be deployed on the big three cloud service providers: AWS, Azure, and GCP. Since both S3 and Snowflake are popular tools within the data ecosystem, data engineers spend a lot of time migrating data from S3 into Snowflake.

There are various ways to get data (such as CSV, JSON, or Parquet files) into Snowflake. For example, you can bulk load using the COPY command in Snowflake, or automate Snowflake’s access to S3 with an AWS Service Catalog, or use third-party tools such as Meltano or Airbyte.

In the next section, you’ll learn how to extract data from S3, load it into a Snowflake data warehouse, and execute a transformation using Meltano, a DataOps tool that combines the power of various open-source data tools including Singer, Airflow, and dbt.

Implementing an S3 to Snowflake Data Pipeline with Meltano

With Meltano, extracting data from S3 and loading it into Snowflake is fairly straightforward. For instance, you only need to specify an S3 bucket, pass it an extension and prefix to search for, and your tap is ready. Loading data into Snowflake is slightly more difficult because you need to create a FILE FORMAT object and set proper roles and rights. I’ll walk you through this process using the tutorial below.

Before you begin, you’ll need to set up and install these prerequisites:

Set up AWS authentication via an access key and secret, so Meltano can access S3.
Install Meltano on the machine from which you want to orchestrate your ELT processes.

Create a Named File Format

First, run the CREATE FILE FORMAT command. Behind the scenes, the loader is writing data to files and copying them into Snowflake, so you need to provide it with a Snowflake file format.

In CREATE FILE FORMAT, specify a file format and a name that you will reference later. In the example below, your\_csv\_format was used, and it was set to CSV:

CREATE OR REPLACE FILE FORMAT {your_database}.{your_schema}.{your_csv_format_name}

TYPE = 'CSV'

FIELD_DELIMITER = ','

SKIP_HEADER = 0;

Configure the Role and User to Access Snowflake

Next, run some SQL commands to create a role that can access and edit the Snowflake objects that hold the data. We encourage you to follow Snowflake’s best practices for managing access control by toggling between USERADMIN and SYSADMIN, but feel free to customize these commands to meet your organization’s needs.

First, create a role that will have the necessary access rights.:

USE ROLE USERADMIN;

CREATE OR REPLACE ROLE ppw_target_snowflake;

USE ROLE SYSADMIN;

GRANT USAGE ON DATABASE {your_database} TO ROLE ppw_target_snowflake;

GRANT CREATE SCHEMA ON DATABASE {your_database} TO ROLE ppw_target_snowflake;

 

GRANT USAGE ON SCHEMA {your_database}.{your_schema} TO role ppw_target_snowflake;

GRANT USAGE ON FILE FORMAT {your_database}.{your_schema}.{your_csv_format_name} TO ROLE ppw_target_snowflake;

GRANT USAGE ON WAREHOUSE {your_warehouse} TO ROLE ppw_target_snowflake;

Then, create a user that Meltano will use to access Snowflake and assign the previously created role to the user:

USE ROLE USERADMIN;

CREATE OR REPLACE USER {your_user}

PASSWORD = '{your_password}'

DEFAULT_ROLE = ppw_target_snowflake

DEFAULT_WAREHOUSE = '{your_warehouse}'

MUST_CHANGE_PASSWORD = FALSE;

  

GRANT ROLE ppw_target_snowflake TO USER {your_user};

Create and Configure a Meltano Project

Next, initialize a Meltano project using Meltano’s init command. This is the foundation for your centralized data platform that we’ll continue to build on.

meltano init s3-to-snowflake-project

cd s3-to-snowflake-project

Next, add the required Meltano plugins through Meltano’s add command. The plugins you need are the S3 tap and the Snowflake target. Add the S3 tap using the custom flag by providing the following inputs when prompted:

meltano add --custom extractor tap-s3-csv

# (namespace) [tap_s3_csv]: <hit enter, no value needed>

# (pip_url) [tap-s3-csv]: pipelinewise-tap-s3-csv

# (executable) [pipelinewise-tap-s3-csv]: tap-s3-csv

# (capabilities) [[]]: properties,discover,state

# (settings) [[]]: aws_access_key_id:password,aws_secret_access_key:password

Then, add the Snowflake target:

meltano add loader target-snowflake

If you’re curious about what’s happening behind the scenes, you’ll notice that the plugins are added to your meltano.yml file. For each plugin, you should also use the invoke command with the --help flag, which will invoke the plugin and simply print its help message. If you see the help messages, the plugins have been installed successfully:

meltano --log-level=debug invoke tap-s3-csv --help

meltano --log-level=debug invoke target-snowflake --help

Configure the S3 Tap and Snowflake Target

Now, configure the S3 tap that will extract the data from its source bucket.

To begin, configure the S3 tap with some parameters. With the Meltano CLI, you can use the following commands to configure your extractor and loader:

start\_date: Cutoff date for files to consider
bucket: The name of the bucket from where you want to extract CSV files
aws_access_key_id: Your AWS access key
aws_secret_access_key: Your AWS secret key
tables: An array of objects that define the files you want to extract from S3

meltano config tap-s3-csv set start_date 2021-11-02T00:00:00Z

meltano config tap-s3-csv set bucket your-initial-s3-bucket

meltano config tap-s3-csv set aws_access_key_id {your_aws_access_id}

meltano config tap-s3-csv set aws_secret_access_key {your_aws_access_secret}

meltano config tap-s3-csv set tables '[{"table_name": "{you_table_name}", "search_pattern": "\\.csv", "search_prefix": "{you_s3_prefix}/", "delimiter": ",", "key_properties": ["id"]}]'

A couple of things to note about the configurations:

Any sensitive settings, such as aws_access_key_id, will have been written to a .env file in your project directory. These should be automatically excluded when checking your code into the git repository.
The search_pattern within the tables array needs to be properly escaped as described in the example. The intended value \.csv needs to be set to \\.csv within the input object.

If you have any challenges while getting your extractor configured, refer to the troubleshooting section in the docs or join the [Slack community](ADD SLACK LINK!!!), where you can ask questions of other community members.

Next, run the following commands to configure your snowflake loader:

meltano config target-snowflake set account {your_account}

meltano config target-snowflake set dbname {your_database}

meltano config target-snowflake set user {your_user}

meltano config target-snowflake set password {your_password}

meltano config target-snowflake set warehouse {your_warehouse}

meltano config target-snowflake set file_format {your_schema}.{your_csv_format}

meltano config target-snowflake set default_target_schema {your_schema}

After you’ve completed the configuration steps via the CLI you should see your meltano.yml updated to look similar to the following :

…

plugins:

extractors:

- name: tap-s3-csv

namespace: tap_s3_csv

pip_url: git+https://github.com/transferwise/pipelinewise-tap-s3-csv.git

executable: tap-s3-csv

capabilities:

- catalog

- discover

- state

settings:

- name: aws_access_key_id

kind: password

- name: aws_secret_access_key

kind: password

loaders:

- name: target-snowflake

variant: transferwise

pip_url: pipelinewise-target-snowflake

environments:

- name: dev

config:

plugins:

extractors:

- name: tap-s3-csv

config:

start_date: '2021-11-02T00:00:00Z'

bucket: your-initial-s3-bucket

tables:

- table_name: my_table

search_pattern: \.csv

search_prefix: exports/

delimiter: ','

key_properties: [id]

loaders:

- name: target-snowflake

config:

account: your_account

dbname: your_database

user: your_user

warehouse: your_warehouse

file_format: your_schema.your_csv_format

default_target_schema: YOUR_SCHEMA

…

Run the Meltano Data Pipeline

Now that you have everything set up, there is only one more thing to do: run the pipeline with the run command. If everything goes well, the data from your S3 bucket will be loaded into Snowflake.

meltano run tap-s3-csv target-snowflake

Behind the scenes, Meltano manages much of the complexity that comes with running the Singer extractors and loaders together. If your extractor supports bookmarked state between runs, Meltano will store and retrieve it for you, so after the initial load, each subsequent execution only extracts the incremental changes.

What’s Next?

Now that you’ve set up a basic data pipeline using Meltano, there are so many other things you can discover.

If you want to explore more, try executing transformations with the dbt plugin after loading the data into Snowflake, or schedule a data pipeline with the Airflow orchestrator plugin, or combine a data pipeline with more extractors and loaders with Meltano’s run command. Once your data is transformed, you can also pull a plugin such as Superset into your project for analyzing and building dashboards.

You can also join Meltano’s growing community to discover new features and discuss the intricacies of ELT pipelines.

Get your Data into Snowflake

Thanks to our guest writer- Roel Peters.

Data Pipelines
Implementing an S3 to Snowflake Data Pipeline with Meltano
What’s Next?