Typically, data engineers spend their days building data pipelines. The goal of a pipeline is to get data from a data source to its intermediary or final destination. This process usually encompasses multiple technologies, both on-premise and in the cloud, and these technologies need to be perfectly aligned to move data from point A to point B. That’s why data pipelines define steps that should be executed in a specific order.
A common component of a data pipeline is to get data from a storage bucket, such as GCP Cloud Storage or Amazon Web Services (AWS) S3 into a data warehouse. These buckets are often used as data sinks, where batch jobs or streams from operational databases or Software-as-a-Service (SaaS) tools place their data in a semi-structured format. However, this is rarely the end goal because data only becomes useful for business intelligence purposes when its queried, for example, in a data warehouse.
In this article, you’ll learn how to get data from AWS S3 to Snowflake, a popular cloud data warehouse solution.
Data Pipelines
Many people in most organizations are eager to access, process, and analyze data. However, most companies centralize their data to guarantee consistency and governance regarding what they track, store, and expose to data consumers. More than ever, data is exposed via specialized analytical databases—such as data lakes and data warehouses—to data scientists and other analytical roles.
Data engineers set up and maintain data pipelines to get data from a source system or data sink to an analytical database, and even back into production systems such as CRMs and advertising platforms. Typically, a data pipeline contains three elements:
- Extraction (E) of data from the source system or data sink
- Transformation (T) of the data into a specific schema or data model
- Loading (L) of the data into its destination
Often, the data technologies that an organization uses define the order in which they execute these steps. In the past, when storing data was expensive, organizations transformed data before loading it; this process was known as extract, transform, load (ETL). However, keeping both raw and transformed data in modern technologies is relatively cheap, which is why—among other reasons—many organizations have adopted an extract, load, transform (ELT) approach.
A very common ELT data pipeline is to get data from a data repository such as cloud storage into a data warehouse. These storage technologies are often used as data sinks, where snapshots from production databases or extracts from SaaS systems are dumped for further processing. When it’s time to analyze that data, a data engineer copies it to a data warehouse and prepares it for consumption.
For example, imagine that your analysts want to analyze the evolution of slow-moving properties of their customers: where they live, where they work, how many children they have, and so on. One way to enable these kinds of analyses is by taking a daily snapshot of the source customer database by unloading it to a storage bucket. The data pipeline then loads all the snapshot data from the storage bucket into the data warehouse. For organizations with millions of customers, the volume of this data could quickly grow to data sets with billions of rows.
A very popular storage service (if not the most popular) is Amazon’s Simple Storage Service (S3). However, there is also Snowflake, which is a data warehouse that can be deployed on the big three cloud service providers: AWS, Azure, and GCP. Since both S3 and Snowflake are popular tools within the data ecosystem, data engineers spend a lot of time migrating data from S3 into Snowflake.
There are various ways to get data (such as CSV, JSON, or Parquet files) into Snowflake. For example, you can bulk load using the COPY command in Snowflake, or automate Snowflake’s access to S3 with an AWS Service Catalog, or use third-party tools such as Meltano or Airbyte.
In the next section, you’ll learn how to extract data from S3, load it into a Snowflake data warehouse, and execute a transformation using Meltano, a DataOps tool that combines the power of various open-source data tools including Singer, Airflow, and dbt.
Implementing an S3 to Snowflake Data Pipeline with Meltano
With Meltano, extracting data from S3 and loading it into Snowflake is fairly straightforward. For instance, you only need to specify an S3 bucket, pass it an extension and prefix to search for, and your tap is ready. Loading data into Snowflake is slightly more difficult because you need to create a FILE FORMAT
object and set proper roles and rights. I’ll walk you through this process using the tutorial below.
Before you begin, you’ll need to set up and install these prerequisites:
- Set up AWS authentication via an access key and secret, so Meltano can access S3.
- Install Meltano on the machine from which you want to orchestrate your ELT processes.
Create a Named File Format
First, run the CREATE FILE FORMAT
command. Behind the scenes, the loader is writing data to files and copying them into Snowflake, so you need to provide it with a Snowflake file format.
In CREATE FILE FORMAT
, specify a file format and a name that you will reference later. In the example below, your\_csv\_format
was used, and it was set to CSV
:
CREATE OR REPLACE FILE FORMAT {your_database}.{your_schema}.{your_csv_format_name} TYPE = 'CSV' FIELD_DELIMITER = ',' SKIP_HEADER = 0;
Configure the Role and User to Access Snowflake
Next, run some SQL commands to create a role that can access and edit the Snowflake objects that hold the data. We encourage you to follow Snowflake’s best practices for managing access control by toggling between USERADMIN and SYSADMIN, but feel free to customize these commands to meet your organization’s needs.
First, create a role that will have the necessary access rights.:
USE ROLE USERADMIN; CREATE OR REPLACE ROLE ppw_target_snowflake; USE ROLE SYSADMIN; GRANT USAGE ON DATABASE {your_database} TO ROLE ppw_target_snowflake; GRANT CREATE SCHEMA ON DATABASE {your_database} TO ROLE ppw_target_snowflake; GRANT USAGE ON SCHEMA {your_database}.{your_schema} TO role ppw_target_snowflake; GRANT USAGE ON FILE FORMAT {your_database}.{your_schema}.{your_csv_format_name} TO ROLE ppw_target_snowflake; GRANT USAGE ON WAREHOUSE {your_warehouse} TO ROLE ppw_target_snowflake;
Then, create a user that Meltano will use to access Snowflake and assign the previously created role to the user:
USE ROLE USERADMIN; CREATE OR REPLACE USER {your_user} PASSWORD = '{your_password}' DEFAULT_ROLE = ppw_target_snowflake DEFAULT_WAREHOUSE = '{your_warehouse}' MUST_CHANGE_PASSWORD = FALSE; GRANT ROLE ppw_target_snowflake TO USER {your_user};
Create and Configure a Meltano Project
Next, initialize a Meltano project using Meltano’s init
command. This is the foundation for your centralized data platform that we’ll continue to build on.
meltano init s3-to-snowflake-project cd s3-to-snowflake-project
Next, add the required Meltano plugins through Meltano’s add
command. The plugins you need are the S3 tap and the Snowflake target. Add the S3 tap using the custom flag by providing the following inputs when prompted:
meltano add --custom extractor tap-s3-csv # (namespace) [tap_s3_csv]: <hit enter, no value needed> # (pip_url) [tap-s3-csv]: pipelinewise-tap-s3-csv # (executable) [pipelinewise-tap-s3-csv]: tap-s3-csv # (capabilities) [[]]: properties,discover,state # (settings) [[]]: aws_access_key_id:password,aws_secret_access_key:password
Then, add the Snowflake target:
meltano add loader target-snowflake
If you’re curious about what’s happening behind the scenes, you’ll notice that the plugins are added to your meltano.yml
file. For each plugin, you should also use the invoke
command with the --help
flag, which will invoke the plugin and simply print its help message. If you see the help messages, the plugins have been installed successfully:
meltano --log-level=debug invoke tap-s3-csv --help meltano --log-level=debug invoke target-snowflake --help
Configure the S3 Tap and Snowflake Target
Now, configure the S3 tap that will extract the data from its source bucket.
To begin, configure the S3 tap with some parameters. With the Meltano CLI, you can use the following commands to configure your extractor and loader:
start\_date
: Cutoff date for files to considerbucket
: The name of the bucket from where you want to extract CSV filesaws_access_key_id
: Your AWS access keyaws_secret_access_key
: Your AWS secret keytables
: An array of objects that define the files you want to extract from S3
meltano config tap-s3-csv set start_date 2021-11-02T00:00:00Z meltano config tap-s3-csv set bucket your-initial-s3-bucket meltano config tap-s3-csv set aws_access_key_id {your_aws_access_id} meltano config tap-s3-csv set aws_secret_access_key {your_aws_access_secret} meltano config tap-s3-csv set tables '[{"table_name": "{you_table_name}", "search_pattern": "\\.csv", "search_prefix": "{you_s3_prefix}/", "delimiter": ",", "key_properties": ["id"]}]'
A couple of things to note about the configurations:
- Any sensitive settings, such as
aws_access_key_id
, will have been written to a.env
file in your project directory. These should be automatically excluded when checking your code into the git repository. - The
search_pattern
within thetables
array needs to be properly escaped as described in the example. The intended value\.csv
needs to be set to\\.csv
within the input object.
If you have any challenges while getting your extractor configured, refer to the troubleshooting section in the docs or join the [Slack community](ADD SLACK LINK!!!), where you can ask questions of other community members.
Next, run the following commands to configure your snowflake loader:
meltano config target-snowflake set account {your_account} meltano config target-snowflake set dbname {your_database} meltano config target-snowflake set user {your_user} meltano config target-snowflake set password {your_password} meltano config target-snowflake set warehouse {your_warehouse} meltano config target-snowflake set file_format {your_schema}.{your_csv_format} meltano config target-snowflake set default_target_schema {your_schema}
After you’ve completed the configuration steps via the CLI you should see your meltano.yml updated to look similar to the following :
… plugins: extractors: - name: tap-s3-csv namespace: tap_s3_csv pip_url: git+https://github.com/transferwise/pipelinewise-tap-s3-csv.git executable: tap-s3-csv capabilities: - catalog - discover - state settings: - name: aws_access_key_id kind: password - name: aws_secret_access_key kind: password loaders: - name: target-snowflake variant: transferwise pip_url: pipelinewise-target-snowflake environments: - name: dev config: plugins: extractors: - name: tap-s3-csv config: start_date: '2021-11-02T00:00:00Z' bucket: your-initial-s3-bucket tables: - table_name: my_table search_pattern: \.csv search_prefix: exports/ delimiter: ',' key_properties: [id] loaders: - name: target-snowflake config: account: your_account dbname: your_database user: your_user warehouse: your_warehouse file_format: your_schema.your_csv_format default_target_schema: YOUR_SCHEMA …
Run the Meltano Data Pipeline
Now that you have everything set up, there is only one more thing to do: run the pipeline with the run
command. If everything goes well, the data from your S3 bucket will be loaded into Snowflake.
meltano run tap-s3-csv target-snowflake
Behind the scenes, Meltano manages much of the complexity that comes with running the Singer extractors and loaders together. If your extractor supports bookmarked state between runs, Meltano will store and retrieve it for you, so after the initial load, each subsequent execution only extracts the incremental changes.
What’s Next?
Now that you’ve set up a basic data pipeline using Meltano, there are so many other things you can discover.
If you want to explore more, try executing transformations with the dbt plugin after loading the data into Snowflake, or schedule a data pipeline with the Airflow orchestrator plugin, or combine a data pipeline with more extractors and loaders with Meltano’s run
command. Once your data is transformed, you can also pull a plugin such as Superset into your project for analyzing and building dashboards.
You can also join Meltano’s growing community to discover new features and discuss the intricacies of ELT pipelines.
Thanks to our guest writer- Roel Peters.