# Getting Started
Welcome! If you're ready to get started with Meltano and run an EL(T) pipeline with a data source and destination of your choosing, you've come to the right place!
Short on time, or just curious what the fuss is about?
To get a sense of the Meltano experience in just a few minutes, follow the examples on the homepage.
They can be copy-pasted right into your terminal and will take you all the way through
installation, data integration (EL), data transformation (T), orchestration, and containerization
with the tap-gitlab
extractor
and the target-jsonl
and target-postgres
loaders.
# Install Meltano
Before you can get started with Meltano and the meltano
CLI, you'll need to install it onto your system.
To learn more about the different installation methods, refer to the Installation guide.
# Local installation
If you're running Linux or macOS and have Python 3.6, 3.7 or 3.8 installed, we recommend installing Meltano into a dedicated Python virtual environment inside the directory that will hold your Meltano projects.
Create and navigate to a directory to hold your Meltano projects:
mkdir meltano-projects cd meltano-projects
Create and activate a virtual environment for Meltano inside the
.venv
directory:python3 -m venv .venv source .venv/bin/activate
Install the
meltano
package from PyPI:pip3 install meltano
Optionally, verify that the
meltano
CLI is now available by viewing the version:meltano --version
If anything's not behaving as expected, refer to the "Local Installation" section of the Installation guide for more details.
# Docker installation
Alternatively, and assuming you already have Docker installed and running,
you can use the meltano/meltano
Docker image which exposes the meltano
CLI command as its entrypoint.
Pull or update the latest version of the Meltano Docker image:
docker pull meltano/meltano:latest
By default, this image comes with the oldest version of Python supported by Meltano, currently 3.6. If you'd like to use Python 3.7 or 3.8 instead, add a
-python<X.Y>
suffix to the image tag, e.g.latest-python3.8
.Optionally, verify that the
meltano
CLI is now available through the Docker image by viewing the version:docker run meltano/meltano --version
Now, whenever this guide or the documentation asks you to run the meltano
command, you'll need to run it using docker run meltano/meltano <args>
as in the example above.
When running a meltano
subcommand that requires access to your project (which you'll create in the next step), you'll also need to mount the project directory into the container and set it as the container's working directory:
docker run -v $(pwd):/project -w /project meltano/meltano <args>
If anything's not behaving as expected, refer to the "Installing on Docker" section of the Installation guide for more details.
# Create your Meltano project
Now that you have a way of running the meltano
CLI,
it's time to create a new Meltano project that (among other things)
will hold the plugins that implement the various details of your ELT pipelines.
To learn more about Meltano projects, refer to the Projects concept doc.
Navigate to the directory that you'd like to hold your Meltano projects, if you didn't already do so earlier:
mkdir meltano-projects cd meltano-projects
Initialize a new project in a directory of your choosing using
meltano init
:meltano init <project directory name> # For example: meltano init my-meltano-project # If you're using Docker, don't forget to mount the current working directory: docker run -v $(pwd):/projects -w /projects meltano/meltano init my-meltano-project
This will create a new directory with, among other things, your
meltano.yml
project file:version: 1 send_anonymous_usage_stats: true project_id: <random UUID>
It doesn't define any plugins or pipeline schedules yet, but note that the
send_anonymous_usage_stats
setting is enabled by default. To disable it, change the value tofalse
and optionally remove theproject_id
setting.Navigate to the newly created project directory:
cd <project directory> # For example: cd my-meltano-project
Optionally, if you'd like to version control your changes, initialize a Git repository and create an initial commit:
git init git add --all git commit -m 'Initial Meltano project'
This will allow you to use
git diff
to easily check the impact of themeltano
commands you'll run below on your project files, most notably yourmeltano.yml
project file.
# Add an extractor to pull data from a source
Now that you have your very own Meltano project, it's time to add some plugins to it!
The first plugin you'll want to add is an extractor, which will be responsible for pulling data out of your data source.
To learn more about adding plugins to your project, refer to the Plugin Management guide.
Find out if an extractor for your data source is supported out of the box by checking the Sources list or using
meltano discover
:meltano discover extractors
Depending on the result, pick your next step:
If an extractor is supported out of the box, add it to your project using
meltano add
:meltano add extractor <plugin name> # For example: meltano add extractor tap-gitlab # If you have a preference for a non-default variant, select it using `--variant`: meltano add extractor tap-gitlab --variant=singer-io # If you're using Docker, don't forget to mount the project directory: docker run -v $(pwd):/project -w /project meltano/meltano add extractor tap-gitlab
This will add the new plugin to your
meltano.yml
project file:plugins: extractors: - name: tap-gitlab variant: meltano pip_url: git+https://gitlab.com/meltano/tap-gitlab.git
You can now continue to step 4.
If an extractor is not yet discoverable, find out if a Singer tap for your data source already exists by checking Singer's index of taps and/or doing a web search for
Singer tap <data source>
, e.g.Singer tap COVID-19
.
Depending on the result, pick your next step:
If a Singer tap for your data source is available, add it to your project as a custom plugin using
meltano add --custom
:meltano add --custom extractor <tap name> # For example: meltano add --custom extractor tap-covid-19 # If you're using Docker, don't forget to mount the project directory, # and ensure that interactive mode is enabled so that Meltano can ask you # additional questions about the plugin and get your answers over STDIN: docker run --interactive -v $(pwd):/project -w /project meltano/meltano add --custom extractor tap-covid-19
Meltano will now ask you some additional questions to learn more about the plugin.
This will add the new plugin to your
meltano.yml
project file:plugins: extractors: - name: tap-covid-19 namespace: tap_covid_19 pip_url: tap-covid-19 executable: tap-covid-19 capabilities: - catalog - discover - state settings: - name: api_token - name: user_agent - name: start_date
To learn more about adding custom plugins, refer to the Plugin Management guide.
TIP
Once you've got the extractor working in your project, please consider contributing its description to the index of discoverable plugins so that it can be supported out of the box for new users!
If a Singer tap for your data source doesn't exist yet, learn how to build your own tap by following the "Create a Custom Extractor" tutorial or Singer's "Developing a Tap" guide.
Once you've got your new tap project set up, you can add it to your Meltano project as a custom plugin by following the
meltano add --custom
instructions above. When asked to provide apip install
argument, you can provide a local directory path or Git repository URL.
Optionally, verify that the extractor was installed successfully and that its executable can be invoked using
meltano invoke
:meltano invoke <plugin> --help # For example: meltano invoke tap-gitlab --help
If you see the extractor's help message printed, the plugin was definitely installed successfully, but an error message related to missing configuration or an unimplemented
--help
flag would also confirm that Meltano can invoke the plugin's executable.
# Configure the extractor
Chances are that the extractor you just added to your project will require some amount of configuration before it can start extracting data.
To learn more about managing the configuration of your plugins, refer to the Configuration guide.
What if I already have a config file for this extractor?
If you've used this Singer tap before without Meltano, you may have a config file already.
If you'd like to use the same configuration with Meltano, you can skip this section and copy and paste the JSON config object into your meltano.yml
project file under the plugin's config
key:
extractors:
- name: tap-example
config: {
"setting": "value",
"another_setting": true
}
Since YAML is a superset of JSON, the object should be indented correctly, but formatting does not need to be changed.
Find out what settings your extractor supports using
meltano config <plugin> list
:meltano config <plugin> list # For example: meltano config tap-gitlab list
Assuming the previous command listed at least one setting, set appropriate values using
meltano config <plugin> set
:meltano config <plugin> set <setting> <value> # For example: meltano config tap-gitlab set projects "meltano/meltano meltano/tap-gitlab" meltano config tap-gitlab set start_date 2020-05-01T00:00:00Z meltano config tap-gitlab set private_token my_private_token
This will add the non-sensitive configuration to your
meltano.yml
project file:plugins: extractors: - name: tap-gitlab variant: meltano config: projects: meltano/meltano meltano/tap-gitlab start_date: '2020-10-01T00:00:00Z'
Sensitive configuration (like
private_token
) will instead be stored in your project's.env
file so that it will not be checked into version control:export TAP_GITLAB_PRIVATE_TOKEN=my_private_token
Optionally, verify that the configuration looks like what the Singer tap expects according to its documentation using
meltano config <plugin>
:meltano config <plugin> # For example: meltano config tap-gitlab
This will show the current configuration:
{ "api_url": "https://gitlab.com", "private_token": "my_private_token", "groups": "", "projects": "meltano/meltano meltano/tap-gitlab", "ultimate_license": false, "fetch_merge_request_commits": false, "fetch_pipelines_extended": false, "start_date": "2020-10-01T00:00:00Z" }
# Select entities and attributes to extract
Now that the extractor has been configured, it'll know where and how to find your data, but not yet which specific entities and attributes (tables and columns) you're interested in.
By default, Meltano will instruct extractors to extract all supported entities and attributes, but it's recommended that you specify the specific entities and attributes you'd like to extract, to improve performance and save on bandwidth and storage.
To learn more about selecting entities and attributes for extraction, refer to the Data Integration (EL) guide.
What if I already have a catalog file for this extractor?
If you've used this Singer tap before without Meltano, you may have generated a catalog file already.
If you'd like Meltano to use it instead of generating a catalog based on the entity selection rules you'll be asked to specify below, you can skip this section and either set the catalog
extractor extra or use meltano elt
's --catalog
option when running the data integration (EL) pipeline later on in this guide.
Find out whether the extractor supports entity selection, and if so, what entities and attributes are available, using
meltano select --list --all
:meltano select --list --all <plugin> # For example: meltano select --list --all tap-covid-19
If this command fails with an error, this usually means that the Singer tap does not support catalog discovery mode, and will always extract all supported entities and attributes.
Assuming the previous command succeeded, select the desired entities and attributes for extraction using
meltano select
:meltano select <plugin> <entity> <attribute> meltano select <plugin> --exclude <entity> <attribute> # For example: meltano select tap-covid-19 eu_daily date meltano select tap-covid-19 eu_daily country meltano select tap-covid-19 eu_daily cases meltano select tap-covid-19 eu_daily deaths # Include all attributes of an entity meltano select tap-covid-19 eu_ecdc_daily "*" # Exclude matching attributes of all entities meltano select tap-covid-19 --exclude "*" "git_*"
As you can see in the example, entity and attribute identifiers can contain wildcards (
*
) to match multiple entities or attributes at once.This will add the selection rules to your
meltano.yml
project file:plugins: extractors: - name: tap-covid-19 select: - eu_daily.date - eu_daily.country - eu_daily.cases - eu_daily.deaths - eu_ecdc_daily.* - '!*.git_*'
Optionally, verify that only the intended entities and attributes are now selected using
meltano select --list
:meltano select --list <plugin> # For example: meltano select --list tap-covid-19
# Choose how to replicate each entity
If the data source you'll be pulling data from is a database, like PostgreSQL or MongoDB, your extractor likely requires one final setup step: setting a replication method for each selected entity (table).
Extractors for SaaS APIs typically hard-code the appropriate replication method for each supported entity, so if you're using one, you can skip this section and move on to setting up a loader.
Most database extractors, on the other hand, support two or more of the following replication methods and require you to choose an appropriate option for each table through the replication-method
stream metadata key:
LOG_BASED
: Log-based Incremental ReplicationThe extractor uses the database's binary log files to identify what records were inserted, updated, and deleted from the table since the last run (if any), and extracts only these records.
This option is not supported by all databases and database extractors.
INCREMENTAL
: Key-based Incremental ReplicationThe extractor uses the value of a specific column on the table (the Replication Key, e.g. an
updated_at
timestamp or incrementingid
integer) to identify what records were inserted or updated (but not deleted) since the last run (if any), and extracts only those records.FULL_TABLE
: Full Table ReplicationThe extractor extracts all available records in the table on every run.
To learn more about replication methods, refer to the Data Integration (EL) guide.
Find out which replication methods (i.e. options for the
replication-method
stream metadata key) the extractor supports by checking its documentation or the README in its repository.Set the desired
replication-method
metadata for each selected entity usingmeltano config <plugin> set
and the extractor'smetadata
extra:meltano config <plugin> set _metadata <entity> replication-method <LOG_BASED|INCREMENTAL|FULL_TABLE> # For example: meltano config tap-postgres set _metadata some_entity_id replication-method INCREMENTAL meltano config tap-postgres set _metadata other_entity replication-method FULL_TABLE # Set replication-method metadata for all entities meltano config tap-postgres set _metadata '*' replication-method INCREMENTAL # Set replication-method metadata for matching entities meltano config tap-postgres set _metadata '*_full' replication-method FULL_TABLE
As you can see in the example, entity identifiers can contain wildcards (
*
) to match multiple entities at once.If you've set a table's
replication-method
toINCREMENTAL
, also choose a Replication Key by setting thereplication-key
metadata:meltano config <plugin> set _metadata <entity> replication-key <column> # For example: meltano config tap-postgres set _metadata some_entity_id replication-key updated_at meltano config tap-postgres set _metadata some_entity_id replication-key id
This will add the metadata rules to your
meltano.yml
project file:plugins: extractors: - name: tap-gitlab metadata: some_entity_id: replication-method: INCREMENTAL replication-key: id other_entity: replication-method: FULL_TABLE '*': replication-method: INCREMENTAL '*_full': replication-method: FULL_TABLE
Optionally, verify that the stream metadata for each table was set correctly in the extractor's generated catalog file by dumping it using
meltano invoke --dump=catalog <plugin>
:meltano invoke --dump=catalog <plugin> # For example: meltano invoke --dump=catalog tap-postgres
# Add a loader to send data to a destination
Now that your Meltano project has everything it needs to pull data from your source, it's time to tell it where that data should go!
This is where the loader comes in, which will be responsible for loading extracted data into an arbitrary data destination.
To learn more about adding plugins to your project, refer to the Plugin Management guide.
Find out if a loader for your data destination is supported out of the box by checking the Destinations list or using
meltano discover
:meltano discover loaders
Depending on the result, pick your next step:
If a loader is supported out of the box, add it to your project using
meltano add
:meltano add loader <plugin name> # For example: meltano add loader target-postgres # If you have a preference for a non-default variant, select it using `--variant`: meltano add loader target-postgres --variant=transferwise
This will add the new plugin to your
meltano.yml
project file:plugins: loaders: - name: target-postgres variant: datamill-co pip_url: singer-target-postgres
You can now continue to step 4.
If a loader is not yet discoverable, find out if a Singer target for your data source already exists by checking Singer's index of targets and/or doing a web search for
Singer target <data destination>
, e.g.Singer target BigQuery
.
Depending on the result, pick your next step:
If a Singer target for your data destination is available, add it to your project as a custom plugin using
meltano add --custom
:meltano add --custom loader <target name> # For example: meltano add --custom loader target-bigquery # If you're using Docker, don't forget to mount the project directory, # and ensure that interactive mode is enabled so that Meltano can ask you # additional questions about the plugin and get your answers over STDIN: docker run --interactive -v $(pwd):/project -w /project meltano/meltano add --custom loader target-bigquery
Meltano will now ask you some additional questions to learn more about the plugin.
This will add the new plugin to your
meltano.yml
project file:plugins: loaders: - name: target-bigquery namespace: target_bigquery pip_url: target-bigquery executable: target-bigquery settings: - name: project_id - name: dataset_id - name: table_id
To learn more about adding custom plugins, refer to the Plugin Management guide.
TIP
Once you've got the loader working in your project, please consider contributing its description to the index of discoverable plugins so that it can be supported out of the box for new users!
If a Singer target for your data source doesn't exist yet, learn how to build your own target by following Singer's "Developing a Target" guide.
Once you've got your new target project set up, you can add it to your Meltano project as a custom plugin by following the
meltano add --custom
instructions above. When asked to provide apip install
argument, you can provide a local directory path or Git repository URL.
Optionally, verify that the loader was installed successfully and that its executable can be invoked using
meltano invoke
:meltano invoke <plugin> --help # For example: meltano invoke target-postgres --help
If you see the loader's help message printed, the plugin was definitely installed successfully, but an error message related to missing configuration or an unimplemented
--help
flag would also confirm that Meltano can invoke the plugin's executable.
# Configure the loader
Chances are that the loader you just added to your project will require some amount of configuration before it can start loading data.
To learn more about managing the configuration of your plugins, refer to the Configuration guide.
What if I already have a config file for this loader?
If you've used this Singer target before without Meltano, you may have a config file already.
If you'd like to use the same configuration with Meltano, you can skip this section and copy and paste the JSON config object into your meltano.yml
project file under the plugin's config
key:
loaders:
- name: target-example
config: {
"setting": "value",
"another_setting": true
}
Since YAML is a superset of JSON, the object should be indented correctly, but formatting does not need to be changed.
Find out what settings your loader supports using
meltano config <plugin> list
:meltano config <plugin> list # For example: meltano config target-postgres list
Assuming the previous command listed at least one setting, set appropriate values using
meltano config <plugin> set
:meltano config <plugin> set <setting> <value> # For example: meltano config target-postgres set postgres_host localhost meltano config target-postgres set postgres_port 5432 meltano config target-postgres set postgres_username meltano meltano config target-postgres set postgres_password meltano meltano config target-postgres set postgres_database warehouse meltano config target-postgres set postgres_schema public
This will add the non-sensitive configuration to your
meltano.yml
project file:plugins: loaders: - name: target-bigquery variant: datamill-co config: postgres_host: localhost postgres_port: 5432 postgres_username: meltano postgres_database: warehouse postgres_schema: public
Sensitive configuration (like
postgres_password
) will instead be stored in your project's.env
file so that it will not be checked into version control:export TARGET_POSTGRES_PASSWORD=meltano
Optionally, verify that the configuration looks like what the Singer target expects according to its documentation using
meltano config <plugin>
:meltano config <plugin> # For example: meltano config target-postgres
This will show the current configuration:
{ "postgres_host": "localhost", "postgres_port": 5432, "postgres_username": "meltano", "postgres_password": "meltano", "postgres_database": "warehouse", "postgres_schema": "public" }
# Run a data integration (EL) pipeline
Now that your Meltano project, extractor, and loader are all set up, we've reached the final chapter of this adventure, and it's time to run your first data integration (EL) pipeline!
To learn more about data integration, refer to the Data Integration (EL) guide.
There's just one step here: run your newly added extractor and loader in a pipeline using meltano elt
:
meltano elt <extractor> <loader> --job_id=<pipeline name>
# For example:
meltano elt tap-gitlab target-postgres --job_id=gitlab-to-postgres
If everything was configured correctly, you should now see your data flow from your source into your destination!
If the command failed, but it's not obvious how to resolve the issue, consider enabling debug mode to get some more insight into what's going on behind the scenes. If that doesn't get you closer to a solution, learn how to get help with your issue.
If you run meltano elt
another time with the same Job ID, you'll see it automatically pick up where the previous run left off, assuming the extractor supports incremental replication.
What if I already have a state file for this extractor?
If you've used this Singer tap before without Meltano, you may have a state file already.
If you'd like Meltano to use it instead of looking up state based on the Job ID, you can either use meltano elt
's --state
option or set the state
extractor extra.
If you'd like to dump the state generated by the most recent run into a file, so that you can explicitly pass it along to the next invocation, you can use meltano elt
's --dump=state
option:
meltano elt <extractor> <loader> --job_id=<pipeline name> --dump=state > state.json
# For example:
meltano elt tap-gitlab target-postgres --job_id=gitlab-to-postgres --dump=state > state.json
# Next steps
Now that you've successfully run your first data integration (EL) pipeline using Meltano, you have a few possible next steps:
- Schedule pipelines to run regularly
- Transform loaded data for analysis
- Containerize your project
- Deploy your pipelines in production
# Schedule pipelines to run regularly
Most pipelines aren't run just once, but over and over again, to make sure additions and changes in the source eventually make their way to the destination.
To help you realize this, Meltano supports scheduled pipelines that can be orchestrated using Apache Airflow.
To learn more about orchestration, refer to the Orchestration guide.
Schedule a new
meltano elt
pipeline to be invoked on an interval usingmeltano schedule
:meltano schedule <pipeline name> <extractor> <loader> <interval> # For example: meltano schedule gitlab-to-postgres tap-gitlab target-postgres @daily
The
pipeline name
argument corresponds to the--job_id
option onmeltano elt
, which identifies related EL(T) runs when storing and looking up incremental replication state. To have scheduled runs pick up where your earlier manual run left off, ensure you use the same pipeline name.This will add the new schedule to your
meltano.yml
project file:schedules: - name: gitlab-to-postgres extractor: tap-gitlab loader: target-postgres transform: skip interval: '@daily'
Optionally, verify that the schedule was created successfully using
meltano schedule list
:meltano schedule list
Add the Apache Airflow orchestrator to your project using
meltano add
, which will be responsible for managing the schedule and executing the appropriatemeltano elt
commands:meltano add orchestrator airflow
This will add the new plugin to your
meltano.yml
project file:plugins: orchestrators: - name: airflow pip_url: apache-airflow==1.10.12
It will also automatically add a
meltano elt
DAG generator to your project'sorchestrate/dags
directory, where Airflow will be configured to look for DAGs by default.Start the Airflow scheduler using
meltano invoke
:meltano invoke airflow scheduler # Add `-D` to run the scheduler in the background: meltano invoke airflow scheduler -D
As long as the scheduler is running, your scheduled pipelines will run at the appropriate times.
Optionally, verify that a DAG was automatically created for each scheduled pipeline by starting the Airflow web interface:
meltano invoke airflow webserver # Add `-D` to run the scheduler in the background: meltano invoke airflow webserver -D
The web interface and DAG overview will be available at http://localhost:8080.
# Transform loaded data for analysis
Once your raw data has arrived in your data warehouse, its schema will likely need to be transformed to be more appropriate for analysis.
To help you realize this, Meltano supports transformation using dbt
.
To learn about data transformation, refer to the Data Transformation (T) guide.
# Containerize your project
To learn how to containerize your project, refer to the Containerization guide.
# Deploy your pipelines in production
To learn how to deploy your pipelines in production, refer to the Deployment in Production guide.