Introducing Mario: The Data Source Piping Tool

How we collect and store disparate external data sources consistently

Posted by Sebastian Perez Saaibi and Juan Camilo Bohorquez Escalante on June 9, 2017

The Problem

As a growing technology platform, Pager uses several third party services for its different business units such as Marketing, Business Development, Sales, and Product.

When these tools become part of the company’s daily workflow, it is essential to track the data they generate in order to maintain accurate metrics and to measure the performance of the company.

This is where the real problem starts. A business stakeholder typically wants to answer questions that involve extracting and combining data from more than one of these third party data sources.

The traditional way to solve this was to build a manual pipeline (exporting data from the tool’s GUI) and to perform an ad-hoc analysis. But what if different stakeholders are interested in a recurrent report that involves this data?

Our Solution: Mario

This is where Mario, our data extraction and normalization tool comes in. Mario pipes data from multiple external sources and drains it into a relational database which acts like a single point for internal consumption of normalized data (e.g. querying, reporting and data analytics).

Mario alleviates the problem of gathering third party data by making it easy to store and consume disparate sources information in a relational database.

This has proven to be a key factor in promoting Pager’s culture of data driven decision making.

Mario

How does Mario Work?

Mario was conceived as a procedural worker to be started by a scheduler which simultaneously retrieves, transforms and stores data from multiple sources.

Each source has a specific flow to retrieve data. That could involve making requests to a REST API, scraping a website, or something else.

After retrieving the data, every flow has its own transformation methods to turn incoming data into the most appropriate data structures for storage and consumption. Having said that, procedures such as error tracking and persistence are common across all flows.

In a development environment, data can be persisted to a DB as it would on a production environment or into CSV files. The latter is useful for experiementation. For instance, it allows us to determine the type of each column, analyze descriptive statistics relevant to each type and finally generate meaningful SQL table structures.

The next sections will a high level description of the different flows required to collect data from the various sources.

Mario Flow

Fetching data from an API

The figure on the left shows the most common flow, which follows these steps:

  1. Build an authentication request
  2. Make an authentication request
  3. Collect data from the authentication response
  4. Build a query request with authorization headers
  5. Collect data from the query response
  6. Transform data
  7. Store transformed data

Other Flows

Some other times the sources don’t provide a public API and it becomes necessary to use data from the websites. For that purspose, we have flows that make requests to collect cookies and scrape data from the website to build the authentication request.

After Mario completes the authentication request and stores the cookies, it pushes a request to the respective API.

OAuth Flow

OAuth Flow

A service that implements OAuth will then follow a similar path to a common flow. If a valid refresh token is not present we will have to follow some steps to generate the seed for the automated flow of refreshing the access token.

  1. Start a server that will be used by the the OAuth redirect.
  2. Manualy input credentials and grant access.
  3. Store the access token that comes in the redirect.
  4. Use the access token to get a refresh token and store it so a scheduled job can refresh the access token before it expires.

Then the credentials are refreshed asynchronously with a cron job using the refresh token we manually created.

Error Reporting

Mario is executed by a time-based job scheduler, so something might go wrong while no one is watching.

For this reason, we use logs to keep a record of the trail of events and stream the critical events to Sentry.

Sentry notifies us via email or Slack of any atypical event, helping us promptly recover from any failure. That way we can always be sure we have updated, consistent, and reliable data.

Use Case

Building a User Attribution Funnel

As a Patient Engagement Technology Platform, Pager has mobile and web apps for our patients, medical providers, and command center medical specialists. This ecosystem of applications has a relatively heavy tracking footprint, and raises a fundamental question: how do you build an end-to-end attribution funnel?

Pager uses Mixpanel to track user actions and activities inside the suite of apps. However, people would often want to include top of the funnel information when understanding people using and progressing through the app.

We use Mario to pull data from disparate data sources such as iTunes Connect, the Google Play Store and Mixpanel and provide an aggregate view of the attribution funnel.

Thanks to Mario, we can build detailed attribution funnels for the full ecosystem of our products.

Next Steps

In the past months, some companies have started to offer commercial implementations that resemble what we’ve built with Mario. Being able to build a user footprint that gathers information from several data sources is key to providing a better and more efficient product or service.

We strongly believe in the power of this tool and we think several businesses and developers out there could benefit from it if they haven’t built their own yet. That being said, we are working to open source this tool and are striving to reach the following milestones:

  • Refactor and architecture revamp
  • Add a user interface: currently this is a command-line tool and we don’t have an interface to manage the tool
  • Add a dashboard to customize queries, turn on/off integrations and set credentials
  • Additional external sources