Azure Data Factory – A simple explanation

This article provides a business-friendly view of what Azure Data Factory is, and where it fits in the data-tools landscape.

Today businesses are generating more and more data; the amount of data is growing exponentially. Businesses use a variety of tools to run their business, which means data comes from different sources. From real-time feeds to big-batches at regular intervals, the format of the data can either be structured, semi-structured, or unstructured.

Businesses want to tap into this data and convert it into insights to drive their business forward as quickly as possible. To gain insights, we need to combine and integrate these data sources into common storage so that the data analyst or data scientists can dig into this data and find insights.

As mentioned previously, data comes from many different sources, right? What are these sources? These data sources could either be: SaaS applications like Google Analytics, Facebook ads, Salesforce, Zoho etc, from on-premise databases like Oracle, SQL server and so on, or other cloud platforms like AWS (Amazon Web Services) or GCP (Google Cloud Platform), etc.

One of the biggest challenges is the lack of functionality in our standard integration tools to integrate and combine into common storage. This is where Microsoft Azure Data Factory comes into the picture and eases the process.

As per Microsoft definition “ADF is a fully managed, serverless data integration service for ingesting, preparing and transforming all of your data sources”.

Let’s go into more detail:

  • ADF is a resource in the Azure subscription, and Microsoft manages our data factory. That means we don’t need to worry about installing the application, operating system, scalability, availability, and security requirements etc, of our data factory. That’s why it is called a fully managed service.
  • ADF is serverless, so our resources can scale to any size without infrastructure management.
  • ADF supports 90+ connectors (integration) for ingestion of various data sources, and Microsoft is continually adding new connectors. It is easy to connect with other major clouds like AWS, GCP and ingest data from on-premise databases. We can quickly build ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) pipelines with a code-free environment. 
  • In addition to the above, ADF helps orchestrate the execution of Machine Learning models (using Azure ML Studio, Databricks) and publishing dashboards using Power BI. 
  • Finally, ADF helps to monitor our data pipeline.

Things to keep in mind if you are thinking of using ADF

  • ADF is not a data storage solution. It provides a computer to process the data, and we need to store it in either Azure storage or Databases.
  • ADF is not a data migration tool. However, we can use the Azure data migration service to migrate our data from one database to another.
  • Complex data transformation is challenging in ADF due to lack of support compared to other code free ETL tools. We can use Databricks, HDInsights for complex transformations. However, ADF can orchestrate this workflow. We may expect some future improvements from Microsoft in this area.
  • ADF is not designed for streaming datasets. Instead, we need to use the Azure Event hub or other components. It’s used for loading and transforming data periodically.

We hope you enjoyed my short introduction to Azure Data Factory. Please feel free to email us at hello@datacubed.nz with any further data questions.

More data knowledge coming soon… for our next blog we’ll be covering how you can use Azure Data Factory to build a modern ETL pipeline that will help power all of those rich insights.

This blog was written by Dhilip S.