Welcome to the first series of our guest-bloggers at Data³, where we showcase quick, detailed and useful guidance on practicalities of data aggregation and integration. Catch-up on technical details of moving data when using Azure Data Factories, Azure Data Lake Storage and Azure Event Hub.
Dhilip, is an experienced data professional in Wellington, New Zealand. He has a driving data passion for Machine Learning, Data Science, Data Analysis, Business Intelligence, Data Visualisation and much more. More than anything, he loves to tell stories with data, especially those that have a big impact. You can also find his work in Medium, Towards Data Science, Towards AI, SAS Community, KD Nuggets and Indianai.
In this blog, we will see how to extract data from an open API using Azure Data Factory and copy the data to Azure Data Lake Storage Gen 2.
- Azure Subscription
- Rest API
- Azure Data Factory Instance
- Azure Data Lake Storage Gen 2 Instance
We used the Department of Health and Mental Hygiene (DOHMH) New York City Restaurant Inspection Results data from NYC open data. DOHMH provide an Open API to their data.
Steps to be followed
- Create a linked service for Rest API and Storage Gen 2
Open an Azure Data Factory Studio
Click the Manage icon and this will redirect to Linked Services
We need to create a new Linked service,
Source Linked service -> search Rest API connector -> select -> continue
Enter the below properties:
- Base URL (Your Rest API Endpoint)
- Select suitable Authentication Type (We selected Anonymous as this API doesn’t have any tokens). You can select the authentication based on your API.
- Test the Connection
- Click create
New linked service “ls_NYCApi_output” has been created
→ Create Storage gen 2 linked service
Source Linked service -> search gen 2 connector -> select -> continue
Enter the below properties,
- Select Azure Subscription
- Select the storage name (You need to create a gen 2 instance before creating this linked service)
- Test Connection
- Click create
New linked service “ls_gen2_NYCApi_output” has been created.
2) Create a source and sink dataset
Our source dataset is the NYC data which comes from Rest API. Sink dataset is nothing but storing the data into the necessary destination. In this example, we are extracting the data from the API and store into the Data Lake storage gen 2.
In order to create a source and sink dataset, we need to click the Author icon from the Azure Data Factory.
Next step, to click the new dataset
New dataset → search Rest API -> select – continue
Enter the below properties
- Select the linked service (we need to select our rest API linked service which we created in the earlier steps “ls_NYCApi_output”
- Click OK
The source dataset has been created.
-> Now, we need to create a sink for storing our data into Gen 2. Again select a new dataset
New dataset → search Gen2 -> select – continue
This will redirect to a new window called “Select format” where we can select the format of our data to be stored in the Gen 2. We would like to store it as a csv file. In simple terms, we are extracting data from Rest API and storing the data into Gen 2 as a CSV file.
By default, data factory supports Binary, Excel, XML, Parquet, ORC, CSV, Json formats. You can select the format based on your requirement. We select CSV and click continue
Enter the below properties
- Select the linked service (we need to select our gen2 linked service which we created in the earlier steps “ls_gen2_NYCApi_output”
- Select the file path -> we need to create a folder in gen 2 before proceeding this step. We created a folder called “nyc” in our gen 2 and selected the same. So our csv file stores into the nyc folder in the Gen 2. Give the name of the file you want to store (saved as “nyc.csv”)
- Select First row as header
- Click OK
Sink dataset has been created
Now in the sink dataset properties, we need to give the file name for our data. We name the file as “nyc.csv”
3) Create a Pipeline
Click new pipeline
Enter the below property
- Name ( pl_ETL_NYCApi)
Click move & transform -> drag copy data into the pipeline
Click the dragged copy data and it shows the below tabs in the down
We are going to work on General, Source, Sink, Mapping tabs in this example.
Enter the name of the copy data (“we gave “Extract_NYCApi”)
Leave the other properties default
- Click the source tab and select the source dataset. Here, we need to select our Rest API Source dataset called “dataset_NYCApi_source” which we created in the above steps.
- Select GET method in the Request method, as we are extracting the data from the API
- You can change the time out but we leave as it is in this example.
- We leave other properties as it is as we use the open API here. If your API has headers, paginations, you can change other properties
- Final step, click the preview data to check the data
We can see the data in the preview data tab
- Click the sink tab and select the sink dataset. Here, we need to select out Gen 2 Sink dataset called “dataset_NYCApi_Sink”
In the mapping tab, click Import Schemas. This imports the exact json structure of our Rest API. It displays the entire attributes of the data. You can select/unselect at any point the attributes that you don’t want to copy. We need to select the data type before completing this step. Just for this example, we selected strings for all the attributes.
4) Running the Pipeline
Click debug to run the copy activity in the pipeline
In order to check whether our pipeline is successful or not, we need to see the status in the output window. Status shows succeeded -> that means our data has successfully copied to gen 2.
Copy activity has finished. Data has extracted from API and successfully copied into gen 2 as a csv file (“nyc.csv)
5) Checking the Gen 2 Storage
Final step -> go to your data lake storage gen 2 and check the CSV file. Our “nyc.csv” file has been created under nyc folder
Click nyc.csv and select edit to see the preview of the file
CSV file has been created successfully and you can directly download the file or you can connect to any BI tools to make a report.
In our next blog, we can see how to export this CSV file into Azure SQL database for simple reports.
This blog was written by Dhillip S.