Azure Data Factory Mapping Data Flows

Shivani Mishra - Easy Azure for Everyone

Our data factory is a fantastic tool for moving data, but until recently, it was not great at transforming data. However, this has changed with the introduction of mapping data flows for Azure Data Factory. Mapping data flow allows us to create data transformations without writing code. It is a visual designer that allows us to create data transformation flows by simply dragging and dropping actions.



Build ETL visual way

When building a flow, it typically looks like the following:

·         One or more data sources

·         At least one data set

·         Multiple actions and data syncs

This is very similar to how SQL Server Integration Services (SSIS) works, so if you are familiar with SSIS, this will feel familiar to you.

In data flows, there are multiple types of actions:

·         Pool actions: multiple inputs and outputs, conditional splits, unions, lookups, and schema modifiers

·         Green actions: derive columns, create calculated columns, select subsets of columns, aggregate, pivot, window functions, etc.

·         Orange actions: filter, sort, and alter rows

In Data Factory, each pipeline had multiple actions, where each copy action moved data from one place to another. With data flows, we get a new action called "Data Flow," which contains the entire flow nested within the pipeline. This flow has its own sources, data sets, and transformations. It is an end-to-end process within a single block.

Currently, there are six supported data sources for data flows: blob storage, SQL database, data lake (generation one and two), data warehouse, and Cosmos DB.

In this demo, we will show you how data flows work and how you can build transformations using mapping data flows.

Let's start by creating a data factory and a storage account. Then, we will upload a movies CSV file to the storage account.




Step 1: Create a Data Factory and Storage Account

1.    Go to the Azure portal and navigate to your resource group.

2.    Select your data factory and storage account.

3.    Create two containers: one for input and one for output.

4.    Upload a movies CSV file to the input container.

Step 2: Create Link Services and Datasets

1.    In the data factory portal, go to the "Author & Monitor" tab.

2.    Create a new connection for the blob storage.

3.    Create a link service for the blob storage using the connection.

4.    Create a dataset for the input movies CSV file using the link service.

5.    Create a dataset for the output files using the same link service.

Step 3: Create a Data Flow

1.    In the data factory portal, go to the "Author" tab.

2.    Create a new data flow and select "Mapping Data Flow."

3.    In the data flow editor, add a source and select the movies dataset as the source.

4.    Enable data flow debugging.

At this point, you can start debugging your data flow and previewing the results of each step.

Imagine the process of compiling and sending the package is a bigger overhead than reading the data itself. Reading the data is quite fast, it just takes time to send the offer to execute. Let's see the results. It showed us a hundred rows, but we have more. This is just a sample. You can review the data, the colors, and the rows. There are a total of 9,125 rows in our dataset.

Now, let's look at some cool options. First, we have source options and projection. You can map the columns and select the format. You can also optimize by selecting the partitioning.

Next, let's do a simple transformation. We want to clean the movie dataset by extracting the year from the title. To do this, we will use the calculated column step. In this step, we will create a new column called "year" using a Scala function. This function extracts the last six characters from the title, trims it, converts it to an integer, and gets the value of the year.

We can now refresh the package to see the results. The year extraction seems to be working correctly. We can save and finish, and then preview the data. We can see that a new column has been added, and the schema has changed since the last step.

We can continue with another step to clean the title column. We can use an expression to remove the year from the title. After refreshing, we can see that the title has been cleaned.

If we are happy with the results, we can save it and create a new step to save the data as an output CSV file. We can configure the settings and choose the file name option. We can then run the entire flow as a pipeline.

After running the pipeline, we can check the output in the storage account. We can see that multiple files have been created due to the technology behind it. We can change the file naming option to output a single file if needed.

That's it! The pipeline is complete.

  

Comments

Popular posts from this blog

The fundamentals of using Azure Secret Computing to Safeguard your data

Azure Data Factory | Copy multiple tables in Bulk with Lookup Activity & ForEach

Azure SQL Database Blog