Azure Data Factory Mapping Data Flows
Shivani Mishra - Easy Azure for Everyone
Our data factory is a fantastic tool for moving data, but until recently, it was not great at transforming data. However, this has changed with the introduction of mapping data flows for Azure Data Factory. Mapping data flow allows us to create data transformations without writing code. It is a visual designer that allows us to create data transformation flows by simply dragging and dropping actions.
Build ETL visual way
When
building a flow, it typically looks like the following:
·
One or more
data sources
·
At least
one data set
·
Multiple
actions and data syncs
This is
very similar to how SQL Server Integration Services (SSIS) works, so if you are
familiar with SSIS, this will feel familiar to you.
In data
flows, there are multiple types of actions:
·
Pool
actions: multiple inputs and outputs, conditional splits, unions, lookups, and
schema modifiers
·
Green
actions: derive columns, create calculated columns, select subsets of columns,
aggregate, pivot, window functions, etc.
·
Orange
actions: filter, sort, and alter rows
In Data
Factory, each pipeline had multiple actions, where each copy action moved data
from one place to another. With data flows, we get a new action called
"Data Flow," which contains the entire flow nested within the
pipeline. This flow has its own sources, data sets, and transformations. It is
an end-to-end process within a single block.
Currently,
there are six supported data sources for data flows: blob storage, SQL
database, data lake (generation one and two), data warehouse, and Cosmos DB.
In this
demo, we will show you how data flows work and how you can build
transformations using mapping data flows.
Let's start
by creating a data factory and a storage account. Then, we will upload a movies
CSV file to the storage account.
Step 1: Create a Data Factory and Storage Account
1. Go to the Azure portal and
navigate to your resource group.
2. Select your data factory and
storage account.
3. Create two containers: one for
input and one for output.
4. Upload a movies CSV file to the
input container.
Step 2: Create Link Services and Datasets
1. In the data factory portal, go
to the "Author & Monitor" tab.
2. Create a new connection for the
blob storage.
3. Create a link service for the
blob storage using the connection.
4. Create a dataset for the input
movies CSV file using the link service.
5. Create a dataset for the output
files using the same link service.
Step 3: Create a Data Flow
1. In the data factory portal, go
to the "Author" tab.
2. Create a new data flow and
select "Mapping Data Flow."
3. In the data flow editor, add a
source and select the movies dataset as the source.
4. Enable data flow debugging.
At this
point, you can start debugging your data flow and previewing the results of
each step.
Imagine the
process of compiling and sending the package is a bigger overhead than reading
the data itself. Reading the data is quite fast, it just takes time to send the
offer to execute. Let's see the results. It showed us a hundred rows, but we
have more. This is just a sample. You can review the data, the colors, and the
rows. There are a total of 9,125 rows in our dataset.
Now, let's
look at some cool options. First, we have source options and projection. You
can map the columns and select the format. You can also optimize by selecting
the partitioning.
Next, let's
do a simple transformation. We want to clean the movie dataset by extracting
the year from the title. To do this, we will use the calculated column step. In
this step, we will create a new column called "year" using a Scala
function. This function extracts the last six characters from the title, trims
it, converts it to an integer, and gets the value of the year.
We can now
refresh the package to see the results. The year extraction seems to be working
correctly. We can save and finish, and then preview the data. We can see that a
new column has been added, and the schema has changed since the last step.
We can
continue with another step to clean the title column. We can use an expression
to remove the year from the title. After refreshing, we can see that the title
has been cleaned.
If we are
happy with the results, we can save it and create a new step to save the data
as an output CSV file. We can configure the settings and choose the file name
option. We can then run the entire flow as a pipeline.
After
running the pipeline, we can check the output in the storage account. We can
see that multiple files have been created due to the technology behind it. We
can change the file naming option to output a single file if needed.
That's it!
The pipeline is complete.
Comments
Post a Comment