Azure Data Factory | Introduction to ETL in Azure
Shivani Mishra - Easy Azure for Everyone
Data Factory: A Brief Introduction
In this
blog post, we will see Azure Data Factory, a service that allows you to
move data between different sources. Whether you are working in the cloud or
on-premises, data movement is an essential part of any project. Let's explore
how Data Factory works and how you can use it in your own projects.
How does Data Factory work?
To
understand how Data Factory works, let's use an analogy. Imagine you want to
move a cabinet from a shop to your home. In this scenario, you need certain
information:
·
A key and
address to locate the shop and access the cabinet
·
Information
about the cabinet, such as disassembly instructions or packaging details
·
Delivery
details, including what to do with the package once you receive it
·
A key and
address of your home to deliver the cabinet
In Data
Factory, the shop can be replaced with a blob storage and the cabinet with a
customer CSV file. Here's how the process looks:
·
Link
Service: This defines the connectivity to the blob storage and includes
authorization details.
·
Data Set:
This defines the structure of the customer CSV file.
·
Copy
Activity: This is the job in Data Factory that moves the data from the source
(blob storage) to the destination (SQL server).
·
Integration
Run Time: This is the workforce of Data Factory that handles all the jobs.
·
Pipeline: A
pipeline consists of one or more activities (such as copy activity) and defines
the flow of data.
Importing CSV file from Blob to Azure SQL
Now, let's
try importing a CSV file from a blob storage into Azure SQL using Data Factory.
Here are the steps:
1. Create a Data Factory in the
Azure portal.
2. Create an Azure SQL database.
3. Create a storage account.
Once
everything is set up, you can configure the necessary link services, data sets,
and copy activities in Data Factory to move the data from the blob storage to
Azure SQL.
Using Blob Storage to Store Files
I will
create a new container called "input" in the blob storage and set it
to private for unauthorized access. I will upload a file called
"cars.csv" to the "input" container. The file contains data
in CSV format, with headers in the first row and data separated by commas.
Next, I will prepare an SQL query to create a table for importing the cars
data. Once the table is created, I will go to the Data Factory and start
working on the pipeline.
Connecting Blob Storage
In order to
connect to the blob storage, I need to create a link service. I will go to the
connections tab and create a new link service called "input blob" for
the blob storage. I will select the "auto resolve" integration
runtime and choose the account key authentication for simplicity. I will select
my subscription and test the connection. Once the connection is successful, I
will finish creating the link service.
Defining a Data Set
Next, I
need to define a data set for the blob storage. I will go to the data sets tab
and create a new data set called "cars.csv" for the csv file. I will
select the "cars.csv" file from the "input" container and
ensure that the checkbox for headers is selected. I will leave the import
schema as default and continue. A new data set will be created with options to
preview the data and make adjustments if needed.
Connecting to SQL
In order to
connect to SQL, I need to create another link service. I will go to the
connections tab and create a new link service called "output sql" for
the SQL database. I will use the same integration runtime and select the
connection string for my subscription server and database. I will choose the
username and password authentication type and provide the necessary
credentials. Once the test connection is successful, I can proceed.
Comments
Post a Comment