Move Data from On-Premise SQL Server to Azure Blob Storage Using Azure Data Factory

 


  • How to set up the self-hosted integration runtime (SIR) manually?
  • Tips for SQL Server security configuration
  • Tips for SIR connection string configurations
  • Tips for adding Azure Blob Storage as Sink

This tutorial will not start from creating an Azure Data Factory (ADF) instance. If you don’t have one yet and wish to start from there, it is sufficient to use the official tutorial above.

Now, I assume that you have already got your on-premise SQL Server and ADF instance ready. Now we should begin.

Prepare Data Source

Let’s create a test database with a test employ table in the on-prem SQL Server as the data source. So, firstly go to your SQL Server Management Studio (SSMS) to create the database and the table.

Image for post
Right-click “Databases” and select “New Database…”
Image for post
Input a database name
Image for post
Right-click the database select “New Query”

Copy-paste the following SQL script to create the table and insert some test entries. This code is from the official tutorial above

CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
INSERT INTO emp (FirstName, LastName) VALUES ('John', 'Doe')
INSERT INTO emp (FirstName, LastName) VALUES ('Jane', 'Doe')
GO

Create SQL Server User

In practice, we should let SIR to use the root user or any users with admin permission. Instead, we want to create a read-only user particularly for the target database for security reason. Let’s create this user.

Image for post
Right-click “Security” -> “Logins” under the SQL Server and select “New Login”
Image for post
Create the user called “adfuser”, input password

Note that we don’t want to check “User must change password at next login” for this user, as it is not expected to be given to any “user”, but only be used by ADF services.

Then, let’s add this user to the database we’ve just created.

Image for post
Right-click “Security” -> “Users” under the new database and select “New User”
Image for post
Input user name and default schema
Image for post
In the “Membership” tab, enable “db_datareader”

After that, a very important step is to check whether your SQL Server has been enabled “SQL Server Authentication Mode”. If not, the user you’ve created will never be able to access.

Image for post
Right-click the SQL Server Instance and go to “Properties”
Image for post
Go to the “Security” tab and make sure “SQL Server and Windows Authentication mode” is enabled

Create ADF Pipeline

Go to ADF Author tab, click “…” button next to Pipelines and select New pipeline in the dropdown.

Image for post

Then, in the main view on the right pane, give a name to the pipeline. Let’s call it OnPremSQLEmpToBlob_Pipeline.

Image for post

Expand Move & transform section, and drag Copy data to the main area on the right, and then rename it to SQLEmpToBlobEmp_Activity.

Image for post

After that, go to Source tab and click New to create a new dataset as the source.

Image for post

Search “sql server” to get the SQL Server connect. Select it and click continue.

Image for post

Let’s call this dataset OnPremSQLServerEmp, and then we need a new linked service for this dataset.

Image for post

Let’s name the new linked service OnPremSQLServer_LinkedService, then create an integration time for this on-premise source.

Image for post

Set up the Self-Hosted Integration Runtime Manually

After the previous step that we select “New” for the integration runtime. We then select “Self-Hosted” and click continue.

Image for post

Input name of the SIR. I would like to name it DevVmIntegrationRuntime because I will set up it on my Dev VM.

Image for post

Here, we come to an important step. In the official tutorial, it is suggested to use the express setup, which is definitely much easier. However, this might not be feasible in some situations, for example, the system security policy doesn’t allow this automatic script to be run on this machine, and it is kind of common to see this error when you want to use the express setup.

Image for post

Therefore, we need to use option 2: manual setup. So, please download the SIR installer, and run it on the machine that you want to host SIR on.

Image for post

Just wait for a moment for the SIR to be installed, then copy the Key and paste it into the installed SIR to connect them together.

Image for post
Image for post

It should be automatically connected to each other, and the machine name of the SIR running on will be automatically populated as the node name. You can rename it but in my case, it doesn’t matter.

Click “Finish” and wait for it registering itself, then click “Launch Configuration Manager”, you should be able to see the screen as follows.

Image for post

We can then test the SQL Server connection in the Diagnostics tab as follows. Note that I’m running SIR on the same machine of the on-premise SQL Server, so I’m using localhost as the server name. However, this is not recommended in practice due to security concerns. If you are connecting a separated machine with SQL Server instance, just replace localhost with the machine name and instance name.

Another important tip is that we need to use a double slash \\ to split the server name and the instance name in this connection test.

Image for post

Then, we come back to ADF. Apply and close the SIR configuration panel. Now, we want to test the connection here as well. Please see the screenshot below.

Image for post

Another very important tip that I didn’t find any elsewhere specified this. That is, the server name here shouldn’t be double slash. It will cause an error when you test the connection. Instead, you should use a single slash \ to split the server name and instance name. I believe this is because SIR relies on back-slash to escape the slash character, but ADF doesn’t need escaping, but it will translate the double slash to really two slash signs which do not match actually the server name and instance name. So, you would probably get the following error:

Connection failed

Cannot connect to SQL Server: ‘localhost\\SQLEXPRESS’, Database: ‘testdb’, User: ‘adfuser’. Instance failure. Activity ID: xxxxxx-xxxxx-xxxxx.

Now, we’ve got a valid linked server to our on-premise SQL Server. Let choose the table we are going to extract data from, which is dbo.emp and click OK.

Image for post

Create Sink Dataset in Azure Blob Storage

Go to Sink tab on the pipeline configuration panel, click “New” to add a new dataset.

Image for post

Then, choose Azure Blob Storage and click continue.

Image for post

Here, I’ll use DelimitedText as an example because it is the most readable.

Image for post

Then, follow the instruction to create a new linked service. This time we will use AutoResolveIntegrationRuntime, which will be Azure Integration Runtime (AIR) because Azure Blob Storage is native service in Azure Cloud.

Image for post

In the next step, the created linked service is automatically selected. We need to input the file path of the sink dataset. Here I’ve already created a blob container called test-container, but the folder output and the file emp.csv do not actually exist at the moment. ADF will automatically create them for us.

Image for post

Tip 1: Choose “First row as header” if you want to have field headers in your CSV file.

Tip 2: We need to select “None” to the Import schema, otherwise, you’ll not be able to go to the next step because of the schema import failed error:

Schema import failed: The required Blob is missing. ContainerName: https://xxxxx.blob.core.windows.net/test-container, ContainerExist: True, BlobPrefix: emp.csv, BlobCount: 0.\r\n. Activity ID: xxxxx-xxxxx-xxxxx

Execute the Pipeline

Now, we should be ready to run this data movement pipeline. Let publish everything, and then go to the top toolbar on the pipeline panel and click “Add trigger” -> “Trigger now”.

Image for post

Verify the Results

Wait until the notification says the pipeline succeeded, then go to the Blob Storage to check the output file. I prefer to use Azure Storage Explorer because I can directly download the open the CSV file easily, but it is fine if you prefer the web view in Azure Portal.

Image for post

Resources:

Azure Documentation, Copy data from an on-premises SQL Server database to Azure Blob storage: https://docs.microsoft.com/en-us/azure/data-factory/tutorial-hybrid-copy-portal

SQL Server 2017: https://www.microsoft.com/en-in/sql-server/sql-server-downloads

Azure Storage Explorer: https://azure.microsoft.com/en-gb/features/storage-explorer/

This article was originally published on Towards Data Science by myself:
https://towardsdatascience.com/move-data-from-on-premise-sql-server-to-azure-blob-storage-using-azure-data-factory-bbf67e4e5fde


Comments