Although there is an official tutorial for copying data from on-premise SQL Server to Azure Blob Storage (https://docs.microsoft.com/en-us/azure/data-factory/tutorial-hybrid-copy-portal), this article will focus on some details that were not covered in that tutorial. For example,

How to set up the self-hosted integration runtime (SIR) manually?
Tips for SQL Server security configuration
Tips for SIR connection string configurations
Tips for adding Azure Blob Storage as Sink

This tutorial will not start from creating an Azure Data Factory (ADF) instance. If you don’t have one yet and wish to start from there, it is sufficient to use the official tutorial above.

Now, I assume that you have already got your on-premise SQL Server and ADF instance ready. Now we should begin.

Prepare Data Source

Let’s create a test database with a test employ table in the on-prem SQL Server as the data source. So, firstly go to your SQL Server Management Studio (SSMS) to create the database and the table.

Image for post — Right-click “Databases” and select “New Database…”

Copy-paste the following SQL script to create the table and insert some test entries. This code is from the official tutorial above

CREATE TABLE dbo.emp
(
     ID int IDENTITY(1,1) NOT NULL,
     FirstName varchar(50),
     LastName varchar(50)
)
GOINSERT INTO emp (FirstName, LastName) VALUES ('John', 'Doe')
INSERT INTO emp (FirstName, LastName) VALUES ('Jane', 'Doe')
GO

Create SQL Server User

In practice, we should let SIR to use the root user or any users with admin permission. Instead, we want to create a read-only user particularly for the target database for security reason. Let’s create this user.

Note that we don’t want to check “User must change password at next login” for this user, as it is not expected to be given to any “user”, but only be used by ADF services.

Then, let’s add this user to the database we’ve just created.

After that, a very important step is to check whether your SQL Server has been enabled “SQL Server Authentication Mode”. If not, the user you’ve created will never be able to access.

Create ADF Pipeline

Go to ADF Author tab, click “…” button next to Pipelines and select New pipeline in the dropdown.

Then, in the main view on the right pane, give a name to the pipeline. Let’s call it OnPremSQLEmpToBlob_Pipeline.

Expand Move & transform section, and drag Copy data to the main area on the right, and then rename it to SQLEmpToBlobEmp_Activity.

After that, go to Source tab and click New to create a new dataset as the source.

Search “sql server” to get the SQL Server connect. Select it and click continue.

Let’s call this dataset OnPremSQLServerEmp, and then we need a new linked service for this dataset.

Let’s name the new linked service OnPremSQLServer_LinkedService, then create an integration time for this on-premise source.

Set up the Self-Hosted Integration Runtime Manually

After the previous step that we select “New” for the integration runtime. We then select “Self-Hosted” and click continue.

Input name of the SIR. I would like to name it DevVmIntegrationRuntime because I will set up it on my Dev VM.

Here, we come to an important step. In the official tutorial, it is suggested to use the express setup, which is definitely much easier. However, this might not be feasible in some situations, for example, the system security policy doesn’t allow this automatic script to be run on this machine, and it is kind of common to see this error when you want to use the express setup.

Therefore, we need to use option 2: manual setup. So, please download the SIR installer, and run it on the machine that you want to host SIR on.

Just wait for a moment for the SIR to be installed, then copy the Key and paste it into the installed SIR to connect them together.

It should be automatically connected to each other, and the machine name of the SIR running on will be automatically populated as the node name. You can rename it but in my case, it doesn’t matter.

Click “Finish” and wait for it registering itself, then click “Launch Configuration Manager”, you should be able to see the screen as follows.

We can then test the SQL Server connection in the Diagnostics tab as follows. Note that I’m running SIR on the same machine of the on-premise SQL Server, so I’m using localhost as the server name. However, this is not recommended in practice due to security concerns. If you are connecting a separated machine with SQL Server instance, just replace localhost with the machine name and instance name.

Another important tip is that we need to use a double slash \\ to split the server name and the instance name in this connection test.

Then, we come back to ADF. Apply and close the SIR configuration panel. Now, we want to test the connection here as well. Please see the screenshot below.

Another very important tip that I didn’t find any elsewhere specified this. That is, the server name here shouldn’t be double slash. It will cause an error when you test the connection. Instead, you should use a single slash \ to split the server name and instance name. I believe this is because SIR relies on back-slash to escape the slash character, but ADF doesn’t need escaping, but it will translate the double slash to really two slash signs which do not match actually the server name and instance name. So, you would probably get the following error:

Connection failed
Cannot connect to SQL Server: ‘localhost\\SQLEXPRESS’, Database: ‘testdb’, User: ‘adfuser’. Instance failure. Activity ID: xxxxxx-xxxxx-xxxxx.

Now, we’ve got a valid linked server to our on-premise SQL Server. Let choose the table we are going to extract data from, which is dbo.emp and click OK.

Create Sink Dataset in Azure Blob Storage

Go to Sink tab on the pipeline configuration panel, click “New” to add a new dataset.

Then, choose Azure Blob Storage and click continue.

Here, I’ll use DelimitedText as an example because it is the most readable.

Then, follow the instruction to create a new linked service. This time we will use AutoResolveIntegrationRuntime, which will be Azure Integration Runtime (AIR) because Azure Blob Storage is native service in Azure Cloud.

In the next step, the created linked service is automatically selected. We need to input the file path of the sink dataset. Here I’ve already created a blob container called test-container, but the folder output and the file emp.csv do not actually exist at the moment. ADF will automatically create them for us.

Tip 1: Choose “First row as header” if you want to have field headers in your CSV file.

Tip 2: We need to select “None” to the Import schema, otherwise, you’ll not be able to go to the next step because of the schema import failed error:

Schema import failed: The required Blob is missing. ContainerName: https://xxxxx.blob.core.windows.net/test-container, ContainerExist: True, BlobPrefix: emp.csv, BlobCount: 0.\r\n. Activity ID: xxxxx-xxxxx-xxxxx

Execute the Pipeline

Now, we should be ready to run this data movement pipeline. Let publish everything, and then go to the top toolbar on the pipeline panel and click “Add trigger” -> “Trigger now”.

Verify the Results

Wait until the notification says the pipeline succeeded, then go to the Blob Storage to check the output file. I prefer to use Azure Storage Explorer because I can directly download the open the CSV file easily, but it is fine if you prefer the web view in Azure Portal.

Resources:

Azure Documentation, Copy data from an on-premises SQL Server database to Azure Blob storage: https://docs.microsoft.com/en-us/azure/data-factory/tutorial-hybrid-copy-portal

SQL Server 2017: https://www.microsoft.com/en-in/sql-server/sql-server-downloads

Azure Storage Explorer: https://azure.microsoft.com/en-gb/features/storage-explorer/

This article was originally published on Towards Data Science by myself:
https://towardsdatascience.com/move-data-from-on-premise-sql-server-to-azure-blob-storage-using-azure-data-factory-bbf67e4e5fde

Python and Data Science Playground

Search This Blog

Move Data from On-Premise SQL Server to Azure Blob Storage Using Azure Data Factory