U-SQL User Defined Function in Azure Data Factory - azure

I'm currently using Azure Data Factory for an ETL job and in the end I want to start a U-SQL job. I've created my datalake.usql script and the UDF's the script uses are in the datalake.usql.cs file, the same structure a U-SQL project has in Visual Studio (which is where I developed the U-SQL job and succesfully ran it).
After that, I uploaded them both to Azure Blob Storage and set up the U-SQL step in Azure Data Factory to use the U-SQL script, but it doesn't see the datalake.usql.cs with my UDF's.
How can I do this?

I figured this one out so I'll post the answer here if someone else has this problem in the future.
When you're developing locally, you have a script.usql and a script.usql.cs file. When you run it, Visual Studio does all the heavy lifting, compiles it all in a usable way and your script runs. But when you're trying to run a script from Azure Data Factory, you can't perform that compilation on the fly, like Visual Studio does.
The solution to this problem is to make an Assembly, a .dll file, from your script.usql.cs, upload it to the data storage you're using and register it in the U-SQL Catalog. Then, you can reference this assembly and use it normally, as you would on your local machine.
All the steps needed for this are presented in these short guides:
https://saveenr.gitbooks.io/usql-tutorial/content/usql-catalog/intro.html
https://saveenr.gitbooks.io/usql-tutorial/content/usql-catalog/usql-databases.html
https://saveenr.gitbooks.io/usql-tutorial/content/usql-catalog/assemblies.html
https://www.c-sharpcorner.com/UploadFile/1e050f/creating-and-using-dll-class-library-in-C-Sharp/

Related

Build a pipeline in azure data factory to load Excel files, format content, transform in csv and send to azure sql DB

I'm approaching to Azure environment and watching tutorials/reading documents, but I'm trying to figure out how to setup a flow that enables the process that I will describe hereunder. The starting point are reports in .xlsx format produced monthly by Mktg Dept: the requirements are to bring them in Azure SQL DB so that data can be stored and analysed. Sofar I managed to put those files (previously manually converted in .csv format) in a BLOB storage and build an ADF pipeline that copy each file in a table on the SQL DB.
The problem is that as far as I understood with ADF it's not possible to directly manage xlsx files, and I'm wondering how to set up an automated procedure that enables the conversion from .xlsx to .csv and save them on BLOB storage. I was thinking about adding to the pipeline a python script/Databricks notebook to convert format, but I'm not sure this could be the best solution. Any hint/reference to existing tutorial or resources would be very appreciated
I found a tutorial which uses Logic Apps to do the conversion.
Datanovice indirectly suggested using a Custom activity to run either a C# or Python application to do the conversion for you.
The least expensive solution would be to do the conversion before uploading to blob, like Datanovice said.

Azure - Process Message Files in real time

I am working on Azure platform and use Python 3.x for data integration (ETL) activities using Azure Data Factory v2. I got a requirement to parse the message files in .txt format real time as and when they are downloaded from blob storage to Windows Virtual Machine under the path D:/MessageFiles/.
I wrote a Python script to parse the message files because it's a fixed width file and it parses all the files in the directory and generates the output. Once the files are successfully parsed, it will be moved to archive directory. This script runs well in local disk on ad-hoc mode whenever i need it.
Now, i would like to make this script run continuously in Azure so that it looks for the incoming message files in the directory D:/MessageFiles/ all the time and perform the processing as and when it sees the new files in the path.
Can someone please let me know how to do this? Should i use any stream analytics application to achieve this?
Note: I don't want to use Timer option in Python script. Instead, i am looking for an option in Azure to use Python logic only for File Parsing.

Azure Data Factory v2 - SSIS lift and shift

I'm in the process of evaluating the possibilites of lifting and shifting my ssis packages to ADFv2 but without testing I'm finding it hard to see if all SSIS functionalities are supported.
For example my package unzips files, modifies contents of files (script task) saving new version in different directory, loads modified files to DB and update data etc
What I'm not sure about is unzipping the files (I dont want to transfer unzipped files from on prem) and also modifying files with script task. I believe these would have to be moved outside of SSIS and created as an activity of ADF? And leave only the load of files, updating data etc as my SSIS package? Probably with the files stored in Blob storage?
Or can all this still be done directly in SSIS?
Thanks
What you currently do using SSIS on premises, you could also do using SSIS in ADF. For example, you could install additional (un)zip programs using custom setup and utilize the %TEMP% folder/current working directory (".") of your SSIS IR to modify files, see
https://learn.microsoft.com/en-us/azure/data-factory/how-to-configure-azure-ssis-ir-custom-setup
https://learn.microsoft.com/en-us/sql/integration-services/lift-shift/ssis-azure-files-file-shares?view=sql-server-2017

Continuous Integration of SQL Server in Visual Studio Online

I was just about to do a continuous integration of SQL Server scripts with VSTS. I have two script files in my visual studio 2015 database project.
createStudentTable.sql => simple create table script
Script.PostDeployment1.sql => :r .\createStudentTable.sql (pointing to the above script)
Now after the successful build in visual studio online I suddenly recognized that a .dacpac file is also created - see this screenshot:
Now my database has around 100 tables + view and stored procedures. Now does this .dacpac file contain the entire schema details? If so then it would be an huge overhead in carrying this .dacpac with every build.
Please advise.
Dacpac file only contains the schema model definition of your database and it does not contain any of table data unless you add all of insert statements in the postdeploymentscript.sql
The overhead of dacpac is that it compares the model in dacpac and your target database when the actual deployment happens.
This is a trade-off. If you don't use dacpac then you will end up doing all the database versions and version migrations by yourself manually or using another tool that can make those database change managements with ALTER statements somewhat easier.
BTW the scale of 100 table can be handled well by dacpac.

OnStart vs Startup Script for batch file?

I have a Ruby on Rails application that needs to find a home in an Azure Worker Role.
I currently automate the deployment of the application with a batch file - a file that takes the apache and ruby installers, runs them, and then drops the RoR app in the appropriate directory. After the batch script finishes, Apache is serving to and from the application via port 80.
I'm new to Azure and trying to figure out how to do this.
From my understanding, I have two options here: OnStart with the installation files in Blob Storage, or a startup script. I'm not sure how to do the latter, but I have located the onStart method within the WorkerRole.vb file in the new Azure project I just created.
My question: Is it recommended to use OnStart to deploy the application (using the batch script)? If so, how would I go about integrating the script into the project? And - how do I get started with storing and referencing the files in blob storage?
I know these are super high-level questions. Any input or suggested reading would be super helpful. I have tried to google / search for relevant resources but haven't been able to find much. Thank you for your time!
When you are inside OnStart() function it is better to do role configuration things i.e. IP binding, etc however if you would want to install runtime, download application zip, configured role specific setting, it is best to use Startup task. Please visit my blog Windows Azure: Startup task or OnStart(), which to choose? to learn more about it.
Now in your case it is best to use Startup task. What you can do it as below:
Create your ROR package a zip and place it at Windows Azure Blob Storage
Create a Cmmmand batch file which will do:
2.1 Download the ZIP
2.2 Unzip to Zip content to a specific location
2.3 Update the status back to AZure Blob Storage (Optional)
In your OnStart() function you just need to configure the ROR
The code will look as below if you have TCP Endpoint name "RORWeb80" set to use port 80:
TcpListener RoRPortListener = new TcpListener(RoleEnvironment.CurrentRoleInstance.InstanceEndpoints["RORWeb80"].IPEndpoint);
RoRPortListener.Start();
I have written a sample app for Tomcat/Java based worker role which does exactly the same. So what you can do it just replace the Tomcat ZIP file with ROR ZIP and reuse the code exactly.
As long as you don't need admin-level access (e.g. modifying registry, installing msi's, etc.) you can do your setup from OnStart(), including launching your script. Just include the startup script with your project (don't forget to set Copy Local to true).
Same goes with startup script: you call your cmd file, which then executes the sequence for you. And if you give it elevated permissions, you can run installers, modify registry settings, install custom perf counters, whatever.
In either case: you can keep your apache zip, ruby installers, etc. in blob storage and, at startup, download them to local storage. This saves you from bundling everything within the deployment, which gives you a few advantages (being able to update ruby / apache without redeploy, reduced package size, etc.).
There's a sample app on codeplex that demonstrates the basics of setting up Tomcat via startup script. For one more example, you can look at the scripts installed via Eclipse Windows Azure plugin for Java. These scripts are quite similar. The key is to have some way of downloading files from blob storage and then unzipping them. the codeplex project I referred to points to a sample app that does simple blob downloading. The Eclipse packaging provides similar functionality in a .vbs app. Here's a snippet of one of my scripts from an Eclipse-based project:
SET SERVER_DIR_NAME=apache-tomcat-7.0.25
SET WAR_NAME=myapp.war
rd "\%ROLENAME%"
mklink /D "\%ROLENAME%" "%ROLEROOT%\approot"
cd /d "\%ROLENAME%"
cscript /NoLogo util\unzip.vbs jre7.zip "%CD%"
cscript /NoLogo util\unzip.vbs tomcat7.zip "%CD%"
copy %WAR_NAME% "%SERVER_DIR_NAME%\webapps\%WAR_NAME%"
cd "%SERVER_DIR_NAME%\bin"
set JAVA_HOME=\%ROLENAME%\jre7
set PATH=%PATH%;%JAVA_HOME%\bin
cmd /c startup.bat
The codeplex project has a similar-looking script.
Don't forget: you'll need to set up an Input Endpoint for your role (part of the role properties).
To get blobs into blob storage, there are both free tools (like Clumsy Leaf CloudXplorer and paid tools (such as Cerebrata's Cloud Storage Studio).
To download blobs to local storage, you can either write a few lines of .net code (from OnStart) or just use the utility pointed to in the codeplex project.

Resources