PDI slow loading into azure databases - azure

I have an Azure VM with Pentaho Data Integration installed, i'm trying to build some ETL which loads a dimensional model from the staging area, but when i start a transformation, the load speed of PDI into any azure database is painfully slow.
It is possible to have PDI working on cloud with Azure Databases? There is some configuration step needed to achieve a reasonable loading speed?
PS:
VM and databases are in the same region
There is a firewall rule to allow port acess
Reading speed is working just fine
PDI 8.1, using table output step

I've been experiencing same speed problem but I will tell you my workarounds with this.
First of all: Download and install latest jdbc driver that lets you gain connection with azure sql database, in documentation the link is here but the way I do is keep it synced from here in GitHub any of this will let you use the latest driver in PDI.
Second workaround: for large files what I've found most powerful is using BCP Utility integrated with PowerShell or Linux Batch. Doesn't care if it files are local or in azure blob storage but you might need credentials for this.
Last but not least: Use Azure Data Factory V2 to move and load files (if you are like me I try to keep it in PDI until I have to load it, the http get step will let you trigger ADF pipeline).
Good luck and let me know if you get it.

Related

Syncing incremental data with ADF

In Synapse I've setup 3 different pipelines. They all gather data from different sources (SQL, REST and CSV) and sink this to the same SQL database.
Currently they all run during the night, but I already know that the question is coming of running it more frequently. I want to prevent that my pipelines are going to run through all the sources while nothing has changed in the source.
Therefore I would like to store the last succesfull sync run of each pipeline (or pipeline activity). Before the next start of each pipeline I want to create a new pipeline, a 4th one, which checks if something has changed in sources. If so, it triggers the execution of one, two or all three the pipelines to run.
I still see some complications in doing this, so I'm not fully convinced on how to do this. So all help and thoughts are welcome, don't know if someone has experience in doing this?
This is (at least in part) the subject of the following Microsoft tutorial:
Incrementally load data from Azure SQL Database to Azure Blob storage using the Azure portal
You're on the correct path - the crux of the issue is creating and persisting "watermarks" for each source from which you can determine if there have been any changes. The approach you use may be different for different source types. In the above tutorial, they create a stored procedure that can store and retrieve a "last run date", and use this to intelligently query tables for only rows modified after this last run date. Of course this requires the cooperation of the data source to take note of when data is inserted or modified.
If you have a source that cannot by intelligently queried in part (e.g. a CSV file) you still have options to use things like the Get Metadata Activity to e.g. query the lastModified property of a source file (or even its contentMD5 if using blob or ADLGen2) and compare this to a value saved during your last run (You would have to pick a place to store this, e.g. an operational DB, Azure Table or small blob file) to determine whether it needs to be reprocessed.
If you want to go crazy, you can look into streaming patterns (might require dabbling in HDInsights or getting your hands dirty with Azure Event Hubs to trigger ADF) to move from the scheduled trigger to an automatic ingestion as new data appears at the sources.

Azure Data Factory - Order of actions inside the Copy Activity

I have a tricky question about the "Copy Activity" in ADF. Assume the following scenario:
Source: an external API or an non-Azure database using hosted integration runtime.
Sink: an Azure SQL Server database.
The "pre-copy Script" field has a command to delete some data from the sink table (why deleting is out of scope of the discussion).
When the pipeline runs, the connection with the source fails (due to a time-out, network issue, authentication, etc.)
The question is: will the pre-copy script run in this case? Or the script only runs after ADF successfully connected the source data store? I couldn't find any reference about it.
I can just try to simulate it and see what happens, but I'm hopping someone can save my time. :)
Thanks in advance!
Per my experience about Data Factory, the pre-copy script won't run.
As I understand, we can consider it as a workflow, connect to source--> get data from source-->connect to sink-->run the pre-copy script-->write data to sink. No mater which step failed, data factory will stop run.

Use Microsoft Azure as a computing cluster

My lab just got a sponsorship from Microsoft Azure and I'm exploring how to utilize it. I'm new to industrial level cloud service and pretty confused about tons of terminologies and concepts. In short, here is my scenario:
I want to experiment the same algorithm with multiple datasets, aka data parallelism.
The algorithm is implemented with C++ on Linux (ubuntu 16.04). I made my best to use static linking, but still depends on some dynamic libraries. However these dynamic libraries can be easily installed by apt.
Each dataset is structured, means data (images, other files...) are organized with folders.
The idea system configuration would be a bunch of identical VMs and a shared file system. Then I can submit my job with 'qsub' from a script or something. Is there a way to do this on Azure?
I investigated the Batch Service, but having trouble installing dependencies after creating compute node. I also had trouble with storage. So far I only saw examples of using Batch with Blob storage, with is unstructured.
So are there any other services in Azure can meet my requirement?
I somehow figured it out my self based on the article: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-classic-hpcpack-cluster/. Here is my solution:
Create a HPC Pack with a Windows head node and a set of Linux compute node. Here are several useful template in Marketplace.
From Head node, we can execute command inside Linux compute node, either inside HPC Cluster Manager, or using "clusrun" inside PowerShell. We can easily install dependencies via apt-get for computing node.
Create a File share inside one of the storage account. This can be mounted by all machines inside the cluster.
One glitch here is that for some encryption reason, you can not mount the File share on Linux machines outside the Azure. There are two solutions in my head: (1) mount the file share to Windows head node, and create file sharing from there, either by FTP or SSH. (2) create another Linux VM (as a bridge), mount the File share on that VM and use "scp" to communicate with it from outside. Since I'm not familiar with Windows, I adopted the later solution.
For executable, I simply uploaded the binary executable compiled on my local machine. Most dependencies are statically linked. There are still a few dynamic objects, though. I upload these dynamic object to the Azure and set LD_LIBRARY_PATH when execute programs on computing node.
Job submission is done in Windows head node. To make it more flexible, I wrote a python script, which writes XML files. The Job Manager can load these XML files to create a job. Here are some instructions: https://msdn.microsoft.com/en-us/library/hh560266(v=vs.85).aspx
I believe there should be more a elegant solution with Azure Batch Service, but so far my small cluster runs pretty well with HPC Pack. Hope this post can help somebody.
Azure files could provide you a shared file solution for your Ubuntu boxes - details are here:
https://azure.microsoft.com/en-us/documentation/articles/storage-how-to-use-files-linux/
Again depending on your requirement you can create a pseudo structure via Blob storage via the use of containers and then the "/" in the naming strategy of your blobs.
To David's point, whilst Batch is generally looked at for these kind of workloads it may not fit your solution. VM Scale Sets(https://azure.microsoft.com/en-us/documentation/articles/virtual-machine-scale-sets-overview/) would allow you to scale up your compute capacity either via load or schedule depending on your workload behavior.

Using ODBC Driver in Azure to connect to external database

I am working in a business in New Zealand. We currently use a remote server (Plexus) to store a large amount of data (some tables > 2 billion rows). We have started down the SharePoint route, and I have created a number of databases and apps in SharePoint that use this data. Currently, I have to run a program in New Zealand that downloads the data to our local server and then pushes up that data into an Azure database, which the web apps connect to. I would like to remove this middle step for many reasons but the biggest reason is that the web connection between NZ and the US tends to result in a lot of time outs and long pulls due to having to pull large data sets across the Pacific. The remote database we are using is Plexus.
Ideally, I would like to have my C# code sitting in Azure and have this connect to the remote server directly. This way I could simply send the SQL request to Plex and have this data go directly into the Azure databases. The major advantage would be that this would mean it would all be based in the US which would make things a lot faster.
The major hurdle is that we need to install an ODBC Driver given to us by the remote server into Azure so it recognises the calls as genuine. Our systems adminstrator has said he has looked into it and it seems this can't be done?
I was hoping someone on the StackOverFlow community has encountered a similar issue and resolved it?
Note: Please dont think I am asking whether Azure has an ODBC connection because I know it does. I am not asking if I can connect TO Azure, I am asking if I can connect Azure to another external data source.
In a Worker Role/Cloud service in azure you can install the ODBC driver in a startup task using powershells ODBC commandlets.
More info here: Powershell Add-OdbcDsn and here: Powershell startup task in cloud services
One option is to create a virtual machine in the same Azure data center as your database and install your ODBC driver and your C# app.

Automated migrations for Azure Blob Storage. Does this exist?

We are using Azure Blob Storage in all our projects. Through lifetime of a project the naming convention for files in Azure change: sometimes we would like to rename containers, remove extra folders and other clean-up operations.
But Azure does not allow easily to rename things, we have to do copy-delete.
Also we can change naming convention locally, during development. But we need to remember do the exact operation on production storage when we deploy new versions.
At the same time we use Entity Framework migrations: we updated database, migration script is created. Then we run "update-database" and DB is updated. The same is run automatically by deployment scripts: check if production DB needs to be updated, and update it if needed.
What would be good if we can do the same migration goodness for Azure storage: check if all the migration scripts have been applied, execute processes for missing scripts. Somewhere in the containers keep a reference to a latest executed script.
Does such thing exist? or should I have a go on it and try implementing something myself.
No, such functionality/behavior does not exists. And do remember that EF migrations are supported and are part of the EF itself, not the Data Base! So when you talk about Azure Blob Storage - it, as a service does not provide such functionality, the same way SQL Server itself does not do it.
To the question if such a library/code exists - no there isn't.
You are raising a very interesting question though!
I personally am not a big fan of "migrations". You can do it while in early stages of development life cycle. But once you hit GA/Production, you have to be very careful what you are doing. Even EF migrations might be good with small database sizes, but are you willing to run migrations on a DB which has tables with millions of records production data? Same with blobs. If you have 100 or 1000 blobs might be fine. How about 2M blobs? Are you really willing to put some code that would go through 2M entities and do some operations over it, and run this code as part of your build/deploy process? I would not.

Resources