Are we able to use Snappy-data to Update a record in Azure Data lake ? OR is Azure data lake append only? - azure

I am currently working on azure data lake with snappy-data integration,I have a query on snappy-data are we able to update the data in the snappy-data to azure data lake storage, or we can append only on the azure data lake storage i searched in forum but i can't reach for that proper solution on it,if any one know about that query on it please share it,thank you.

Azure Data Lake Store, much like HDFS, is an append only store. You can append to a file or replace it altogether. There is no way to update an existing file.

I've achieved MERGE style behaviour in USQL by using a Azure Data Lake table as the middle ground between input and output. Check out my blog post with the code showing how I did it with a series of joins.
https://www.purplefrogsystems.com/paul/2016/12/writing-a-u-sql-merge-statement/
This will give you append behaviour in your output.

Related

Move data from Azure Data Lake to Big Query

I wanted to move some data on daily basis from azure data lake to Big query using Azure Data Factory. However, ADF does not support Big Query as sink. What would you suggest? Any GCP service analogue to ADF to perform this task?
Thanks!
However, ADF does not support Big Query as sink.
Yes, ADF can only support Google Big Query as the source. So this means ADF can not achieve your requirement.
Any GCP service analogue to ADF to perform this task?
It seems that there is no ready-made tool, maybe you can write code to get data from datalake and copy it?

When to use Data Factory (copy) over direct pull in SQL synapse

I am just going through some Microsoft Document and doing handOn for Data engineering related things.
I have couple of queries for a scenrerio - "copy CSV file(s) from Blob storage to Synapse analytics (stage table(s)):
I read that we can do direct data pull in Synapse with the process of creating external tables. (https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/load-data-wideworldimportersdw)
If above is possible, then in what cases we do use Azure Data factory Copy or data flow method?
While working with Azure data factory, is it a good idea to use Polybase, because it will use Blob storage again as staging in this scenrerio (i.e. I am copying file from Blob only and again using blob for staging)?
I searched for answers to my queries but haven't found any satisfactory answer yet.
If you're just straight loading data from CSV into DW, use Copy. Polybase is recommended, but not always needed for small files.
If you need to transform that data or perform updates, then use data flows.

What is the purpose of having two folders in Azure Data-lake Analytics

I am a newbie to Azure Data lake.
The below screenshot has 2 folders (Storage Account and Catalog), one for Datalake analytics and other data lake store.
My Question is why is the purpose of each folder and why are we using U-SQL for transformations when this can be done in the data factory.
Please explain the data flow process from the data store to the data lake.
enter image description here
Thanks you,
Addy
I have addressed your query on MSDN thread:
https://social.msdn.microsoft.com/Forums/en-US/f8405bdb-0c85-4d37-8f2e-0dab983c7f94/what-is-the-purpose-of-having-two-folders-in-azure-datalake-analytics?forum=AzureDataLake
Hope this helps.

Moving a DocumentDB Collection to Azure Data Lake Storage

I was wondering what's the best practice moving a documentDB to the Azure Data Lake Storage.
Should I create a file for each document in a collection or move the entire documentDB?
Also I didn't find much information on how I can access the documentDB using U-SQL?
Input would be appreciated.
You currently cannot use U-SQL to access data in DocumentDB (or now called CosmosDB). There is a feature request here. Please feel free to add your vote.
If you move the data over, the organization depends on how you want to manage the data (delete all, or only parts?), how it is structured (keep similar structured data together, either in same file or same folder) and how you use it (always need all of it? or only parts?) and what gives you the best performance accessing it (larger files are normally better, but if they are JSON, also make sure the extraction process works).
You can use Azure Data Factory to connect to Document DB and store your data on Data Lake.
After that you can query the data directly from Data Lake using U-SQL.

How to transfer csv files from Google Cloud Storage to Azure Datalake Store

I'd like to have our daily csv log files transferred from GCS to Azure Datalake Store, but I can't really figure out what would be the easiest way for it.
Is there a built-in solution for that?
Can I do that with Data Factory?
I'd rather avoid running a VM scheduled to do this with the apis. The idea comes from the GCS->(DataFlow->)BigQuery solution.
Thanks for any ideas!
Yes, you can move data from Google Cloud Storage to Azure Data lake Store using Azure Data Factory by developing custom copy activity. However, in this activity, you will be using APIs for transferring that data. See details on this article.

Resources