Data Copy between ADLS Instances - apache-spark

Copying data between various instances of ADLS using DISTCP
Hi All
Hope you are doing well.
We have a use case around using ADLS as different tiers of the ingestion process, just required you valuable opinions regarding the feasibility of the same.
INFRASTRUCTURE: There will two instances of ADLS named LAND & RAW. LAND instance will be getting the file directly from the source while RAW instance will be getting the file once validations are passed in LAND instance. We also have a Cloudera cluster hosted on Azure platform which will have connectivity established to both the ADLS instances.
PROCESS: We will have a set of data & control files landing in one of the ADLS instances (say landing). We need to run a spark code on Cloudera cluster to perform count validation between Data & control file present in Land ADLS instance. Once the validation is successful, we want distcp command to copy data from Land ADLS instance to Raw ADLS instance. We are assuming that Distcp utility will be already installed on the Cloudera cluster.
Can you guys suggest if above approach looks fine?
Primarily our question is whether DISTCP utility will support data movement between two different ADLS instances?
We also considered other options like ADLCopy but Distcp appeared better.
NOTE: We haven't considered use Azure Data Factory since it may has certain security challenges though we know Data Factory is best suited for above use case.

If your use case requires you to copy data between multiple storage accounts, distcp is the right way to execute this.
Note that even if you were to encapsulate this solution in data factory, the pipeline with copy activity will invoke distcp.

Related

how to load a csv file from on-prem to azure data lake

I have a file placed on prem-Server the file get every 10 seconds a new row, i need a way to only copy the new added row to a data lake and not alway the whole file, i use azure synapse
and there is a timestamp column on the rows
Thank you
Not a lot of information on your setup, which could be subject to numerous constraints, e.g. is your file accessible from outside your network? If not, you need something to perform Outbound activities, e.g. have On-Premises Data Gateway on that machine (or another machine with access via network to that file).
Azure Synapse Integration runtime enables the concept of "self-hosted", which literally means Data Factory pipeline to run data flows (which could accommodate the "append" capability you need). Details at https://learn.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime?tabs=data-factory
You might want to opt for alternative solutions, such as based on Azure storage with agent sync-up from on-premises, at which time Azure Synapse could directly use the online copy.

Do you have to use Azure Data Factory or can you just Databricks as your ETL tool from your multiple sources?

...Or do i need to add the data into a data lake using data factory first and then use databricks as an ELT?
Depends.
Databricks can connect to datasources and ingest data. However Azure Data Factory(ADF) have more connectors than databricks. So it depends on what you need. If using ADF, you need to land the data somewhere (i.e. Azure storage) so that databricks can pick it up.
Moreover, another main feature of ADF is to orchestrate data movement or activity. Databricks do have Job feature to schedule notebooks or JAR, however it is limited within databricks. If you want to schedule anything outside of databricks (e.g. drop file to SFTP or email on completion or terminate databricks cluster etc...) then ADF is the way to go.
Indeed it depends to the scenario I think. If you have a wide variety of datascources you need to connect to then adf is probably the better option.
If your sources are datafiles (in any format) you could consider using databricks for etl.
I use databricks as a pure etl tool (without adf) by mounting a notebook to a storage container in a blobstorage, take huge xml data from there and write the data to a dataframe in databricks. Then I parse the shape of the dataframe and then writing the data into an azure sql database. Fair to say I’m not really using it for the “e” in etl, as the data has already been extracted from the real source system.
Big advantage is the power you have at your disposal to parse the files.
Best regards.

Backup of Data Lake Store

I am working on a backup strategy for Data Lake Store (DLS). My plan is to create two DLS accounts and copy data between them. I have evaluated several approaches to achieve this but none of them satisfies the requirement to preserve the POSIX ACLs (permissions in DLS parlance). PowerShell cmdlets require data to be downloaded from the primary DLS onto a VM and re-uploaded onto the secondary DLS. The AdlCopy tool works only on Windows 10, does not preserve permissions and neither supports copying data across regions (not that this is a hard requirement). Data Factory seemed like the most sensible approach until I realized it also doesn't preserve permissions.
Which leads me to my last option - Distcp. According to the Distcp guide (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html), the tool supports preserving of permissions. However, the downside of using Distcp is that the tool must be run from HDInsight. Although it supports both intra and inter-cluster copying, I would rather not have a running HDInsight cluster just for backup operations.
Am I missing something? Does anyone have any better suggestions?
Your assessment is comprehensive. Those are indeed the options that are available should you want to copy over permissions. So you will have to choose one of them, sorry. If you truly want a serverless option that would copy over the permissions, Azure Data Factory would have to be it. Could you please create a feedback item here - https://feedback.azure.com/forums/270578-data-factory?
Thanks,
Sachin Sheth
Program Manager, Azure Data Lake.

How to efficiently move big data from a data center to Azure Blob Storage for later processing via HDInsight?

I need to setup scheduled tasks which purpose is to copy/move large amounts of data from an on-premises data center to Windows Azure Blob Storage.
The options I've explored are WebHDFS and Flume (the latter does not seem to be supported by HDInsight currently).
What is the most efficient way to transfer unstructured files from a data center to Windows Azure Blob Storage?
If you are using HDInsight, you don't need to involve HDFS at all. In fact you don't need your cluster to be running to upload the data. The best way of getting data into HDInsight is to upload it to Azure Blob Storage, using either the standard .NET clients, or something third-party like Azure Management Studio or AzCopy.
If you want to stream the data constantly, then you are probably better setting up something like Flume, Kafka or Storm to work against an HDInsight cluster, but that will require a certain amount of customisation on the cluster itself, which means you'll run into problems with reboots, and require a permanent cluster.
You didn't mention how much data you're talking about (you just said large amounts). But... assuming it's 100's of TB or petabytes, Azure has an Import/Export Service which offers disk-ship.
Outside of that, you'd need to use your own code or use a 3rd-party tool such as Microsoft's AzCopy to transfer your content to blobs. Remember that you'll be able to perform parallel uploads, to compress time (as long as your data center's bandwidth is large enough for you to see the benefits).
You could use CloudBerry drive and Flume to stream data to HDInsight cluster/Azure Blob storage
http://blogs.msdn.com/b/bigdatasupport/archive/2014/03/18/using-apache-flume-with-hdinsight.aspx
No,you cannot use flume to stream data directly to HDInsight. post from Microsoft blog says that
a vast majority of Flume consumers will land their streaming data into HDFS – and HDFS is not the default file system used with HDInsight. Even if it were - we do not expose public facing Name Node or HDFS endpoints so the Flume agent would have a terrible time reaching the cluster! So, for these reasons and a few others , the answer is typically "no. …it won't work or its not supported"
source :http://blogs.msdn.com/b/bigdatasupport/archive/2014/03/18/using-apache-flume-with-hdinsight.aspx?CommentPosted=true#commentmessage
It also is worth mentioning the ExpressRoute option. Microsoft now has a program called ExpressRoute where your datacenter can be connected straight to Azure with a much faster connection, in cooperation with your ISP. See also http://azure.microsoft.com/en-us/services/expressroute/

Can we use HDInsight Service for ATS?

We have a logging system called as Xtrace. We use this system to dump logs, exceptions, traces etc. in SQL Azure database. Ops team then uses this data for debugging, SCOM purpose. Considering the 150 GB limitation that SQL Azure has we are thinking of using HDInsight (Big Data) Service.
If we dump the data in Azure Table Storage, will HDInsight Service work against ATS?
Or it will work only against the blob storage, which means the log records need to be created as files on blob storage?
Last question. Considering the scenario I explained above, is it a good candidate to use HDInsight Service?
HDInsight is going to consume content from HDFS, or from blob storage mapped to HDFS via Azure Storage Vault (ASV), which effectively provides an HDFS layer on top of blob storage. The latter is the recommended approach, since you can have a significant amount of content written to blob storage, and this maps nicely into a file system that can be consumed by your HDInsight job later. This would work great for things like logs/traces. Imagine writing hourly logs to separate blobs within a particular container. You'd then have your HDInsight cluster created, attached to the same storage account. It then becomes very straightforward to specify your input directory, which is mapped to files inside your designated storage container, and off you go.
You can also store data in Windows Azure SQL DB (legacy naming: "SQL Azure"), and use a tool called Sqoop to import data straight from SQL DB into HDFS for processing. However, you'll have the 150GB limit you mentioned in your question.
There's no built-in mapping from Table Storage to HDFS; you'd need to create some type of converter to read from Table Storage and write to text files for processing (but I think writing directly to text files will be more efficient, skipping the need for doing a bulk read/write in preparation for your HDInsight processing). Of course, if you're doing non-HDInsight queries on your logging data, then it may indeed be beneficial to store initially to Table Storage, then extracting the specific data you need whenever launching your HDInsight jobs.
There's some HDInsight documentation up on the Azure Portal that provides more detail around HDFS + Azure Storage Vault.
The answer above is sligthly misleading in regard to the Azure Table Storage part. It is not necessary to first write ATS contents to text files and then process the text files. Instead a standard Hadoop InputFormat or Hive StorageHandler can be written, that reads directly from ATS. There are at least 2 implementations available at this point in time:
ATS InputFormat and Hive StorageHandler written by an MS employee
ATS Hive StorageHandler written by Simon Ball

Resources