Delta lake and ADLS Gen2 transactions - azure

We are running a Delta lake on ADLS Gen2 with plenty of tables and Spark jobs. The Spark jobs are running in Databricks and we mounted the ADLS containers into DBFS (abfss://delta#<our-adls-account>.dfs.core.windows.net/silver). There's one container for each "tier", so bronze, silver, gold.
This setup has been stable for some months now, but last week, we've seen a sudden increase in transactions within our storage account, particularly in the ListFilesystemDir operations:
We've added some smaller jobs that read and write some data in that time frame, but turning them off did not reduce the amount of transactions back to the old level.
Two questions regarding this:
Is there some sort of documentation that explains which operation on a Delta table causes which kind of ADLS transactions?
Is it possible to find out which container/directory/Spark job/... causes this amount of transactions, without turning off the Spark jobs one by one?

If you go into logs from your data lake (if you have log analytics enabled) you can view the exact timestamp, caller and target of the spike. Take that data and go into your databricks cluster and navigate to Spark UI. In there you should be able to see timestamps and jobs. There you can find what notebook is causing it.

Related

Does Databricks cluster need to be always up for VACUUM operation of Delta Lake?

I am using Azure Databricks with latest runtime for the clusters. I had some confusion regarding VACUUM operation in delta lake. We know we can set a retention duration on the deleted data, however, for actual data to be delete after the retention period is over, do we need to keep the Cluster Up for the entire duration?
In simple words-: Do we need to have Cluster always in running state in order to leverage Delta lake ?
You don't need to always keep a cluster up and running. You can schedule a vacuum job to run daily (or weekly) to clean up stale data older than the threshold. Delta Lake doesn't require an always-on cluster. All the data/metadata are stored in the storage (s3/adls/abfs/hdfs), so no need to keep anything up and running.
Apparently you need a cluster to be up and running always to query for the data available in databricks tables.
If you have configured the external meta store for databricks, then you can use any wrappers like apache hive by pointing it to that external meta store DB and query the data using hive layer without using databricks.

Move delta lake files from one storage to another

I need to move my delta lake files to a new blobstore on a different subscription. Any ideas whats the best way to do this?
Im moving them to an ADLS Gen2 Storage, I think the previous storage was just blob storage. This delta lake is updated on an hourly basis by databricks jobs (but I can pause those if necessary). Size is around 3TB-5TB, I'm initially thinking of pausing all jobs and using azcopy to move the files and point the jobs there afterwards. But I want to check other options that may be better in terms of speed of transfer and cost.
The best way would just to use Azure Data Factory. There you can point to your different locations and move the files really quick.

batch processing in azure

We are planning to do batch processing on a daily basis. We generate 1 GB of CSV files every day and will manually put them into Azure Data Lake Store. I have read the Microsoft Azure documents regarding the batch processing and I have decided to use Spark as to batch processing. My question is that after we transfer the data using RDD/DF what would be the next step? how we can visualize the data? since this process is supposed to be run every day, once the data transformation done using Spark, do we need to push the data to any kind of data store like hive hdfs or cosmos before we could visualize it?
There are several options doing this on Azure. It really depends on your requirements (e.g. number of users, needed visualizations, etc). Examples for doing it:
Running Spark on Azure Databricks, you could use the Notebook capabilities to visualize your data
Use HDInsight with Jupyter or Zeppelin Notebooks
Define Spark tables on Azure Databricks and visualize them with Power BI
Load the data with Azure Data Factory V2 to Azure SQL DB or Azure SQL Data Warehouse and visualize it with Power BI.
For Time-Series-Data you could push the data via Spark to Azure EventHubs (see Example notebook with Eventhubs Sink in the following documentation) and consume it via Azure Time Series Insights. If you have an EventData-Stream this could also replace your batch oriented architecture in the future. Parquet files will be used by Azure Time Series Insights as Long-term Storage (see the following link). For Spark also have a look at Time Series Package which adds some time series capabilities to spark.

Why is Polybase slow for large compressed files that span 1 billion records?

What would cause Polybase performance to degrade when querying larger datasets in order to insert records into Azure Data Warehouse from Blob storage?
For example, a few thousand compressed (.gz) CSV files with headers partitioned by a few hours per day across 6 months worth of data. Querying these files from an external table in SSMS is not exactly optimial and it's extremely slow.
Objectively, I'm loading data into Polybase in order to transfer data into Azure Data Warehouse. Except, it seems with large datasets, Polybase is pretty slow.
What options are available to optimize Polybase here? Wait out the query or load the data after each upload to blob storage incrementally?
In your scenario, Polybase has to connect to the files in the external source, uncompress them, then ensure they fit your external table definition (schema) and then allow the contents to be targeted by the query. When you are processing large amounts of text files in a one-off import fashion, there is nothing to really cache either, since it is dealing with new content every time. In short, your scenario is compute heavy.
Azure Blob Storage will (currently) max out at around 1,250MB/sec, so if your throughput is not near maxing this, then the best way to improve performance is to upgrade your DWU on your SQL data warehouse. In the background, this will spread your workload over a bigger cluster (more servers). SQL Data Warehouse DWU can be scaled either up and down in a matter of minutes.
If you have huge volumes and are maxing the storage, then use multiple storage accounts to spread the load.
Other alternatives include relieving Polybase of the unzip work as part of your upload or staging process. Do this from within Azure where the network bandwidth within a data center is lightning fast.
You could also consider using Azure Data Factory to do the work. See here for supported file formats. GZip is supported. Use the Copy Activity to copy from the Blob storage in to SQL DW.
Also look in to:
CTAS (Create Table as Select), the fastest way to move data from external tables in to internal storage in Azure Data Warehouse.
Creating statistics for your external tables if you are going to query them repeatedly. SQL Data Warehouse does not create statistics automatically like SQL Server and you need to do this yourself.

HDInsight: HBase or Azure Table Storage?

Currently my team is creating a solution that would use HDInsight. We will be getting 5TB of data daily and will need to do some map/reduce jobs on this data. Would there be any performance/cost difference if our data will be stored in Azure Table Storage instead of Azure HBase?
The main differences will be in both functionality and cost.
Azure Table Storage doesn't have a map reduce engine attached to it in itself, though of course you could use the map reduce approach to write your own.
You can use Azure HDInsight to connect Map Reduce to table storage. There are a couple of connectors around, including one written by me which is hive focused and requires some configuration, and may not suit your partition scheme (http://www.simonellistonball.com/technology/hadoop-hive-inputformat-azure-tables/) and a less performance focused, but more complete version from someone at Microsoft (http://blogs.msdn.com/b/mostlytrue/archive/2014/04/04/analyzing-azure-table-storage-data-with-hdinsight.aspx).
The main advantage of Table Storage is that you aren't constantly taking processing cost.
If you use HBase, you will need to run a full cluster all the time, so there is a cost disadvantage, however, you will get some functionality and performance gains, plus you will have something a bit more portable, should you wish to use other hadoop platforms. You would also have access to a much greater range of analytic functionality with the HBase option.
HDInsight (HBase/Hadoop) uses Azure Blob storage not ATS. For your data-storage you will charged only applicable blob storage cost, based on your subscription.
P.S. Don't forget to delete your cluster once job has completed, to avoid charges. Your data will persist in BLOB storage and can be used by next cluster you build.

Resources