How to get usage statistics from Databricks or SQL Databricks? - apache-spark

I am looking for a way to get usage statistics from Databricks (Data Science & Engineering and SQL persona).
For example:
I created a table. I want to know how many times a specific user queried that table.
How many times a pipeline was triggered?
How long it took to run a DLT pipeline?
Is there any way to get usage statistics?

Yes, there are several ways to get usage statistics from Databricks:
Databricks UI:
The Databricks UI provides information on the usage of tables, notebooks, and jobs. You can view the number of times a table was accessed, the number of times a notebook was run, and the duration of a job run.
Audit Logs:
Databricks maintains audit logs that can be used to track user activity and monitor usage statistics. These logs contain information on the type of activity performed, the user who performed the activity, and the time the activity was performed.
Databricks REST API:
The Databricks REST API provides programmatic access to Databricks resources, including usage statistics. You can use the API to retrieve information on table usage, job run history, and other usage statistics.
Databricks Delta Metrics:
Databricks Delta provides detailed metrics on the usage of Databricks Delta tables, including information on the number of reads and writes, the size of the tables, and the duration of job runs.
freel free to refer this doc : https://docs.databricks.com/administration-guide/account-settings/audit-logs.html

Related

Delta lake and ADLS Gen2 transactions

We are running a Delta lake on ADLS Gen2 with plenty of tables and Spark jobs. The Spark jobs are running in Databricks and we mounted the ADLS containers into DBFS (abfss://delta#<our-adls-account>.dfs.core.windows.net/silver). There's one container for each "tier", so bronze, silver, gold.
This setup has been stable for some months now, but last week, we've seen a sudden increase in transactions within our storage account, particularly in the ListFilesystemDir operations:
We've added some smaller jobs that read and write some data in that time frame, but turning them off did not reduce the amount of transactions back to the old level.
Two questions regarding this:
Is there some sort of documentation that explains which operation on a Delta table causes which kind of ADLS transactions?
Is it possible to find out which container/directory/Spark job/... causes this amount of transactions, without turning off the Spark jobs one by one?
If you go into logs from your data lake (if you have log analytics enabled) you can view the exact timestamp, caller and target of the spike. Take that data and go into your databricks cluster and navigate to Spark UI. In there you should be able to see timestamps and jobs. There you can find what notebook is causing it.

Optimizing Azure Analysis Services processing on Synapse Analytics

I have a massive model with billions of records within Synapse Analytics that needs to be loaded into Azure Analysis Services. At present we have the tables partitioned by month, datatypes and joins are perfect and we have reduced the tables to ONLY the columns we need. We have scaled Synapse to DWU 5000c and Azure Analysis Services to S9v2 (the max setting). Given above it still takes hours and hours to process full for the first load with up to 64 partitions running in parallel given concurrency in Synapse.
I am curious, is there a setting that can enable a 'mass bulk import process full' that I am missing at this point?
For example here: https://www.businessintelligenceinfo.com/business-intelligence/self-service-bi/building-an-azure-analysis-services-model-on-top-of-azure-blob-storage-part-3, he is loading differently as from storage blobs and was able to load 1 TB by changing the config as: ReadBlobData(Source, ““, 1, 50). If there is any similar config for Synapse to Analysis Services?

Does Databricks cluster need to be always up for VACUUM operation of Delta Lake?

I am using Azure Databricks with latest runtime for the clusters. I had some confusion regarding VACUUM operation in delta lake. We know we can set a retention duration on the deleted data, however, for actual data to be delete after the retention period is over, do we need to keep the Cluster Up for the entire duration?
In simple words-: Do we need to have Cluster always in running state in order to leverage Delta lake ?
You don't need to always keep a cluster up and running. You can schedule a vacuum job to run daily (or weekly) to clean up stale data older than the threshold. Delta Lake doesn't require an always-on cluster. All the data/metadata are stored in the storage (s3/adls/abfs/hdfs), so no need to keep anything up and running.
Apparently you need a cluster to be up and running always to query for the data available in databricks tables.
If you have configured the external meta store for databricks, then you can use any wrappers like apache hive by pointing it to that external meta store DB and query the data using hive layer without using databricks.

batch processing in azure

We are planning to do batch processing on a daily basis. We generate 1 GB of CSV files every day and will manually put them into Azure Data Lake Store. I have read the Microsoft Azure documents regarding the batch processing and I have decided to use Spark as to batch processing. My question is that after we transfer the data using RDD/DF what would be the next step? how we can visualize the data? since this process is supposed to be run every day, once the data transformation done using Spark, do we need to push the data to any kind of data store like hive hdfs or cosmos before we could visualize it?
There are several options doing this on Azure. It really depends on your requirements (e.g. number of users, needed visualizations, etc). Examples for doing it:
Running Spark on Azure Databricks, you could use the Notebook capabilities to visualize your data
Use HDInsight with Jupyter or Zeppelin Notebooks
Define Spark tables on Azure Databricks and visualize them with Power BI
Load the data with Azure Data Factory V2 to Azure SQL DB or Azure SQL Data Warehouse and visualize it with Power BI.
For Time-Series-Data you could push the data via Spark to Azure EventHubs (see Example notebook with Eventhubs Sink in the following documentation) and consume it via Azure Time Series Insights. If you have an EventData-Stream this could also replace your batch oriented architecture in the future. Parquet files will be used by Azure Time Series Insights as Long-term Storage (see the following link). For Spark also have a look at Time Series Package which adds some time series capabilities to spark.

Batch processing with spark and azure

I am working for an energy provider company. Currently, we are generating 1 GB data in form of flat files per day. We have decided to use azure data lake store to store our data, in which we want to do batch processing on a daily basis. My question is that what is the best way to transfer the flat files into azure data lake store? and after the data is pushed into azure I am wondering whether it is good idea to process the data with HDInsight spark? like Dataframe API or SparkSQL and finally visualize it with azure?
For a daily load from a local file system I would recommend using Azure Data Factory Version 2. You have to install Integration Runtimes on Premise (more than one for High Avalibility). You have to consider several security topics (local firewalls, network connectivity etc.) A detailed documentation can be found here. There are also some good Tutorials available. With Azure Data Factory you can trigger your upload to Azure with a Get-Metadata-Activity and use e. g. an Azure Databricks Notebook Activity for further Spark processing.

Resources