Write to Kusto ADX with spark connector -Performance Observations - apache-spark

We have a use-case where we receive large volume of data (i.e., 80 GB divided into 300 files comes every 5 mins) in ADLS-V2 and using spark-connector to write from ADLS-V2 to Kusto table.
During the write stage, noticed multiple cores are used to batch the entire data and only one core is used to write to Kusto table, i.e., 80Gb is writing with only one core and remaining cores are in idle state.
This process takes good amount of 20-25 mins and we have tight SLA of 10 mins.
Azure databricks(28GB RAM and 8 CPU cores each- 5 nodes)
Each file size is of ~260MB uncompressed and in parquet format. I also seen some best practices document where it says file size should be between 100MB to 1 GB uncompressed.
Using writestream API in databricks to write the data.
What is the ideal approach to write the data from ADLS to ADX in distributed way using spark-connector ?

First - the most efficient flow from ADLS storage to ADX is EventGrid, as the writing through the Spark connector means data is translated to Spark internal and then to csv which is sent to ADX. From the conversation with you guys it was clear you are using Spark for transforming the data before ingestion, in that case the Spark connector is a good choice.
From version 3.1.0 the connector flow got split by default into three Jobs (unless writeMode.Queued - is used), the first translates data into csv, writes it to storage, and queue and ingestion for ADX, this is done in distributed fashion. The second stage is polling on these ingestions until all finishes successfully to ensure transactionality, this is done using one core as the operation is really cheap (call to table storage) and there's no need to hold more than one worker for that. Third stage is sealing the transaction (this is metadata operation in ADX) - and therefore also needs one core.


Synapse Pipeline : DF-Executor-OutOfMemoryError

I am having nested json as source in gzip format. In Synapse pipeline I am using the dataflow activity where I have mentioned the compression type as gzip in the source dataset. The pipeline was executing fine for small size files under 10MB. When I tried to execute pipeline for a large gzip file about 89MB.
The dataflow activity failed with below error:
Error1 {"message":"Job failed due to reason: Cluster ran into out of memory issue during execution,
please retry using an integration runtime with bigger core count and/or memory optimized compute type.
Requesting for your help and guidance.
To resolve Error1, I tried Azure integration runtime with bigger core count (128+16 cores) and memory optimized compute type but still the same error.
I thought it could be too intensive to read json directly from gzip so I tried a basic copy data activity to decompress the gzip file first but still its failing with the same error.
As per your scenario I would recommend Instead of pulling all the data from Json file, pulled from small Json files. You first partitioned your big Json file in few parts with the dataflow using Round robin partition technique. and store this files into a folder in blob storage
Data is evenly distributed among divisions while using round robin. When you don't have excellent key candidates, use round-robin to put a decent, clever partitioning scheme into place. The number of physical divisions is programmable.
You need to evaluate the data size or the partition number of input data, then set reasonable partition number under "Optimize". For example, the cluster that you use in the data flow pipeline execution is 8 cores and the memory of each core is 20GB, but the input data is 1000GB with 10 partitions. If you directly run the data flow, it will meet the OOM issue because 1000GB/10 > 20GB, so it is better to set repartition number to 100 (1000GB/100 < 20GB).
And after above process use these partitioned files to perform dataflow operations with for each activity and in last merge them in a single file.
Reference: Partition in dataflow.

Where does Spark saves retrieved data on Azure Databricks?

I would like to understand the difference between the RAM and storage in Azure databricks.
Suppose I am reading csv data from the Azure data lake (ADLS Gen 2) as follows:
df = spark.read.csv("path to the csv file").collect()
I am aware that the read method in spark is a Transformation method in spark. And this is not going to be run immediately. However, now if I perform an Action using the collect() method, I would assume that the data is now actually been read from the data lake by Spark and loaded into RAM or Disk. First, I would like to know, where is the data stored. Is it in RAM or in Disk. And, if the data is stored in RAM, then what is cache used for?; and if the data is retrieved and stored on disk, then what does persist do? I am aware that cache stores the data in memory for late use, and that if I have very large amount of data, I can use persist to store the data into a disk.
I would like to know, how much can databricks scale if we have peta bytes of data?
How much does the RAM and Disk differ in size?
how can I know where the data is stored at any point in time?
What is the underlying operating system running Azure Databricks?
Please note that I am newbie to Azure Databricks and Spark.
I would like to get some recommendation on the best practices when using Spark.
Your help is much appreciated!!
First, I would like to know, where is the data stored.
When you run any action (i.e. collect or others) Data is collected from executors nodes to driver node and stored in ram (memory)
And, if the data is stored in RAM, then what is cache used for
Spark has lazy evaluation what does that mean is until you call an action it doesn't do anything, and once you call it, it creates a DAG and then executed that DAF.
Let's understand it by an example. let's consider you have three tables Table A, Table B and Table C. You have joined this table and apply some business logic (maps and filters), let's call this dataframe filtered_data. and now you are using this DataFrame in let's say 5 different places (another dataframes) for either lookup or join and other business reason.
if you won't persist(cache) your filterd_data dataframe, everytime it will be referenced, it will again go through joins and other business logic. So it's advisable to persist(cache) dataframe if you are going to use that into multiple places.
By Default Cache stored data in memory (RAM) but you can set the storage level to disk
would like to know, how much can databricks scale if we have petabytes of data?
It's a distributed environment, so what you need to do is add more executors. and may be need to increase the memory and CPU configuration,
how can I know where the data is stored at any point in time?
if you haven't created a table or view, it's stored in memory.
What is the underlying operating system running Azure Databricks?
it uses linux operation system.
specifically Linux-4.15.0-1050-azure-x86_64-with-Ubuntu-16.04-xenial
you can run the following command to know.
import platform

How to stream 100GB of data in Kafka topic?

So, in one of our kafka topic, there's close to 100 GB of data.
We are running spark-structured streaming to get the data in S3
When the data is upto 10GB, streaming runs fine and we are able to get the data in S3.
But with 100GB, it is taking forever to stream the data in kafka.
Question: How does spark-streaming reads data from Kafka?
Does it take the entire data from current offset?
Or does it take in batch of some size?
Spark will work off consumer groups, just as any other Kafka consumer, but in batches. Therefore it takes as much data as possible (based on various Kafka consumer settings) from the last consumed offsets. In theory, if you have the same number of partitions, with the same commit interval as 10 GB, it should only take 10x longer to do 100 GB. You've not stated how long that currently takes, but to some people 1 minute vs 10 minutes might seem like "forever", sure.
I would recommend you plot the consumer lag over time using the kafka-consumer-groups command line tool combined with something like Burrow or Remora... If you notice an upward trend in the lag, then Spark is not consuming records fast enough.
To overcome this, the first option would be to ensure that the number of Spark executors is evenly consuming all Kafka partitions.
You'll also want to be making sure you're not doing major data transforms other than simple filters and maps between consuming and writing the records, as this also introduces lag.
For non-Spark approaches, I would like to point out that the Confluent S3 connector is also batch-y in that it'll only periodically flush to S3, but the consumption itself is still closer to real-time than Spark. I can verify that it's able to write very large S3 files (several GB in size), though, if the heap is large enough and the flush configurations are set to large values.
Secor by Pinterest is another option that requires no manual coding

Spark based processing of data stored on SSD

We are currently using Spark 2.1 based application which analyses and process huge number of records to generate some stats which is used for report generation. Now our we are using 150 executors, 2 core per executor and 10 GB per executor for our spark jobs and size of data is ~3TB stored in parquet format. For processing 12 months of data it is taking ~15 mins of time.
Now to improve performance we want to try full SSD based node to store data in HDFS. Well the question is, are there any special configuration/optimisation to be done for SSD? Are there any study done for Spark processing performance on SSD based HDFS vs HDD based HDFS?
SPARK_LOCAL_DIRS is config that you need to change.
Use case is K means algo but will help.

Faster reading from blob storage via spark

I currently have a spark cluster set up with 4 worker nodes and 2 head nodes. I have a 1.5 GB CSV file in blob storage that I can access from one of the head nodes. I find that it takes quite a while to load this data and cache it using PySpark. Is there a way to load the data faster?
One thought I had was loading the data, then partitioning the data into k (number of nodes) different segments and saving them back to blob as parquet files. This way, I can load in different parts of the data set in parallel then union... However, I am unsure if all the data is just loaded on the head node, then when computation occurs, it distributes to the other machines. If the latter is true, then the partitioning would be useless.
Help would be much appreciated. Thank you.
Generally, you will want to have smaller file sizes on blob storage so that way you can transfer data between blob storage to compute in parallel so you have faster transfer rates. A good rule of thumb is to have a file size between 64MB - 256MB; a good reference is Vida Ha's Data Storage Tips for Optimal Spark Performance.
Your call out for reading the file and then saving it back to Parquet (with default snappy codec compression) is a good idea. Parquet is natively used by Spark and is often faster to query against. The only tweak would be to partition more by the file size vs. # of worker nodes. The data is loaded onto the worker nodes but partitioning is helpful because more tasks are created to read more files.
