I am building pipelines on Azure Data Factory, using the Mapping Data Flow activity (Azure SQL DB to Synapse). The pipelines complete in debug mode, when I enable sampling data for the sources. When I disable sampling data and run the debug, I make no progress in the pipeline. i.e. none of the transformations complete (yellow dot)
To improve this, should I increase the batch size from the source/sink (how do I determine a batch size), increase the number of partitions (how do I determine a good number of partitions)
What is the size of the Spark compute cluster you have set in the Azure Integration Runtime under data flow properties. Start there by creating an Azure IR with enough cores to provide RAM for your process. Then you can adjust the partitions and batch sizes. Much of the learnings in this area are shared here at this ADF data flow performance guide.
Related
In Azure I have an Event Hub with partition count 5 and a Stream Analytics Job which persists data from the hub to blob storage as is in json format. So now there are 5 files created to store incoming data.
Is it possible without changing hub partition to configure stream analytics job so it saves all the data to a single file?
For reference, what is taken into consideration for how to split output files is described here.
In your case, the condition that is met is:
If the query is fully partitioned, and a new file is created for each output partition
That's the trick here, if your query is passthrough (no shuffling around partitions) from event hub (partitioned) to storage account (matching incoming partitions via splitting files) then your job is always fully partitioned.
What you can do, if you don't care about performance, is to break the partition alignment. For that you can repartition your input or your query (via snapshot aggregation).
In my opinion though, you should look into using another tool (ADF, Power BI Dataflow) to process these downstream. You should see those files are landing files, optimized for query throughput. If you remove the partition alignment from your job, you severely limit its capability to scale and absorb spikes in incoming traffic.
After experimenting with partitioning suggested by this answer I found out that my goal can be achieved by changing Stream Analytics Job configuration.
There are different compatibility levels for stream analytics jobs and the latest one at the moment (1.2) introduced automatic parallel query execution for input sources with multiple partitions:
Previous levels: Azure Stream Analytics queries required the use of PARTITION BY clause to parallelize query processing across input source partitions.
1.2 level: If query logic can be parallelized across input source partitions, Azure Stream Analytics creates separate query instances and runs computations in parallel.
So when I changed compatibility level of the job to 1.1 it started to write all the output to a single file in a blob storage.
I am new to Azure Databricks. I have two input files and python AI model, I am cleaning the input files and applying AI model on input Files to get final probabilities. Reading files, loading model, cleaning data, preprocessing the data and displaying output with probabilities taking me only few minutes.
But while I am trying to write the result to Table or parquet file it is taking me more than 4-5 hours. I have tried various approaches of repartition/partitionBy/saveAsTable but none of it is fast enough.
My output spark dataframe consists of three columns with 120000000 rows. My shared cluster size is 9-Node cluster with each Node of 56GB memory.
My doubts are :-
1.) Is it expected behavior in azure databricks with slow writing capabilities.
2.) Is it true that we can't tune spark configurations in azure databricks, azure databricks tunes itself with available memory.
The performance depends on multiple factors: To investigate further, could you please share the below details:
What is the size of the data?
What is the size of the worker type?
Share the code which you are running?
I would suggest you go through the below articles, which helps to improve the performance:
Optimize performance with caching
7 Tips to Debug Apache Spark Code Faster with Databricks
Azure Databricks Performance Notes
I have used azure databricks and have written data to azure storage and it has been fast.
Also the databricks is hosted on Azure like in Aws.So all configurations of spark can be set.
As pradeep asked, what is the datasize and number of partitions? you can get it using df.rdd.getNumPartitions().
Have you tried a repartition before write? Thanks.
I'm gonna set up monitoring Spark application via $SPARK_HOME/conf/metrics.propetries.
And decided to use Graphite.
Is there any way to estimate the database size of Graphite especially for monitoring Spark application?
Regardless of what you are monitoring, Graphite has its own configuration about retention and rollup of metrics. It stores file (called whisper) per metric and you can use the calculator to estimate how much disk space it can take https://m30m.github.io/whisper-calculator/
One of the selling points of Hadoop is that the data sits with the compute? How does that work with WASB?
When processing a MapReduce job the map and reduce tasks are executed where the blocks of data are resided. This way the data locality is achieved.
But in the case of HDInsight, the data is stored in the wasb. So when the MapReduce is executed does the data is copied from wasb to each of the compute node and then the processing is proceeded? If so, then the single channel to copy data to compute nodes will be a bottleneck.
Can anyone explain to me how data is stored on wasb and how during processing the data is handled?
Just like with any Hadoop system the data is loaded into memory on the individual nodes at compute time (when the job runs). The difference with WASB is that the data is loaded from the Azure storage accounts instead of from local disks. Given the way Azure data center backbones are built the performance is generally the same with disks locally attached to the VMs.
HDInsight clusters are located in any of Azure's regions. The storage accounts that clusters can read from can only be from the same region to avoid high latency. Azure has done a lot of work on its data centers so that performance is comparable.
If you want to learn more, Ashish's quote comes from this article:
https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/
I am developing a data pipeline that will consume data from Kafka, process it via spark streaming and ingest it into Cassandra.
The data pipeline I will go into production will definitely evolve after several months. But how to move from old to new data pipeline, but to maintain continuous delivery and avoid any data loss?
Thank you
The exact solution will depend on the specific requirements of your application. In general, Kafka will serve you as buffer. Messages going into Kafka are preserved following the topic expiration time.
In Spark streaming, you need to track the consumed offsets either automatically through snapshots, or manually (we do the later as that provides more recovery options).
Then you can stop, deploy a new version and restart your pipeline from where you previously left. In this model, messages are processed with at-least-once semantics and zero data loss.