Comparison between sequential and parallel stream

Comparison between sequential and parallel stream - multithreading

Should I keep using parallel stream here or should I go with sequential stream. I have tested it on various environments. In my local it was considerably fast but in the test environment, it was slower. doc is a list of strings of size 30K appx. Please suggest.
doc.parallelStream().filter(transaction -> StringUtils.isNotBlank(transaction))
.forEach(transaction -> sendTransactionsToTopic(transaction, isEncryptionEnabled))

Related

Parallel Data Download from Snowflake to Databricks

I have a big table in Snowflake ( 10B records) , which I want to download in Databricks using Snowflakeconnector (spark.read.format("snowflake")). I am trying to apply parallel fetch by means of dividing the table using a date column. For running concurrently, I am using Databrick's concurrent Notebook mechanism. The split code looks something like this
val notebooks = Seq(
NotebookData("my_snowflake_table", 6000,Map("start_date" -> "2022-05-01", "end_date" -> "2022-08-01")),
NotebookData("my_snowflake_table", 6000,Map("start_date" -> "2022-08-01", "end_date" -> "2022-11-01")),
NotebookData("my_snowflake_table", 6000,Map("start_date" -> "2022-11-01", "end_date" -> "2022-12-01")))
// Run the notebooks in parallel
val res = parallelNotebooks(notebooks)
Await.result(res, 7200 seconds) // this is a blocking call.
res.value
I was expecting horizontal scalability (by the means of autoscaling cluster) with this , however that doesn't seem to be the case. The time taken per split is lot more compared to when run alone. Each split takes around 15 minutes, but when done together, doesn't even complete even in an hour.
Is it because all the splits are running in the same driver, and therefore bandwidth of the driver node could be the bottleneck , also has compounding effect on the overall performance ?
Even more importantly, does this design make sense for how Dtabricks and Snowflake work together ? How else can we download data from Snowflake to Databricks in a faster way ?

How to reduce white space in the task stream?

I have obtained task stream using distributed computing in Dask for different number of workers. I can observe that as the number of workers increase (from 16 to 32 to 64), the white spaces in task stream also increases which reduces the efficiency of parallel computation. Even when I increase the work-load per worker (that is, more number of computation per worker), I obtain the similar trend. Can anyone suggest how to reduce the white spaces?
PS: I need to extend the computation to 1000s of workers, so reducing the number of workers is not an option for me.
Image for: No. of workers = 16
Image for: No. of workers = 32
Image for: No. of workers = 64

As you mention, white space in the task stream plot means that there is some inefficiency causing workers to not be active all the time.
This can be caused by many reasons. I'll list a few below:
Very short tasks (sub millisecond)
Algorithms that are not very parallelizable
Objects in the task graph that are expensive to serialize
...
Looking at your images I don't think that any of these apply to you.
Instead, I see that there are gaps of inactivity followed by gaps of activity. My guess is that this is caused by some code that you are running locally. My guess is that your code looks like the following:
for i in ...:
results = dask.compute(...) # do some dask work
next_inputs = ... # do some local work
So you're being blocked by doing some local work. This might be Dask's fault (maybe it takes a long time to build and serialize your graph) or maybe it's the fault of your code (maybe building the inputs for the next computation takes some time).
I recommend profiling your local computations to see what is going on. See https://docs.dask.org/en/latest/phases-of-computation.html

Spark and 100000k of sequential HTTP calls: driver vs workers

I have to do 100000 sequential HTTP requests with Spark. I have to store responses into S3. I said sequential, because each request returns around 50KB of data, and I have to keep 1 second in order to not exceed API rate limits.
Where to make HTTP calls: from Spark Job's code (executed on driver/master node) or from dataset transformation (executed on worker node)?
Workarrounds
Make HTTP request from my Spark job (on Driver/Master node), create dataset of each HTTP response (each contains 5000 json items) and save each dataset to S3 with help of spark. You do not need to keep dataset after you saved it
Create dataset from all 100000 URLs (move all further computations to workers), make HTTP requests inside map or mapPartition, save single dataset to S3.
The first option
It's simpler and it represents a nature of my compurations - they're sequential, because of 1 second delay. But:
Is it bad to make 100_000 HTTP calls from Driver/Master node?
*Is it more efficient to create/save one 100_000 * 5_000 dataset than creating/saving 100_000 small datasets of size 5_000*
Each time I creating dataset from HTTP response - I'll move response to worker and then save it to S3, right? Double shuffling than...
Second option
Actually it won't benefit from parallel processing, since you have to keep interval of 1 second because request. The only bonus is to moving computations (even if they aren't too hard) from driver. But:
Is it worth of moving computations to workers?
Is it a good idea to make API call inside transformation?

Saving a file <32MB (or whatever fs.s3a.block.size is) to S3 is ~2xGET, 1xLIST and a PUT; you get billed a bit by AWS for each of these calls, plus storage costs.
For larger files, a POST to initiate multipart upload after that first block, one POST per 32 MB (of 32MB, obviously) and a final POST of a JSON file to complete. So: slightly more efficient
Where small S3 sizes matter is in the bills from AWS and followup spark queries: anything you use in spark, pyspark, SQL etc. many small files are slower: Theres a high cost in listing files in S3, and every task pushed out to a spark worker has some setup/commit/complete costs.
regarding doing HTTP API calls inside a worker, well, you can do fun things there. If the result isn't replicable then task failures & retries can give bad answers, but for a GET it should be OK. What is hard is throttling the work; I'll leave you to come up with a strategy there.
Here's an example of uploading files to S3 or other object store in workers; first the RDD of the copy src/dest operations is built up, then they are pushed out to workers. The result of the worker code includes upload duration length info, if someone ever wanted to try and aggregate the stats (though there you'd probably need timestamp for some time series view)
Given you have to serialize the work to one request/second, 100K requests is going to take over a day. if each request takes <1 second, you may as well run it on a single machine. What's important is to save the work incrementally so that if your job fails partway through you can restart from the last checkpoint. I'd personally focus on that problem: how could do this operation such that every 15-20 minutes of work was saved, and on a restart you can carry on from there.
Spark does not handle recovery of a failed job, only task failures. Lose the driver and you get to restart your last query. Break things up.
Something which comes to mind could be
* first RDD takes list of queries and some summary info about any existing checkpointed data, calculates the next 15 minutes of work,
* building up a list of GET calls to delegate to 1+ worker. Either 1 URL/row, or have multiple URLs in a single row
* run that job, save the results
* test recovery works with a smaller window and killing things.
* once happy: do the full run
Maybe also: recognise & react to any throttle events coming off the far end by
1. Sleeping in the worker
1. returning a count of throttle events in the results, so that the driver can initially collect aggregate stats and maybe later tune sleep window for subsequent tasks.

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance

Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

Parallel.ForEach in C# when number of iterations is unknown

I have TPL (Task Parallel Library) code for executing a loop in parallel in C# in a class library project using .Net 4.0. I am new to TPL in C# and had following questions .
CODE Background:
In the code that appears just after the questions, I am getting all unprocessed batches and then processing each batch one at a time. Each batch can be processed independently since there are no dependencies between batches, but for each batch the sequence of steps is very important when processing it.
My questions are:
Will using Parallel.ForEach be advisable in this scenario where the number of batches and therefore the number of iterations could be very small or very large like 10,000 batches? I am afraid that with too many batches, using parallelism might cause more harm than good in this case.
When using Parallel.ForEach is the sequence of steps in ProcessBatch method guaranteed to execute in the same order as step1, step2, step3 and then step4?
public void ProcessBatches() {
List < Batch > batches = ABC.Data.GetUnprocessesBatches();
Parallel.ForEach(batches, batch = > {
ProcessBatch(batch);
});
}
public void ProcessBatch(Batch batch) {
//step 1
ABC.Data.UpdateHistory(batch);
//step2
ABC.Data.AssignNewRegions(batch);
//step3
UpdateStatus(batch);
//step4
RemoveBatchFromQueue(batch);
}
UPDATE 1:
From the accepted answer, the number of iterations is not an issue even when its large. In fact according to an article at this url: Potential Pitfalls in Data and Task Parallelism, performance improvements with parallelism will likely occur when there are many iterations and for fewer iterations parallel loop is not going to provide any benefits over a sequential/synchronous loop.
So it seems having a large number of iterations in the loop is the best situation for using Parallel.ForEach.
The basic rule of thumb is that parallel loops that have few iterations and fast user delegates are unlikely to speedup much.

Parallel foreach will us the appropriate number of threads for the hardware you are running on. So you don't need to worry about too many batches causing harm
The steps will run in order for each batch. ProcessBatch will get called on different threads for different batches but for each batch the steps will get executed in the order they are defined in that method

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string