I am interested in the local threaded scheduler of Dask. This scheduler can load data blocks from a multidimensional array in "parallel" using several threads. I am interested in I/O bound problems so I am not considering compute intensive applications for the moment.
This fact seems verified by some speed tests I did on loading and saving data from a random array using Dask's store method: As the block size augment the performance decreases (supposedly because smallest chunks increase the parallelism). In this experiment I am working with hdf5 files with no physical chunks: 1 dataset containing all the data from the array.
The problem I have is two fold:
1) How can Dask have parallelism in reading data when reading on HDD is sequential?
2) How can Dask have parallelism in reading when the python GIL should prevent the threads from saving data in memory at the same time?
Thank you for your time.
How can Dask have parallelism in reading data when reading on HDD is sequential?
You're correct that if you are bound by reading from the hard disk then using multiple threads should not have any performance benefit.
However, it may be that there is work to do here other than reading from the hard disk.
Your data may be compressed in HDF format, requiring some CPU work to decompress it
You may be doing something with your data other than just reading it, and these operations can be interleaved with IO tasks
How can Dask have parallelism in reading when the python GIL should prevent the threads from saving data in memory at the same time?
Python's GIL isn't that much of a problem for numeric workloads, which do most of their computation in linked C/Fortran libraries. In general if you aree using Numpy-like libraries on numeric data then the GIL is unlikely to affect you.
Related
I want to perform a computation that can be written as a simple Python UDF. This particular computation consumes much more memory while generating intermediate results than is needed to store the inputs and outputs combined.
Here's the rough structure of the computational task:
import pyspark.sql.functions as fx
#fx.udf("double")
def my_computation(reference):
large_object = load_large_object(reference)
result = summarize_large_object(large_object)
return result
df_input = spark.read.parquet("list_of_references.parquet")
df_result = df_input.withColumn("result", my_computation(fx.col("reference")))
pdf_result = df_result.toPandas()
The idea is that load_large_object takes a small input (reference) and generates a much larger object, whereas summarize_large_object takes that large object and summarizes it down to a much smaller result. So, while the inputs and outputs (reference and result) can be quite small, the intermediate value large_object requires much more memory.
I haven't found any way to reliably run computations like this without severely limiting the amount of parallelism that I can achieve on upstream and downstream computations in the same Spark session.
Naively running a computation like the one above (without any changes to the default Spark configuration) often leads to worker nodes running out of memory. For example, if large_object consumes 2 GB of memory and the relevant Spark stage is running as 8 parallel tasks on 8 cores of a machine with less than 16 GB of RAM, the worker node will run out of memory (or at least start swapping to disk and slow down significantly).
IMO, the best solution would be to temporarily limit the number of parallel tasks that can run simultaneously. The closest configuration parameter that I'm aware of is spark.task.cpus, but this affects all upstream and downstream computations within the same Spark session.
Ideally, there would be some way to provide a hint to Spark that effectively says "for this step, make sure to allocate X amount of extra memory per task" (and then Spark wouldn't schedule such a task on any worker node that isn't expected to have that amount of extra memory available). Upstream and downstream jobs/stages could remain unaffected by the constraints imposed by this sort of hint.
Does such a mechanism exist?
Why spark is faster than Hadoop MapReduce?.
As per my understanding if spark is faster due to in-memory processing then Hadoop is also load data into RAM then it process. Every program first load into RAM then it execute. So how we can say spark is doing in-memory processing and why not other big data technology not doing the same. Could you please explain me?
Spark was created out of all the lessons learned from MapReduce. It's not a generation 2, it's redesigned using similar concepts but really learning what was missing/done poorly in map reduce.
MapReduce partitions data, it reads data, does a map, writes to disk, sends to reducer, which writes it to disk, then reads it, then reduces it, then writes to disk. Lots of writing and reading. If you want to do another operation you start the whole cycle again.
Spark, tries to keep it in memory, while it does multiple maps/operations, it still does transfer data but only when it has to and uses smart logic to figure out how it can optimize what you are asking it to do. In memory is helpful, but not the only thing it does.
We're facing performance issues running big data task on a single machine. Task by design is memory and compute intensive and is running an optimization algorithm (branch and bound algorithm) on huge data sets. A single EC2 c5.24x large machine (96vCPU/192GiB) is now taking over 2 days to complete the task. We have tried to optimize the code (multiple threads, more memory, parallelism, optimized algorithm) but looks like there's a limit to how much we could achieve and doesn't sound like a scalable option as the data set is growing and we're adding more use cases to it.
Thinking of distributing this in to smaller tasks and have it executed by multiple workers in spark cluster. Output of the task will be a single Gzipped JSON (2- 20 MB in size) and by distributing i want each worker to build smaller JSONs or RDD chunks which could later be merged at the driver side.
Is this doable? Is there a limit on how much data each worker can send back to the driver? Or is it better to store each worker output to some database (S3) and then merge at driver side? What are the pros and cons of each approach?
I would suggest not to collect data from executors to driver and combine them there. It makes the driver a bottleneck and it will run into frequent resource related issues. The best option is to let the executors process the data and produce JSONs. You can leave the output JSONs as is which will help in reprocessing them again. If you are too worried about size of these JSONs or want to combine them to send to another process then you can use a file utility to combine them or you can use dataframe.coleasc(1) to generate one output file.
In terms of memory RAM efficiency , who much better?
What dask do to reduce/compress large data to runs on small RAM?
When running on a single machine with datasets smaller than RAM, pandas/numpy should help you run fine. Dask is a distributed task distribution package, which basically means you can lazily read datasets on single computers. For example, a folder of .csvs, that together are too big (60 GB) to load into memory., can be loaded with dask so you only use the data when you need it, by calling dask.dataframe.compute().
Basically, start with using pandas - if your code starts throwing MemoryErrors, you can use dask instead.
Source:
http://dask.pydata.org/en/latest/why.html
We have a requirement where a calculation must be done in near real time (with in 100ms at most) and involves moderately complex computation which can be parallelized easily. One of the options we are considering is to use spark in batch mode apart from Apache Hadoop YARN. I've read that submitting batch jobs to spark has huge overhead however. Is these a way we can reduce/eliminate this overhead?
Spark best utilizes available resources i.e. memory and cores. Spark uses the concept of Data Locality.
If data and the code that operates on it are together than computation tends to be fast. But if code and data are separated, one must move to the other. Typically it is faster to ship serialized code from place to place than a chunk of data because code size is much smaller than data.
If you are low on resources surely scheduling and processing time will shoot. Spark builds its scheduling around this general principle of data locality.
Spark prefers to schedule all tasks at the best locality level, but this is not always possible.
Check https://spark.apache.org/docs/1.2.0/tuning.html#data-locality