Display bytes read in a Spark+Parquet program - apache-spark

I'm trying to optimise both some Spark queries and a Parquet schema, by taking advantage of things like partitions and pushdown. My understanding is that these techniques allow large portions of the Parquet files to be skipped.
Is there a way to display the number of bytes that was read by Spark versus the total size of the Parquet files? And additionally, the number of read operations? (I'm using S3, so I'd like to minimise the number of read operations due to the overhead of the S3 API calls.)

If you are using apache spark (and not EMR's private variant), the S3A connector collects a lot of stats, including things like: bytes discarded when closing connections, #of HEAD requests, throttled operations, etc
But: its not really collected in spark, and because a single instance of the filesystem class for each s3 bucket (hence statistics) is used per worker, even once you do work out how to collect them they tend to over-estimate the amount of effort. There's opportunities to improve things there, but it'd take a lot of work. All you currently get is the per-thread bytes read, bytes written statistics, which can actually under-report bytes written, if the HTTP requests to upload data is done in a background thread.
You can enable org.apache.hadoop.fs.s3a.S3AStorageStatistics to log at debug and then the logs of each spark worker will actually track those operations as they happen, but its very noisy. Primarily useful when trying to debug things or doing low-level performance optimisation of something like the Parquet reader itself.
No idea about EMR I'm afraid —not my code.

Related

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

Can Spark/EMR read data from s3 multi-threaded

Due to some unfortunate sequences of events, we've ended up with a very fragmented dataset stored on s3. The table metadata is stored on Glue, and data is written with "bucketBy", and stored in parquet format. Thus discovery of the files is not an issue, and the number of spark partitions is equal to the number of buckets, which provides a good level of parallelism.
When we load this dataset on Spark/EMR we end up having each spark partition loading around ~8k files from s3.
As we've stored the data in a columnar format; per our use-case where we need a couple of fields, we don't really read all the data but a very small portion of what is stored.
Based on CPU utilization on the worker nodes, I can see that each task (running per partition) is utilizing almost around 20% of their CPUs, which I suspect is due to a single thread per task reading files from s3 sequentially, so lots of IOwait...
Is there a way to encourage spark tasks on EMR to read data from s3 multi-threaded, so that we can read multiple files at the same time from s3 within a task? This way, we can utilize the 80% idle CPU to make things a bit faster?
There are two parts to reading S3 data with Spark dataframes:
Discovery (listing the objects on S3)
Reading the S3 objects, including decompressing, etc.
Discovery typically happens on the driver. Some managed Spark environments have optimizations that use cluster resources for faster discovery. This is not typically a problem unless you get beyond 100K objects. Discovery is slower if you have .option("mergeSchema", true) as each file will have to touched to discover its schema.
Reading S3 files is part of executing an action. The parallelism of reading is min(number of partitions, number of available cores). More partitions + more available cores means faster I/O... in theory. In practice, S3 can be quite slow if you haven't accesses these files regularly for S3 to scale their availability up. Therefore, in practice, additional Spark parallelism has diminishing returns. Watch the total network RW bandwidth per active core and tune your execution for the highest value.
You can discover the number of partitions with df.rdd.partitions.length.
There are additional things you can do if the S3 I/O throughput is low:
Make sure the data on S3 is dispersed when it comes to its prefix (see https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html).
Open an AWS support request and ask the prefixes with your data to be scaled up.
Experiment with different node types. We have found storage-optimized nodes to have better effective I/O.
Hope this helps.

Best way to return data back to driver from spark workers

We're facing performance issues running big data task on a single machine. Task by design is memory and compute intensive and is running an optimization algorithm (branch and bound algorithm) on huge data sets. A single EC2 c5.24x large machine (96vCPU/192GiB) is now taking over 2 days to complete the task. We have tried to optimize the code (multiple threads, more memory, parallelism, optimized algorithm) but looks like there's a limit to how much we could achieve and doesn't sound like a scalable option as the data set is growing and we're adding more use cases to it.
Thinking of distributing this in to smaller tasks and have it executed by multiple workers in spark cluster. Output of the task will be a single Gzipped JSON (2- 20 MB in size) and by distributing i want each worker to build smaller JSONs or RDD chunks which could later be merged at the driver side.
Is this doable? Is there a limit on how much data each worker can send back to the driver? Or is it better to store each worker output to some database (S3) and then merge at driver side? What are the pros and cons of each approach?
I would suggest not to collect data from executors to driver and combine them there. It makes the driver a bottleneck and it will run into frequent resource related issues. The best option is to let the executors process the data and produce JSONs. You can leave the output JSONs as is which will help in reprocessing them again. If you are too worried about size of these JSONs or want to combine them to send to another process then you can use a file utility to combine them or you can use dataframe.coleasc(1) to generate one output file.

Parquet write OutOfMemoryException on spark

I have about 8m rows of data with about 500 columns.
When I try to write it with spark as a single file coalesce(1) it fails with an OutOfMemoryException.
I know this is a lot of data on one executor, but as far as I understand the write process of parquet, it only holds the data for one row group in memory, before flushing it to disk and then continues with the next one.
My executor has 16gb of memory and it cannot be increased any further. The data contains a lot of strings.
So what I am interested in is, some settings where I can tweak the process of writing big parquet files for wide tables.
I know i can enable/disable dictionary, increase/decrease block- and pagesize.
But what would be a good configuration for my needs?
I don't think that Parquet is really contributes to failure here and tweaking its configuration probably won't help.
coalesce(1) is a drastic operation that affect all upstream code. As a result, all processing is done on a single node, and according to your own words, your resources are already very limited.
You didn't provide any information about the rest of the pipeline, but if you want to stay with Spark, your best hope is replacing coalesce with repartition. If OOM occurs in one of the preceding operations it might help.

Kafka , Spark large csv file (4Go)

I am developing an integration channel with kafka and spark, which will process batchs and streaming.
for batch processing, I entered huge CSV files (4 GB).
I'm considering two solutions:
Send the whole file to the file system and send a message to kafka
with the file address, and the spark job will read the file from the
FS and turn on it.
cut the file before kafka in unit message (with apache nifi) and
send to treat the batch as streaming in the spark job.
What do you think is the best solution ?
Thanks
If you're writing code to place the file on the file system, you can use that same code to submit the Spark job to the job tracker. The job tracker becomes the task queue and processes your submitted files as Spark jobs.
This would be a more simplistic way of implementing #1 but it has drawbacks. The main drawback being that you have to tune resource allocation to make sure you don't under allocate for cases if your data set is extremely large. If you over allocate resources for the job, then your task queue potentially grows while tasks are waiting for resources. The advantage is that there aren't very many moving parts to maintain and troubleshoot.
Using nifi to cut a large file down and having spark handle the pieces as a stream would probably make it easier to utilize the cluster resources more effectively. If your cluster is servicing random jobs on top of this data ingestion, this might be the better way to go. The drawbacks here might be that you need to do extra work to process all parts of a single file in one transactional context, you may have to do a few extra things to make sure you aren't going to lose the data delivered by Kafka, etc.
If this is for a batch operation, maybe method 2 would be considered overkill. The setup seems pretty complex for reading a CSV file even if it is a potentially really large file. If you had a problem with the velocity of the CSV file, a number of ever-changing sources for the CSV, or a high error rate then NiFi would make a lot of sense.
It's hard to suggest the best solution. If it were me, I'd start with the variation of #1 to make it work first. Then you make it work better by introducing more system parts depending on how your approach performs with an acceptable level of accuracy in handling anomalies in the input file. You may find that your biggest problem is trying to identify errors in input files during a large scale ingestion.

Resources