Can Spark/EMR read data from s3 multi-threaded - multithreading

Due to some unfortunate sequences of events, we've ended up with a very fragmented dataset stored on s3. The table metadata is stored on Glue, and data is written with "bucketBy", and stored in parquet format. Thus discovery of the files is not an issue, and the number of spark partitions is equal to the number of buckets, which provides a good level of parallelism.
When we load this dataset on Spark/EMR we end up having each spark partition loading around ~8k files from s3.
As we've stored the data in a columnar format; per our use-case where we need a couple of fields, we don't really read all the data but a very small portion of what is stored.
Based on CPU utilization on the worker nodes, I can see that each task (running per partition) is utilizing almost around 20% of their CPUs, which I suspect is due to a single thread per task reading files from s3 sequentially, so lots of IOwait...
Is there a way to encourage spark tasks on EMR to read data from s3 multi-threaded, so that we can read multiple files at the same time from s3 within a task? This way, we can utilize the 80% idle CPU to make things a bit faster?

There are two parts to reading S3 data with Spark dataframes:
Discovery (listing the objects on S3)
Reading the S3 objects, including decompressing, etc.
Discovery typically happens on the driver. Some managed Spark environments have optimizations that use cluster resources for faster discovery. This is not typically a problem unless you get beyond 100K objects. Discovery is slower if you have .option("mergeSchema", true) as each file will have to touched to discover its schema.
Reading S3 files is part of executing an action. The parallelism of reading is min(number of partitions, number of available cores). More partitions + more available cores means faster I/O... in theory. In practice, S3 can be quite slow if you haven't accesses these files regularly for S3 to scale their availability up. Therefore, in practice, additional Spark parallelism has diminishing returns. Watch the total network RW bandwidth per active core and tune your execution for the highest value.
You can discover the number of partitions with df.rdd.partitions.length.
There are additional things you can do if the S3 I/O throughput is low:
Make sure the data on S3 is dispersed when it comes to its prefix (see https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html).
Open an AWS support request and ask the prefixes with your data to be scaled up.
Experiment with different node types. We have found storage-optimized nodes to have better effective I/O.
Hope this helps.

Related

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

Display bytes read in a Spark+Parquet program

I'm trying to optimise both some Spark queries and a Parquet schema, by taking advantage of things like partitions and pushdown. My understanding is that these techniques allow large portions of the Parquet files to be skipped.
Is there a way to display the number of bytes that was read by Spark versus the total size of the Parquet files? And additionally, the number of read operations? (I'm using S3, so I'd like to minimise the number of read operations due to the overhead of the S3 API calls.)
If you are using apache spark (and not EMR's private variant), the S3A connector collects a lot of stats, including things like: bytes discarded when closing connections, #of HEAD requests, throttled operations, etc
But: its not really collected in spark, and because a single instance of the filesystem class for each s3 bucket (hence statistics) is used per worker, even once you do work out how to collect them they tend to over-estimate the amount of effort. There's opportunities to improve things there, but it'd take a lot of work. All you currently get is the per-thread bytes read, bytes written statistics, which can actually under-report bytes written, if the HTTP requests to upload data is done in a background thread.
You can enable org.apache.hadoop.fs.s3a.S3AStorageStatistics to log at debug and then the logs of each spark worker will actually track those operations as they happen, but its very noisy. Primarily useful when trying to debug things or doing low-level performance optimisation of something like the Parquet reader itself.
No idea about EMR I'm afraid —not my code.

Storing data from sensors into hdfs

I am working on a project that involves using HDFS for storage and Spark for computation.
I need to store data from sensors into HDFS in real time.
For example I have a weather station where the sensor generates data(temperature pression) each 5 seconds. I would like to know how to store these data in hdfs in real time
Writing a lot of small files directly to HDFS may have some undesirable effects, as it affects master node memory usage and may lead to lower processing speed in comparison with batch processing.
Any of your sensor will produce 500k files monthly, so, unless you have very limited number of sensors, I would suggest you to take a look at message brokers. Apache Kafka (https://kafka.apache.org/) is well known one and already bundled in some Hadoop distros. You can use it to "stage" you data and process it in (mini-)batches, for example.
Finally, if you need to process incoming data in real-time maner (CEP and so on), i would recommend to pay attention to Spark Streaming (https://spark.apache.org/streaming/) technology.

Does size of part files play a role for Spark SQL performance

I am trying to query the hdfs which has lot of part files (avro). Recently we made a change to reduce parallelism and thus the size of part files have increased , the size of each of these part files are in the range of 750MB to 2 GB (we use spark streaming to write date to hdfs in 10 minute intervals, so the size of these files depends on the amount of data we are processing from the upstream). The number of part files would be around 500. I was wondering if the size of these part files/ number of part files would play any role in the spark SQL performance?
I can provide more information if required.
HDFS, Map Reduce and SPARK prefer files that are larger in size, as opposed to many small files. S3 also has issues. I am not sure if you mean HDFS or S3 here.
Repartitioning smaller files to a lesser number of larger files will - without getting into all the details - allow SPARK or MR to process less of, but bigger blocks of data, thereby improving the speed of jobs by decreasing the number of map tasks needed to read them in, and reducing the cost of storage due to less wastage and name node contention issues.
All in all, the small files problem of which there is much to read on. E.g. https://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html. Just to be clear, I am a Spark fan.
Generally, fewer, larger files are better,
One issue is whether the file can be split, and how.
Files compressed with .gz cannot be split: you have to read from the start to the finish, so at most one worker at a time gets assigned a single file (except near the end of a query & speculation can trigger a second). Use a compression like snappy and all is well
very small files are inefficient as startup/commit overhead dominates
on HDFS, small files put load on the namenode, so the ops team may be unhappy to

Faster reading from blob storage via spark

I currently have a spark cluster set up with 4 worker nodes and 2 head nodes. I have a 1.5 GB CSV file in blob storage that I can access from one of the head nodes. I find that it takes quite a while to load this data and cache it using PySpark. Is there a way to load the data faster?
One thought I had was loading the data, then partitioning the data into k (number of nodes) different segments and saving them back to blob as parquet files. This way, I can load in different parts of the data set in parallel then union... However, I am unsure if all the data is just loaded on the head node, then when computation occurs, it distributes to the other machines. If the latter is true, then the partitioning would be useless.
Help would be much appreciated. Thank you.
Generally, you will want to have smaller file sizes on blob storage so that way you can transfer data between blob storage to compute in parallel so you have faster transfer rates. A good rule of thumb is to have a file size between 64MB - 256MB; a good reference is Vida Ha's Data Storage Tips for Optimal Spark Performance.
Your call out for reading the file and then saving it back to Parquet (with default snappy codec compression) is a good idea. Parquet is natively used by Spark and is often faster to query against. The only tweak would be to partition more by the file size vs. # of worker nodes. The data is loaded onto the worker nodes but partitioning is helpful because more tasks are created to read more files.

Resources