Optimized caching via spark - apache-spark

I am working on a solution to provide low latency results using spark. For this, I was planning to cache the data beforehand on which a user wants to query.
I am able to achieve good performance on the queries. One thing I noticed is that the data on cluster (parquet format) explodes when caching. I understand this is due to deserializing and decoding the data. I am just wondering if there is any other options to reduce the memory footprint.
I tried using
sqlContext.cacheTable("table_name") and also
tbl.persist(StorageLevel.MEMORY_AND_DISK_SER)
But nothing is helping reduce the memory footprint

Perhaps you want to try orc ? There have been improvements in orc support recently (more here: https://www.slideshare.net/Hadoop_Summit/orc-improvement-in-apache-spark-23-95295487). I am not an expert, but I heard that orc uses in memory columnar format... This format gives opportunities for doing things like compressing via techniques like run length encoding of repeated values -- which tends to lower memory footprint.

It also explodes when not caching.
cache has nothing do with reducing memory footprint. You do not state RDD or DF, but I presume latter. This RDD Memory footprint in spark gives an idea for RDDs and the improvements for DFs / DSs: https://spoddutur.github.io/spark-notes/deep_dive_into_storage_formats.html.
You cannot reuse the data for different users. What you could consider is Apache Ignite. See https://ignite.apache.org/use-cases/spark/shared-memory-layer.html

Related

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

Impala vs Spark performance for ad hoc queries

I'm interested only in query performance reasons and architectural differences behind them. All answers I've seen before were outdated or hadn't provide me with enough context of WHY Impala is better for ad hoc queries.
From 3 considerations below only the 2nd point explain why Impala is faster on bigger datasets. Could you please contribute to the following statements?
Impala doesn't miss time for query pre-initialization, means impalad daemons are always running & ready. In other hand, Spark Job Server provide persistent context for the same purposes.
Impala is in-memory and can spill data on disk, with performance penalty, when data doesn't have enough RAM. The same is true for Spark. The main difference is that Spark is written on Scala and have JVM limitations, so workers bigger than 32 GB aren't recommended (because of GC). In turn, [wrong, see UPD] Impala is implemented on C++, and has high hardware requirements: 128-256+ GBs of RAM recommended. This is very significant, but should benefit Impala only on datasets that requires 32-64+ GBs of RAM.
Impala is integrated with Hadoop infrastructure. AFAIK the main reason to use Impala over another in-memory DWHs is the ability to run over Hadoop data formats without exporting data from Hadoop. Means Impala usually use the same storage/data/partitioning/bucketing as Spark can use, and do not achieve any extra benefit from data structure comparing to Spark. Am I right?
P.S. Is Impala faster than Spark in 2019? Have you seen any performance benchmarks?
UPD:
Questions update:
I. Why Impala recommends 128+ GBs RAM? What is an implementation language of each Impala's component? Docs say that "Impala daemons run on every node in the cluster, and each daemon is capable of acting as the query planner, the query coordinator, and a query execution engine.". If impalad is Java, than what parts are written on C++? Is there smth between impalad & columnar data? Are 256 GBs RAM required for impalad or some other component?
II. Impala loose all in-memory performance benefits when it comes to cluster shuffles (JOINs), right? Does Impala have any mechanics to boost JOIN performance compared to Spark?
III. Impala use Multi-Level Service Tree (smth like Dremel Engine see "Execution model" here) vs Spark's Directed Acyclic Graph. What does actually MLST vs DAG mean in terms of ad hoc query performance? Or it's a better fit for multi-user environment?
First off, I don't think comparison of a general purpose distributed computing framework and distributed DBMS (SQL engine) has much meaning. But if we would still like to compare a single query execution in single-user mode (?!), then the biggest difference IMO would be what you've already mentioned -- Impala query coordinators have everything (table metadata from Hive MetaStore + block locations from NameNode) cached in memory, while Spark will need time to extract this data in order to perform query planning.
Second biggie would probably be shuffle implementation, with Spark writing temp files to disk at stage boundaries against Impala trying to keep everything in-memory. Leading to a radical difference in resilience - while Spark can recover from losing an executor and move on by recomputing missing blocks, Impala will fail the entire query after a single impalad daemon crash.
Less significant performance-wise (since it typically takes much less time compared to everything else) but architecturally important is work distribution mechanism -- compiled whole stage codegens sent to the workers in Spark vs. declarative query fragments communicated to daemons in Impala.
As far as specific query optimization techniques (query vectorization, dynamic partition pruning, cost-based optimization) -- they could be on par today or will be in the near future.

Spark dataset exceeds total ram size

I am recently working in spark and came across few queries which I still couldn't resolve.
Let's say i have a dataset of 100GB and my ram size of the cluster is
16 GB.
Now, I know in case of simply reading the file and saving it in the HDFS will work as Spark will do it for each partition. What will happen when I perform sorting or aggregation transformation on 100GB data? How will it process 100GB in memory since we need entire data in case of sorting?
I have gone through below link but this only tells us what spark do in case of persisting, what I am looking is Spark aggregations or sorting on dataset greater than ram size.
Spark RDD - is partition(s) always in RAM?
Any help is appreciated.
There are 2 things you might want to know.
Once Spark reaches the memory limit, it will start spilling data to
disk. Please check this Spark faq and also there are severals
question from SO talking about the same, for example, this one.
There is an algorihtm called external sort that allows you to sort datasets which do not fit in memory. Essentially, you divide the large dataset by chunks which actually fit in memory, sort each chunk and write each chunk to disk. Finally, merge every sorted chunk in order to get the whole dataset sorted. Spark supports external sorting as you can see here and here is the implementation.
Answering your question, you do not really need that your data fit in memory in order to sort it, as I explained to you before. Now, I would encourage you to think about an algorithm for data aggregation dividing the data by chunks, just like external sort does.
There are multiple things you need to consider. Because you have 16RAM and 100GB data set, it will be good idea to keep persistence in DISK. It maybe difficult as when aggregating if data set has high cardinality. If the cardinality is low you will be better of to do aggregate at each RDD before merging into whole dataset. Also remember to make sure that each partition in RDD is less than memory (default value 0.4*container_size)

Pypark read CSV with 60516 columns taking longer time

The CSV file size is 130 MB but just reading and caching the file takes more than 5 minutes. I have set Inferschema as False and also it is taking much time. I tried with increasing cores, nodes , memory but no use. Any suggestions, please ?
Unfortunately, this is somewhat expected behavior or rather known weak side of Apache Spark. Structured API (Spark SQL / Dataset) scales poorly (depending on the context and version complexity might grow even exponentially) in terms of number of fields used for the query. Luckily this is constant overhead (doesn't depend on the number of rows).
If you work with very wide data, and require low latency, it might be wise to skip Spark SQL, and go back to RDD API.

What happens if the data can't fit in memory with cache() in Spark?

I am new to Spark. I have read at multiple places that using cache() on a RDD will cause it to be stored in memory but I haven't so far found clear guidelines or rules of thumb on "How to determine the max size of data" that one could cram into memory? What happens if the amount of data that I am calling "cache" on, exceeds the memory ? Will it cause my job to fail or will it still complete with a noticeable impact on Cluster performance?
Thanks!
As it is clearly stated in the official documentation with MEMORY_ONLY persistence (equivalent to cache):
If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.
Even if data fits into memory it can be evicted if new data comes in. In practice caching is more a hint than a contract. You cannot depend on caching take place but you don't have to if it succeeds either.
Note:
Please keep in mind that the default StorageLevel for Dataset is MEMORY_AND_DISK, which will:
If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
See also:
(Why) do we need to call cache or persist on a RDD
Why do I have to explicitly tell Spark what to cache?

Resources