Spark Out of memory exception in executors - apache-spark

My functionality is reading hive 10 table in spark and join based on some keys creating a
Dataset = joining all tables.
then applying some business logic on top of the dataset to create another output of dataset
Dataset = apply buisness logic on Daaset
then store in output Dataset in another hive table . this is completely working
we split the functionality into two by reading the 10 hive table apply the join and store the intermediate Dataset in one hive table .
the read one hive table Dataset in apply business logic and store the outout of Datasetin final hive table which leads the out of memory exception in excutors at exit code 143 in yarn
Spark configuaration are all same in both process.
would this scenario make a difference in memory of spark.
tried increasing executors memory but no use

Try increasing both spark.driver.memory, spark.executor.memory

Related

How does spark saveAsTable work while reading and writing to hive table

I have the following code:
Dataset<Row> rows = sparkContext.sql ("select from hive tables with multiple joins");
rows.saveAsTable(writing to another external table in hive immediately);
1) In the above case when saveAsTable() is invoked, will spark load the whole dataset into memory?
1.1) If yes, then how do we handle the scenario when this query can actually return huge volume of data which cannot fit into the memory?
2) When spark starts executing saveAsTable() to write data to the external Hive table when the server crashes, is there a possibility of partial data be written to the target Hive table?
2.2) If yes, how do we avoid incomplete/partial data being persisted into target hive tables?
Yes spark will place all data in memory but use parallel processes. But when we write data it will use driver memory to store the data before write. So try increasing driver memory.
so there are couple of options you have. If you have memory in clustor you can increase num-cores, num-executors, executor-memory along with driver-memory based on data size.
If you cannot fit all data in memory break the data and process in a loop programatically.
Lets say source data is partitioned by date and you have 10 days to process. try to process 1 day at a time and write to a staging dataframe. Then create partition based on date in final table and overwrite date everytime in loop.

Hive and PySpark effiency - many jobs or one job?

I have a question on the inner workings of Spark.
If I define a dataframe from a Hive table e.g. df1 = spark_session.table('db.table'); is that table read just once?
What I mean is, if I created 4 or 5 new dataframes from df1 and output them all to separate files, is that more efficient than running them all as different spark files?
Is this more efficient than the below diagram? Does it result in less load on Hive because we read the data once, or is that now how it works?
Than this:
If I define a dataframe from a Hive table e.g. df1 = spark_session.table('db.table'); is that table read just once?
You need to cache() the df1 = spark_session.table('db.table').cache() then spark will read the table once and caches the data when action is performed.
If you output df1 to 4 or 5 different files also spark only read the data from hive table once as we already cached the data.
Is this more efficient than the below diagram? Does it result in less load on Hive because we read the data once, or is that now how it works?
Yes in your first diagram we are keeping less load on hive as we are reading data once.
In your second diagram if we write separate spark jobs for each file that means we are reading hive table in each job.

Performance consideration when reading from hive view Vs hive table via DataFrames

We have a view that unions multiple hive tables. If i use spark SQL in pyspark and read that view will there be any performance issue as against reading directly from the table.
In hive we had something called full table scan if we don't limit the where clause to an exact table partition. Is spark intelligent enough to directly read the table that has the data that we are looking for rather than searching through the entire view ?
Please advise.
You are talking about partition pruning.
Yes spark supports it spark automatically omits large data read when partition filters are specified.
Partition pruning is possible when data within a table is split across multiple logical partitions. Each partition corresponds to a particular value of a partition column and is stored as a subdirectory within the table root directory on HDFS. Where applicable, only the required partitions (subdirectories) of a table are queried, thereby avoiding unnecessary I/O
After partitioning the data, subsequent queries can omit large amounts of I/O when the partition column is referenced in predicates. For example, the following query automatically locates and loads the file under peoplePartitioned/age=20/and omits all others:
val peoplePartitioned = spark.read.format("orc").load("peoplePartitioned")
peoplePartitioned.createOrReplaceTempView("peoplePartitioned")
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20")
more detailed info is provided here
You can also see this in the logical plan if you run an explain(True) on your query:
spark.sql("SELECT * FROM peoplePartitioned WHERE age = 20").explain(True)
it will show which partitions are read by spark

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

How do I increase the number of partitions when I read in a hive table in Spark

So, I am trying to read in a hive table in Spark with hiveContext.
The job basically reads data from two tables into two Dataframes which are subsequently converted to RDD's. I, then, join them based on a common key.
However, this join is failing due to a MetadataFetchFailException (What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?).
I want to avoid that by spreading my data over to other nodes.
Currently, even though I have 800 executors, most data is being read into 10 nodes, each of which is using > 50% of its memory.
The question, is, how do I spread the data over to more partitions during the read operation? I do not want to repartition later on.
val tableDF= hiveContext.read.table("tableName")
.select("colId1", "colId2")
.rdd
.flatMap(sqlRow =>{
Array((colId1, colId2))
})

Resources