I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs.
There is a very similar post to SO here:
Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'.
Basically the other post explains, that when you read from the same HDFS directory that you are writing to, and your SaveMode is "overwrite", then you will get a java.io.FileNotFoundException.
But here I am finding that just moving where in the program the action is can give very different results - either completing the program or giving this exception.
I was wondering if anyone can explain why Spark is not being consistent here?
val myDF = spark.read.format("csv")
.option("header", "false")
.option("delimiter", "\t")
.schema(schema)
.load(myPath)
// If I cache it here or persist it then do an action after the cache, it will occasionally
// not throw the error. This is when completely restarting the SparkSession so there is no
// risk of another user interfering on the same JVM.
myDF.cache()
myDF.show(1)
// Just an example.
// Many different transformations are then applied...
val secondDF = mergeOtherDFsWithmyDF(myDF, otherDF, thirdDF)
val fourthDF = mergeTwoDFs(thirdDF, StringToCheck, fifthDF)
// Below is the same .show(1) action call as was previously done, only this below
// action ALWAYS results in a successful completion and the above .show(1) sometimes results
// in FileNotFoundException and sometimes results in successful completion. The only
// thing that changes among test runs is only one is executed. Either
// fourthDF.show(1) or myDF.show(1) is left commented out
fourthDF.show(1)
fourthDF.write
.mode(writeMode)
.option("header", "false")
.option("delimiter", "\t")
.csv(myPath)
Try using count instead of show(1), I believe the issue is due to Spark trying to be being smart and not loading the whole dataframe (since show does not need everything). Running count forces Spark to load and properly cache all the data which will hopefully make the inconsistency go away.
Spark only materializes rdds on demand and Most actions require to read all partitions of the DF such us count() but actions such as take() and first() do not require all the partitions.
In your case, it requires a single partition so only 1 partition is materialized and cached. Then when you do a count() all partitions need to be materialized and cached to the extent your available memory allows.
Related
There is a table which I read early in the script, but it will fail during run if the underlying table changes in a partition I read in e.g.:
java.io.FileNotFoundException: File does not exist:
hdfs://R2/projects/.../country=AB/date=2021-08-20/part-00005e4-4fa5-aab4-93f02feaf746.c000
Even when I specifically cache the table, and do an action, the script will still fail down the line if the above happens.
df.cache()
df.show(1)
My question is, how is this possible?
If I cache the data on memory/disk, why does it matter if the underlying file is updated or not?
Edit: the code is very long, the main thing:
df= read in table, whose underlying data is in the above HDFS folder
df. cache() and df.show() immediately after it, since Spark evaluates lazily. With show() I make the caching happening
Later when I refer to df: if underlying data is changed, script will fail with java.io.FileNotFoundException:
new_df= df.join(
other_df, 'id', 'right')
As discussed in comment section, Spark will automatically evict the cached data based on LRU(Lease Recently Utilized) concept whenever it encounters out of memory issue.
In your case spark might have evicted the cached table. If there is no cached data then previous lineage will be used to form the dataframe again and it will throw an error if the underlying file is missing.
You can try increasing the memory or use storage level as DISK_ONLY.
df.persist(StorageLevel.DISK_ONLY)
sessionIdList is of type :
scala> sessionIdList
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
When I try to run below code :
val x = sc.parallelize(List(1,2,3))
val cartesianComp = x.cartesian(x).map(x => (x))
val kDistanceNeighbourhood = sessionIdList.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
I receive exception :
14/05/21 16:20:46 ERROR Executor: Exception in task ID 80
java.lang.NullPointerException
at org.apache.spark.rdd.RDD.filter(RDD.scala:261)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:38)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
However if I use :
val l = sc.parallelize(List("1","2"))
val kDistanceNeighbourhood = l.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
Then no exception is displayed
The difference between the two code snippets is that in first snippet sessionIdList is of type :
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
and in second snippet "l" is of type
scala> l
res13: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at parallelize at <console>:12
Why is this error occuring ?
Do I need to convert sessionIdList to ParallelCollectionRDD in order to fix this ?
Spark doesn't support nesting of RDDs (see https://stackoverflow.com/a/14130534/590203 for another occurrence of the same problem), so you can't perform transformations or actions on RDDs inside of other RDD operations.
In the first case, you're seeing a NullPointerException thrown by the worker when it tries to access a SparkContext object that's only present on the driver and not the workers.
In the second case, my hunch is the job was run locally on the driver and worked purely by accident.
Its a reasonable question and I have heard it asked it enough times that. I'm going to try to take a stab at explaining why this is true, because it might help.
Nested RDDs will always throw an exception in production. Nested function calls as I think you are describing them here, if it means calling an RDD operation inside an RDD operation, will cause also cause failures since it is actually the same thing. (RDDs are immutable, so performing an RDD operation such as a "map" is equivalent to creating a new RDD.) The in ability to create nested RDDs is a necessary consequence of the way an RDD is defined and the way the Spark Application is set up.
An RDD is a distributed collection of objects (called partitions) that live on the Spark Executors. Spark executors cannot communicate with each other, only with the Spark driver. The RDD operations are all computed in pieces on these partitions.Because the RDD's executor environment isn't recursive (i.e. you can configure a Spark driver to be on a spark executor with sub executors) neither can an RDD.
In your program, you have created a distributed collection of partitions of integers. You are then performing a mapping operation. When the Spark driver sees a mapping operation, it sends the instructions to do the mapping to the executors, who perform the transformation on each partition in parallel. But your mapping cannot be done, because on each partition you are trying to call the "whole RDD" to perform another distributed operation. This can't not be done, because each partition does not have access to the information on the other partitions, if it did, the computation couldn't run in parallel.
What you can do instead, because the data you need in the map is probably small (since you are doing a filter, and the filter does not require any information about sessionIdList) is to first filter the session ID list. Then collect that list to the driver. Then broadcast it to the executors, where you can use it in the map. If the sessionID list is too large, you will probably need to do a join.
I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index
My program works like this:
Read in a lot of files as dataframes. Among those files there is a group of about 60 files with 5k rows each, where I create a separate Dataframe for each of them, do some simple processing and then union them all into one dataframe which is used for further joins.
I perform a number of joins and column calculations on a number of dataframes finally which finally results in a target dataframe.
I save the target dataframe as a Parquet file.
In the same spark application, I load that Parquet file and do some heavy aggregation followed by multiple self-joins on that dataframe.
I save the second dataframe as another Parquet file.
The problem
If I have just one file instead of 60 in the group of files I mentioned above, everything works with driver having 8g memory. With 60 files, the first 3 steps work fine, but driver runs out of memory when preparing the second file. Things improve only when I increase the driver's memory to 20g.
The Question
Why is that? When calculating the second file I do not use Dataframes used to calculate the first file so their number and content should not really matter if the size of the first Parquet file remains constant, should it? Do those 60 dataframes get cached somehow and occupy driver's memory? I don't do any caching myself. I also never collect anything. I don't understand why 8g of memory would not be sufficient for Spark driver.
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//you have to use serialization configuration if you are using MEMORY_AND_DISK_SER
val rdd1 = sc.textFile("some data")
rdd1.persist(storageLevel.MEMORY_AND_DISK_SER) // marks rdd as persist
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd3.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
rdd1.unpersist()
rdd2.unpersist()
rdd3.unpersist()
For tuning your code follow this link
Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.
RDDs can be cached using cache operation. They can also be persisted using persist operation.
The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.
refer to use of persist and unpersist
Consider a continuous flow of JSON data on a Kafka topic, we want to deal with it by structured streaming like this:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
I was wondering if the program runs for a long run, then df variable will become so big - in my case like 100 TB for a week. So is there any configuration available to eliminate earlier data in df or simply dequeue earliest rows?
In Spark the execution will not start until an action is triggered.
This concept is called Lazy Evaluation in Apache Spark.
“Transformations are lazy in nature meaning when we call some operation in RDD, it does not execute immediately”
Having said that the load operation is a transformation and no data will be read upon executing this line of code.
In order to kick off a streaming job to need to provide the following 4 logical components and call start:
The input (Kafka, file, socket, ..)
The trigger (how often to input get updated)
The result table (that is created upon a query after the tigger update)
Output (define what part of the result will be written)
The memory consumption depends on what is done in the query that will be triggered. Spark Documentation::
"Since Spark is updating the Result Table, it has full control over
updating old aggregates when there is late data, as well as cleaning
up old aggregates to limit the size of intermediate state data. Since
Spark 2.1, we have support for watermarking which allows the user to
specify the threshold of late data, and allows the engine to
accordingly clean up old state."
So you have to determine the amount of data needed to calculate the result table in order to estimate the about of required memory.
It is possible that an executor will crash with an OOM exception, if you do something like: mapGroupWithState, …