Spark is throwing FileNotFoundException while accessing cached table - apache-spark

There is a table which I read early in the script, but it will fail during run if the underlying table changes in a partition I read in e.g.:
java.io.FileNotFoundException: File does not exist:
hdfs://R2/projects/.../country=AB/date=2021-08-20/part-00005e4-4fa5-aab4-93f02feaf746.c000
Even when I specifically cache the table, and do an action, the script will still fail down the line if the above happens.
df.cache()
df.show(1)
My question is, how is this possible?
If I cache the data on memory/disk, why does it matter if the underlying file is updated or not?
Edit: the code is very long, the main thing:
df= read in table, whose underlying data is in the above HDFS folder
df. cache() and df.show() immediately after it, since Spark evaluates lazily. With show() I make the caching happening
Later when I refer to df: if underlying data is changed, script will fail with java.io.FileNotFoundException:
new_df= df.join(
other_df, 'id', 'right')

As discussed in comment section, Spark will automatically evict the cached data based on LRU(Lease Recently Utilized) concept whenever it encounters out of memory issue.
In your case spark might have evicted the cached table. If there is no cached data then previous lineage will be used to form the dataframe again and it will throw an error if the underlying file is missing.
You can try increasing the memory or use storage level as DISK_ONLY.
df.persist(StorageLevel.DISK_ONLY)

Related

Differences between persist(DISK_ONLY) vs manually saving to HDFS and reading back

This answer clearly explains RDD persist() and cache() and the need for it - (Why) do we need to call cache or persist on a RDD
So, I understand that calling someRdd.persist(DISK_ONLY) is lazy, but someRdd.saveAsTextFile("path") is eager.
But other than this (also disregarding the cleanup of text file stored in HDFS manually), is there any other difference (performance or otherwise) between using persist to cache the rdd to disk versus manually writing and reading from disk?
Is there a reason to prefer one over the other?
More Context: I came across code which manually writes to HDFS and reads it back in our production application. I've just started learning Spark and was wondering if this can be replaced with persist(DISK_ONLY). Note that the saved rdd in HDFS is deleted before every new run of the job and this stored data is not used for anything else between the runs.
There are at least these differences:
Writing to HDFS will have the replicas overhead, while caching is written locally on the executor (or to second replica if DISK_ONLY_2 is chosen).
Writing to HDFS is persistent, while cached data might get lost if/when an executor is killed for any reason. And you already mentioned the benefit of writing to HDFS when the entire application goes down.
Caching does not change the partitioning, but reading from HDFS might/will result in different partitioning than the original written DataFrame/RDD. For example, small partitions (files) will be aggregated and large files will be split.
I usually prefer to cache small/medium data sets that are expensive to evaluate, and write larger data sets to HDFS.

Does Spark hold DataFrame in memory when loaded from a file?

If I create a Dataframe like so:
val usersDF = spark.read.csv("examples/src/main/resources/users.csv")
Does spark actually load (/copy) the data (from the csv file) into memory, or into the underlying filesystem as a distributed dataset?
I ask because after loading the df, any change in the underlying file's data is not reflecting in queries against the dataframe. (Unless ofcourse the dataframe is again freshly loaded by invoking the above line of code.
I am using interactive queries on Databricks notebooks.
Unless until you perform an action on that file, the file doesn't gets loaded into memory and you will see all the contents of the file till the time it is loaded into memory when an action occurs in the execution plan.
And if an action has already been taken on the file during which any modification has been done to the file, then you will see the cached result of the first execution if it is able to fit in MEMORY.

Spark cache function: Caching Job and Caching Stage

I am new to spark and can use some guidance here.
We have some basic code to read in a csv, cache it, and output it to parquet:
1. val df=sparkSession.read.options(options).schema(schema).csv(path)
2. val dfCached = df.withColumn()....orderBy(some Col).cache()
3. dfCached.write.partitionBy(partitioning).parquet(outputPath)
AFAIK, once we invoke the parquet call (an action) the cache command should be executed to save the state of the DF before the action is applied.
In the spark UI I see:
A single staged job which is executing the cache call from #2 above
Then a Job which is executing the parquet call. This job has 2 stages; 1 which seems to be repeating the caching step and the second which performs the conversion to parquet. (see images below)
Why do I have both a caching Job and a caching Stage? I would expect to have only one or the other but it seems like we are caching twice here.
I'm not 100% sure but it seems that the following is happening:
When csv data is loaded it is split among the worker nodes. We call cache() and each node stores the data it received in memory. This is the first cache job.
When we call partitionBy(...)data needs to be regrouped among different executors based on the args passed to the function. Since we are caching data and data has moved from one executor to another we need to re-cache the shuffled data. This is confirmed because the second caching stage shows some shuffle write data. Furthermore the caching stage shows less tasks than the initial caching job; possibly because only the shuffled data needs to be recached as opposed to the entire data frame.
The parquet stage is invoked. We can see some shuffle read data which shows the executors reading the newly shuffled data.

Spark driver running out of memory when reading multiple files

My program works like this:
Read in a lot of files as dataframes. Among those files there is a group of about 60 files with 5k rows each, where I create a separate Dataframe for each of them, do some simple processing and then union them all into one dataframe which is used for further joins.
I perform a number of joins and column calculations on a number of dataframes finally which finally results in a target dataframe.
I save the target dataframe as a Parquet file.
In the same spark application, I load that Parquet file and do some heavy aggregation followed by multiple self-joins on that dataframe.
I save the second dataframe as another Parquet file.
The problem
If I have just one file instead of 60 in the group of files I mentioned above, everything works with driver having 8g memory. With 60 files, the first 3 steps work fine, but driver runs out of memory when preparing the second file. Things improve only when I increase the driver's memory to 20g.
The Question
Why is that? When calculating the second file I do not use Dataframes used to calculate the first file so their number and content should not really matter if the size of the first Parquet file remains constant, should it? Do those 60 dataframes get cached somehow and occupy driver's memory? I don't do any caching myself. I also never collect anything. I don't understand why 8g of memory would not be sufficient for Spark driver.
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//you have to use serialization configuration if you are using MEMORY_AND_DISK_SER
val rdd1 = sc.textFile("some data")
rdd1.persist(storageLevel.MEMORY_AND_DISK_SER) // marks rdd as persist
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd3.persist(storageLevel.MEMORY_AND_DISK_SER)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")
rdd1.unpersist()
rdd2.unpersist()
rdd3.unpersist()
For tuning your code follow this link
Caching or persistence are optimisation techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.
RDDs can be cached using cache operation. They can also be persisted using persist operation.
The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.
refer to use of persist and unpersist

Spark JavaPairRDD behaviour when file gets updated

I have one use case where I am joining data between one file and stream data.
For this purpose I read the data in file as JavaPairRDD and cache it.
But the catch is that the file is going to be updated periodically in 3-4 hours.
Now my doubt is do I have to read the file again and re create the JavaPairRDDs to reflect the changes in file or is it taken care of by Spark already i.e. whenever the file gets updated are the RDDs recreated ?
RDD's in Spark are designed to be immutable, if the underlying data changes the values in the RDD will not change unless it is uncached/unpersisted/uncheckpointed. In general Spark assumes that the backing data for an RDD doesn't change, so you would likely be better of instead creating a new RDD (or treating both as streams).

Resources