I'm trying to understand how RDD.unpersist() works but I'm running into some confusing output.
When I force-delete an RDD and then try to show it:
rdd.unpersist(blocking=True)
rdd.show() # why doesn't this line throw an error?
I expect the second line to error, but it doesn't. The RDD prints out as usual.
I saw these two questions: Why doesnt spark unload memory even with unpersist, How to make sure my DataFrame frees its memory?. They were both helpful in understanding how to use unpersist but don't answer my question.
I'm using a Jupyter notebook and wondered if the notebook might be caching the RDD, so I tested this out in a .py file as well, and the same thing happened.
If the RDD has been deleted, why does it print when show() is called on it?
If it hasn't been deleted, how can I delete it?
You can check the rdd is cached or not by using the attribute is_cached.
rdd = sc.parallelize([1, 2, 3])
print(rdd.is_cached)
rdd.cache()
print(rdd.is_cached)
rdd.unpersist()
print(rdd.is_cached)
rdd.count() -- ignore response
print(rdd.is_cached)
False
True
False
False
Without cache call, the rdd is not cached and unpersist is not working as you expected. It will not delete rdd itself, just remove cached version. How do you delete a variable?
Related
I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.
There is a table which I read early in the script, but it will fail during run if the underlying table changes in a partition I read in e.g.:
java.io.FileNotFoundException: File does not exist:
hdfs://R2/projects/.../country=AB/date=2021-08-20/part-00005e4-4fa5-aab4-93f02feaf746.c000
Even when I specifically cache the table, and do an action, the script will still fail down the line if the above happens.
df.cache()
df.show(1)
My question is, how is this possible?
If I cache the data on memory/disk, why does it matter if the underlying file is updated or not?
Edit: the code is very long, the main thing:
df= read in table, whose underlying data is in the above HDFS folder
df. cache() and df.show() immediately after it, since Spark evaluates lazily. With show() I make the caching happening
Later when I refer to df: if underlying data is changed, script will fail with java.io.FileNotFoundException:
new_df= df.join(
other_df, 'id', 'right')
As discussed in comment section, Spark will automatically evict the cached data based on LRU(Lease Recently Utilized) concept whenever it encounters out of memory issue.
In your case spark might have evicted the cached table. If there is no cached data then previous lineage will be used to form the dataframe again and it will throw an error if the underlying file is missing.
You can try increasing the memory or use storage level as DISK_ONLY.
df.persist(StorageLevel.DISK_ONLY)
I have a lot of data that I'm trying to take out in chunks - let's say 3 chunks - and not have it all cached in memory at once. However, I'd like to save it (action) all at the same time afterwards.
This is the current simplified strategy:
for query in [query1,query2,query3]:
df = spark.sql(query)
df.cache()
df1 = df.filter('a')
df2 = df.filter('b')
final_output_1 = final_output_1.join(df1)
final_output_2 = final_output_2.join(df2)
df.unpersist()
final_output_1.write.saveAsTable()
final_output_2.write.saveAsTable()
So first question: would unpersist() not work here since there hasn't been an action yet on df?
second question: how does df.cache() work here when I'm reusing the df variable in the for loop? I know it's immutable so it would make a copy but would the unpersist() actually clear that memory?
Caching is used in Spark when you want to re use a dataframe again and again ,
for ex: mapping tables
once you cache teh df you need an action operation to physicaly move data to memory as spark is based on lazy execution.
In your case
df.cache()
will not work as expected as you are not performing an action after this.
For cache to work you need to run df.count() or df.show() or any other action for the data to be moved to memory , otherwise your data wont be moved to memory and you will not get any advantage. and so the df.unpersist() is also redundant.
First Question:
No your df.cache() and df.unpersist() will not work as no data was cached to begin with so their is nothing to unpersist.
Second Question:
Yes you can use the same variable name and if an action is performed data will get cached and after your operations df.unpersist() will unpersist the data in each loop.
So the previous DF has no connection to the next DF in next loop.
As you said they are immutable , and since you are assigning new query to the same variable in each loop it acts as a new DF (not related to previous DF).
Based on your code i dont think u need to do caching as you are only performing one operation.
refer to When to cache a DataFrame? and If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?
I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs.
There is a very similar post to SO here:
Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'.
Basically the other post explains, that when you read from the same HDFS directory that you are writing to, and your SaveMode is "overwrite", then you will get a java.io.FileNotFoundException.
But here I am finding that just moving where in the program the action is can give very different results - either completing the program or giving this exception.
I was wondering if anyone can explain why Spark is not being consistent here?
val myDF = spark.read.format("csv")
.option("header", "false")
.option("delimiter", "\t")
.schema(schema)
.load(myPath)
// If I cache it here or persist it then do an action after the cache, it will occasionally
// not throw the error. This is when completely restarting the SparkSession so there is no
// risk of another user interfering on the same JVM.
myDF.cache()
myDF.show(1)
// Just an example.
// Many different transformations are then applied...
val secondDF = mergeOtherDFsWithmyDF(myDF, otherDF, thirdDF)
val fourthDF = mergeTwoDFs(thirdDF, StringToCheck, fifthDF)
// Below is the same .show(1) action call as was previously done, only this below
// action ALWAYS results in a successful completion and the above .show(1) sometimes results
// in FileNotFoundException and sometimes results in successful completion. The only
// thing that changes among test runs is only one is executed. Either
// fourthDF.show(1) or myDF.show(1) is left commented out
fourthDF.show(1)
fourthDF.write
.mode(writeMode)
.option("header", "false")
.option("delimiter", "\t")
.csv(myPath)
Try using count instead of show(1), I believe the issue is due to Spark trying to be being smart and not loading the whole dataframe (since show does not need everything). Running count forces Spark to load and properly cache all the data which will hopefully make the inconsistency go away.
Spark only materializes rdds on demand and Most actions require to read all partitions of the DF such us count() but actions such as take() and first() do not require all the partitions.
In your case, it requires a single partition so only 1 partition is materialized and cached. Then when you do a count() all partitions need to be materialized and cached to the extent your available memory allows.
Say I have three RDD transformation function called on rdd1:
def rdd2 = rdd1.f1
def rdd3 = rdd2.f2
def rdd4 = rdd3.f3
Now I want to cache rdd4, so I call rdd4.cache().
My question:
Will only the result from the action on rdd4 be cached or will every RDD above rdd4 be cached? Say I want to cache both rdd3 and rdd4, do I need to cache them separately?
The whole idea of cache is that spark is not keeping the results in memory unless you tell it to. So if you cache the last RDD in the chain it only keeps the results of that one in memory. So, yes, you do need to cache them separately, but keep in mind you only need to cache an RDD if you are going to use it more than once, for example:
rdd4.cache()
val v1 = rdd4.lookup("key1")
val v2 = rdd4.lookup("key2")
If you do not call cache in this case rdd4 will be recalculated for every call to lookup (or any other function that requires evaluation). You might want to read the paper on RDD's it is pretty easy to understand and explains the ideas behind certain choices they made regarding how RDD's work.