Spark 1.6 Dataframe cache not working correctly - apache-spark

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory, And whenever that cached dataframe is called in the program it uses already computed dataframe from cache.
but that is not how my program is working.
I have a dataframe like below which I am caching it, and then immediately I run a df.count action.
val df = inputDataFrame.select().where().withColumn("newcol" , "").cache()
df.count
When I run the program. In Spark UI I see that first line runs for 4 min and
when it comes to second line it again runs 4 min basically first line is re computed twice?
Shouldn't first line computed and cached when second line triggers?
how to resolve this behavior. I am stuck, please advise.

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory,
It is not correct. Simple cache and count (take wouldn't work on RDD either) is a valid method for RDDs but it is not the case with Datasets, which use much more advanced optimizations. With query:
df.select(...).where(...).withColumn("newcol" , "").count()
any column, which is not used in where clause can be ignored.
There is an important discussion on the developer list and quoting Sean Owen
I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every element.
Translated to code:
df.foreach(_ => ())
There is
df.registerAsTempTable("df")
sqlContext.sql("CACHE TABLE df")
which is eager but it is no longer (Spark 2 and forward) documented and should be avoided.

No, if you call cache on a DataFrame it's not cached in this moment, it's only "marked" for potential future caching. The actual caching is only done when an action is followed later. You can also see your cached DataFrame in Spark UI under "Storage"
Another problem in your code is that count on DataFrame does not compute the entire DataFrame because not all columns need to be computed for that. You can use df.rdd.count() to force the entire evualation (see How to force DataFrame evaluation in Spark).
The question is why your first operation takes so long, even if no action is called. I think this is related to the caching logic (e.g. size estimations etc) being computed when calling cache (see eg. Why is rdd.map(identity).cache slow when rdd items are big?)

Related

How does pyspark perform union?

If I have a code like this:
def my_func(df):
df1 = trans1(df)
df2 = trans2(df1)
df3 = trans3(df1)
df4 = df2.unionAll(df3)
return df4
And I run a df.collect()on the result of the function while not having persisted anything. How many times will the operations in '''trans1''' be run? Once or twice? Thanks!
Coming into your question In your DF there is no Action .
So it will not perform anything .
But hypothetically I am taking an example you have performed cache
df.cache().storageLevel
Post that you have performed some count action .
Caching/persistence is lazy when used with Dataset API so you have to trigger the caching using count operator or similar that in turn submits a Spark job.
In your case even after union there is no action if you have used action to write into disk.
Yes. Only actions (like saving to an external storage) can trigger the persistence for future reuse.
you can check out Storage tab in web UI about it.
Twice
All transformations in spark all lazily evaluated which basically creates a lineage of instructions, and once an action is performed, it traverses the lineage from bottom to top until it finds the materialised data.
Since df1 is not materialised it will perform the same operation twice for two different linages. If df1 is persisted, first graph will perform full transformation whereas second will do only further computation by reusing it. You can see it in DAG and SQL plan tab.
Please note that it is not necessary to always cache the data. Sometimes caching can cost more than re-computation, therefore you should cache the if and only if the computation cost is higher.

Spark dataframe is being re-evaluated after a cache

I am running into some issues using cache on a spark dataframe. My expectation is that after a cache on a dataframe, the dataframe is created and cached the fist time it is needed. Any further calls to the dataframe should be from the cache
here's my code:
val mydf = spark.sql("read about 400 columns from a hive table").
withColumn ("newcol", someudf("existingcol")).
cache()
To test I ran a mydf.count() twice. I would expect the first time to take some time since the data is being cached. But the second time should be instantaneous?
What I am actually seeing is that it takes the same time for both the counts. This first one comes back pretty quickly which I think tells me that the data was not cached. If I remove the withColumn part of the code and just cache the raw data, the second count is instantaneous
Am I doing something wrong? How can I load raw data from hive, add columns and then cache the dataframe for further use? Using spark 2.3
Any help will be great!
the problem with your case is that mydf.count() is not actually materializing the dataframe (i.e. not all columns are read, your udf will no be called). That is because count() is highly optimized.
To make sure the entire dataframe is cached into memory, you should repeat your experiment with mydf.rdd.count() or another query (e.g. using sorting and/or aggregation)
See e.g. this SO question
As you are caching a dataset/dataframe, se the documented default behavior:
def cache(): Dataset.this.type
Persist this Dataset with the default storage level (MEMORY_AND_DISK).
So for your case you can try persist(MEMORY_ONLY)
def persist(newLevel: StorageLevel): Dataset.this.type
Persist this Dataset with the given storage level.
newLevel One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
If its relevant
.cache/persist is lazy evaluation, to force it you can use the spark SQL's API which have the capability change form lazy to eager.
CACHE [ LAZY ] TABLE table_identifier
[ OPTIONS ( 'storageLevel' [ = ] value ) ] [ [ AS ] query ]
Unless LAZY specified it would be eager mode, you need to register a temp table prior to this.
Pseudo code would be:
df.createOrReplaceTempView("dummyTbl")
spark.sql("cache table dummyTbl")
More on the document reference - https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html

How to Identify the list of available RDDs?

I am using the below command to get the list of available registered Temp tables
sqlContext.sql("show tables").collect().foreach(println)
Is there any similar command to get list of available RDDs?
Here is my requirement (using scala)
1. Need to create some RDD on the fly
2. Identify list of available RDDs
3. remove/delete/clear the unwanted RDDs and move forward
How to delete an RDD in PySpark for the purpose of releasing resources?
An additional note, I went through this link, but it doesn't answer all my questions... also i tried the below but don't find any difference before and after unpersist, so not sure how to confirm that my RDD has been released the memory
val tempRDD1 = RDD1.reduceByKey((acc,value)=> acc+value)
tempRDD1.collect.foreach(println)
tempRDD1.unpersist()
tempRDD1.collect.foreach(println)
The RDD data is not saved until it is 1. persisted (cached) and 2. an action occurs to force the preceding transformations to occur. If either of these do not occur, no data will be stored. Any RDD that appears to be "created", will just create an action plan to produce the data if it is needed later. This model is called lazy evaluation.
In your example, no RDD is ever cached, so no data will ever be stored in memory. And the unpersist call will have no effect.

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
EDIT,
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

Why does df.limit keep changing in Pyspark?

I'm creating a data sample from some dataframe df with
rdd = df.limit(10000).rdd
This operation takes quite some time (why actually? can it not short-cut after 10000 rows?), so I assume I have a new RDD now.
However, when I now work on rdd, it is different rows every time I access it. As if it resamples over again. Caching the RDD helps a bit, but surely that's not save?
What is the reason behind it?
Update: Here is a reproduction on Spark 1.5.2
from operator import add
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],100)
rdd1=rdd.toDF().limit(1000).rdd
for _ in range(3):
print(rdd1.map(lambda row:row.i).reduce(add))
The output is
499500
19955500
49651500
I'm surprised that .rdd doesn't fix the data.
EDIT:
To show that it get's more tricky than the re-execution issue, here is a single action which produces incorrect results on Spark 2.0.0.2.5.0
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200)
rdd1=rdd.toDF().limit(12345).rdd
rdd2=rdd1.map(lambda x:(x,x))
rdd2.join(rdd2).count()
# result is 10240 despite doing a self-join
Basically, whenever you use limit your results might be potentially wrong. I don't mean "just one of many samples", but really incorrect (since in the case the result should always be 12345).
Because Spark is distributed, in general it's not safe to assume deterministic results. Your example is taking the "first" 10,000 rows of a DataFrame. Here, there's ambiguity (and hence non-determinism) in what "first" means. That will depend on the internals of Spark. For example, it could be the first partition that responds to the driver. That partition could change with networking, data locality, etc.
Even once you cache the data, I still wouldn't rely on getting the same data back every time, though I certainly would expect it to be more consistent than reading from disk.
Spark is lazy, so each action you take recalculates the data returned by limit(). If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition (i.e. if your data is stored across 10 Parquet files, the first limit call might pull from file 1, the second from file 7, and so on).
From the Spark docs:
The LIMIT clause is used to constrain the number of rows returned by the SELECT statement. In general, this clause is used in conjunction with ORDER BY to ensure that the results are deterministic.
So you need to sort the rows beforehand if you want the call to .limit() to be deterministic. But there is a catch! If you sort by a column that doesn't have unique values for every row, the so called "tied" rows (rows with same sorting key value) will not be deterministically ordered, thus the .limit() might still be nondeterministic.
You have two options to work around this:
Make sure you include the a unique row id in the sorting call.
For example df.orderBy('someCol', 'rowId').limit(n).
You can define the rowId like so:
df = df.withColumn('rowId', func.monotonically_increasing_id())
If you only need deterministic result in the single run, you could simply cache the results of limit df.limit(n).cache() so that at least the results from that limit do not change due to the consecutive action calls that would otherwise recompute the results of limit and mess up the results.

Resources