I would like to reuse my DataFrame (without falling back to doing this using "Map" function in RDD/Dataset) which I marking as broadcast-eable, but seems Spark keeps broadcasting it again and again.
Having a table "bank" (test table). I perform the following:
val cachedDf = spark.sql("select * from bank").cache
val dfBroadcasted = broadcast(cachedDf)
val dfNormal = spark.sql("select * from bank")
dfNormal.join(dfBroadcasted, List("age"))
.join(dfBroadcasted, List("age")).count
I'm caching before just in case it made a difference, but its the same with or without.
If I execute the above code, I see the following SQL plan:
As you can see, my broadcasted DF gets broadcasted TWICE with also different timings (if I add more actions afterwards, they broadcast again too).
I care about this, because I actually have a long-running program which has a "big" DataFrame which I can use to filter out HUGE DataFrames, and I would like that "big" DataFrame to be reused.
Is there a way to force reusability? (not only inside the same action, but between actions, I could survive with the same action tho)

Ok, updating the question.
INSIDE the same action, left_semis will reuse broadcasts
while normal/left joins won't. Not sure related with the fact that Spark/developers already know the columns of that DF won't affect the output at all so they can reuse it or it's just an optimization spark is missing.
My problem seems mostly-solved, although it would be great if someone knew how to keep the broadcast across actions.
If I use left_semi (which is the join i'm going to use in my real app), the broadcast is only performed once.
dfNormalxx.join(dfBroadcasted, Seq("age"),"left_semi")
.join(dfBroadcasted, Seq("age"),"left_semi").count
The plan becomes (I also changed the size so it matches my real one, but this made no difference):
Also the wall total time is much better than when using "left_semi" (I set 1 executor so it doesn't get parallelized, just wanted to check if the job was really being done twice):
Even though my collect takes 10 seconds, this will speedup table reads+groupBys which are taking like 6-7minutes


Forcing pyspark join to occur sooner

PROBLEM: I have two tables that are vastly different in size. I want to join on some id by doing a left-outer join. Unfortunately, for some reason even after caching my actions after the join are being executed on all records even though I only want the ones from the left table. See below:
1. How can I set this up so only the records that match the left table get processed through the costly wrangling steps?
LARGE_TABLE => ~900M records
SMALL_TABLE => 500K records
combined = SMALL_TABLE.join(LARGE_TABLE, 'left-outer')
w = Window().partitionBy("id").orderBy(col("date_time"))
data = data.withColumn('diff_id_flag', when(lag('id').over(w) != col('id'), lit(1)).otherwise(lit(0)))
Unfortunately, my execution plan shows the expensive transformation operation above is being done on ~900M records. I find this odd since I ran df.count() to force the join to execute eagerly rather than lazily.
Any Ideas?
- note that the expensive transformation in my code flow occurs after the join (at least that is how I interpret it) but my DAG shows the expensive transformation occurring as a part of the join. This is exactly what I want to avoid as the transformation is expensive. I want the join to execute and THEN the result of that join to be run through the expensive transformation.
- Assume the smaller table CANNOT fit into memory.
The best way to do this is to broadcast the tiny dataframe. Caching is good for multiple actions, which doesnt seem to be applicable ro your particular use case.
df.count has no effect on the execution plan at all. It is just expensive operation executed without any good reason.
Window function application in this requires the same logic as join. Because you join by id and partitionBy idboth stages will require the same hash partitioning and full data scan for both sides. There is no acceptable reason to separate these two.
In practice join logic should be applied before window, serving as a filter for the the downstream transformations in the same stage.

Spark 1.6 Dataframe cache not working correctly

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory, And whenever that cached dataframe is called in the program it uses already computed dataframe from cache.
but that is not how my program is working.
I have a dataframe like below which I am caching it, and then immediately I run a df.count action.
val df ="newcol" , "").cache()
When I run the program. In Spark UI I see that first line runs for 4 min and
when it comes to second line it again runs 4 min basically first line is re computed twice?
Shouldn't first line computed and cached when second line triggers?
how to resolve this behavior. I am stuck, please advise.
My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory,
It is not correct. Simple cache and count (take wouldn't work on RDD either) is a valid method for RDDs but it is not the case with Datasets, which use much more advanced optimizations. With query:"newcol" , "").count()
any column, which is not used in where clause can be ignored.
There is an important discussion on the developer list and quoting Sean Owen
I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every element.
Translated to code:
df.foreach(_ => ())
There is
sqlContext.sql("CACHE TABLE df")
which is eager but it is no longer (Spark 2 and forward) documented and should be avoided.
No, if you call cache on a DataFrame it's not cached in this moment, it's only "marked" for potential future caching. The actual caching is only done when an action is followed later. You can also see your cached DataFrame in Spark UI under "Storage"
Another problem in your code is that count on DataFrame does not compute the entire DataFrame because not all columns need to be computed for that. You can use df.rdd.count() to force the entire evualation (see How to force DataFrame evaluation in Spark).
The question is why your first operation takes so long, even if no action is called. I think this is related to the caching logic (e.g. size estimations etc) being computed when calling cache (see eg. Why is slow when rdd items are big?)

DataFrame orderBy followed by limit in Spark

I am having a program take generate a DataFrame on which it will run something like
Select Col1, Col2...
orderBy(ColX) limit(N)
However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N
Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.
I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?
Also just curious does spark handle sort and top different between DataFrame and RDD?
Sorry i didn't mean collect,
what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)
While it is not clear why this fails in this particular case there multiple issues you may encounter:
When you use limit it simply puts all data on a single partition, no matter how big n is. So while it doesn't explicitly collect it almost as bad.
On top of that orderBy requires a full shuffle with range partitioning which can result in a different issues when data distribution is skewed.
Finally when you collect results can be larger than the amount of memory available on the driver.
If you collect anyway there is not much you can improve here. At the end of the day driver memory will be a limiting factor but there still some possible improvements:
First of all don't use limit.
Replace collect with toLocalIterator.
use either orderBy |> rdd |> zipWithIndex |> filter or if exact number of values is not a hard requirement filter data directly based on approximated distribution as shown in Saving a spark dataframe in multiple parts without repartitioning (in Spark 2.0.0+ there is handy approxQuantile method).

Why does df.limit keep changing in Pyspark?

I'm creating a data sample from some dataframe df with
rdd = df.limit(10000).rdd
This operation takes quite some time (why actually? can it not short-cut after 10000 rows?), so I assume I have a new RDD now.
However, when I now work on rdd, it is different rows every time I access it. As if it resamples over again. Caching the RDD helps a bit, but surely that's not save?
What is the reason behind it?
Update: Here is a reproduction on Spark 1.5.2
from operator import add
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],100)
for _ in range(3):
print( row:row.i).reduce(add))
The output is
I'm surprised that .rdd doesn't fix the data.
To show that it get's more tricky than the re-execution issue, here is a single action which produces incorrect results on Spark
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200)
rdd1=rdd.toDF().limit(12345).rdd x:(x,x))
# result is 10240 despite doing a self-join
Basically, whenever you use limit your results might be potentially wrong. I don't mean "just one of many samples", but really incorrect (since in the case the result should always be 12345).
Because Spark is distributed, in general it's not safe to assume deterministic results. Your example is taking the "first" 10,000 rows of a DataFrame. Here, there's ambiguity (and hence non-determinism) in what "first" means. That will depend on the internals of Spark. For example, it could be the first partition that responds to the driver. That partition could change with networking, data locality, etc.
Even once you cache the data, I still wouldn't rely on getting the same data back every time, though I certainly would expect it to be more consistent than reading from disk.
Spark is lazy, so each action you take recalculates the data returned by limit(). If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition (i.e. if your data is stored across 10 Parquet files, the first limit call might pull from file 1, the second from file 7, and so on).
From the Spark docs:
The LIMIT clause is used to constrain the number of rows returned by the SELECT statement. In general, this clause is used in conjunction with ORDER BY to ensure that the results are deterministic.
So you need to sort the rows beforehand if you want the call to .limit() to be deterministic. But there is a catch! If you sort by a column that doesn't have unique values for every row, the so called "tied" rows (rows with same sorting key value) will not be deterministically ordered, thus the .limit() might still be nondeterministic.
You have two options to work around this:
Make sure you include the a unique row id in the sorting call.
For example df.orderBy('someCol', 'rowId').limit(n).
You can define the rowId like so:
df = df.withColumn('rowId', func.monotonically_increasing_id())
If you only need deterministic result in the single run, you could simply cache the results of limit df.limit(n).cache() so that at least the results from that limit do not change due to the consecutive action calls that would otherwise recompute the results of limit and mess up the results.

reducer concept in Spark

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.
The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
.flatMap(_.split(" "))
.map((_, 1))
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey
