Spark DataTables: where is partitionBy? - apache-spark

A common Spark processing flow we have is something like this:
Loading:
rdd = sqlContext.parquetFile("mydata/")
rdd = rdd.map(lambda row: (row.id,(some stuff)))
rdd = rdd.filter(....)
rdd = rdd.partitionBy(rdd.getNumPatitions())
Processing by id (this is why we do the partitionBy above!)
rdd.reduceByKey(....)
rdd.join(...)
However, Spark 1.3 changed sqlContext.parquetFile to return DataFrame instead of RDD, and it no longer has the partitionBy, getNumPartitions, and reduceByKey methods.
What do we do now with partitionBy?
We can replace the loading code with something like
rdd = sqlContext.parquetFile("mydata/").rdd
rdd = rdd.map(lambda row: (row.id,(some stuff)))
rdd = rdd.filter(....)
rdd = rdd.partitionBy(rdd.getNumPatitions())
df = rdd.map(lambda ...: Row(...)).toDF(???)
and use groupBy instead of reduceByKey.
Is this the right way?
PS. Yes, I understand that partitionBy is not necessary for groupBy et al. However, without a prior partitionBy, each join, groupBy &c may have to do cross-node operations. I am looking for a way to guarantee that all operations requiring grouping by my key will run local.

It appears that, since version 1.6, repartition(self, numPartitions, *cols) does what I need:
.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns.
Also made numPartitions optional if partitioning columns are specified.

Since DataFrame provide us an abstraction of Table and Column over RDD, the most convenient way to manipulate DataFrame is to use these abstraction along with the specific table manipulations methods that DataFrame enables us.
On a DataFrame, we could:
transform the table schema with select() \ udf() \ as()
filter rows out by filter() or where()
fire an aggregation through groupBy() and agg()
or other analytic job using sample() \ join() \ union()
persist your result using saveAsTable() \ saveAsParquet() \ insertIntoJDBC()
Please refer to Spark SQL and DataFrame Guide for more details.
Therefore, a common job looks like:
val people = sqlContext.parquetFile("...")
val department = sqlContext.parquetFile("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
And for your specific requirements, this could look like:
val t = sqlContext.parquetFile()
t.filter().select().groupBy().agg()

Related

How to join efficiently 2 Spark dataframes partitioned by some column, when that column is one of multiple join keys?

I am currently facing some issues in Spark 3.0.2 to efficiently join 2 Spark dataframes when
The 2 Spark DataFrames are partitioned by some key id;
id is part of the join key, but it is not the only one.
My intuition is telling me that the query optimizer is, in this case, not choosing the optimal path. I will illustrate my issue through a minimal example (note that this particular example does not really require a join, it's just for illustrative purposes).
Let's start from the simple case: the 2 dataframes are partitioned by id, and we join by id only:
from pyspark.sql import SparkSession, Row, Window
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
# Make up some test dataframe
df = spark.createDataFrame([Row(id=i // 10, order=i % 10, value=i) for i in range(10000)])
# Create the left side of the join (repartitioned by id)
df2 = df.repartition(50, 'id')
# Create the right side of the join (also repartitioned by id)
df3 = df2.select('id', F.col('order').alias('order_alias'), F.lit(0).alias('dummy'))
# Perform the join
joined_df = df2.join(df3, on='id')
joined_df.foreach(lambda x: None)
This results in the following efficient plan:
This plan is efficient: it recognizes that the 2 dataframes are already partitioned by the join key and avoids to re-shuffle them. The 2 dataframes are not only repartitioned, but also colocated.
What happens if there is an additional join key? It results in an inefficient plan:
joined_df = df2.join(df3, on=[df2.id==df3.id, df2.order==df3.order_alias])
joined_df.foreach(lambda x: None)
The plan is inefficient since it is repartitioning the 2 dataframes to do the join. This does not make sense to me. Intuitively, we could use the existing partitions: all keys to be joined will be found in the same partition as before, there is just one additional condition to apply! So I thought: perhaps we could phrase the 2nd condition as a filter?
joined_df.foreach(lambda x: None)
joined_df = df2.join(df3, on='id')
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
This however results in the same inefficient plan, since Spark query optimizer will just merge the 2nd filter with the join.
So, I finally thought that maybe I could force Spark to process the join as I want by adding a dummy cache step, by trying the following:
from pyspark import StorageLevel
joined_df = df2.join(df3, on='id')
# Note that this storage level will not cache anything, it's just to suggest to Spark that I need this intermediate result
joined_df.persist(StorageLevel(False, False, False, False))
# Do the filtering after "persisting" the join
joined_df_filtered = joined_df.filter(df2.order==df3.order_alias)
joined_df_filtered.foreach(lambda x: None)
This results in an efficient plan! It is in fact much faster than the previous ones.
The workaround of "persisting" the first join to force Spark to use a more efficient processing plan is "good enough" for my use case, but I still have a few questions:
Am I missing something in my intuition that Spark should actually be reusing partitions when the partition key is part of the join key, instead of re-shuffling?
Is this expected behavior of the query optimizer? Should a ticket be filed for it?
Is there a better way to force the desired processing plan than adding the "persist" step? It seems more like an indirect workaround than a direct solution.

Caching in spark before diverging the flow

I have a basic question regarding working with Spark DataFrame.
Consider the following piece of pseudo code:
val df1 = // Lazy Read from csv and create dataframe
val df2 = // Filter df1 on some condition
val df3 = // Group by on df2 on certain columns
val df4 = // Join df3 with some other df
val subdf1 = // All records from df4 where id < 0
val subdf2 = // All records from df4 where id > 0
* Then some more operations on subdf1 and subdf2 which won't trigger spark evaluation yet*
// Write out subdf1
// Write out subdf2
Suppose I start of with main dataframe df1(which I lazy read from the CSV), do some operations on this dataframe (filter, groupby, join) and then comes a point where I split this datframe based on a condition (for eg, id > 0 and id < 0). Then I further proceed to operate on these sub dataframes(let us name these subdf1, subdf2) and ultimately write out both the sub dataframes.
Notice that the write function is the only command that triggers the spark evaluation and rest of the functions(filter, groupby, join) result in lazy evaluations.
Now when I write out subdf1, I am clear that lazy evaluation kicks in and all the statements are evaluated starting from reading of CSV to create df1.
My question comes when we start writing out subdf2. Does spark understand the divergence in code at df4 and store this dataframe when command for writing out subdf1 was encountered? Or will it again start from the first line of creating df1 and re-evaluate all the intermediary dataframes?
If so, is it a good idea to cache the dataframe df4(Assuming I have sufficient memory)?
I'm using scala spark if that matters.
Any help would be appreciated.
No, Spark cannot infer that from your code. It will start all over again. To confirm this, you can do subdf1.explain() and subdf2.explain() and you should see that both dataframes have query plans that start right from the beginning where df1 was read.
So you're right that you should cache df4 to avoid redoing all the computations starting from df1, if you have enough memory. And of course, remember to unpersist by doing df4.unpersist() at the end if you no longer need df4 for any further computations.

Spark scala partition dataframe for large cross joins

I have two dataframes that need to be cross joined on a 20-node cluster. However because of their size, a simple crossjoin is failing. I am looking to partition the data and perform the crossjoin and am looking for an efficient way to do it.
Simple Algorithm
Manually split file f1 into three and read into dataframes: df1A, df1B, df1C. Manually split file f2 into four and ready into dataframes: df2A, df2B, df2C, df2D. Cross join df1A X df2A, df1A X df2B,..,df1A X df2D,...,df1C X df2D. Save each cross join in a file and manually put together all files. This way Spark can perform each cross join parallely and things should complete fairly quickly.
Question
Is there is more efficient way of accomplishing this by reading both files into two dataframes, then partitioning each dataframe into 3 and 4 "pieces" and for each partition of one dataframe cross join with every partition of the other dataframe?
Data frame can be partitioned ether range or hash .
val df1 = spark.read.csv("file1.txt")
val df2 = spark.read.csv("file2.txt")
val partitionedByRange1 = df1.repartitionByRange(3, $"k")
val partitionedByRange2 = df2.repartitionByRange(4, $"k")
val result =partitionedByRange1.crossJoin(partitionedByRange2);
NOTE : set property spark.sql.crossJoin.enabled=true
You can convert this in to a rdd and then use cartesian operation on that RDD. You should then be able to save that RDD to a file. Hope that helps

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState?

How to do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API? looking for a more declarative way
Example:
select count(*) from some_view
I want the output to just count whatever records are available in each batch but not aggregate from the previous batch
To do stateless aggregations in spark using Structured Streaming 2.3.0 without using flatMapsGroupWithState or Dstream API, you can use following code-
import spark.implicits._
def countValues = (_: String, it: Iterator[(String, String)]) => it.length
val query =
dataStream
.select(lit("a").as("newKey"), col("value"))
.as[(String, String)]
.groupByKey { case(newKey, _) => newKey }
.mapGroups[Int](countValues)
.writeStream
.format("console")
.start()
Here what we are doing is-
We added one column to our datastream - newKey. We did this so that we can do a groupBy over it, using groupByKey. I have used a literal string "a", but you can use anything. Also, you need to select anyone column from the available columns in datastream. I have selected value column for this purpose, you can select anyone.
We created a mapping function - countValues, to count the values aggregated by groupByKey function by writing it.length.
So, in this way, we can count whatever records are available in each batch but not aggregating from the previous batch.
I hope it helps!

First element of each dataframe partition Spark 2.0

I need to retrieve the first element of each dataframe partition.
I know that I need to use mapPartitions but it is not clear for me how to use it.
Note: I am using Spark2.0, the dataframe is sorted.
I believe it should look something like following:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
...
implicit val encoder = RowEncoder(df.schema)
val newDf = df.mapPartitions(iterator => iterator.take(1))
This will take 1 element from each partition in DataFrame. Then you can collect all the data to your driver i.e.:
nedDf.collect()
This will return you an array with a number of elements equal to number of your partitions.
UPD updated in order to support Spark 2.0

Resources