Apache Spark - How to use groupBy groupByKey to form a (Key, List) pair - apache-spark

I have org.apache.spark.sql.DataFrame = [id: bigint, name: string] in hand
and sample data in it looks like:
(1, "City1")
(2, "City3")
(1, "CityX")
(4, "CityZ")
(2, "CityN")
I am trying to form a output like
(1, ("City1", "CityX"))
(2, ("City3", "CityN"))
(4, ("CityZ"))
I tried the following variants
df.groupByKey.mapValues(_.toList).show(20, false)
df.groupBy("id").show(20, false)
df.rdd.groupByKey.mapValues(_.toList).show(20, false)
df.rdd.groupBy("id").show(20, false)
All of them complain about either groupBy or groupByKey being ambiguous or method not found errors. Any help is appreciated.
I tried the solution posted in Spark Group By Key to (Key,List) Pair, however that doesn't work for me and it fails with the following error:
<console>:88: error: overloaded method value groupByKey with alternatives:
[K](func: org.apache.spark.api.java.function.MapFunction[org.apache.spark.sql.Row,K], encoder: org.apache.spark.sql.Encoder[K])org.apache.spark.sql.KeyValueGroupedDataset[K,org.apache.spark.sql.Row] <and>
[K](func: org.apache.spark.sql.Row => K)(implicit evidence$3: org.apache.spark.sql.Encoder[K])org.apache.spark.sql.KeyValueGroupedDataset[K,org.apache.spark.sql.Row]
cannot be applied to ()
Thanks.
Edit:
I did try the following:
val result = df.groupBy("id").agg(collect_list("name"))
which gives
org.apache.spark.sql.DataFrame = [id: bigint, collect_list(node): array<string>]
I am not sure how to use this collect_list type .. I am trying to dump this to a file by doing
result.rdd.coalesce(1).saveAsTextFile("test")
and I see the following
(1, WrappedArray(City1, CityX))
(2, WrappedArray(City3, CityN))
(4, WrappedArray(CityZ))
How do I dump this as the following ?
(1, (City1, CityX))
(2, (City3, CityN))
(4, (CityZ))

If you have an RDD of pairs, then you can use combineByKey(). To do this you have to pass 3 methods as arguments.
Method 1 takes a String, for example 'City1' as input, will add that String to an empty List and return that list
Method 2 takes a String, for example 'CityX' and one of the lists created by the previous method. Add the String to the list and return the list.
Method 3 takes 2 lists as input. It returns a new list with all the values from the 2 argument lists
combineByKey will then return an RDD>.
However in your case you are starting off with a DataFrame, which I do not have much experience with. I imagine that you will need to convert it to an RDD in order to use combineByKey()

Related

PySpark Reduce on RDD with only single element

Is there anyway to deal with RDDs with only a single element (this can sometimes happen for what I am doing)? When that's the case, reduce stops working as the operation requires 2 inputs.
I am working with key-value pairs such as:
(key1, 10),
(key2, 20),
And I want to aggregate their values, so the result should be:
30
But there are cases where the rdd only contain a single key-value pair, so reduce does not work here, example:
(key1, 10)
This will return nothing.
If you do a .values() before doing reduce, it should work even if there is only 1 element in the RDD:
from operator import add
rdd = sc.parallelize([('key1', 10),])
rdd.values().reduce(add)
# 10

How to find distinct values of multiple columns in Spark

I have an RDD and I want to find distinct values for multiple columns.
Example:
Row(col1=a, col2=b, col3=1), Row(col1=b, col2=2, col3=10)), Row(col1=a1, col2=4, col3=10)
I would like to find have a map:
col1=[a,b,a1]
col2=[b,2,4]
col3=[1,10]
Can dataframe help compute it faster/simpler?
Update:
My solution with RDD was:
def to_uniq_vals(row):
return [(k,v) for k,v in row.items()]
rdd.flatMap(to_uniq_vals).distinct().collect()
Thanks
I hope I understand your question correctly;
You can try the following:
import org.apache.spark.sql.{functions => F}
val df = Seq(("a", 1, 1), ("b", 2, 10), ("a1", 4, 10))
df.select(F.collect_set("_1"), F.collect_set("_2"), F.collect_set("_3")).show
Results:
+---------------+---------------+---------------+
|collect_set(_1)|collect_set(_2)|collect_set(_3)|
+---------------+---------------+---------------+
| [a1, b, a]| [1, 2, 4]| [1, 10]|
+---------------+---------------+---------------+
The code above should be more efficient than the purposed select distinct
column-by-column for several reasons:
Less workers-host round trips.
De-duping should be done locally on the worker prior to inter-worker de-doupings.
Hope it helps!
You can use drop duplicates and then select the same columns. Might not be the most efficient way but still a decent way:
df.dropDuplicates("col1","col2", .... "colN").select("col1","col2", .... "colN").toJSON
** Works well using Scala

Check Spark Dataframe row has ANY column meeting a condition and stop when first such column found

The following code can be used to filter rows that contain a value of 1. Image there are a lot of columns.
import org.apache.spark.sql.types.StructType
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)
df.withColumn("ones", ones).where($"ones" === 0).show
The downside here is that it should ideally stop when the first such condition is met. I.e. the first column found. OK, we all know that.
But I cannot find an elegant method to achieve this without presumably using a UDF or very specific logic. The map will process all cols.
Can therefore a fold(Left) be used that can terminate when first occurrence found possibly? Or some other approach? May be an oversight.
My first idea was to use logical expressions and hope for short-circuiting, but it seems spark is not doing this :
df
.withColumn("ones", df.columns.tail.map(x => when(col(x) === 1, true)
.otherwise(false)).reduceLeft(_ or _))
.where(!$"ones")
.show()
But I'm no sure whether spark does support short-circuiting, I think not (https://issues.apache.org/jira/browse/SPARK-18712)
So alternatively you can apply a custom function on your rows using lazy exist on scala's Seq:
df
.map{r => (r.getString(0),r.toSeq.tail.exists(c => c.asInstanceOf[Int]==1))}
.toDF("ID","ones")
.show()
This approach is similar to an UDF, so not sure if thats what you accept.

How to efficiently select distinct rows on an RDD based on a subset of its columns`

Consider a Case Class:
case class Prod(productId: String, date: String, qty: Int, many other attributes ..)
And an
val rdd: RDD[Prod]
containing many instances of that class.
The unique key is intended to be the (productId,date) tuple. However we do have some duplicates.
Is there any efficient means to remove the duplicates?
The operation
rdd.distinct
would look for entire rows that are duplicated.
A fallback would involve joining the unique (productId,date) combinations back to the entire rows: I am working through exactly how to do this. But even so it is several operations. A simpler approach (faster as well?) would be useful if it exists.
I'd use dropDuplicates on Dataset:
val rdd = sc.parallelize(Seq(
Prod("foo", "2010-01-02", 1), Prod("foo", "2010-01-02", 2)
))
rdd.toDS.dropDuplicates("productId", "date")
but reduceByKey should work as well:
rdd.keyBy(prod => (prod.productId, prod.date)).reduceByKey((x, _) => x).values

Can a DataFrame be converted to Dataset of a case class if a column name contains a space?

I have a Spark DataFrame where a column name contains a space. Is it possible to convert these rows into case classes?
For example, if I do this:
val data = Seq(1, 2, 3).toDF("a number")
case class Record(`a number`: Int)
data.as[Record]
I get this exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`a$u0020number`' given input columns: [a number];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
...
Is there any way to do this?
(Of course I can work around this by renaming the column before converting to a case class. I was hoping to have the case class match the input schema exactly.)
Can you try this solution , this worked for me without changing the column name.
import sqlContext.implicits._
case class Record(`a number`: Int)
val data = Seq(1, 2, 3)
val recDF = data.map(x => Record(x)).toDF()
recDF.collect().foreach(println)
[1]
[2]
[3]
I'm using Spark 1.6.0. The only part of your code that doesn't work for me is the part where you're setting up your test data. I have to use a sequence of tuples instead of a sequence of integers:
case class Record(`a number`:Int)
val data = Seq(Tuple1(1),Tuple1(2),Tuple1(3)).toDF("a number")
data.as[Record]
// returns org.apache.spark.sql.Data[Record] = [a$u0020number: int]
If you need a Dataframe instead of a Dataset you can always use another toDF:
data.as[Record].toDF

Resources