GroupByKey and create lists of values pyspark sql dataframe - apache-spark

So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :
a | b_list
5 | (2,4)
2 | (4,3)
How would I go about doing this with a pyspark sql dataframe?
Thank you! :)

Here are the steps to get that Dataframe.
>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 5| 2| 1|
| 5| 4| 3|
| 2| 4| 2|
| 2| 3| 7|
+---+---+---+
>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 5| [2, 4]|
| 2| [4, 3]|
+---+---------------+

Related

calling pyspark udf with multiple columns

The below UDF doesn't work - am I passing 2 columns in correctly & calling the function in the right way?
Thanks!!
def shield(x, y):
if x == '':
shield = y
else:
shield = x
return shield
df3.withColumn("shield", shield(df3.custavp1, df3.custavp1))
I think the passing the arguments to the udf is incorrect.
The correct way is given below:
>>> ls
[[1, 2, 3, 4], [5, 6, 7, 8]]
>>> from pyspark.sql import Row
>>> R = Row("A1", "A2")
>>> df = sc.parallelize([R(*r) for r in zip(*ls)]).toDF()
>>> df.show
<bound method DataFrame.show of DataFrame[A1: bigint, A2: bigint]>
>>> df.show()
+---+---+
| A1| A2|
+---+---+
| 1| 5|
| 2| 6|
| 3| 7|
| 4| 8|
+---+---+
>>> def foo(x,y):
... if x%2 == 0:
... return x
... else:
... return y
...
>>>
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import IntegerType
>>>
>>> custom_udf = udf(foo, IntegerType())
>>> df1 = df.withColumn("res", custom_udf(col("A1"), col("A2")))
>>> df1.show()
+---+---+---+
| A1| A2|res|
+---+---+---+
| 1| 5| 5|
| 2| 6| 2|
| 3| 7| 7|
| 4| 8| 4|
+---+---+---+
Let me know if it helps.

parse values from list in dataframe column

I have a pyspark dataframe like the input dataframe below. It has a column colA that contains lists of numbers as each value. I would like to create a new column colC that parses each number from the list in colA, like the example output dataframe below. Can anyone suggest how to do this?
input dataframe:
colA colB
[1,2] 1
[3,2,4] 2
output dataframe:
colA colB colC
[1,2] 1 1
[1,2] 1 2
[3,2,4] 2 3
[3,2,4] 2 2
[3,2,4] 2 4
It can be done by explode function:
from pyspark.sql.functions import explode
df.withColumn("colC", explode(df.colA)).show()
Output:
+---------+----+----+
| colA|colB|colC|
+---------+----+----+
| [1, 2]| 1| 1|
| [1, 2]| 1| 2|
|[3, 2, 4]| 2| 3|
|[3, 2, 4]| 2| 2|
|[3, 2, 4]| 2| 4|
+---------+----+----+

How to apply similar RDD reduceByKey on Dataframes [duplicate]

So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :
a | b_list
5 | (2,4)
2 | (4,3)
How would I go about doing this with a pyspark sql dataframe?
Thank you! :)
Here are the steps to get that Dataframe.
>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 5| 2| 1|
| 5| 4| 3|
| 2| 4| 2|
| 2| 3| 7|
+---+---+---+
>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 5| [2, 4]|
| 2| [4, 3]|
+---+---------------+

How can I check if an instance is in a dataframe in Pyspark?

I have an instance extracted from a dataframe df1 and I want to check if that instance is in another dataframe df2 in Pyspark. Is there way to face it?
For example:
Instance:
+------+------+------+
| Atr1 | Atr2 | Atr3 |
+------+------+------+
| 'A' | 2 | 'B' |
+------+------+------+
Dataframe:
+------+------+------+
| Atr1 | Atr2 | Atr3 |
+------+------+------+
| 'C' | 1 | 'B' |
+------+------+------+
| 'D' | 2 | 'A' |
+------+------+------+
| 'E' | 2 | 'C' |
+------+------+------+
| 'A' | 2 | 'B' |
+------+------+------+
This way, I want to get true because the instance is in the dataframe (4th row).
Thanks.
You can take the intersection of df1 and df2 and compare if the count of df1 is equal to that of the intersection as follows:
>>> df1 = spark.createDataFrame(sc.parallelize([['A', 2, 'B']]), ['Atr1', 'Atr2', 'Atr3'])
>>> df2 = spark.createDataFrame(sc.parallelize([['C',1,'B'],['D',2,'A'],['E',2,'C'],['A',2,'B']]), ['Atr1', 'Atr2', 'Atr3'])
>>> df1.show()
+----+----+----+
|Atr1|Atr2|Atr3|
+----+----+----+
| A| 2| B|
+----+----+----+
>>> df2.show()
+----+----+----+
|Atr1|Atr2|Atr3|
+----+----+----+
| C| 1| B|
| D| 2| A|
| E| 2| C|
| A| 2| B|
+----+----+----+
>>> df2.intersect(df1).count() == df1.count()
True
>>>
For information on pyspark.sql.DataFrame.intersect check the documentation here.
Pyspark is not the right language to do this, but still:
First let's create our dataframes:
df1 = spark.createDataFrame(sc.parallelize([['A', 2, 'B']]), ['Atr1', 'Atr2', 'Atr3'])
df2 = spark.createDataFrame(sc.parallelize([['C',1,'B'],['D',2,'A'],['E',2,'C'],['A',2,'B']]), ['Atr1', 'Atr2', 'Atr3'])
you can use:
subtract
df1.subtract(df2).count() == 0
a join
df2.join(df1, ['Atr1', 'Atr2', 'Atr3']).count() > 0
a filter
df2.filter((df2.Atr1 == 'A') & (df2.Atr2 == 2) & (df2.Atr3 == 'B')).count() > 0
Hope this helps!

Spark : How to group by distinct values in DataFrame

I have a data in a file in the following format:
1,32
1,33
1,44
2,21
2,56
1,23
The code I am executing is following:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
case class Person(a: Int, b: Int)
val ppl = sc.textFile("newfile.txt").map(_.split(","))
.map(p=> Person(p(0).trim.toInt, p(1).trim.toInt))
.toDF()
ppl.registerTempTable("people")
val result = ppl.select("a","b").groupBy('a).agg()
result.show
Expected Output is:
a 32, 33, 44, 23
b 21, 56
Instead of aggregation by sum, count, mean etc. I want every element in the row.
Try collect_set function inside agg()
val df = sc.parallelize(Seq(
(1,3), (1,6), (1,5), (2,1),(2,4)
(2,1))).toDF("a","b")
+---+---+
| a| b|
+---+---+
| 1| 3|
| 1| 6|
| 1| 5|
| 2| 1|
| 2| 4|
| 2| 1|
+---+---+
val df2 = df.groupBy("a").agg(collect_set("b")).show()
+---+--------------+
| a|collect_set(b)|
+---+--------------+
| 1| [3, 6, 5]|
| 2| [1, 4]|
+---+--------------+
And if you want duplicate entries , can use collect_list
val df3 = df.groupBy("a").agg(collect_list("b")).show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 1| [3, 6, 5]|
| 2| [1, 4, 1]|
+---+---------------+

Resources