how to count values in columns for identical elements - apache-spark

I have a dataframe:
+------------+------------+-------------+
| id| column1| column2|
+------------+------------+-------------+
| 1| 1| 5|
| 1| 2| 5|
| 1| 3| 5|
| 2| 1| 15|
| 2| 2| 5|
| 2| 6| 5|
+------------+------------+-------------+
How to get the maximum value of column 1? And how to get the sum of the values in column 2?
To get this result:
+------------+------------+-------------+
| id| column1| column2|
+------------+------------+-------------+
| 1| 3| 15|
| 2| 6| 25|
+------------+------------+-------------+

Use .groupBy and agg (max(column1),sum(column2)) for this case
#sample data
df=spark.createDataFrame([(1,1,5),(1,2,5),(1,3,5),(2,1,15),(2,2,5),(2,6,5)],["id","column1","column2"])
from pyspark.sql.functions import *
df.groupBy("id").\
agg(max("column1").alias("column1"),sum("column2").alias("column2")).\
show()
#+---+-------+-------+
#| id|column1|column2|
#+---+-------+-------+
#| 1| 3| 15|
#| 2| 6| 25|
#+---+-------+-------+

If you are familiar with sql then below is the sql version using group by , max and sum functions
import spark.implicits._
import org.apache.spark.sql.functions._
val input = Seq(
(1, 1, 5),
(1, 2, 5),
(1, 3, 5),
(2, 1, 15),
(2, 2, 5),
(2, 6, 5)
).toDF("id", "col1", "col2").createTempView("mytable")
spark.sql("select id,max(col1),sum(col2) from mytable group by id").show
Result :
+---+---------+---------+
| id|max(col1)|sum(col2)|
+---+---------+---------+
| 1| 3| 15|
| 2| 6| 25|
+---+---------+---------+

All you need is groupBy to group corresponding values of id and use aggregate functions sum and max with agg
The functions come from org.apache.spark.sql.functions._ package.
import spark.implicits._
import org.apache.spark.sql.functions._
val input = Seq(
(1, 1, 5),
(1, 2, 5),
(1, 3, 5),
(2, 1, 15),
(2, 2, 5),
(2, 6, 5)
).toDF("id", "col1", "col2")
val result = input
.groupBy("id")
.agg(max(col("col1")),sum(col("col2")))
.show()

Related

How to randomize different numbers for subgroup of rows pyspark

I have a pyspark dataframe. I need to randomize values taken from list for all rows within given condition. I did:
df = df.withColumn('rand_col', f.when(f.col('condition_col') == condition, random.choice(my_list)))
but the effect is, that it randomizes only one value and assigns it to all rows:
How can I randomize separately for each row?
You can:
use rand and floor from pyspark.sql.functions to create a random indexing column to index into your my_list
create a column in which the my_list value is repeated
index into that column using f.col
It would look something like this:
import pyspark.sql.functions as f
my_list = [1, 2, 30]
df = spark.createDataFrame(
[
(1, 0),
(2, 1),
(3, 1),
(4, 0),
(5, 1),
(6, 1),
(7, 0),
],
["id", "condition"]
)
df = df.withColumn('rand_index', f.when(f.col('condition') == 1, f.floor(f.rand() * len(my_list))))\
.withColumn('my_list', f.array([f.lit(x) for x in my_list]))\
.withColumn('rand_value', f.when(f.col('condition') == 1, f.col("my_list")[f.col("rand_index")]))
df.show()
+---+---------+----------+----------+----------+
| id|condition|rand_index| my_list|rand_value|
+---+---------+----------+----------+----------+
| 1| 0| null|[1, 2, 30]| null|
| 2| 1| 0|[1, 2, 30]| 1|
| 3| 1| 2|[1, 2, 30]| 30|
| 4| 0| null|[1, 2, 30]| null|
| 5| 1| 1|[1, 2, 30]| 2|
| 6| 1| 2|[1, 2, 30]| 30|
| 7| 0| null|[1, 2, 30]| null|
+---+---------+----------+----------+----------+

How to fill up null values in Spark Dataframe based on other columns' value?

Given this dataframe:
+-----+-----+----+
|num_a|num_b| sum|
+-----+-----+----+
| 1| 1| 2|
| 12| 15| 27|
| 56| 11|null|
| 79| 3| 82|
| 111| 114| 225|
+-----+-----+----+
How would you fill up Null values in sum column if the value can be gathered from other columns? In this example 56+11 would be the value.
I've tried df.fillna with an udf, but that doesn't seems to work, as it was just getting the column name not the actual value. I would want to compute the value just for the rows with missing values, so creating a new column would not be a viable option.
If your requirement is UDF, then it can be done as:
import pyspark.sql.functions as F
from pyspark.sql.types import LongType
df = spark.createDataFrame(
[(1, 2, 3),
(12, 15, 27),
(56, 11, None),
(79, 3, 82)],
["num_a", "num_b", "sum"]
)
F.udf(returnType=LongType)
def fill_with_sum(num_a, num_b, sum):
return sum if sum is None else (num_a + num_b)
df = df.withColumn("sum", fill_with_sum(F.col("num_a"), F.col("num_b"), F.col("sum")))
[Out]:
+-----+-----+---+
|num_a|num_b|sum|
+-----+-----+---+
| 1| 2| 3|
| 12| 15| 27|
| 56| 11| 67|
| 79| 3| 82|
+-----+-----+---+
You can use coalesce function. Check this sample code
import pyspark.sql.functions as f
df = spark.createDataFrame(
[(1, 2, 3),
(12, 15, 27),
(56, 11, None),
(79, 3, 82)],
["num_a", "num_b", "sum"]
)
df.withColumn("sum", f.coalesce(f.col("sum"), f.col("num_a") + f.col("num_b"))).show()
Output is:
+-----+-----+---+
|num_a|num_b|sum|
+-----+-----+---+
| 1| 2| 3|
| 12| 15| 27|
| 56| 11| 67|
| 79| 3| 82|
+-----+-----+---+

Null values when applying correlation function over a Window?

Having the following DataFrame I want to apply the corr function over the following DF;
val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
val sampleSet = Seq(
("group1", "id1", 1, 1, 6),
("group1", "id2", 2, 2, 5),
("group1", "id3", 3, 3, 4),
("group2", "id4", 4, 4, 3),
("group2", "id5", 5, 5, 2),
("group2", "id6", 6, 6, 1)
)
val initialSet = sparkSession
.createDataFrame(sampleSet)
.toDF(sampleColumns: _*)
----- .show()
+------+---+------+------+----------+
| group| id|count1|count2|orderCount|
+------+---+------+------+----------+
|group1|id1| 1| 1| 6|
|group1|id2| 2| 2| 5|
|group1|id3| 3| 3| 4|
|group2|id4| 4| 4| 3|
|group2|id5| 5| 5| 2|
|group2|id6| 6| 6| 1|
+------+---+------+------+----------+
val initialSetWindow = Window
.partitionBy("group")
.orderBy("orderCountSum")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val groupedSet = initialSet
.groupBy(
"group"
).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
.withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
----- .show()
+------+---------+---------+-------------+----+
| group|count1Sum|count2Sum|orderCountSum| cf|
+------+---------+---------+-------------+----+
|group1| 6| 6| 15|null|
|group2| 15| 15| 6|null|
+------+---------+---------+-------------+----+
When trying to apply the corr function, some of the resulting values in cf are null for some reason:
The question is, how can I apply corr to each of the rows within their subgroup (Window)? Would like to obtain the corr value per Row and subgroup (group1 and group2).

How to ensure same partitions for same keys while reading two dataframes in pyspark?

Say I have two dataframes :
data1=[(0, 4, 2),
(0, 3, 3),
(0, 5, 2),
(1, 3, 5),
(1, 4, 5),
(1, 5, 1),
(2, 4, 2),
(2, 3, 2),
(2, 1, 5),
(3, 5, 2),
(3, 1, 5),
(3, 4, 2)]
df1=spark.createDataFrame(data1,schema = 'a int,b int,c int')
data2 = [(0, 2, 1), (0, 2, 5), (1, 4, 5), (1, 5, 3), (2, 2, 2), (2, 1, 2)]
df2=spark.createDataFrame(data2,schema = 'a int,b int,c int')
so the dataframes on disk in a csv are like so:
df1.show()
| a| b| c|
+---+---+---+
| 0| 4| 2|
| 0| 3| 3|
| 0| 5| 2|
| 1| 3| 5|
| 1| 4| 5|
| 1| 5| 1|
| 2| 4| 2|
| 2| 3| 2|
| 2| 1| 5|
| 3| 5| 2|
| 3| 1| 5|
| 3| 4| 2|
+---+---+---+
df2.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 2| 1|
| 0| 2| 5|
| 1| 4| 5|
| 1| 5| 3|
| 2| 2| 2|
| 2| 1| 2|
+---+---+---+
I want to read them in and then merge them basis a. I can side-step the shuffle step if I can get same keys (a) on the same partitions. How do I:
Get the same keys on the same node? (I can do df.repartition() after spark.read.csv but, does that not first read the data into whatever partitioning spark sees fit and then repartitions it as I want? This shuffle is what I want to avoid in the first place
Once I succeed in step 1, tell spark that now the keys are on the same node and it does not have to go through the shuffle?

Find common pairs in a column based on overlapping entries in another column grouped on it

I need some help with a query. Say I have a dataframe like this:
+------+------+
|userid|songid|
+------+------+
| 1| a|
| 1| b|
| 1| c|
| 2| a|
| 2| d|
| 3| c|
| 4| e|
| 4| d|
| 5| b|
+------+------+
I want to return a data frame which has userid pairs that have atleast one songid in common. It would look like this for the dataframe above:
+------+------+
|userid|friendid|
+------+------+
| 1| 2|
| 1| 3|
| 1| 5|
| 2| 4|
+------+------+
How can I do this?
One simple way is to use a self Join :
data = [(1, 'a'), (1, 'b'), (1, 'c'),
(2, 'a'), (2, 'd'), (3, 'c'),
(4, 'e'), (4, 'd'), (5, 'b')
]
df = spark.createDataFrame(data, ["userid", "songid"])
# join on songId = songId and userid different
join_condition = (col("u1.songid") == col("u2.songid")) & (col("u1.userid") != col("u2.userid"))
df.alias("u1").join(df.alias("u2"), join_condition, "inner") \
.select(sort_array(array(col("u1.userid"), col("u2.userid"))).alias("pairs")) \
.distinct() \
.select(col("pairs").getItem(0).alias("userid"), col("pairs").getItem(1).alias("friendid"))\
.show()
+------+--------+
|userid|friendid|
+------+--------+
| 1| 3|
| 1| 5|
| 2| 4|
| 1| 2|
+------+--------+

Resources