Null values when applying correlation function over a Window? - apache-spark

Having the following DataFrame I want to apply the corr function over the following DF;
val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
val sampleSet = Seq(
("group1", "id1", 1, 1, 6),
("group1", "id2", 2, 2, 5),
("group1", "id3", 3, 3, 4),
("group2", "id4", 4, 4, 3),
("group2", "id5", 5, 5, 2),
("group2", "id6", 6, 6, 1)
)
val initialSet = sparkSession
.createDataFrame(sampleSet)
.toDF(sampleColumns: _*)
----- .show()
+------+---+------+------+----------+
| group| id|count1|count2|orderCount|
+------+---+------+------+----------+
|group1|id1| 1| 1| 6|
|group1|id2| 2| 2| 5|
|group1|id3| 3| 3| 4|
|group2|id4| 4| 4| 3|
|group2|id5| 5| 5| 2|
|group2|id6| 6| 6| 1|
+------+---+------+------+----------+
val initialSetWindow = Window
.partitionBy("group")
.orderBy("orderCountSum")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val groupedSet = initialSet
.groupBy(
"group"
).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
.withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
----- .show()
+------+---------+---------+-------------+----+
| group|count1Sum|count2Sum|orderCountSum| cf|
+------+---------+---------+-------------+----+
|group1| 6| 6| 15|null|
|group2| 15| 15| 6|null|
+------+---------+---------+-------------+----+
When trying to apply the corr function, some of the resulting values in cf are null for some reason:
The question is, how can I apply corr to each of the rows within their subgroup (Window)? Would like to obtain the corr value per Row and subgroup (group1 and group2).

Related

How to randomize different numbers for subgroup of rows pyspark

I have a pyspark dataframe. I need to randomize values taken from list for all rows within given condition. I did:
df = df.withColumn('rand_col', f.when(f.col('condition_col') == condition, random.choice(my_list)))
but the effect is, that it randomizes only one value and assigns it to all rows:
How can I randomize separately for each row?
You can:
use rand and floor from pyspark.sql.functions to create a random indexing column to index into your my_list
create a column in which the my_list value is repeated
index into that column using f.col
It would look something like this:
import pyspark.sql.functions as f
my_list = [1, 2, 30]
df = spark.createDataFrame(
[
(1, 0),
(2, 1),
(3, 1),
(4, 0),
(5, 1),
(6, 1),
(7, 0),
],
["id", "condition"]
)
df = df.withColumn('rand_index', f.when(f.col('condition') == 1, f.floor(f.rand() * len(my_list))))\
.withColumn('my_list', f.array([f.lit(x) for x in my_list]))\
.withColumn('rand_value', f.when(f.col('condition') == 1, f.col("my_list")[f.col("rand_index")]))
df.show()
+---+---------+----------+----------+----------+
| id|condition|rand_index| my_list|rand_value|
+---+---------+----------+----------+----------+
| 1| 0| null|[1, 2, 30]| null|
| 2| 1| 0|[1, 2, 30]| 1|
| 3| 1| 2|[1, 2, 30]| 30|
| 4| 0| null|[1, 2, 30]| null|
| 5| 1| 1|[1, 2, 30]| 2|
| 6| 1| 2|[1, 2, 30]| 30|
| 7| 0| null|[1, 2, 30]| null|
+---+---------+----------+----------+----------+

How to ensure same partitions for same keys while reading two dataframes in pyspark?

Say I have two dataframes :
data1=[(0, 4, 2),
(0, 3, 3),
(0, 5, 2),
(1, 3, 5),
(1, 4, 5),
(1, 5, 1),
(2, 4, 2),
(2, 3, 2),
(2, 1, 5),
(3, 5, 2),
(3, 1, 5),
(3, 4, 2)]
df1=spark.createDataFrame(data1,schema = 'a int,b int,c int')
data2 = [(0, 2, 1), (0, 2, 5), (1, 4, 5), (1, 5, 3), (2, 2, 2), (2, 1, 2)]
df2=spark.createDataFrame(data2,schema = 'a int,b int,c int')
so the dataframes on disk in a csv are like so:
df1.show()
| a| b| c|
+---+---+---+
| 0| 4| 2|
| 0| 3| 3|
| 0| 5| 2|
| 1| 3| 5|
| 1| 4| 5|
| 1| 5| 1|
| 2| 4| 2|
| 2| 3| 2|
| 2| 1| 5|
| 3| 5| 2|
| 3| 1| 5|
| 3| 4| 2|
+---+---+---+
df2.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 2| 1|
| 0| 2| 5|
| 1| 4| 5|
| 1| 5| 3|
| 2| 2| 2|
| 2| 1| 2|
+---+---+---+
I want to read them in and then merge them basis a. I can side-step the shuffle step if I can get same keys (a) on the same partitions. How do I:
Get the same keys on the same node? (I can do df.repartition() after spark.read.csv but, does that not first read the data into whatever partitioning spark sees fit and then repartitions it as I want? This shuffle is what I want to avoid in the first place
Once I succeed in step 1, tell spark that now the keys are on the same node and it does not have to go through the shuffle?

how to count values in columns for identical elements

I have a dataframe:
+------------+------------+-------------+
| id| column1| column2|
+------------+------------+-------------+
| 1| 1| 5|
| 1| 2| 5|
| 1| 3| 5|
| 2| 1| 15|
| 2| 2| 5|
| 2| 6| 5|
+------------+------------+-------------+
How to get the maximum value of column 1? And how to get the sum of the values in column 2?
To get this result:
+------------+------------+-------------+
| id| column1| column2|
+------------+------------+-------------+
| 1| 3| 15|
| 2| 6| 25|
+------------+------------+-------------+
Use .groupBy and agg (max(column1),sum(column2)) for this case
#sample data
df=spark.createDataFrame([(1,1,5),(1,2,5),(1,3,5),(2,1,15),(2,2,5),(2,6,5)],["id","column1","column2"])
from pyspark.sql.functions import *
df.groupBy("id").\
agg(max("column1").alias("column1"),sum("column2").alias("column2")).\
show()
#+---+-------+-------+
#| id|column1|column2|
#+---+-------+-------+
#| 1| 3| 15|
#| 2| 6| 25|
#+---+-------+-------+
If you are familiar with sql then below is the sql version using group by , max and sum functions
import spark.implicits._
import org.apache.spark.sql.functions._
val input = Seq(
(1, 1, 5),
(1, 2, 5),
(1, 3, 5),
(2, 1, 15),
(2, 2, 5),
(2, 6, 5)
).toDF("id", "col1", "col2").createTempView("mytable")
spark.sql("select id,max(col1),sum(col2) from mytable group by id").show
Result :
+---+---------+---------+
| id|max(col1)|sum(col2)|
+---+---------+---------+
| 1| 3| 15|
| 2| 6| 25|
+---+---------+---------+
All you need is groupBy to group corresponding values of id and use aggregate functions sum and max with agg
The functions come from org.apache.spark.sql.functions._ package.
import spark.implicits._
import org.apache.spark.sql.functions._
val input = Seq(
(1, 1, 5),
(1, 2, 5),
(1, 3, 5),
(2, 1, 15),
(2, 2, 5),
(2, 6, 5)
).toDF("id", "col1", "col2")
val result = input
.groupBy("id")
.agg(max(col("col1")),sum(col("col2")))
.show()

Spark dataframe self-joins are producing empty dataframe as a result

Below is my data in csv which I read into dataframe.
id,pid,pname,ppid
1, 1, 5, -1
2, 1, 7, -1
3, 2, 9, 1
4, 2, 11, 1
5, 3, 5, 1
6, 4, 7, 2
7, 1, 9, 3
I am reading that data into a dataframe data_df. I am tryng to do a self-join on different columns. But the results dataframes are empty. Have tried multiple options.
Below is my code. Only the last joined4 is producing the result.
val joined = data_df.as("first").join(data_df.as("second")).where( col("first.ppid") === col("second.pid"))
joined.show(50, truncate = false)
val joined2 = data_df.as("first").join(data_df.as("second"), col("first.ppid") === col("second.pid"), "inner")
joined2.show(50, truncate = false)
val df1 = data_df.as("df1")
val df2 = data_df.as("df2")
val joined3 = df1.join(df2, $"df1.ppid" === $"df2.id")
joined3.show(50, truncate = false)
val joined4 = data_df.as("df1").join(data_df.as("df2"), Seq("id"))
joined4.show(50, truncate = false)
Below are the output of joined, joined2, joined3, joined4 respectively :
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+-----+----+
|id |pid|pname|ppid|pid|pname|ppid|
+---+---+-----+----+---+-----+----+
| 1 | 1| 5| -1| 1| 5| -1|
| 2 | 1| 7| -1| 1| 7| -1|
| 3 | 2| 9| 1| 2| 9| 1|
| 4 | 2| 11| 1| 2| 11| 1|
| 5 | 3| 5| 1| 3| 5| 1|
| 6 | 4| 7| 2| 4| 7| 2|
| 7 | 1| 9| 3| 1| 9| 3|
+---+---+-----+----+---+-----+----+
Sorry, later on figured out that the spaces in the csv were causing the issue. If I create a correctly structured csv of the initial data, the problem disappears.
Correct csv format as follows.
id,pid,pname,ppid
1,1,5,-1
2,1,7,-1
3,2,9,1
4,2,1,1
5,3,5,1
6,4,7,2
7,1,9,3
Ideally, I can also use the option to ignore leading whitespaces as shown in the following answer :
val data_df = spark.read
.schema(dataSchema)
.option("mode", "FAILFAST")
.option("header", "true")
.option("ignoreLeadingWhiteSpace", "true")
.csv(dataSourceName)
pySpark (v2.4) DataFrameReader adds leading whitespace to column names

Spark : How to group by distinct values in DataFrame

I have a data in a file in the following format:
1,32
1,33
1,44
2,21
2,56
1,23
The code I am executing is following:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
case class Person(a: Int, b: Int)
val ppl = sc.textFile("newfile.txt").map(_.split(","))
.map(p=> Person(p(0).trim.toInt, p(1).trim.toInt))
.toDF()
ppl.registerTempTable("people")
val result = ppl.select("a","b").groupBy('a).agg()
result.show
Expected Output is:
a 32, 33, 44, 23
b 21, 56
Instead of aggregation by sum, count, mean etc. I want every element in the row.
Try collect_set function inside agg()
val df = sc.parallelize(Seq(
(1,3), (1,6), (1,5), (2,1),(2,4)
(2,1))).toDF("a","b")
+---+---+
| a| b|
+---+---+
| 1| 3|
| 1| 6|
| 1| 5|
| 2| 1|
| 2| 4|
| 2| 1|
+---+---+
val df2 = df.groupBy("a").agg(collect_set("b")).show()
+---+--------------+
| a|collect_set(b)|
+---+--------------+
| 1| [3, 6, 5]|
| 2| [1, 4]|
+---+--------------+
And if you want duplicate entries , can use collect_list
val df3 = df.groupBy("a").agg(collect_list("b")).show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 1| [3, 6, 5]|
| 2| [1, 4, 1]|
+---+---------------+

Resources