Having the following DataFrame I want to apply the corr function over the following DF;
val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
val sampleSet = Seq(
("group1", "id1", 1, 1, 6),
("group1", "id2", 2, 2, 5),
("group1", "id3", 3, 3, 4),
("group2", "id4", 4, 4, 3),
("group2", "id5", 5, 5, 2),
("group2", "id6", 6, 6, 1)
)
val initialSet = sparkSession
.createDataFrame(sampleSet)
.toDF(sampleColumns: _*)
----- .show()
+------+---+------+------+----------+
| group| id|count1|count2|orderCount|
+------+---+------+------+----------+
|group1|id1| 1| 1| 6|
|group1|id2| 2| 2| 5|
|group1|id3| 3| 3| 4|
|group2|id4| 4| 4| 3|
|group2|id5| 5| 5| 2|
|group2|id6| 6| 6| 1|
+------+---+------+------+----------+
val initialSetWindow = Window
.partitionBy("group")
.orderBy("orderCountSum")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val groupedSet = initialSet
.groupBy(
"group"
).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
.withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
----- .show()
+------+---------+---------+-------------+----+
| group|count1Sum|count2Sum|orderCountSum| cf|
+------+---------+---------+-------------+----+
|group1| 6| 6| 15|null|
|group2| 15| 15| 6|null|
+------+---------+---------+-------------+----+
When trying to apply the corr function, some of the resulting values in cf are null for some reason:
The question is, how can I apply corr to each of the rows within their subgroup (Window)? Would like to obtain the corr value per Row and subgroup (group1 and group2).
Related
I have a pyspark dataframe. I need to randomize values taken from list for all rows within given condition. I did:
df = df.withColumn('rand_col', f.when(f.col('condition_col') == condition, random.choice(my_list)))
but the effect is, that it randomizes only one value and assigns it to all rows:
How can I randomize separately for each row?
You can:
use rand and floor from pyspark.sql.functions to create a random indexing column to index into your my_list
create a column in which the my_list value is repeated
index into that column using f.col
It would look something like this:
import pyspark.sql.functions as f
my_list = [1, 2, 30]
df = spark.createDataFrame(
[
(1, 0),
(2, 1),
(3, 1),
(4, 0),
(5, 1),
(6, 1),
(7, 0),
],
["id", "condition"]
)
df = df.withColumn('rand_index', f.when(f.col('condition') == 1, f.floor(f.rand() * len(my_list))))\
.withColumn('my_list', f.array([f.lit(x) for x in my_list]))\
.withColumn('rand_value', f.when(f.col('condition') == 1, f.col("my_list")[f.col("rand_index")]))
df.show()
+---+---------+----------+----------+----------+
| id|condition|rand_index| my_list|rand_value|
+---+---------+----------+----------+----------+
| 1| 0| null|[1, 2, 30]| null|
| 2| 1| 0|[1, 2, 30]| 1|
| 3| 1| 2|[1, 2, 30]| 30|
| 4| 0| null|[1, 2, 30]| null|
| 5| 1| 1|[1, 2, 30]| 2|
| 6| 1| 2|[1, 2, 30]| 30|
| 7| 0| null|[1, 2, 30]| null|
+---+---------+----------+----------+----------+
Say I have two dataframes :
data1=[(0, 4, 2),
(0, 3, 3),
(0, 5, 2),
(1, 3, 5),
(1, 4, 5),
(1, 5, 1),
(2, 4, 2),
(2, 3, 2),
(2, 1, 5),
(3, 5, 2),
(3, 1, 5),
(3, 4, 2)]
df1=spark.createDataFrame(data1,schema = 'a int,b int,c int')
data2 = [(0, 2, 1), (0, 2, 5), (1, 4, 5), (1, 5, 3), (2, 2, 2), (2, 1, 2)]
df2=spark.createDataFrame(data2,schema = 'a int,b int,c int')
so the dataframes on disk in a csv are like so:
df1.show()
| a| b| c|
+---+---+---+
| 0| 4| 2|
| 0| 3| 3|
| 0| 5| 2|
| 1| 3| 5|
| 1| 4| 5|
| 1| 5| 1|
| 2| 4| 2|
| 2| 3| 2|
| 2| 1| 5|
| 3| 5| 2|
| 3| 1| 5|
| 3| 4| 2|
+---+---+---+
df2.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 2| 1|
| 0| 2| 5|
| 1| 4| 5|
| 1| 5| 3|
| 2| 2| 2|
| 2| 1| 2|
+---+---+---+
I want to read them in and then merge them basis a. I can side-step the shuffle step if I can get same keys (a) on the same partitions. How do I:
Get the same keys on the same node? (I can do df.repartition() after spark.read.csv but, does that not first read the data into whatever partitioning spark sees fit and then repartitions it as I want? This shuffle is what I want to avoid in the first place
Once I succeed in step 1, tell spark that now the keys are on the same node and it does not have to go through the shuffle?
I have a dataframe:
+------------+------------+-------------+
| id| column1| column2|
+------------+------------+-------------+
| 1| 1| 5|
| 1| 2| 5|
| 1| 3| 5|
| 2| 1| 15|
| 2| 2| 5|
| 2| 6| 5|
+------------+------------+-------------+
How to get the maximum value of column 1? And how to get the sum of the values in column 2?
To get this result:
+------------+------------+-------------+
| id| column1| column2|
+------------+------------+-------------+
| 1| 3| 15|
| 2| 6| 25|
+------------+------------+-------------+
Use .groupBy and agg (max(column1),sum(column2)) for this case
#sample data
df=spark.createDataFrame([(1,1,5),(1,2,5),(1,3,5),(2,1,15),(2,2,5),(2,6,5)],["id","column1","column2"])
from pyspark.sql.functions import *
df.groupBy("id").\
agg(max("column1").alias("column1"),sum("column2").alias("column2")).\
show()
#+---+-------+-------+
#| id|column1|column2|
#+---+-------+-------+
#| 1| 3| 15|
#| 2| 6| 25|
#+---+-------+-------+
If you are familiar with sql then below is the sql version using group by , max and sum functions
import spark.implicits._
import org.apache.spark.sql.functions._
val input = Seq(
(1, 1, 5),
(1, 2, 5),
(1, 3, 5),
(2, 1, 15),
(2, 2, 5),
(2, 6, 5)
).toDF("id", "col1", "col2").createTempView("mytable")
spark.sql("select id,max(col1),sum(col2) from mytable group by id").show
Result :
+---+---------+---------+
| id|max(col1)|sum(col2)|
+---+---------+---------+
| 1| 3| 15|
| 2| 6| 25|
+---+---------+---------+
All you need is groupBy to group corresponding values of id and use aggregate functions sum and max with agg
The functions come from org.apache.spark.sql.functions._ package.
import spark.implicits._
import org.apache.spark.sql.functions._
val input = Seq(
(1, 1, 5),
(1, 2, 5),
(1, 3, 5),
(2, 1, 15),
(2, 2, 5),
(2, 6, 5)
).toDF("id", "col1", "col2")
val result = input
.groupBy("id")
.agg(max(col("col1")),sum(col("col2")))
.show()
Below is my data in csv which I read into dataframe.
id,pid,pname,ppid
1, 1, 5, -1
2, 1, 7, -1
3, 2, 9, 1
4, 2, 11, 1
5, 3, 5, 1
6, 4, 7, 2
7, 1, 9, 3
I am reading that data into a dataframe data_df. I am tryng to do a self-join on different columns. But the results dataframes are empty. Have tried multiple options.
Below is my code. Only the last joined4 is producing the result.
val joined = data_df.as("first").join(data_df.as("second")).where( col("first.ppid") === col("second.pid"))
joined.show(50, truncate = false)
val joined2 = data_df.as("first").join(data_df.as("second"), col("first.ppid") === col("second.pid"), "inner")
joined2.show(50, truncate = false)
val df1 = data_df.as("df1")
val df2 = data_df.as("df2")
val joined3 = df1.join(df2, $"df1.ppid" === $"df2.id")
joined3.show(50, truncate = false)
val joined4 = data_df.as("df1").join(data_df.as("df2"), Seq("id"))
joined4.show(50, truncate = false)
Below are the output of joined, joined2, joined3, joined4 respectively :
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+-----+----+
|id |pid|pname|ppid|pid|pname|ppid|
+---+---+-----+----+---+-----+----+
| 1 | 1| 5| -1| 1| 5| -1|
| 2 | 1| 7| -1| 1| 7| -1|
| 3 | 2| 9| 1| 2| 9| 1|
| 4 | 2| 11| 1| 2| 11| 1|
| 5 | 3| 5| 1| 3| 5| 1|
| 6 | 4| 7| 2| 4| 7| 2|
| 7 | 1| 9| 3| 1| 9| 3|
+---+---+-----+----+---+-----+----+
Sorry, later on figured out that the spaces in the csv were causing the issue. If I create a correctly structured csv of the initial data, the problem disappears.
Correct csv format as follows.
id,pid,pname,ppid
1,1,5,-1
2,1,7,-1
3,2,9,1
4,2,1,1
5,3,5,1
6,4,7,2
7,1,9,3
Ideally, I can also use the option to ignore leading whitespaces as shown in the following answer :
val data_df = spark.read
.schema(dataSchema)
.option("mode", "FAILFAST")
.option("header", "true")
.option("ignoreLeadingWhiteSpace", "true")
.csv(dataSourceName)
pySpark (v2.4) DataFrameReader adds leading whitespace to column names
I have a data in a file in the following format:
1,32
1,33
1,44
2,21
2,56
1,23
The code I am executing is following:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
case class Person(a: Int, b: Int)
val ppl = sc.textFile("newfile.txt").map(_.split(","))
.map(p=> Person(p(0).trim.toInt, p(1).trim.toInt))
.toDF()
ppl.registerTempTable("people")
val result = ppl.select("a","b").groupBy('a).agg()
result.show
Expected Output is:
a 32, 33, 44, 23
b 21, 56
Instead of aggregation by sum, count, mean etc. I want every element in the row.
Try collect_set function inside agg()
val df = sc.parallelize(Seq(
(1,3), (1,6), (1,5), (2,1),(2,4)
(2,1))).toDF("a","b")
+---+---+
| a| b|
+---+---+
| 1| 3|
| 1| 6|
| 1| 5|
| 2| 1|
| 2| 4|
| 2| 1|
+---+---+
val df2 = df.groupBy("a").agg(collect_set("b")).show()
+---+--------------+
| a|collect_set(b)|
+---+--------------+
| 1| [3, 6, 5]|
| 2| [1, 4]|
+---+--------------+
And if you want duplicate entries , can use collect_list
val df3 = df.groupBy("a").agg(collect_list("b")).show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 1| [3, 6, 5]|
| 2| [1, 4, 1]|
+---+---------------+