How to randomize different numbers for subgroup of rows pyspark - apache-spark

I have a pyspark dataframe. I need to randomize values taken from list for all rows within given condition. I did:
df = df.withColumn('rand_col', f.when(f.col('condition_col') == condition, random.choice(my_list)))
but the effect is, that it randomizes only one value and assigns it to all rows:
How can I randomize separately for each row?

You can:
use rand and floor from pyspark.sql.functions to create a random indexing column to index into your my_list
create a column in which the my_list value is repeated
index into that column using f.col
It would look something like this:
import pyspark.sql.functions as f
my_list = [1, 2, 30]
df = spark.createDataFrame(
[
(1, 0),
(2, 1),
(3, 1),
(4, 0),
(5, 1),
(6, 1),
(7, 0),
],
["id", "condition"]
)
df = df.withColumn('rand_index', f.when(f.col('condition') == 1, f.floor(f.rand() * len(my_list))))\
.withColumn('my_list', f.array([f.lit(x) for x in my_list]))\
.withColumn('rand_value', f.when(f.col('condition') == 1, f.col("my_list")[f.col("rand_index")]))
df.show()
+---+---------+----------+----------+----------+
| id|condition|rand_index| my_list|rand_value|
+---+---------+----------+----------+----------+
| 1| 0| null|[1, 2, 30]| null|
| 2| 1| 0|[1, 2, 30]| 1|
| 3| 1| 2|[1, 2, 30]| 30|
| 4| 0| null|[1, 2, 30]| null|
| 5| 1| 1|[1, 2, 30]| 2|
| 6| 1| 2|[1, 2, 30]| 30|
| 7| 0| null|[1, 2, 30]| null|
+---+---------+----------+----------+----------+

Related

Null values when applying correlation function over a Window?

Having the following DataFrame I want to apply the corr function over the following DF;
val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
val sampleSet = Seq(
("group1", "id1", 1, 1, 6),
("group1", "id2", 2, 2, 5),
("group1", "id3", 3, 3, 4),
("group2", "id4", 4, 4, 3),
("group2", "id5", 5, 5, 2),
("group2", "id6", 6, 6, 1)
)
val initialSet = sparkSession
.createDataFrame(sampleSet)
.toDF(sampleColumns: _*)
----- .show()
+------+---+------+------+----------+
| group| id|count1|count2|orderCount|
+------+---+------+------+----------+
|group1|id1| 1| 1| 6|
|group1|id2| 2| 2| 5|
|group1|id3| 3| 3| 4|
|group2|id4| 4| 4| 3|
|group2|id5| 5| 5| 2|
|group2|id6| 6| 6| 1|
+------+---+------+------+----------+
val initialSetWindow = Window
.partitionBy("group")
.orderBy("orderCountSum")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val groupedSet = initialSet
.groupBy(
"group"
).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
.withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
----- .show()
+------+---------+---------+-------------+----+
| group|count1Sum|count2Sum|orderCountSum| cf|
+------+---------+---------+-------------+----+
|group1| 6| 6| 15|null|
|group2| 15| 15| 6|null|
+------+---------+---------+-------------+----+
When trying to apply the corr function, some of the resulting values in cf are null for some reason:
The question is, how can I apply corr to each of the rows within their subgroup (Window)? Would like to obtain the corr value per Row and subgroup (group1 and group2).

How to ensure same partitions for same keys while reading two dataframes in pyspark?

Say I have two dataframes :
data1=[(0, 4, 2),
(0, 3, 3),
(0, 5, 2),
(1, 3, 5),
(1, 4, 5),
(1, 5, 1),
(2, 4, 2),
(2, 3, 2),
(2, 1, 5),
(3, 5, 2),
(3, 1, 5),
(3, 4, 2)]
df1=spark.createDataFrame(data1,schema = 'a int,b int,c int')
data2 = [(0, 2, 1), (0, 2, 5), (1, 4, 5), (1, 5, 3), (2, 2, 2), (2, 1, 2)]
df2=spark.createDataFrame(data2,schema = 'a int,b int,c int')
so the dataframes on disk in a csv are like so:
df1.show()
| a| b| c|
+---+---+---+
| 0| 4| 2|
| 0| 3| 3|
| 0| 5| 2|
| 1| 3| 5|
| 1| 4| 5|
| 1| 5| 1|
| 2| 4| 2|
| 2| 3| 2|
| 2| 1| 5|
| 3| 5| 2|
| 3| 1| 5|
| 3| 4| 2|
+---+---+---+
df2.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 2| 1|
| 0| 2| 5|
| 1| 4| 5|
| 1| 5| 3|
| 2| 2| 2|
| 2| 1| 2|
+---+---+---+
I want to read them in and then merge them basis a. I can side-step the shuffle step if I can get same keys (a) on the same partitions. How do I:
Get the same keys on the same node? (I can do df.repartition() after spark.read.csv but, does that not first read the data into whatever partitioning spark sees fit and then repartitions it as I want? This shuffle is what I want to avoid in the first place
Once I succeed in step 1, tell spark that now the keys are on the same node and it does not have to go through the shuffle?

how to count values in columns for identical elements

I have a dataframe:
+------------+------------+-------------+
| id| column1| column2|
+------------+------------+-------------+
| 1| 1| 5|
| 1| 2| 5|
| 1| 3| 5|
| 2| 1| 15|
| 2| 2| 5|
| 2| 6| 5|
+------------+------------+-------------+
How to get the maximum value of column 1? And how to get the sum of the values in column 2?
To get this result:
+------------+------------+-------------+
| id| column1| column2|
+------------+------------+-------------+
| 1| 3| 15|
| 2| 6| 25|
+------------+------------+-------------+
Use .groupBy and agg (max(column1),sum(column2)) for this case
#sample data
df=spark.createDataFrame([(1,1,5),(1,2,5),(1,3,5),(2,1,15),(2,2,5),(2,6,5)],["id","column1","column2"])
from pyspark.sql.functions import *
df.groupBy("id").\
agg(max("column1").alias("column1"),sum("column2").alias("column2")).\
show()
#+---+-------+-------+
#| id|column1|column2|
#+---+-------+-------+
#| 1| 3| 15|
#| 2| 6| 25|
#+---+-------+-------+
If you are familiar with sql then below is the sql version using group by , max and sum functions
import spark.implicits._
import org.apache.spark.sql.functions._
val input = Seq(
(1, 1, 5),
(1, 2, 5),
(1, 3, 5),
(2, 1, 15),
(2, 2, 5),
(2, 6, 5)
).toDF("id", "col1", "col2").createTempView("mytable")
spark.sql("select id,max(col1),sum(col2) from mytable group by id").show
Result :
+---+---------+---------+
| id|max(col1)|sum(col2)|
+---+---------+---------+
| 1| 3| 15|
| 2| 6| 25|
+---+---------+---------+
All you need is groupBy to group corresponding values of id and use aggregate functions sum and max with agg
The functions come from org.apache.spark.sql.functions._ package.
import spark.implicits._
import org.apache.spark.sql.functions._
val input = Seq(
(1, 1, 5),
(1, 2, 5),
(1, 3, 5),
(2, 1, 15),
(2, 2, 5),
(2, 6, 5)
).toDF("id", "col1", "col2")
val result = input
.groupBy("id")
.agg(max(col("col1")),sum(col("col2")))
.show()

Pyspark: reduceByKey multiple columns but independently

My data consists of multiple columns and it looks something like this:
I would like to group the data for each column separately and count number of occurrences of each element, which I can achieve by doing this:
df.groupBy("Col-1").count()
df.groupBy("Col-2").count()
df.groupBy("Col-n").count()
However, if there are 1000 of columns, it my be time consuming. So I was trying to find the another way to do it:
At the moment what I have done so far:
def mapFxn1(x):
vals=[1] * len(x)
c=tuple(zip(list(x), vals))
return c
df_map=df.rdd.map(lambda x: mapFxn1(x))
mapFxn1 takes each row and transforms it into tuple of tuples: so basically row one would look like this: ((10, 1), (2, 1), (x, 1))
I am just wondering how one can used reduceByKey on df_map with the lambda x,y: x + y in order to achieve the grouping on each of columns and counting the occurrences of elements in each of the columns in single step.
Thank you in advance
With cube:
df = spark.createDataFrame(
[(3, 2), (2, 1), (3, 8), (3, 9), (4, 1)]
).toDF("col1", "col2")
df.createOrReplaceTempView("df")
spark.sql("""SELECT col1, col2, COUNT(*)
FROM df GROUP BY col1, col2 GROUPING SETS(col1, col2)"""
).show()
# +----+----+--------+
# |col1|col2|count(1)|
# +----+----+--------+
# |null| 9| 1|
# | 3|null| 3|
# |null| 1| 2|
# |null| 2| 1|
# | 2|null| 1|
# |null| 8| 1|
# | 4|null| 1|
# +----+----+--------+
With melt:
melt(df, [], df.columns).groupBy("variable", "value").count().show()
# +--------+-----+-----+
# |variable|value|count|
# +--------+-----+-----+
# | col2| 8| 1|
# | col1| 3| 3|
# | col2| 2| 1|
# | col1| 2| 1|
# | col2| 9| 1|
# | col1| 4| 1|
# | col2| 1| 2|
# +--------+-----+-----+
With reduceByKey
from operator import add
counts = (df.rdd
.flatMap(lambda x: x.asDict().items())
.map(lambda x: (x, 1))
.reduceByKey(add))
counts.toLocalIterator():
print(x)
#
# (('col1', 2), 1)
# (('col2', 8), 1)
# (('col2', 1), 2)
# (('col2', 9), 1)
# (('col1', 4), 1)
# (('col1', 3), 3)
# (('col2', 2), 1)

Pyspark find friendship pairs from friendship lists

I currently have data describing single directional friendship such as below:
For the first line, it means 1 added 3, 4, 8 as friends but doesn't know their responses, and if 3 added 1 as friend as well, they become a pair.
ID friendsList
1 [3, 4, 8]
2 [8]
3 [1]
4 [1]
5 [6]
6 [7]
7 [1]
8 [1, 2, 4]
How can I use PySpark and PySpark SQL to generate friendship pair that both of them are bi-directional friends? Sample output(distinct or not doesn't matter):
(1, 4)
(1, 8)
(1, 3)
(2, 8)
(3, 1)
(4, 1)
(8, 1)
(8, 2)
Thanks!
This can be achieved by explode function and self join as shown below.
from pyspark.sql.functions import explode
df = spark.createDataFrame(((1,[3, 4, 8]),(2,[8]),(3,[1]),(4,[1]),(5,[6]),(6,[7]),(7,[1]),(8,[1, 2, 4])),["c1",'c2'])
df.withColumn('c2',explode(df['c2'])).createOrReplaceTempView('table1')
>>> spark.sql("SELECT t0.c1,t0.c2 FROM table1 t0 INNER JOIN table1 t1 ON t0.c1 = t1.c2 AND t0.c2 = t1.c1").show()
+---+---+
| c1| c2|
+---+---+
| 1| 3|
| 8| 1|
| 1| 4|
| 2| 8|
| 4| 1|
| 8| 2|
| 3| 1|
| 1| 8|
+---+---+
use below if Dataframe API is preferred over spark SQL.
df = df.withColumn('c2',explode(df['c2']))
df.alias('df1') \
.join(df.alias('df2'),((col('df1.c1') == col('df2.c2')) & (col('df2.c1') == col('df1.c2')))) \
.select(col('df1.c1'),col('df1.c2'))

Resources