Pyspark: reduceByKey multiple columns but independently - apache-spark

My data consists of multiple columns and it looks something like this:
I would like to group the data for each column separately and count number of occurrences of each element, which I can achieve by doing this:
df.groupBy("Col-1").count()
df.groupBy("Col-2").count()
df.groupBy("Col-n").count()
However, if there are 1000 of columns, it my be time consuming. So I was trying to find the another way to do it:
At the moment what I have done so far:
def mapFxn1(x):
vals=[1] * len(x)
c=tuple(zip(list(x), vals))
return c
df_map=df.rdd.map(lambda x: mapFxn1(x))
mapFxn1 takes each row and transforms it into tuple of tuples: so basically row one would look like this: ((10, 1), (2, 1), (x, 1))
I am just wondering how one can used reduceByKey on df_map with the lambda x,y: x + y in order to achieve the grouping on each of columns and counting the occurrences of elements in each of the columns in single step.
Thank you in advance

With cube:
df = spark.createDataFrame(
[(3, 2), (2, 1), (3, 8), (3, 9), (4, 1)]
).toDF("col1", "col2")
df.createOrReplaceTempView("df")
spark.sql("""SELECT col1, col2, COUNT(*)
FROM df GROUP BY col1, col2 GROUPING SETS(col1, col2)"""
).show()
# +----+----+--------+
# |col1|col2|count(1)|
# +----+----+--------+
# |null| 9| 1|
# | 3|null| 3|
# |null| 1| 2|
# |null| 2| 1|
# | 2|null| 1|
# |null| 8| 1|
# | 4|null| 1|
# +----+----+--------+
With melt:
melt(df, [], df.columns).groupBy("variable", "value").count().show()
# +--------+-----+-----+
# |variable|value|count|
# +--------+-----+-----+
# | col2| 8| 1|
# | col1| 3| 3|
# | col2| 2| 1|
# | col1| 2| 1|
# | col2| 9| 1|
# | col1| 4| 1|
# | col2| 1| 2|
# +--------+-----+-----+
With reduceByKey
from operator import add
counts = (df.rdd
.flatMap(lambda x: x.asDict().items())
.map(lambda x: (x, 1))
.reduceByKey(add))
counts.toLocalIterator():
print(x)
#
# (('col1', 2), 1)
# (('col2', 8), 1)
# (('col2', 1), 2)
# (('col2', 9), 1)
# (('col1', 4), 1)
# (('col1', 3), 3)
# (('col2', 2), 1)

Related

How to randomize different numbers for subgroup of rows pyspark

I have a pyspark dataframe. I need to randomize values taken from list for all rows within given condition. I did:
df = df.withColumn('rand_col', f.when(f.col('condition_col') == condition, random.choice(my_list)))
but the effect is, that it randomizes only one value and assigns it to all rows:
How can I randomize separately for each row?
You can:
use rand and floor from pyspark.sql.functions to create a random indexing column to index into your my_list
create a column in which the my_list value is repeated
index into that column using f.col
It would look something like this:
import pyspark.sql.functions as f
my_list = [1, 2, 30]
df = spark.createDataFrame(
[
(1, 0),
(2, 1),
(3, 1),
(4, 0),
(5, 1),
(6, 1),
(7, 0),
],
["id", "condition"]
)
df = df.withColumn('rand_index', f.when(f.col('condition') == 1, f.floor(f.rand() * len(my_list))))\
.withColumn('my_list', f.array([f.lit(x) for x in my_list]))\
.withColumn('rand_value', f.when(f.col('condition') == 1, f.col("my_list")[f.col("rand_index")]))
df.show()
+---+---------+----------+----------+----------+
| id|condition|rand_index| my_list|rand_value|
+---+---------+----------+----------+----------+
| 1| 0| null|[1, 2, 30]| null|
| 2| 1| 0|[1, 2, 30]| 1|
| 3| 1| 2|[1, 2, 30]| 30|
| 4| 0| null|[1, 2, 30]| null|
| 5| 1| 1|[1, 2, 30]| 2|
| 6| 1| 2|[1, 2, 30]| 30|
| 7| 0| null|[1, 2, 30]| null|
+---+---------+----------+----------+----------+

Choose the column having more data

I have to select a column out of two columns which has more data or values in it using PySpark and keep it in my DataFrame.
For example, we have two columns A and B:
In example, the column B has more values so I will keep it in my DF for transformations. Similarly, I would take A, if A had more values. I think we can do it using if else conditions, but I'm not able to get the correct logic.
You could first aggregate the columns (count the values in each). This way you will get just 1 row which you could extract as dictionary using .head().asDict(). Then use Python's max(your_dict, key=your_dict.get) to get the dict's key having the max value (i.e. the name of the column which has maximum number of values). Then just select this column.
Example input:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 7), (2, 4), (3, 7), (None, 8), (None, 4)], ['A', 'B'])
df.show()
# +----+---+
# | A| B|
# +----+---+
# | 1| 7|
# | 2| 4|
# | 3| 7|
# |null| 8|
# |null| 4|
# +----+---+
Scalable script using built-in max:
val_cnt = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head().asDict()
df = df.select(max(val_cnt, key=val_cnt.get))
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Script for just 2 columns (A and B):
head = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head()
df = df.select('B' if head.B > head.A else 'A')
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Scalable script adjustable to more columns, without built-in max:
val_cnt = df.agg(*[F.count(c).alias(c) for c in {'A', 'B'}]).head().asDict()
key, val = '', -1
for k, v in val_cnt.items():
if v > val:
key, val = k, v
df = df.select(key)
df.show()
# +---+
# | B|
# +---+
# | 7|
# | 4|
# | 7|
# | 8|
# | 4|
# +---+
Create a data frame with the data
df = spark.createDataFrame(data=[(1,7),(2,4),(3,7),(4,8),(5,0),(6,0),(None,3),(None,5),(None,8),(None,4)],schema = ['A','B'])
Define a condition to check for that
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
condition = fx.when((fx.col('A').isNotNull() & (fx.col('A')>fx.col('B'))),fx.col('A')).otherwise(fx.col('B'))
df_1 = df.withColumn('max_value_among_A_and_B',condition)
Print the dataframe
df_1.show()
Please check the below screenshot for details
or
If you want to pick up the whole column just based on the count. you can try this:
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df = spark.createDataFrame(data=[(1,7),(2,4),(3,7),(4,8),(5,0),(6,0),(None,3),(None,5),(None,8),(None,4)],schema = ['A','B'])
if df.select('A').count() > df.select('B').count():
pickcolumn = 'A'
else:
pickcolumn = 'B'
df_1 = df.withColumn('NewColumnm',col(pickcolumn)).drop('A','B')
df_1.show()

Pyspark: Calculate streak of consecutive observations

I have a Spark (2.4.0) data frame with a column that has just two values (either 0 or 1). I need to calculate the streak of consecutive 0s and 1s in this data, resetting the streak to zero if the value changes.
An example:
from pyspark.sql import (SparkSession, Window)
from pyspark.sql.functions import (to_date, row_number, lead, col)
spark = SparkSession.builder.appName('test').getOrCreate()
# Create dataframe
df = spark.createDataFrame([
('2018-01-01', 'John', 0, 0),
('2018-01-01', 'Paul', 1, 0),
('2018-01-08', 'Paul', 3, 1),
('2018-01-08', 'Pete', 4, 0),
('2018-01-08', 'John', 3, 0),
('2018-01-15', 'Mary', 6, 0),
('2018-01-15', 'Pete', 6, 0),
('2018-01-15', 'John', 6, 1),
('2018-01-15', 'Paul', 6, 1),
], ['str_date', 'name', 'value', 'flag'])
df.orderBy('name', 'str_date').show()
## +----------+----+-----+----+
## | str_date|name|value|flag|
## +----------+----+-----+----+
## |2018-01-01|John| 0| 0|
## |2018-01-08|John| 3| 0|
## |2018-01-15|John| 6| 1|
## |2018-01-15|Mary| 6| 0|
## |2018-01-01|Paul| 1| 0|
## |2018-01-08|Paul| 3| 1|
## |2018-01-15|Paul| 6| 1|
## |2018-01-08|Pete| 4| 0|
## |2018-01-15|Pete| 6| 0|
## +----------+----+-----+----+
With this data, I'd like to calculate the streak of consecutive zeros and ones, ordered by date and "windowed" by name:
# Expected result:
## +----------+----+-----+----+--------+--------+
## | str_date|name|value|flag|streak_0|streak_1|
## +----------+----+-----+----+--------+--------+
## |2018-01-01|John| 0| 0| 1| 0|
## |2018-01-08|John| 3| 0| 2| 0|
## |2018-01-15|John| 6| 1| 0| 1|
## |2018-01-15|Mary| 6| 0| 1| 0|
## |2018-01-01|Paul| 1| 0| 1| 0|
## |2018-01-08|Paul| 3| 1| 0| 1|
## |2018-01-15|Paul| 6| 1| 0| 2|
## |2018-01-08|Pete| 4| 0| 1| 0|
## |2018-01-15|Pete| 6| 0| 2| 0|
## +----------+----+-----+----+--------+--------+
Of course, I would need the streak to reset itself to zero if the 'flag' changes.
Is there a way of doing this?
This would require a difference in row numbers approach to first group consecutive rows with the same value and then using a ranking approach among the groups.
from pyspark.sql import Window
from pyspark.sql import functions as f
#Windows definition
w1 = Window.partitionBy(df.name).orderBy(df.date)
w2 = Window.partitionBy(df.name,df.flag).orderBy(df.date)
res = df.withColumn('grp',f.row_number().over(w1)-f.row_number().over(w2))
#Window definition for streak
w3 = Window.partitionBy(res.name,res.flag,res.grp).orderBy(res.date)
streak_res = res.withColumn('streak_0',f.when(res.flag == 1,0).otherwise(f.row_number().over(w3))) \
.withColumn('streak_1',f.when(res.flag == 0,0).otherwise(f.row_number().over(w3)))
streak_res.show()
There is a more intuitive solution without the use of row_number() if you already have a natural ordering column (str_date) in this case.
In short, to find streak of 1's, just use the
cumulative sum of the flag,
then, multiplied by the flag.
To find streak of 0's, invert the flag first and then do the same for streak of 1's.
First we define a function to calculate cumulative sum:
from pyspark.sql import Window
from pyspark.sql import functions as f
def cum_sum(df, new_col_name, partition_cols, order_col, value_col):
windowval = (Window.partitionBy(partition_cols).orderBy(order_col)
.rowsBetween(Window.unboundedPreceding, 0))
return df.withColumn(new_col_name, f.sum(value_col).over(windowval))
Note the use of rowsBetween (instead of rangeBetween). This is important to get the correct cumulative sum when there are duplicate values in the order column.
Calculate streak of 1's
df = cum_sum(df,
new_col_name='1_group',
partition_cols='name',
order_col='str_date',
value_col='flag')
df = df.withColumn('streak_1', f.col('flag')*f.col('1_group'))
Calculate streak of 0's
df = df.withColumn('flag_inverted', 1-f.col('flag'))
df = cum_sum(df,
new_col_name='0_group',
partition_cols='name',
order_col='str_date',
value_col='flag_inverted')
df = df.withColumn('streak_0', f.col('flag_inverted')*f.col('0_group'))

Spark: Replace missing values with values from another column

Suppose you have a Spark dataframe containing some null values, and you would like to replace the values of one column with the values from another if present. In Python/Pandas you can use the fillna() function to do this quite nicely:
df = spark.createDataFrame([('a', 'b', 'c'),(None,'e', 'f'),(None,None,'i')], ['c1','c2','c3'])
DF = df.toPandas()
DF['c1'].fillna(DF['c2']).fillna(DF['c3'])
How can this be done using Pyspark?
You need to use the coalesce function :
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDF.show()
# +----+----+
# | a| b|
# +----+----+
# |null|null|
# | 1|null|
# |null| 2|
# +----+----+
cDf.select(coalesce(cDf["a"], cDf["b"])).show()
# +--------------+
# |coalesce(a, b)|
# +--------------+
# | null|
# | 1|
# | 2|
# +--------------+
cDf.select('*', coalesce(cDf["a"], lit(0.0))).show()
# +----+----+----------------+
# | a| b|coalesce(a, 0.0)|
# +----+----+----------------+
# |null|null| 0.0|
# | 1|null| 1.0|
# |null| 2| 0.0|
# +----+----+----------------+
You can also apply coalesce on multiple columns :
cDf.select(coalesce(cDf["a"], cDf["b"], lit(0))).show()
# ...
This example is taken from the pyspark.sql API documentation.

How to get the min of each row in PySpark DataFrame [duplicate]

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
For example:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
Ouput :
col_4 = max(col1, col_2, col_3) = [3,2,5]
There is something similar in pandas as explained in this question.
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?
You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ also provides least, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)
We can use greatest
Creating DataFrame
df = spark.createDataFrame(
[[1,2,3], [2,1,2], [3,4,5]],
['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| 1| 2| 3|
| 2| 1| 2|
| 3| 4| 5|
+-----+-----+-----+
Solution
from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()
+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
| 1| 2| 3| 3|
| 2| 1| 2| 2|
| 3| 4| 5| 5|
+-----+-----+-----+-----------+
You can also use the pyspark built-in least:
from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))
Another simple way of doing it. Let us say that the below df is your dataframe
df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10| 1|
|200| 2| 20|
| 3| 30|300|
|400| 40| 4|
+---+---+---+
You can process the above df as below to get the desited results
from pyspark.sql.functions import lit, min
df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
lit('c2').alias('cn2'), min(df.c2).alias('c2'),
lit('c3').alias('cn3'), min(df.c3).alias('c3')
)\
.rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
.toDF(['Columnn', 'Min']).show()
+-------+---+
|Columnn|Min|
+-------+---+
| c1| 3|
| c2| 2|
| c3| 1|
+-------+---+
Scala solution:
df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))
df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show
+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10| 1| 1|
|200| 2| 20| 2|
| 3| 30|300| 3|
|400| 40| 4| 4|
+---+---+---+---+

Resources