Calculate per row and add new column in DataFrame PySpark - better solution? - apache-spark

I work with Data Frame in PySpark
I have the following task: check how many "times" values from each column was > 2 for all columns. For u1 it is 0, for u2 => 2 and etc
user a b c d times
u1 1 0 1 0 0
u2 0 1 4 3 2
u3 2 1 7 0 1
My solution below. It works, I'm not sure that it is the best way and didn't try on real big data yet. I don't like transform to rdd and back to data frame. Is there anything better? I thouth in the beginning to claculate by UDF per columns, but didn't find a way to accamulte and sum all results per row:
def calculate_times(row):
times = 0
for index, item in enumerate(row):
if not isinstance(item, basestring):
if item > 2:
times = times+1
return times
def add_column(pair):
return dict(pair[0].asDict().items() + [("is_outlier", pair[1])])
def calculate_times_for_all(df):
rdd_with_times = df.map(lambda row: (calculate_times(row))
rdd_final = df.rdd.zip(rdd_with_times).map(add_column)
df_final = sqlContext.createDataFrame(rdd_final)
return df_final
for this solution i used this topic
How do you add a numpy.array as a new column to a pyspark.SQL DataFrame?
Thanks!

It is just a simple one-liner. Example data:
df = sc.parallelize([
("u1", 1, 0, 1, 0), ("u2", 0, 1, 4, 3), ("u3", 2, 1, 7, 0)
]).toDF(["user", "a", "b", "c", "d"])
withColumn:
df.withColumn("times", sum((df[c] > 2).cast("int") for c in df.columns[1:]))
and the result:
+----+---+---+---+---+-----+
|user| a| b| c| d|times|
+----+---+---+---+---+-----+
| u1| 1| 0| 1| 0| 0|
| u2| 0| 1| 4| 3| 2|
| u3| 2| 1| 7| 0| 1|
+----+---+---+---+---+-----+
Note:
It columns are nullable you should correct for that, for example using coalesce:
from pyspark.sql.functions import coalesce
sum(coalesce((df[c] > 2).cast("int"), 0) for c in df.columns[1:])

Related

Counting total rows, rows with null value, rows with zero values, and their ratios on PySpark

I have a table structure like this:
unique_id | group | value_1 | value_2 | value_3
abc_xxx 1 200 null 100
def_xxx 1 0 3 40
ghi_xxx 2 300 1 2
that I need to extract the following information from:
Total number of rows per group
Count number of rows per group that contains null values.
Count number of rows per group with zero values.
I can do the first one with a simple groupBy and count
df.select().groupBy(group).count()
I'm not so sure how to approach the next two which is needed for me to compute the null and zero rate from the total rows per group.
data= [
('abc_xxx', 1, 200, None, 100),
('def_xxx', 1, 0, 3, 40 ),
('ghi_xxx', 2, 300, 1, 2 ),
]
df = spark.createDataFrame(data, ['unique_id','group','value_1','value_2','value_3'])
# new edit
df = df\
.withColumn('contains_null', when(isnull(col('value_1')) | isnull(col('value_2')) | isnull(col('value_3')), lit(1)).otherwise(lit(0)))\
.withColumn('contains_zero', when((col('value_1')==0) | (col('value_2')==0) | (col('value_3')==0), lit(1)).otherwise(lit(0)))
df.groupBy('group')\
.agg(count('unique_id').alias('total_rows'), sum('contains_null').alias('null_value_rows'), sum('contains_zero').alias('zero_value_rows')).show()
+-----+----------+---------------+---------------+
|group|total_rows|null_value_rows|zero_value_rows|
+-----+----------+---------------+---------------+
| 1| 2| 1| 1|
| 2| 1| 0| 0|
+-----+----------+---------------+---------------+
# total_count = (count('value_1') + count('value_2') + count('value_3'))
# null_count = (sum(when(isnull(col('value_1')), lit(1)).otherwise(lit(0)) + when(isnull(col('value_2')), lit(1)).otherwise(lit(0)) + when(isnull(col('value_3')), lit(1)).otherwise(lit(0))))
# zero_count = (sum(when(col('value_1')==0, lit(1)).otherwise(lit(0)) + when(col('value_2')==0, lit(1)).otherwise(lit(0)) + when(col('value_3')==0, lit(1)).otherwise(lit(0))))
# df.groupBy('group')\
# .agg(total_count.alias('total_numbers'), null_count.alias('null_values'), zero_count.alias('zero_values')).show()
#+-----+-------------+-----------+-----------+
#|group|total_numbers|null_values|zero_values|
#+-----+-------------+-----------+-----------+
#| 1| 5| 1| 1|
#| 2| 3| 0| 0|
#+-----+-------------+-----------+-----------+

How do I use lag function to get my desired df from my source df?

I have a source dataframe that looks like this -
Id
Offset
a
b
c
d
e
f
p
1
1
2
null
null
null
null
p
2
null
null
3
4
null
null
q
1
1
2
null
null
null
null
q
2
null
null
3
4
null
null
q
3
null
null
null
null
5
6
You can think of the columns (a-f) to be some features that describe some object (named Id), and these features get updated over time (the offsets). Not all of these features will be updated at the same time. This data is essentially my first df. From this df though, I need to get something like the second df, that essentially describes my objects with all the feature data available at that point of time.
I need my output df to be like this -
Id
Offset
a
b
c
d
e
f
p
1
1
2
n
n
n
n
p
2
1
2
3
4
n
n
q
1
1
2
n
n
n
n
q
2
1
2
3
4
n
n
q
3
1
2
3
4
5
6
how can I achieve this with lag function (or something else?) in pyspark?
Based on the input and the expected output and your explanation, I assume you want to fill for all rows following a row with a non-null value, the value contained in the non-null row.
To do this, you can apply last aggregate over a window and for rows between current and all previous rows while ignoring nulls.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("p", 1, 1, 2, None, None, None, None,),
("p", 2, None, None, 3, 4, None, None,),
("q", 1, 1, 2, None, None, None, None,),
("q", 2, None, None, 3, 4, None, None,),
("q", 3, None, None, None, None, 5, 6,), ]
df = spark.createDataFrame(data, ("Id", "Offset", "a", "b", "c", "d", "e", "f",))
window_spec = W.partitionBy("Id").orderBy(F.asc("Offset")).rowsBetween(W.unboundedPreceding, W.currentRow)
features_to_transform = ["a", "b", "c", "d", "e", "f"]
transformations = [(F.last(feature, ignorenulls=True).over(window_spec)).alias(feature)
for feature in features_to_transform]
df.select("Id", "Offset", *transformations).show()
Output
+---+------+---+---+----+----+----+----+
| Id|Offset| a| b| c| d| e| f|
+---+------+---+---+----+----+----+----+
| p| 1| 1| 2|null|null|null|null|
| p| 2| 1| 2| 3| 4|null|null|
| q| 1| 1| 2|null|null|null|null|
| q| 2| 1| 2| 3| 4|null|null|
| q| 3| 1| 2| 3| 4| 5| 6|
+---+------+---+---+----+----+----+----+

How to replace negative values with 0 in pyspark dataframe

I want to replace all negative with 0 and nan values with 0 in pyspark dataframe with integer columns. I tried
df[df < 0] = 0
But getting error.
You can do this with a combination of reduce and when -
to_convert - Contains the list of columns you wise to convert to 0
Data Preparation
input_str = """
|-1|100
|10|-10
|200|-300
|-500|300
""".split("|")
input_values = list(map(lambda x: int(x.strip()), input_str[1:]))
input_list = [(x, y) for x, y in zip(input_values[0::2], input_values[1::2])]
sparkDF = sql.createDataFrame(input_list, ["a", "b"])
sparkDF.show()
+----+----+
| a| b|
+----+----+
| -1| 100|
| 10| -10|
| 200|-300|
|-500| 300|
+----+----+
Reduce and When
to_convert = set(['a'])
sparkDF = reduce(
lambda df, x: df.withColumn(x, F.when(F.col(x) < 0, 0).otherwise(F.col(x))),
to_convert,
sparkDF,
)
sparkDF.show()
+---+----+
| a| b|
+---+----+
| 0| 100|
| 10| -10|
|200|-300|
| 0| 300|
+---+----+
You can replace null values with 0 (or any value of your choice) across all columns with df.fillna(0) method. However to replace negative values across columns, I don't there is any direct approach, except using case when on each column as below.
from pyspark.sql import functions as F
df.withColumn(
"col1",
F.when(df["col1"] < 0, 0).when(F.col("col1").isNull(), 0).otherwise(F.col("col1")),
)

Label encoding of date and create new attribute based on quarter using pyspark

I have a date and want to create a new attributes based on the date. For example.
date quarter1 quarter2 quarter3 quarter4
2/3/2020(mm/dd/yyyy) 1 0. 0. 0
11/11/2020 0 0 0 1
You can try to cast the date column and use quarter function, then apply an when + otherwise condition to create the columns:
from pyspark.sql import functions as F
qtrs = ['quarter1','quarter2','quarter3','quarter4']
df = df.select("*",F.concat(F.lit("quarter"),
F.quarter(F.to_date("date",'M/d/yyyy'))).alias("quarters"))\
.select("*",*[F.when(F.col("quarters")==col,1).otherwise(0).alias(col) for col in qtrs])\
.drop("quarters")
df.show()
+----------+--------+--------+--------+--------+
| date|quarter1|quarter2|quarter3|quarter4|
+----------+--------+--------+--------+--------+
| 2/3/2020| 1| 0| 0| 0|
|11/11/2020| 0| 0| 0| 1|
+----------+--------+--------+--------+--------+
Per OP's request, adding approach with withColumn:
df = (df.withColumn("quarters",F.concat(F.lit("quarter"),
F.quarter(F.to_date("date",'M/d/yyyy'))))
.withColumn("quarter1",F.when(F.col("quarters")=='quarter1',1).otherwise(0))
.withColumn("quarter2",F.when(F.col("quarters")=='quarter2',1).otherwise(0))
.withColumn("quarter3",F.when(F.col("quarters")=='quarter3',1).otherwise(0))
.withColumn("quarter4",F.when(F.col("quarters")=='quarter4',1).otherwise(0))
.drop("quarters")
)
df.show()

Mapping key and list of values to key value using pyspark

I have a dataset which consists of two columns C1 and C2.The columns are associated with a relation of many to many.
What I would like to do is find for each C2 the value C1 which has the most associations with C2 values overall.
For example:
C1 | C2
1 | 2
1 | 5
1 | 9
2 | 9
2 | 8
We can see here that 1 is matched to 3 values of C2 while 2 is matched to 2 so i would like as output:
Out1 |Out2| matches
2 | 1 | 3
5 | 1 | 3
9 | 1 | 3 (1 wins because 3>2)
8 | 2 | 2
What I have done so far is:
dataset = sc.textFile("...").\
map(lambda line: (line.split(",")[0],list(line.split(",")[1]) ) ).\
reduceByKey(lambda x , y : x+y )
What this does is for each C1 value gather all the C2 matches,the count of this list is our desired matches column. What I would like now is somehow use each value in this list as a new key and have a mapping like :
(Key ,Value_list[value1,value2,...]) -->(value1 , key ),(value2 , key)...
How could this be done using spark? Any advice would be really helpful.
Thanks in advance!
The dataframe API is perhaps easier for this kind of task. You can group by C1, get the count, then group by C2, and get the value of C1 that corresponds to the highest number of matches.
import pyspark.sql.functions as F
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df2 = (df.groupBy('C1')
.count()
.join(df, 'C1')
.groupBy(F.col('C2').alias('Out1'))
.agg(
F.max(
F.struct(F.col('count').alias('matches'), F.col('C1').alias('Out2'))
).alias('c')
)
.select('Out1', 'c.Out2', 'c.matches')
.orderBy('Out1')
)
df2.show()
+----+----+-------+
|Out1|Out2|matches|
+----+----+-------+
| 2| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 9| 1| 3|
+----+----+-------+
We can get the desired result easily using dataframe API.
from pyspark.sql import *
import pyspark.sql.functions as fun
from pyspark.sql.window import Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
# preparing sample dataframe
data = [(1, 2), (1, 5), (1, 9), (2, 9), (2, 8)]
schema = ["c1", "c2"]
df = spark.createDataFrame(data, schema)
output = df.withColumn("matches", fun.count("c1").over(Window.partitionBy("c1"))) \
.groupby(fun.col('C2').alias('out1')) \
.agg(fun.first(fun.col("c1")).alias("out2"), fun.max("matches").alias("matches"))
output.show()
# output
+----+----+-------+
|Out1|out2|matches|
+----+----+-------+
| 9| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 2| 1| 3|
+----+----+-------+

Resources