spark data frame not able to replace NULL values - apache-spark

Below code working fine, but if any one field is NULL out of 5 columns SAL1, SAL2, SAL3, SAL4, SAL5 the corresponding TOTAL_SALARY is coming as NULL.
Looks like some null condition or spark udfs need to create, could you please help in that.
input:
NO NAME ADDR SAL1 SAL2 SAL3 SAL4 SAL5
1 ABC IND 100 200 300 null 400
2 XYZ USA 200 333 209 232 444
The second record's sum coming fine, but in first record because of null in SAL4, the output also coming as null.
from pyspark.shell import spark
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
sc = spark.sparkContext
df = spark.read.option("header","true").option("delimiter", ",").csv("C:\\TEST.txt")
df.createOrReplaceTempView("table1")
df1 = spark.sql( "select * from table1" )
df2 = df1.groupBy('NO', 'NAME', 'ADDR').agg(F.sum(df1.SAL1 + df1.SAL2 + df1.SAL3 + df1.SAL4 + df1.SAL5).alias("TOTAL_SALARY"))
df2.show()
Thanks in advance

Just put a na.fill(0) in your code. This would replace the NULL values with 0 and you should be able to perform the operation.
So your last line should look like:
df2 = df1.na.fill(0).groupBy('NO', 'NAME', 'ADDR').agg(F.sum(df1.SAL1 + df1.SAL2 + df1.SAL3 + df1.SAL4 + df1.SAL5).alias("TOTAL_SALARY"))
It also seems that the sum function should be able to handle Null values correctly. I just tested the following code:
df_new = spark.createDataFrame([
(1, 4), (2, None), (3,None), (4,None),
(5,5), (6,None), (7,None),(1, 4), (2, 8), (3,9), (4,1),(1, 2), (2, 1), (3,3), (4,7),
], ("customer_id", "balance"))
df_new.groupBy("customer_id").agg(sum(col("balance"))).show()
df_new.na.fill(0).groupBy("customer_id").agg(sum(col("balance"))).show()
Output:
+-----------+------------+
|customer_id|sum(balance)|
+-----------+------------+
| 7| null|
| 6| null|
| 5| 5|
| 1| 10|
| 3| 12|
| 2| 9|
| 4| 8|
+-----------+------------+
+-----------+------------+
|customer_id|sum(balance)|
+-----------+------------+
| 7| 0|
| 6| 0|
| 5| 5|
| 1| 10|
| 3| 12|
| 2| 9|
| 4| 8|
+-----------+------------+
Version 1 only contains NULL values if all values in the sum are NULL.
Version 2 returns 0 instead, since all NULL values are replaced with 0's

Basically below line of code check all 5 SAL fields and if it is null, replace it with 0. If not keep the original value.
df1 = df.withColumn("SAL1", when(df.SAL1.isNull(), lit(0)).otherwise(df.SAL1))\
.withColumn("SAL2", when(df.SAL2.isNull(), lit(0)).otherwise(df.SAL2))\
.withColumn("SAL3", when(df.SAL3.isNull(), lit(0)).otherwise(df.SAL3))\
.withColumn("SAL4", when(df.SAL4.isNull(), lit(0)).otherwise(df.SAL4))\
.withColumn("SAL5", when(df.SAL5.isNull(), lit(0)).otherwise(df.SAL5))\

Related

Mapping key and list of values to key value using pyspark

I have a dataset which consists of two columns C1 and C2.The columns are associated with a relation of many to many.
What I would like to do is find for each C2 the value C1 which has the most associations with C2 values overall.
For example:
C1 | C2
1 | 2
1 | 5
1 | 9
2 | 9
2 | 8
We can see here that 1 is matched to 3 values of C2 while 2 is matched to 2 so i would like as output:
Out1 |Out2| matches
2 | 1 | 3
5 | 1 | 3
9 | 1 | 3 (1 wins because 3>2)
8 | 2 | 2
What I have done so far is:
dataset = sc.textFile("...").\
map(lambda line: (line.split(",")[0],list(line.split(",")[1]) ) ).\
reduceByKey(lambda x , y : x+y )
What this does is for each C1 value gather all the C2 matches,the count of this list is our desired matches column. What I would like now is somehow use each value in this list as a new key and have a mapping like :
(Key ,Value_list[value1,value2,...]) -->(value1 , key ),(value2 , key)...
How could this be done using spark? Any advice would be really helpful.
Thanks in advance!
The dataframe API is perhaps easier for this kind of task. You can group by C1, get the count, then group by C2, and get the value of C1 that corresponds to the highest number of matches.
import pyspark.sql.functions as F
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df2 = (df.groupBy('C1')
.count()
.join(df, 'C1')
.groupBy(F.col('C2').alias('Out1'))
.agg(
F.max(
F.struct(F.col('count').alias('matches'), F.col('C1').alias('Out2'))
).alias('c')
)
.select('Out1', 'c.Out2', 'c.matches')
.orderBy('Out1')
)
df2.show()
+----+----+-------+
|Out1|Out2|matches|
+----+----+-------+
| 2| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 9| 1| 3|
+----+----+-------+
We can get the desired result easily using dataframe API.
from pyspark.sql import *
import pyspark.sql.functions as fun
from pyspark.sql.window import Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
# preparing sample dataframe
data = [(1, 2), (1, 5), (1, 9), (2, 9), (2, 8)]
schema = ["c1", "c2"]
df = spark.createDataFrame(data, schema)
output = df.withColumn("matches", fun.count("c1").over(Window.partitionBy("c1"))) \
.groupby(fun.col('C2').alias('out1')) \
.agg(fun.first(fun.col("c1")).alias("out2"), fun.max("matches").alias("matches"))
output.show()
# output
+----+----+-------+
|Out1|out2|matches|
+----+----+-------+
| 9| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 2| 1| 3|
+----+----+-------+

Spark dataframe self-joins are producing empty dataframe as a result

Below is my data in csv which I read into dataframe.
id,pid,pname,ppid
1, 1, 5, -1
2, 1, 7, -1
3, 2, 9, 1
4, 2, 11, 1
5, 3, 5, 1
6, 4, 7, 2
7, 1, 9, 3
I am reading that data into a dataframe data_df. I am tryng to do a self-join on different columns. But the results dataframes are empty. Have tried multiple options.
Below is my code. Only the last joined4 is producing the result.
val joined = data_df.as("first").join(data_df.as("second")).where( col("first.ppid") === col("second.pid"))
joined.show(50, truncate = false)
val joined2 = data_df.as("first").join(data_df.as("second"), col("first.ppid") === col("second.pid"), "inner")
joined2.show(50, truncate = false)
val df1 = data_df.as("df1")
val df2 = data_df.as("df2")
val joined3 = df1.join(df2, $"df1.ppid" === $"df2.id")
joined3.show(50, truncate = false)
val joined4 = data_df.as("df1").join(data_df.as("df2"), Seq("id"))
joined4.show(50, truncate = false)
Below are the output of joined, joined2, joined3, joined4 respectively :
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+-----+----+
|id |pid|pname|ppid|pid|pname|ppid|
+---+---+-----+----+---+-----+----+
| 1 | 1| 5| -1| 1| 5| -1|
| 2 | 1| 7| -1| 1| 7| -1|
| 3 | 2| 9| 1| 2| 9| 1|
| 4 | 2| 11| 1| 2| 11| 1|
| 5 | 3| 5| 1| 3| 5| 1|
| 6 | 4| 7| 2| 4| 7| 2|
| 7 | 1| 9| 3| 1| 9| 3|
+---+---+-----+----+---+-----+----+
Sorry, later on figured out that the spaces in the csv were causing the issue. If I create a correctly structured csv of the initial data, the problem disappears.
Correct csv format as follows.
id,pid,pname,ppid
1,1,5,-1
2,1,7,-1
3,2,9,1
4,2,1,1
5,3,5,1
6,4,7,2
7,1,9,3
Ideally, I can also use the option to ignore leading whitespaces as shown in the following answer :
val data_df = spark.read
.schema(dataSchema)
.option("mode", "FAILFAST")
.option("header", "true")
.option("ignoreLeadingWhiteSpace", "true")
.csv(dataSourceName)
pySpark (v2.4) DataFrameReader adds leading whitespace to column names

How to filter out rows based on previous consecutive rows?

I have a requirement where a dataframe is sorted by col1 (timestamp) and I need to filter by col2.
Any row where col2 value is less than col2 value of the previous row, I need to filter out that row. Result should be monotonically increasing col2 value.
Note that this is not just about two rows.
For example, let's say the value of col2 for 4 rows are 4,2,3,5. The result should be 4,5 as both the 2nd and 3rd row are less than 4 (first row value).
val input = Seq(
(1,4), (2,2), (3,3), (4,5), (5, 1), (6, 9), (7, 6)
).toDF("timestamp", "value")
scala> input.show
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 2| 2|
| 3| 3|
| 4| 5|
| 5| 1|
| 6| 9|
| 7| 6|
+---------+-----+
val expected = Seq((1,4), (4,5), (6, 9)).toDF("timestamp", "value")
scala> expected.show
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 4| 5|
| 6| 9|
+---------+-----+
Please note that:
rows 2 and 3 filtered out as its value is less than value in row 1, i.e. 4
row 5 is filtered out as its value is less than value in row 4, i.e. 6
Generically speaking, is there a way to filter rows based on comparison of value of one row with value in the previous rows?
I think what you're after is called running maximum (after running total). That always leads me to use windowed aggregation.
// I made the input a bit more tricky
val input = Seq(
(1,4), (2,2), (3,3), (4,5), (5, 1), (6, 9), (7, 6)
).toDF("timestamp", "value")
scala> input.show
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 2| 2|
| 3| 3|
| 4| 5|
| 5| 1|
| 6| 9|
| 7| 6|
+---------+-----+
I'm aiming at the following expected result. Correct me if I'm wrong.
val expected = Seq((1,4), (4,5), (6, 9)).toDF("timestamp", "value")
scala> expected.show
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 4| 5|
| 6| 9|
+---------+-----+
The trick to use for "running" problems is to use rangeBetween when defining a window specification.
import org.apache.spark.sql.expressions.Window
val ts = Window
.orderBy("timestamp")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
With the window spec you filter out what you want to get rid of from the result and you're done.
val result = input
.withColumn("running_max", max("value") over ts)
.where($"running_max" === $"value")
.select("timestamp", "value")
scala> result.show
18/05/29 22:09:18 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 4| 5|
| 6| 9|
+---------+-----+
As you can see it's not very efficient since it only uses a single partition (that leads to a poor single-threaded execution and so not much difference from running the experiment on a single machine).
I think we could partition the input calculate the running maximum partially and then union the partial results and re-run the running maximum calculation again. Just a thought I have not tried out myself.
checking equality with the running maximum should do the trick:
val input = Seq((1,4), (2,2), (3,3), (4,5), (5, 1), (6, 9), (7, 6)).toDF("timestamp", "value")
input.show()
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 2| 2|
| 3| 3|
| 4| 5|
| 5| 1|
| 6| 9|
| 7| 6|
+---------+-----+
input
.withColumn("max",max($"value").over(Window.orderBy($"timestamp")))
.where($"value"===$"max").drop($"max")
.show()
+---------+-----+
|timestamp|value|
+---------+-----+
| 1| 4|
| 4| 5|
| 6| 9|
+---------+-----+

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!
With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])
I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

PySpark: modify column values when another column value satisfies a condition

I have a PySpark Dataframe with two columns:
+---+----+
| Id|Rank|
+---+----+
| a| 5|
| b| 7|
| c| 8|
| d| 1|
+---+----+
For each row, I'm looking to replace Id column with "other" if Rank column is larger than 5.
If I use pseudocode to explain:
For row in df:
if row.Rank > 5:
then replace(row.Id, "other")
The result should look like this:
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Any clue how to achieve this? Thanks!!!
To create this Dataframe:
df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank'])
You can use when and otherwise like -
from pyspark.sql.functions import *
df\
.withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\
.drop(df.Id)\
.select(col('Id_New').alias('Id'),col('Rank'))\
.show()
this gives output as -
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Starting with #Pushkr solution couldn't you just use the following ?
from pyspark.sql.functions import *
df.withColumn('Id',when(df.Rank <= 5,df.Id).otherwise('other')).show()

Resources