Update column with a where clause in Pyspark - apache-spark

How to update a column in Pyspark dataframe with a where clause?
This is similar to this SQL operation :
UPDATE table1 SET alpha1= x WHERE alpha2< 6;
where alpha1 and alpha2 are columns of the table1.
For Eg :
I Have a dataframe table1 with values below :
table1
alpha1 alpha2
3 7
4 5
5 4
6 8
dataframe Table1 after update :
alpha1 alpha2
3 7
x 5
x 4
6 8
How to do this in pyspark dataframe?

You are looking for the when function:
df = spark.createDataFrame([("3",7),("4",5),("5",4),("6",8)],["alpha1", "alpha2"])
df.show()
>>> +------+------+
>>> |alpha1|alpha2|
>>> +------+------+
>>> | 3| 7|
>>> | 4| 5|
>>> | 5| 4|
>>> | 6| 8|
>>> +------+------+
df2 = df.withColumn("alpha1", pyspark.sql.functions.when(df["alpha2"] < 6, "x").otherwise(df["alpha1"]))
df2.show()
>>>+------+------+
>>>|alpha1|alpha2|
>>>+------+------+
>>>| 3| 7|
>>>| x| 5|
>>>| x| 4|
>>>| 6| 8|
>>>+------+------+

Related

Pyspark : concat two spark df side ways without join efficiently

Hi I have sparse dataframe that was loaded by mergeschema option
DF
name A1 A2 B1 B2 ..... partitioned_name
A 1 1 null null partition_a
B 2 2 null null partition_a
A null null 3 4 partition_b
B null null 3 4 partition_b
to
DF
name A1 A2 B1 B2 .....
A 1 1 3 4
B 2 2 3 4
Any Best ideas without joining for efficiency (nor rdd because data is huge)? I was thinking about solutions like pandas concat(axis=1) since all the tables are sorted
If that pattern repeats and you don't mind hardcode the column names:
df = spark.createDataFrame(
[
('A','1','1','null','null','partition_a'),
('B','2','2','null','null','partition_a'),
('A','null','null','3','4','partition_b'),
('B','null','null','3','4','partition_b')
],
['name','A1','A2','B1','B2','partitioned_name']
)\
.withColumn('A1', F.col('A1').cast('integer'))\
.withColumn('A2', F.col('A2').cast('integer'))\
.withColumn('B1', F.col('B1').cast('integer'))\
.withColumn('B2', F.col('B2').cast('integer'))\
df.show()
import pyspark.sql.functions as F
cols_to_agg = [col for col in df.columns if col not in ["name", "partitioned_name"]]
df\
.groupby('name')\
.agg(F.sum('A1').alias('A1'),
F.sum('A2').alias('A2'),
F.sum('B1').alias('B1'),
F.sum('B2').alias('B2'))\
.show()
+----+----+----+----+----+----------------+
# |name| A1| A2| B1| B2|partitioned_name|
# +----+----+----+----+----+----------------+
# | A| 1| 1|null|null| partition_a|
# | B| 2| 2|null|null| partition_a|
# | A|null|null| 3| 4| partition_b|
# | B|null|null| 3| 4| partition_b|
# +----+----+----+----+----+----------------+
# +----+---+---+---+---+
# |name| A1| A2| B1| B2|
# +----+---+---+---+---+
# | A| 1| 1| 3| 4|
# | B| 2| 2| 3| 4|
# +----+---+---+---+---+

spark data frame not able to replace NULL values

Below code working fine, but if any one field is NULL out of 5 columns SAL1, SAL2, SAL3, SAL4, SAL5 the corresponding TOTAL_SALARY is coming as NULL.
Looks like some null condition or spark udfs need to create, could you please help in that.
input:
NO NAME ADDR SAL1 SAL2 SAL3 SAL4 SAL5
1 ABC IND 100 200 300 null 400
2 XYZ USA 200 333 209 232 444
The second record's sum coming fine, but in first record because of null in SAL4, the output also coming as null.
from pyspark.shell import spark
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
sc = spark.sparkContext
df = spark.read.option("header","true").option("delimiter", ",").csv("C:\\TEST.txt")
df.createOrReplaceTempView("table1")
df1 = spark.sql( "select * from table1" )
df2 = df1.groupBy('NO', 'NAME', 'ADDR').agg(F.sum(df1.SAL1 + df1.SAL2 + df1.SAL3 + df1.SAL4 + df1.SAL5).alias("TOTAL_SALARY"))
df2.show()
Thanks in advance
Just put a na.fill(0) in your code. This would replace the NULL values with 0 and you should be able to perform the operation.
So your last line should look like:
df2 = df1.na.fill(0).groupBy('NO', 'NAME', 'ADDR').agg(F.sum(df1.SAL1 + df1.SAL2 + df1.SAL3 + df1.SAL4 + df1.SAL5).alias("TOTAL_SALARY"))
It also seems that the sum function should be able to handle Null values correctly. I just tested the following code:
df_new = spark.createDataFrame([
(1, 4), (2, None), (3,None), (4,None),
(5,5), (6,None), (7,None),(1, 4), (2, 8), (3,9), (4,1),(1, 2), (2, 1), (3,3), (4,7),
], ("customer_id", "balance"))
df_new.groupBy("customer_id").agg(sum(col("balance"))).show()
df_new.na.fill(0).groupBy("customer_id").agg(sum(col("balance"))).show()
Output:
+-----------+------------+
|customer_id|sum(balance)|
+-----------+------------+
| 7| null|
| 6| null|
| 5| 5|
| 1| 10|
| 3| 12|
| 2| 9|
| 4| 8|
+-----------+------------+
+-----------+------------+
|customer_id|sum(balance)|
+-----------+------------+
| 7| 0|
| 6| 0|
| 5| 5|
| 1| 10|
| 3| 12|
| 2| 9|
| 4| 8|
+-----------+------------+
Version 1 only contains NULL values if all values in the sum are NULL.
Version 2 returns 0 instead, since all NULL values are replaced with 0's
Basically below line of code check all 5 SAL fields and if it is null, replace it with 0. If not keep the original value.
df1 = df.withColumn("SAL1", when(df.SAL1.isNull(), lit(0)).otherwise(df.SAL1))\
.withColumn("SAL2", when(df.SAL2.isNull(), lit(0)).otherwise(df.SAL2))\
.withColumn("SAL3", when(df.SAL3.isNull(), lit(0)).otherwise(df.SAL3))\
.withColumn("SAL4", when(df.SAL4.isNull(), lit(0)).otherwise(df.SAL4))\
.withColumn("SAL5", when(df.SAL5.isNull(), lit(0)).otherwise(df.SAL5))\

How to convert PySpark pipeline rdd (tuple inside tuple) into Data Frame?

I have a PySpark pipeline RDD like bellow
(1,([1,2,3,4],[5,3,4,5])
(2,([1,2,4,5],[4,5,6,7])
I want to generate Data Frame like below:
Id sid cid
1 1 5
1 2 3
1 3 4
1 4 5
2 1 4
2 2 5
2 4 6
2 5 7
Please help me on this.
If you have an RDD like this one,
rdd = sc.parallelize([
(1, ([1,2,3,4], [5,3,4,5])),
(2, ([1,2,4,5], [4,5,6,7]))
])
I would just use RDDs:
rdd.flatMap(lambda rec:
((rec[0], sid, cid) for sid, cid in zip(rec[1][0], rec[1][1]))
).toDF(["id", "sid", "cid"]).show()
# +---+---+---+
# | id|sid|cid|
# +---+---+---+
# | 1| 1| 5|
# | 1| 2| 3|
# | 1| 3| 4|
# | 1| 4| 5|
# | 2| 1| 4|
# | 2| 2| 5|
# | 2| 4| 6|
# | 2| 5| 7|
# +---+---+---+

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!
With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])
I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

Explode array data into rows in spark [duplicate]

This question already has answers here:
Dividing complex rows of dataframe to simple rows in Pyspark
(3 answers)
Closed 5 years ago.
I have a dataset in the following way:
FieldA FieldB ArrayField
1 A {1,2,3}
2 B {3,5}
I would like to explode the data on ArrayField so the output will look in the following way:
FieldA FieldB ExplodedField
1 A 1
1 A 2
1 A 3
2 B 3
2 B 5
I mean I want to generate an output line for each item in the array the in ArrayField while keeping the values of the other fields.
How would you implement it in Spark.
Notice that the input dataset is very large.
The explode function should get that done.
pyspark version:
>>> df = spark.createDataFrame([(1, "A", [1,2,3]), (2, "B", [3,5])],["col1", "col2", "col3"])
>>> from pyspark.sql.functions import explode
>>> df.withColumn("col3", explode(df.col3)).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| 1|
| 1| A| 2|
| 1| A| 3|
| 2| B| 3|
| 2| B| 5|
+----+----+----+
Scala version
scala> val df = Seq((1, "A", Seq(1,2,3)), (2, "B", Seq(3,5))).toDF("col1", "col2", "col3")
df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field]
scala> df.withColumn("col3", explode($"col3")).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| 1|
| 1| A| 2|
| 1| A| 3|
| 2| B| 3|
| 2| B| 5|
+----+----+----+
You can use explode function
Below is the simple example for your case
import org.apache.spark.sql.functions._
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(1, "A", List(1,2,3)),
(2, "B", List(3, 5))
)).toDF("FieldA", "FieldB", "FieldC")
data.withColumn("ExplodedField", explode($"FieldC")).drop("FieldC")
Hope this helps!
explode does exactly what you want. Docs:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
Also, here is an example from a different question using it:
https://stackoverflow.com/a/44418598/1461187

Resources