I need to replace null values present in a column in Spark dataframe. Below is the code I tried
df=df.na.fill(0,Seq('c_amount')).show()
But it is throwing me an error NameError: name 'Seq' is not defined
Below is my table
+------------+--------+
|c_account_id|c_amount|
+------------+--------+
| 1| null|
| 2| 123 |
| 3| null|
+------------+--------+
Expected output
+------------+--------+
|c_account_id|c_amount|
+------------+--------+
| 1| 0|
| 2| 123|
| 3| 0|
+------------+--------+
You need to use like this
df = df.fillna("<BLANK>", subset=['col_name'])
Related
I have a PySpark dataframe-
df = spark.createDataFrame([
("u1", 0),
("u2", 0),
("u3", 1),
("u4", 2),
("u5", 3),
("u6", 2),],
['user_id', 'medals'])
df.show()
Output-
+-------+------+
|user_id|medals|
+-------+------+
| u1| 0|
| u2| 0|
| u3| 1|
| u4| 2|
| u5| 3|
| u6| 2|
+-------+------+
I want to get the distribution of the medals column for all the users. So if there are n unique values in the medals column, I want n columns in the output dataframe with corresponding number of users who received that many medals.
The output for the data given above should look like-
+------- +--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
| 2| 1| 2| 1|
+--------+--------+--------+--------+
How do I achieve this?
it's a simple pivot:
df.groupBy().pivot("medals").count().show()
+---+---+---+---+
| 0| 1| 2| 3|
+---+---+---+---+
| 2| 1| 2| 1|
+---+---+---+---+
if you need some cosmetic to add the word medals in the column name, then you can do this :
medals_df = df.groupBy().pivot("medals").count()
for col in medals_df.columns:
medals_df = medals_df.withColumnRenamed(col, "medals_{}".format(col))
medals_df.show()
+--------+--------+--------+--------+
|medals_0|medals_1|medals_2|medals_3|
+--------+--------+--------+--------+
| 2| 1| 2| 1|
+--------+--------+--------+--------+
I have column quantity in dataframe. I want to add a new column to this dataframe with each record having min("Quantity"). I am trying to use lit() in pyspark. something like below
df.withColumn("min_quant", lit(min(col("Quantity")))).show().
It's resulting in the getting below error
grouping expressions sequence is empty, and `InvoiceNo` is not an aggregate function.
Wrap (min(`Quantity`) AS `min_quant`) in windowing function(s) or wrap
This is working:
df.withColumn("min_quant", lit(2)).show().
But, in place of 2 here, I want min(Quantity). Am I missing something?
Please try using window function as min() function needs aggregation.
val windowSpec = Window.orderBy("InvoiceNo")
df.withColumn("min_quant", min("Quantity") over(windowSpec)).show()
Sample Result:
+---------+----+--------+---------+
|InvoiceNo|name|Quantity|min_quant|
+---------+----+--------+---------+
| 1| ABC| 19| 1|
| 1| ABC| 1| 1|
| 1| ABC| 8| 1|
| 1| ABC| 389| 1|
| 1| ABC| 196| 1|
| 2| CBD| 10| 1|
| 2| CBD| 946| 1|
| 3| XYZ| 3| 1|
+---------+----+--------+---------+
I'm using pyspark to generate a dataframe where I need to update 'amt' column with previous row's 'amt' value only when amt = 0.
For example, below is my dataframe
+---+-----+
| id|amt |
+---+-----+
| 1| 5|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
| 6| 3|
+---+-----+
Now, I want the following DF to be created. whenever amt = 0, modi_amt col will contain previous row's non zero value, else no change.
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 5|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
I'm able to get the previous rows value but need help for the rows where multiple 0 amt appears (example, id = 2,3)
code I'm using :
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
DF= DF.withColumn("modi_amt",when(DF.amt== 0,DF.prev_amt).otherwise(DF.amt)).drop('prev_amt')
I'm getting the below DF
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 0|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
basically id 3 also should have modi_amt = 5
I've used the below approach to get the output and it's working fine,
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
# this will hold the previous col value
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
# this will replace the amt 0 with previous column value, but not consecutive rows having 0 amt.
DF = DF.withColumn("amt_adjusted",when(DF.prev_amt == 0,DF.prev_OffSet).otherwise(DF.amt))
# define null for the rows where both amt and amt_adjusted are having 0 (logic for consecutive rows having 0 amt)
DF = DF.withColumn('zeroNonZero', when((DF.amt== 0)&(DF.amt_adjusted == 0),lit(None)).otherwise(DF.amt_adjusted))
# replace all null values with previous Non zero amt row value
DF= DF.withColumn('modi_amt',last("zeroNonZero", ignorenulls= True).over(Window.orderBy("id").rowsBetween(Window.unboundedPreceding,0)))
Is there any other better approach?
I have a data frame like below in pyspark.
+---+-------------+----+
| id| device| val|
+---+-------------+----+
| 3| mac pro| 1|
| 1| iphone| 2|
| 1|android phone| 2|
| 1| windows pc| 2|
| 1| spy camera| 2|
| 2| spy camera| 3|
| 2| iphone| 3|
| 3| spy camera| 1|
| 3| cctv| 1|
+---+-------------+----+
I want to populate some columns based on the below lists
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
I have done like below.
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
I got the desired result.
Now I want to do some change to the code I want to populate the column value after I divide the cat column with the value in the data frame for that id.
I tried something like below but didn't get the correct result
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ df.val).show()
How can I get what I want?
edit
Expected result
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 0.5| 1| 0.5|
| 3| 1| null| 2|
| 2|null| 0.33| 0.33|
+---+----+------+--------+
Aggregation would need an aggregation function, a simple column would not be identified
Since val column contains same value for each group of id column, you can use first inbuilt function as
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')/ F.first(df.val)).show()
which should give you
+---+----+------------------+------------------+
| id| pc| phones| security|
+---+----+------------------+------------------+
| 3| 1.0| null| 2.0|
| 1| 0.5| 1.0| 0.5|
| 2|null|0.3333333333333333|0.3333333333333333|
+---+----+------------------+------------------+
I have a data frame in pyspark like below.
df.show()
+---+-------------+
| id| device|
+---+-------------+
| 3| mac pro|
| 1| iphone|
| 1|android phone|
| 1| windows pc|
| 1| spy camera|
| 2| spy camera|
| 2| iphone|
| 3| spy camera|
| 3| cctv|
+---+-------------+
phone_list = ['iphone', 'android phone', 'nokia']
pc_list = ['windows pc', 'mac pro']
security_list = ['spy camera', 'cctv']
from pyspark.sql.functions import col
phones_df = df.filter(col('device').isin(phone_list)).groupBy("id").count().selectExpr("id as id", "count as phones")
phones_df.show()
+---+------+
| id|phones|
+---+------+
| 1| 2|
| 2| 1|
+---+------+
pc_df = df.filter(col('device').isin(pc_list)).groupBy("id").count().selectExpr("id as id", "count as pc")
pc_df.show()
+---+---+
| id| pc|
+---+---+
| 1| 1|
| 3| 1|
+---+---+
security_df = df.filter(col('device').isin(security_list)).groupBy("id").count().selectExpr("id as id", "count as security")
security_df.show()
+---+--------+
| id|security|
+---+--------+
| 1| 1|
| 2| 1|
| 3| 2|
+---+--------+
Then I want to do a full outer join on all the three data frames. I have done like below.
full_df = phones_df.join(pc_df, phones_df.id == pc_df.id, 'full_outer').select(f.coalesce(phones_df.id, pc_df.id).alias('id'), phones_df.phones, pc_df.pc)
final_df = full_df.join(security_df, full_df.id == security_df.id, 'full_outer').select(f.coalesce(full_df.id, security_df.id).alias('id'), full_df.phones, full_df.pc, security_df.security)
Final_df.show()
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 2| 1| 1|
| 2| 1|null| 1|
| 3| null| 1| 2|
+---+------+----+--------+
I am able to get what I want but want to simplify my code.
1) I want to create phones_df, pc_df, security_df in a better way because I am using the same code while creating these data frames I want to reduce this.
2) I want to simplify the join statements to one statement
How can I do this? Could anyone explain.
Here is one way using when.otherwise to map column to categories, and then pivot it to the desired output:
import pyspark.sql.functions as F
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
+---+----+------+--------+
| id| pc|phones|security|
+---+----+------+--------+
| 1| 1| 2| 1|
| 3| 1| null| 2|
| 2|null| 1| 1|
+---+----+------+--------+