Replace a column value with NULL in PySpark - apache-spark

How to replace incorrect column values (values with characters like * or #) with null?

Test dataset:
df = spark.createDataFrame(
[(10, '2021-08-16 00:54:43+01', 0.15, 'SMS'),
(11, '2021-08-16 00:04:29+01', 0.15, '*'),
(12, '2021-08-16 00:39:05+01', 0.15, '***')],
['_c0', 'Timestamp', 'Amount','Channel']
)
df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15 |SMS |
# |11 |2021-08-16 00:04:29+01|0.15 |* |
# |12 |2021-08-16 00:39:05+01|0.15 |*** |
# +---+----------------------+------+-------+
Script:
from pyspark.sql import functions as F
df = df.withColumn('Channel', F.when(~F.col('Channel').rlike(r'[\*#]+'), F.col('Channel')))
df.show(truncate=False)
# +---+----------------------+------+-------+
# |_c0|Timestamp |Amount|Channel|
# +---+----------------------+------+-------+
# |10 |2021-08-16 00:54:43+01|0.15 |SMS |
# |11 |2021-08-16 00:04:29+01|0.15 |null |
# |12 |2021-08-16 00:39:05+01|0.15 |null |
# +---+----------------------+------+-------+

So You have multiple choices:
First option is the use the when function to condition the replacement for each character you want to replace:
example: when function
Second option is to use the replace function.
example: replace function
third option is to use regex_replace to replace all the characters with null value
example: regex_replace function

Related

Join two big tables and to get most recent value

I want to join two Spark dataframes that have millions of rows. Assume 'id' is the common column to both dataframes. Both also have 'date' column. However, the date in the two tables may not match. If a record in the first table does not have a matching date in the second table, for the 'value' column from the second table, the most recent observation should be taken. Therefore, I cannot join on 'id' and 'date'. I have created a sample dataframes below. What is optimal way to perform this given that the data size is huge?
import pandas as pd
a = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3, 1,2,3], 'date': ['2020-01-01', '2020-01-01', '2020-01-01', '2020-01-08', '2020-01-08', '2020-01-08', '2020-01-21', '2020-01-21', '2020-01-21', '2020-01-31', '2020-01-31', '2020-01-31']})
a = spark.createDataFrame(a)
b = pd.DataFrame({'id':[1,2,3,1,2,1,3,1,2], 'date': ['2019-12-25', '2019-12-25', '2019-12-25', '2020-01-08', '2020-01-08', '2020-01-21', '2020-01-21', '2020-01-31', '2020-01-31'], 'value': [0.1,0.2,0.3,1,2,10,30,0.1,0.2]})
b = spark.createDataFrame(b)
required_result = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3, 1,2,3], 'date': ['2020-01-01', '2020-01-01', '2020-01-01', '2020-01-08', '2020-01-08', '2020-01-08', '2020-01-21', '2020-01-21', '2020-01-21', '2020-01-31', '2020-01-31', '2020-01-31'],
'value': [0.1,0.2,0.3, 1,2,0.3,10, 2,30,0.1,0.2, 30]})
You could join on id and keep dates from the second dataframe which are equal to or lower than first dataframe's dates.
data1_sdf.join(data2_sdf.withColumnRenamed('date', 'date_b'),
[data1_sdf.id == data2_sdf.id,
data1_sdf.date >= func.col('date_b')],
'left'
). \
drop(data2_sdf.id). \
withColumn('dates_diff', func.datediff('date_b', 'date')). \
withColumn('max_dtdiff',
func.max('dates_diff').over(wd.partitionBy('id', 'date'))
). \
filter(func.col('max_dtdiff') == func.col('dates_diff')). \
drop('dates_diff', 'max_dtdiff'). \
orderBy('id', 'date'). \
show()
# +---+----------+----------+-----+
# | id| date| date_b|value|
# +---+----------+----------+-----+
# | 1|2020-01-01|2019-12-25| 0.1|
# | 1|2020-01-08|2020-01-08| 1.0|
# | 1|2020-01-21|2020-01-21| 10.0|
# | 1|2020-01-31|2020-01-31| 0.1|
# | 2|2020-01-01|2019-12-25| 0.2|
# | 2|2020-01-08|2020-01-08| 2.0|
# | 2|2020-01-21|2020-01-08| 2.0|
# | 2|2020-01-31|2020-01-31| 0.2|
# | 3|2020-01-01|2019-12-25| 0.3|
# | 3|2020-01-08|2019-12-25| 0.3|
# | 3|2020-01-21|2020-01-21| 30.0|
# | 3|2020-01-31|2020-01-21| 30.0|
# +---+----------+----------+-----+
It seems, that you can join just on id, as this key looks well distributed. You could aggregate a bit the df b, join both dfs, then filter and extract the value with max date.
from pyspark.sql import functions as F
b = b.groupBy('id').agg(F.collect_list(F.array('date', 'value')).alias('dv'))
df = a.join(b, 'id', 'left')
df = df.select(
a['*'],
F.array_max(F.filter('dv', lambda x: x[0] <= F.col('date')))[1].alias('value')
)
df.show()
# +---+----------+-----+
# | id| date|value|
# +---+----------+-----+
# | 1|2020-01-01| 0.1|
# | 1|2020-01-08| 1.0|
# | 3|2020-01-01| 0.3|
# | 3|2020-01-08| 0.3|
# | 2|2020-01-01| 0.2|
# | 2|2020-01-08| 2.0|
# | 1|2020-01-21| 10.0|
# | 1|2020-01-31| 0.1|
# | 3|2020-01-21| 30.0|
# | 3|2020-01-31| 30.0|
# | 2|2020-01-21| 2.0|
# | 2|2020-01-31| 0.2|
# +---+----------+-----+

Conditionally get previous row value

I have the following dataset
columns = ['id','trandatetime','code','zip']
data = [('1','2020-02-06T17:33:21.000+0000', '0','35763'),('1','2020-02-06T17:39:55.000+0000', '0','35763'), ('1','2020-02-07T06:06:42.000+0000', '0','35741'), ('1','2020-02-07T06:28:17.000+0000', '4','94043'),('1','2020-02-07T07:12:13.000+0000','0','35802'), ('1','2020-02-07T08:23:29.000+0000', '0','30738')]
df = spark.createDataFrame(data).toDF(*columns)
df= df.withColumn("trandatetime",to_timestamp("trandatetime"))
+---+--------------------+----+-----+
| id| trandatetime|code| zip|
+---+--------------------+----+-----+
| 1|2020-02-06T17:33:...| 0|35763|
| 1|2020-02-06T17:39:...| 0|35763|
| 1|2020-02-07T06:06:...| 0|35741|
| 1|2020-02-07T06:28:...| 4|94043|
| 1|2020-02-07T07:12:...| 0|35802|
| 1|2020-02-07T08:23:...| 0|30738|
+---+--------------------+----+-----+
I am trying to get the previous row zip when code = 0 within a time period.
This is my attempt, but you can see that the row where code is 4 is getting a value, that should be null. The row after the 4 is null, but that one should have a value in it.
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('id').orderBy('timestamp').rangeBetween(-60*60*24,-1)
df = df.withColumn("Card_Present_Last_Zip",F.last(F.when(col("code") == '0', col("zip"))).over(w))
+---+--------------------+----+-----+----------+---------------------+
| id| trandatetime|code| zip| timestamp|Card_Present_Last_Zip|
+---+--------------------+----+-----+----------+---------------------+
| 1|2020-02-06T17:33:...| 0|35763|1581010401| null|
| 1|2020-02-06T17:39:...| 0|35763|1581010795| 35763|
| 1|2020-02-07T06:06:...| 0|35741|1581055602| 35763|
| 1|2020-02-07T06:28:...| 4|94043|1581056897| 35741|
| 1|2020-02-07T07:12:...| 0|35802|1581059533| null|
| 1|2020-02-07T08:23:...| 0|30738|1581063809| 35802|
+---+--------------------+----+-----+----------+---------------------+
Put the last function (with ignorenulls set to True) expression into another when clause to only apply window operation on rows with code = '0'
w = Window.partitionBy('id').orderBy('timestamp').rangeBetween(-60*60*24,-1)
df = (df
.withColumn("timestamp", F.unix_timestamp("trandatetime"))
.withColumn("Card_Present_Last_Zip", F.when(F.col("code") == '0', F.last(F.when(F.col("code") == '0', F.col("zip")), ignorenulls=True).over(w)))
)
df.show()
# +---+-------------------+----+-----+----------+---------------------+
# | id| trandatetime|code| zip| timestamp|Card_Present_Last_Zip|
# +---+-------------------+----+-----+----------+---------------------+
# | 1|2020-02-06 17:33:21| 0|35763|1581010401| null|
# | 1|2020-02-06 17:39:55| 0|35763|1581010795| 35763|
# | 1|2020-02-07 06:06:42| 0|35741|1581055602| 35763|
# | 1|2020-02-07 06:28:17| 4|94043|1581056897| null|
# | 1|2020-02-07 07:12:13| 0|35802|1581059533| 35741|
# | 1|2020-02-07 08:23:29| 0|30738|1581063809| 35802|
# +---+-------------------+----+-----+----------+---------------------+
You can use window function lag() .
window_spec = Window.partitionBy('id').orderBy('timestamp')
df.withColumn('prev_zip', lag('zip').over(window_spec)).\
withColumn('Card_Present_Last_Zip', when(col('code') == 0, col('prev_zip')).otherwise(None)).show()

Removing NULL , NAN, empty space from PySpark DataFrame

I have a dataframe in PySpark which contains empty space, Null, and Nan.
I want to remove rows which have any of those. I tried below commands, but, nothing seems to work.
myDF.na.drop().show()
myDF.na.drop(how='any').show()
Below is the dataframe:
+---+----------+----------+-----+-----+
|age| category| date|empId| name|
+---+----------+----------+-----+-----+
| 25|electronic|17-01-2018| 101| abc|
| 24| sports|16-01-2018| 102| def|
| 23|electronic|17-01-2018| 103| hhh|
| 23|electronic|16-01-2018| 104| yyy|
| 29| men|12-01-2018| 105| ajay|
| 31| kids|17-01-2018| 106|vijay|
| | Men| nan| 107|Sumit|
+---+----------+----------+-----+-----+
What am I missing? What is the best way to tackle NULL, Nan or empty spaces so that there is no problem in the actual calculation?
NaN (not a number) has different meaning that NULL and empty string is just a normal value (can be converted to NULL automatically with csv reader) so na.drop won't match these.
You can convert all to null and drop
from pyspark.sql.functions import col, isnan, when, trim
df = spark.createDataFrame([
("", 1, 2.0), ("foo", None, 3.0), ("bar", 1, float("NaN")),
("good", 42, 42.0)])
def to_null(c):
return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))
df.select([to_null(c).alias(c) for c in df.columns]).na.drop().show()
# +----+---+----+
# | _1| _2| _3|
# +----+---+----+
# |good| 42|42.0|
# +----+---+----+
Maybe in your case it is not important but this code (modifed answer of Alper t. Turker) can handle different datatypes accordingly. The dataTypes can vary according your DataFrame of course. (tested on Spark version: 2.4)
from pyspark.sql.functions import col, isnan, when, trim
# Find out dataType and act accordingly
def to_null_bool(c, dt):
if df == "double":
return c.isNull() | isnan(c)
elif df == "string":
return ~c.isNull() & (trim(c) != "")
else:
return ~c.isNull()
# Only keep columns with not empty strings
def to_null(c, dt):
c = col(c)
return when(to_null_bool(c, dt), c)
df.select([to_null(c, dt[1]).alias(c) for c, dt in zip(df.columns, df.dtypes)]).na.drop(how="any").show()

How to set new list value based on condition in dataframe in Pyspark?

I have a DataFrame like below.
+---+------------------------------------------+
|id |features |
+---+------------------------------------------+
|1 |[6.629056, 0.26771536, 0.79063195,0.8923] |
|2 |[1.4850719, 0.66458416, -2.1034079] |
|3 |[3.0975454, 1.571849, 1.9053307] |
|4 |[2.526619, -0.33559006, -1.4565022] |
|5 |[-0.9286196, -0.57326394, 4.481531] |
|6 |[3.594114, 1.3512149, 1.6967168] |
+---+------------------------------------------+
I want to set some of my features value based on my where condition like below. I.e. where id=1, id=2 or id=6.
I want to set new features value where id=1, I current features value is [6.629056, 0.26771536, 0.79063195,0.8923], but I want to set [0,0,0,0].
I want to set new features value where id=2, I current features value is [1.4850719, 0.66458416, -2.1034079], but I want to set [0,0,0].
My final out put will be:
+------+-----------------------------------+
|id | features |
+-----+---------------------------------- -+
|1 | [0, 0, 0, 0] |
|2 | [0,0,0] |
|3 | [3.0975454, 1.571849, 1.9053307] |
|4 | [2.526619, -0.33559006, -1.4565022] |
|5 | [-0.9286196, -0.57326394, 4.481531] |
|6 | [0,0,0] |
+-----+------------------------------------+
Shaido's answer is fine if you have a limited set of id for which you know the length of the corresponding feature as well.
If that's not the case, it should be cleaner to use a UDF and the ids that you want to convert can be loaded in another Seq :
In Scala
val arr = Seq(1,2,6)
val fillArray = udf { (id: Int, array: WrappedArray[Double] ) =>
if (arr.contains(id) ) Seq.fill[Double](array.length)(0.0)
else array
}
df.withColumn("new_features" , fillArray($"id", $"features") ).show(false)
In Python
from pyspark.sql import functions as f
from pyspark.sql.types import *
arr = [1,2,6]
def fillArray(id, features):
if(id in arr): return [0.0] * len(features)
else : return features
fill_array_udf = f.udf(fillArray, ArrayType( DoubleType() ) )
df.withColumn("new_features" , fill_array_udf( f.col("id"), f.col("features") ) ).show()
Output
+---+------------------------------------------+-----------------------------------+
|id |features |new_features |
+---+------------------------------------------+-----------------------------------+
|1 |[6.629056, 0.26771536, 0.79063195, 0.8923]|[0.0, 0.0, 0.0, 0.0] |
|2 |[1.4850719, 0.66458416, -2.1034079] |[0.0, 0.0, 0.0] |
|3 |[3.0975454, 1.571849, 1.9053307] |[3.0975454, 1.571849, 1.9053307] |
|4 |[2.526619, -0.33559006, -1.4565022] |[2.526619, -0.33559006, -1.4565022]|
|5 |[-0.9286196, -0.57326394, 4.481531] |[-0.9286196, -0.57326394, 4.481531]|
|6 |[3.594114, 1.3512149, 1.6967168] |[0.0, 0.0, 0.0] |
+---+------------------------------------------+-----------------------------------+
Use when and otherwise if you have a small set of ids to change:
df.withColumn("features",
when(df.id === 1, array(lit(0), lit(0), lit(0), lit(0)))
.when(df.id === 2 | df.id === 6, array(lit(0), lit(0), lit(0)))
.otherwise(df.features)))
It should be faster than an UDF but if there are many ids to change it quickly becomes a lot of code. In this case, use an UDF as in philantrovert's answer.

Convert Python dictionary to Spark DataFrame

I have a Python dictionary :
dic = {
(u'aaa',u'bbb',u'ccc'):((0.3, 1.2, 1.3, 1.5), 1.4, 1),
(u'kkk',u'ggg',u'ccc',u'sss'):((0.6, 1.2, 1.7, 1.5), 1.4, 2)
}
I'd like to convert this dictionary to Spark DataFrame with columns :
['key', 'val_1', 'val_2', 'val_3', 'val_4', 'val_5', 'val_6']
example row (1) :
key | val_1 |val_2 | val_3 | val_4 | val_5| val_6|
u'aaa',u'bbb',u'ccc' | 0.3 |1.2 |1.3 |1.5 |1.4 |1 |
Thank you in advance
Extract items, cast key to list and combine everything into a single tuple:
df = sc.parallelize([
(list(k), ) +
v[0] +
v[1:]
for k, v in dic.items()
]).toDF(['key', 'val_1', 'val_2', 'val_3', 'val_4', 'val_5', 'val_6'])
df.show()
## +--------------------+-----+-----+-----+-----+-----+-----+
## | key|val_1|val_2|val_3|val_4|val_5|val_6|
## +--------------------+-----+-----+-----+-----+-----+-----+
## | [aaa, bbb, ccc]| 0.3| 1.2| 1.3| 1.5| 1.4| 1|
## |[kkk, ggg, ccc, sss]| 0.6| 1.2| 1.7| 1.5| 1.4| 2|
## +--------------------+-----+-----+-----+-----+-----+-----+

Resources