How to detect null column in pyspark - apache-spark

I have a dataframe defined with some null values. Some Columns are fully null values.
>> df.show()
+---+---+---+----+
| A| B| C| D|
+---+---+---+----+
|1.0|4.0|7.0|null|
|2.0|5.0|7.0|null|
|3.0|6.0|5.0|null|
+---+---+---+----+
In my case, I want to return a list of columns name that are filled with null values. My idea was to detect the constant columns (as the whole column contains the same null value).
this is how I did it:
nullCoulumns = [c for c, const in df.select([(min(c) == max(c)).alias(c) for c in df.columns]).first().asDict().items() if const]
but this does no consider null columns as constant, it works only with values.
How should I then do it ?

Extend the condition to
from pyspark.sql.functions import min, max
((min(c).isNull() & max(c).isNull()) | (min(c) == max(c))).alias(c)
or use eqNullSafe (PySpark 2.3):
(min(c).eqNullSafe(max(c))).alias(c)

One way would be to do it implicitly: select each column, count its NULL values, and then compare this with the total number or rows. With your data, this would be:
spark.version
# u'2.2.0'
from pyspark.sql.functions import col
nullColumns = []
numRows = df.count()
for k in df.columns:
nullRows = df.where(col(k).isNull()).count()
if nullRows == numRows: # i.e. if ALL values are NULL
nullColumns.append(k)
nullColumns
# ['D']
But there is a simpler way: it turns out that the function countDistinct, when applied to a column with all NULL values, returns zero (0):
from pyspark.sql.functions import countDistinct
df.agg(countDistinct(df.D).alias('distinct')).collect()
# [Row(distinct=0)]
So the for loop now can be:
nullColumns = []
for k in df.columns:
if df.agg(countDistinct(df[k])).collect()[0][0] == 0:
nullColumns.append(k)
nullColumns
# ['D']
UPDATE (after comments): It seems possible to avoid collect in the second solution; since df.agg returns a dataframe with only one row, replacing collect with take(1) will safely do the job:
nullColumns = []
for k in df.columns:
if df.agg(countDistinct(df[k])).take(1)[0][0] == 0:
nullColumns.append(k)
nullColumns
# ['D']

How about this? In order to guarantee the column are all nulls, two properties must be satisfied:
(1) The min value is equal to the max value
(2) The min or max is null
Or, equivalently
(1) The min AND max are both equal to None
Note that if property (2) is not satisfied, the case where column values are [null, 1, null, 1] would be incorrectly reported since the min and max will be 1.
import pyspark.sql.functions as F
def get_null_column_names(df):
column_names = []
for col_name in df.columns:
min_ = df.select(F.min(col_name)).first()[0]
max_ = df.select(F.max(col_name)).first()[0]
if min_ is None and max_ is None:
column_names.append(col_name)
return column_names
Here's an example in practice:
>>> rows = [(None, 18, None, None),
(1, None, None, None),
(1, 9, 4.0, None),
(None, 0, 0., None)]
>>> schema = "a: int, b: int, c: float, d:int"
>>> df = spark.createDataFrame(data=rows, schema=schema)
>>> df.show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
|null| 18|null|null|
| 1|null|null|null|
| 1| 9| 4.0|null|
|null| 0| 0.0|null|
+----+----+----+----+
>>> get_null_column_names(df)
['d']

Related

transform function in pyspark

I was reading the official documentation of PySpark API reference for dataframe and below code snippet for transform function over a dataframe have me confused. I can't figure out why * is placed before sorted function in sort_columns_asc function defined below
from pyspark.sql.functions import col
df = spark.createDataFrame([(1, 1.0), (2, 2.0)], ["int", "float"])
def cast_all_to_int(input_df):
return input_df.select([col(col_name).cast("int") for col_name in input_df.columns])
def sort_columns_asc(input_df):
return input_df.select(*sorted(input_df.columns))
df.transform(cast_all_to_int).transform(sort_columns_asc).show()
+-----+---+
|float|int|
+-----+---+
| 1| 1|
| 2| 2|
+-----+---+
Please help me clarify the confusion.
It's used to unpack arrays/collections from a higher dimension.
# 1D Array
collection1 = [1,2,3,4]
print(*collection1)
1 2 3 4
# 2D Array
collection2 = [[1,2,3,4]]
print(*collection2)
[1, 2, 3, 4]
In your example you are unpacking the names of the column names from
example = ["int", "float"]
to
print(*sorted(example))
float int
Check out this for further information.

PySpark Compare Empty Map Literal

I want to drop rows in a PySpark DataFrame where a certain column contains an empty map. How do I do this? I can't seem to declare a typed empty MapType against which to compare my column. I have seen that in Scala, you can use typedLit, but there seems to be no such equivalent in PySpark. I have also tried using lit(...) and casting to a struct<string,int> but I have found no acceptable argument for lit() (tried using None which returns null and {} which is an error).
I'm sure this is trivial but I haven't seen any docs on this!
Here is a solution using pyspark size build-in function:
from pyspark.sql.functions import col, size
df = spark.createDataFrame(
[(1, {1:'A'} ),
(2, {2:'B'} ),
(3, {3:'C'} ),
(4, {}),
(5, None)]
).toDF("id", "map")
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- map: map (nullable = true)
# | |-- key: long
# | |-- value: string (valueContainsNull = true)
df.withColumn("is_empty", size(col("map")) <= 0).show()
# +---+--------+--------+
# | id| map|is_empty|
# +---+--------+--------+
# | 1|[1 -> A]| false|
# | 2|[2 -> B]| false|
# | 3|[3 -> C]| false|
# | 4| []| true|
# | 5| null| true|
# +---+--------+--------+
Note that the condition is size <= 0 since in the case of null the function returns -1 (if the spark.sql.legacy.sizeOfNull setting is true otherwise it will return null). Here you can find more details.
Generic solution: comparing Map column and literal Map
For a more generic solution we can use the build-in function size in combination with a UDF which append the string key + value of each item into a sorted list (thank you #jxc for pointing out the problem with the previous version). The hypothesis here will be that two maps are equal when:
they have the same size
the string representation of key + value is identical between the items of the maps
The literal map is created from an arbitrary python dictionary combining keys and values via map_from_arrays:
from pyspark.sql.functions import udf, lit, size, when, map_from_arrays, array
df = spark.createDataFrame([
[1, {}],
[2, {1:'A', 2:'B', 3:'C'}],
[3, {1:'A', 2:'B'}]
]).toDF("key", "map")
dict = { 1:'A' , 2:'B' }
map_keys_ = array([lit(k) for k in dict.keys()])
map_values_ = array([lit(v) for v in dict.values()])
tmp_map = map_from_arrays(map_keys_, map_values_)
to_strlist_udf = udf(lambda d: sorted([str(k) + str(d[k]) for k in d.keys()]))
def map_equals(m1, m2):
return when(
(size(m1) == size(m2)) &
(to_strlist_udf(m1) == to_strlist_udf(m2)), True
).otherwise(False)
df = df.withColumn("equals", map_equals(df["map"], tmp_map))
df.show(10, False)
# +---+------------------------+------+
# |key|map |equals|
# +---+------------------------+------+
# |1 |[] |false |
# |2 |[1 -> A, 2 -> B, 3 -> C]|false |
# |3 |[1 -> A, 2 -> B] |true |
# +---+------------------------+------+
Note: As you can see the pyspark == operator works pretty well for array comparison as well.

How to remove words that have less than three letters in PySpark?

I have a 'text' column in which arrays of tokens are stored. How to filter all these arrays so that the tokens are at least three letters long?
from pyspark.sql.functions import regexp_replace, col
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
]
df = spark.createDataFrame(vals, columns)
df.show()
# Had tried this but have TypeError: Column is not iterable
# df_clean = df.select('id', regexp_replace('text', [len(word) >= 3 for word
# in col('text')], ''))
# df_clean.show()
I expect to see:
id | text
1 | [good]
2 | [You, are]
This does it, you can decide to exclude row or not, I added an extra column and filtered out, but options are yours:
from pyspark.sql import functions as f
columns = ['id', 'text']
vals = [
(1, ['I', 'am', 'good']),
(2, ['You', 'are', 'ok']),
(3, ['ok'])
]
df = spark.createDataFrame(vals, columns)
#df.show()
df2 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))"))
df2.show()
# This is the actual piece of logic you are looking for.
df3 = df.withColumn("text_left_over", f.expr("filter(text, x -> not(length(x) < 3))")).where(f.size(f.col("text_left_over")) > 0).drop("text")
df3.show()
returns:
+---+--------------+--------------+
| id| text|text_left_over|
+---+--------------+--------------+
| 1| [I, am, good]| [good]|
| 2|[You, are, ok]| [You, are]|
| 3| [ok]| []|
+---+--------------+--------------+
+---+--------------+
| id|text_left_over|
+---+--------------+
| 1| [good]|
| 2| [You, are]|
+---+--------------+
This is the solution
filter_length_udf = udf(lambda row: [x for x in row if len(x) >= 3], ArrayType(StringType()))
df_final_words = df_stemmed.withColumn('words_filtered', filter_length_udf(col('words')))

Removing NULL , NAN, empty space from PySpark DataFrame

I have a dataframe in PySpark which contains empty space, Null, and Nan.
I want to remove rows which have any of those. I tried below commands, but, nothing seems to work.
myDF.na.drop().show()
myDF.na.drop(how='any').show()
Below is the dataframe:
+---+----------+----------+-----+-----+
|age| category| date|empId| name|
+---+----------+----------+-----+-----+
| 25|electronic|17-01-2018| 101| abc|
| 24| sports|16-01-2018| 102| def|
| 23|electronic|17-01-2018| 103| hhh|
| 23|electronic|16-01-2018| 104| yyy|
| 29| men|12-01-2018| 105| ajay|
| 31| kids|17-01-2018| 106|vijay|
| | Men| nan| 107|Sumit|
+---+----------+----------+-----+-----+
What am I missing? What is the best way to tackle NULL, Nan or empty spaces so that there is no problem in the actual calculation?
NaN (not a number) has different meaning that NULL and empty string is just a normal value (can be converted to NULL automatically with csv reader) so na.drop won't match these.
You can convert all to null and drop
from pyspark.sql.functions import col, isnan, when, trim
df = spark.createDataFrame([
("", 1, 2.0), ("foo", None, 3.0), ("bar", 1, float("NaN")),
("good", 42, 42.0)])
def to_null(c):
return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))
df.select([to_null(c).alias(c) for c in df.columns]).na.drop().show()
# +----+---+----+
# | _1| _2| _3|
# +----+---+----+
# |good| 42|42.0|
# +----+---+----+
Maybe in your case it is not important but this code (modifed answer of Alper t. Turker) can handle different datatypes accordingly. The dataTypes can vary according your DataFrame of course. (tested on Spark version: 2.4)
from pyspark.sql.functions import col, isnan, when, trim
# Find out dataType and act accordingly
def to_null_bool(c, dt):
if df == "double":
return c.isNull() | isnan(c)
elif df == "string":
return ~c.isNull() & (trim(c) != "")
else:
return ~c.isNull()
# Only keep columns with not empty strings
def to_null(c, dt):
c = col(c)
return when(to_null_bool(c, dt), c)
df.select([to_null(c, dt[1]).alias(c) for c, dt in zip(df.columns, df.dtypes)]).na.drop(how="any").show()

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np
data = [
(1, 1, None),
(1, 2, float(5)),
(1, 3, np.nan),
(1, 4, None),
(1, 5, float(10)),
(1, 6, float("nan")),
(1, 6, float("nan")),
]
df = spark.createDataFrame(data, ("session", "timestamp1", "id2"))
Expected output
dataframe with count of nan/null for each column
Note:
The previous questions I found in stack overflow only checks for null & not nan.
That's why I have created a new question.
I know I can use isnull() function in Spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe?
You can use method shown here and replace isNull with isnan:
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 3|
+-------+----------+---+
or
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+-------+----------+---+
|session|timestamp1|id2|
+-------+----------+---+
| 0| 0| 5|
+-------+----------+---+
For null values in the dataframe of pyspark
Dict_Null = {col:df.filter(df[col].isNull()).count() for col in df.columns}
Dict_Null
# The output in dict where key is column name and value is null values in that column
{'#': 0,
'Name': 0,
'Type 1': 0,
'Type 2': 386,
'Total': 0,
'HP': 0,
'Attack': 0,
'Defense': 0,
'Sp_Atk': 0,
'Sp_Def': 0,
'Speed': 0,
'Generation': 0,
'Legendary': 0}
To make sure it does not fail for string, date and timestamp columns:
import pyspark.sql.functions as F
def count_missings(spark_df,sort=True):
"""
Counts number of nulls and nans in each column
"""
df = spark_df.select([F.count(F.when(F.isnan(c) | F.isnull(c), c)).alias(c) for (c,c_type) in spark_df.dtypes if c_type not in ('timestamp', 'string', 'date')]).toPandas()
if len(df) == 0:
print("There are no any missing values!")
return None
if sort:
return df.rename(index={0: 'count'}).T.sort_values("count",ascending=False)
return df
If you want to see the columns sorted based on the number of nans and nulls in descending:
count_missings(spark_df)
# | Col_A | 10 |
# | Col_C | 2 |
# | Col_B | 1 |
If you don't want ordering and see them as a single row:
count_missings(spark_df, False)
# | Col_A | Col_B | Col_C |
# | 10 | 1 | 2 |
An alternative to the already provided ways is to simply filter on the column like so
import pyspark.sql.functions as F
df = df.where(F.col('columnNameHere').isNull())
This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.
Here is my one liner.
Here 'c' is the name of the column
from pyspark.sql.functions import isnan, when, count, col, isNull
df.select('c').withColumn('isNull_c',F.col('c').isNull()).where('isNull_c = True').count()
I prefer this solution:
df = spark.table(selected_table).filter(condition)
counter = df.count()
df = df.select([(counter - count(c)).alias(c) for c in df.columns])
Use the following code to identify the null values in every columns using pyspark.
def check_nulls(dataframe):
'''
Check null values and return the null values in pandas Dataframe
INPUT: Spark Dataframe
OUTPUT: Null values
'''
# Create pandas dataframe
nulls_check = pd.DataFrame(dataframe.select([count(when(isnull(c), c)).alias(c) for c in dataframe.columns]).collect(),
columns = dataframe.columns).transpose()
nulls_check.columns = ['Null Values']
return nulls_check
#Check null values
null_df = check_nulls(raw_df)
null_df
from pyspark.sql import DataFrame
import pyspark.sql.functions as fn
# compatiable with fn.isnan. Sourced from
# https://github.com/apache/spark/blob/13fd272cd3/python/pyspark/sql/functions.py#L4818-L4836
NUMERIC_DTYPES = (
'decimal',
'double',
'float',
'int',
'bigint',
'smallilnt',
'tinyint',
)
def count_nulls(df: DataFrame) -> DataFrame:
isnan_compat_cols = {c for (c, t) in df.dtypes if any(t.startswith(num_dtype) for num_dtype in NUMERIC_DTYPES)}
return df.select(
[fn.count(fn.when(fn.isnan(c) | fn.isnull(c), c)).alias(c) for c in isnan_compat_cols]
+ [fn.count(fn.when(fn.isnull(c), c)).alias(c) for c in set(df.columns) - isnan_compat_cols]
)
Builds off of gench and user8183279's answers, but checks via only isnull for columns where isnan is not possible, rather than just ignoring them.
The source code of pyspark.sql.functions seemed to have the only documentation I could really find enumerating these names — if others know of some public docs I'd be delighted.
if you are writing spark sql, then the following will also work to find null value and count subsequently.
spark.sql('select * from table where isNULL(column_value)')
Yet another alternative (improved upon Vamsi Krishna's solutions above):
def check_for_null_or_nan(df):
null_or_nan = lambda x: isnan(x) | isnull(x)
func = lambda x: df.filter(null_or_nan(x)).count()
print(*[f'{i} has {func(i)} nans/nulls' for i in df.columns if func(i)!=0],sep='\n')
check_for_null_or_nan(df)
id2 has 5 nans/nulls
Here is a readable solution because code is for people as much as computers ;-)
df.selectExpr('sum(int(isnull(<col_name>) or isnan(<col_name>))) as null_or_nan_count'))

Resources