How to drop rows with nulls in one column pyspark - apache-spark

I have a dataframe and I would like to drop all rows with NULL value in one of the columns (string). I can easily get the count of that:
df.filter(df.col_X.isNull()).count()
I have tried dropping it using following command. It executes but the count still returns as positive
df.filter(df.col_X.isNull()).drop()
I tried different attempts but it returns 'object is not callable' error.

Use either drop with subset:
df.na.drop(subset=["col_X"])
or isNotNull()
df.filter(df.col_X.isNotNull())

Dataframes are immutable. so just applying a filter that removes not null values will create a new dataframe which wouldn't have the records with null values.
df = df.filter(df.col_X. isNotNull())

if you want to drop any row in which any value is null, use
df.na.drop() //same as df.na.drop("any") default is "any"
to drop only if all values are null for that row, use
df.na.drop("all")
to drop by passing a column list, use
df.na.drop("all", Seq("col1", "col2", "col3"))

another variation is:
from pyspark.sql.functions import col
df = df.where(col("columnName").isNotNull())

you can add empty string condition also somtimes
df = df.filter(df.col_X. isNotNull() | df.col_X != "")

You can use expr() functions that accept SQL-like query syntax.
from pyspark.sql.functions import expr
filteredDF = rawDF.filter(expr("col_X is not null")).filter("col_Y is not null")

Related

split row value by separator and create new columns

I I have a Pyspark dataset with a column “channels” that looks like this:
channels
name1,name2,name3,name4
happy1,happy2
entity1,entity2,entity3,entity4,entity5
I want to create 5 new columns i.e “channel1, channel2, channel3, channel4, channel5”.
Then, I want to split the contents of the “channels” column using the comma separator. After splitting values from each row, I want to put each separated value in a different column.
For example for the first row, the columns should look like this:
channel1 channel2 channel3 channel4 channel5
name1 name2 name3 name4 ~
When an element is not found, i want to use ~ as the column value. For example in the first row, there were only 4 values instead of 5 so for the channel5 column, I used ~
I only want to use ~, not None or NULL.
How can I achieve this result in pyspark?
I tried this:
df = df.withColumn("channels_split", split(df["channels"], ","))
df = df.withColumn("channel1", coalesce(df["channels_split"][0], "~"))
df = df.withColumn("channel2", coalesce(df["channels_split"][1], "~"))
df = df.withColumn("channel3", coalesce(df["channels_split"][2], "~"))
df = df.withColumn("channel4", coalesce(df["channels_split"][3], "~"))
df = df.withColumn("channel5", coalesce(df["channels_split"][4], "~"))
df = df.drop("channels_split")
but it gives me an error that:
`~` is missing
You're referencing the column `~`, but it is missing from the schema. Please check your code.
Note that I am using pyspark within Foundry
Coalesce expects cols as arguments and you are providing String, i think that you should use lit("~")
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.coalesce.html
This is more of a spark problem than a Foundry problem though.
What #M_S said is correct, as the error message was stating, you need a column and should use the lit function.
But be careful if
spark.sql.ansi.enabled is set to True
then this code will throw an ArrayIndexOutOfBoundsException if there are less than 5 items in your array.
Another way to do this would be to ensure that your array has at least 5 elements on each row by adding some ~, then get the first 5 items or by adding a temporary column with the length of the array and use a when condition.
By the way, you don't need to repeat df = every time
df = (
df
.withColumn("channels_split", split(df["channels"], ","))
.withColumn("default_values", array_repeat(lit("~"), 5)))
.withColumn("channels_split", concat(col("channels_split"), col("default_values")))
.withColumn("channel1", df["channels_split"][0])
.withColumn("channel2", df["channels_split"][1])
.withColumn("channel3", df["channels_split"][2])
.withColumn("channel4", df["channels_split"][3])
.withColumn("channel5", df["channels_split"][4])
.drop("channels_split")
)

Get only rows of dataframe where a subset of columns exist in another dataframe

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1).
Important is here that I want the combination of both columns and not look at the column individually.
My approach was this:
#1. Get all combinations
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)
#2.Define udf
def combination_in_vx(ort,plz):
for arr_el in dfSpark_combinations:
if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
return True
return False
combination_in_vx = udf(combination_in_vx, BooleanType())
#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)
Although this should theoretically work it takes forever!
Does anybody know about a better solution here? Thank you very much!
You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:
import pyspark.sql.functions as F
df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')

How to select an aggregated column from a dataframe

df = df.groupby(F.upper(F.col('count'))).agg({'totalvis.age':'avg'}).show() creates avg(totalvis.age AS age) column.
I want to use another aggregate function to select max of the newly created column, but there is an issue with column name not being resolvable.
You can use this syntax below to assign a column alias to the aggregation:
import pyspark.sql.functions as F
df2 = df.groupby(F.upper(F.col('county'))).agg(F.avg('totalvisitor.age').alias('age_avg'))
Then you can select the maximum as df2.select(F.max('age_avg')).
PS Note that in the code you provided in the question, you have overwritten df with None after calling
df = df.(...).show()
because df.show() returns None.

How to filter column on values in list in pyspark?

I have a dataframe rawdata on which i have to apply filter condition on column X with values CB,CI and CR. So I used the below code:
df = dfRawData.filter(col("X").between("CB","CI","CR"))
But I am getting the following error:
between() takes exactly 3 arguments (4 given)
Please let me know how I can resolve this issue.
The function between is used to check if the value is between two values, the input is a lower bound and an upper bound. It can not be used to check if a column value is in a list. To do that, use isin:
import pyspark.sql.functions as f
df = dfRawData.where(f.col("X").isin(["CB", "CI", "CR"]))

Difference between na().drop() and filter(col.isNotNull) (Apache Spark)

Is there any difference in semantics between df.na().drop() and df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull() && !df.col("onlyColumnInOneColumnDataFrame").isNaN()) where df is Apache Spark Dataframe?
Or shall I consider it as a bug if the first one does NOT return afterwards null (not a String null, but simply a null value) in the column onlyColumnInOneColumnDataFrame and the second one does?
EDIT: added !isNaN() as well. The onlyColumnInOneColumnDataFrame is the only column in the given Dataframe. Let's say it's type is Integer.
With df.na.drop() you drop the rows containing any null or NaN values.
With df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull()) you drop those rows which have null only in the column onlyColumnInOneColumnDataFrame.
If you would want to achieve the same thing, that would be df.na.drop(["onlyColumnInOneColumnDataFrame"]).
In one case, I had to select records with NAs or nulls or >=0. I could so by using only coalesce function and none of like above 3 functions.
rdd.filter("coalesce(index_column, 1000) >= 0")

Resources