How to remove NULL from a struct field in pyspark? - apache-spark

I have a DataFrame which contains one struct field. I want to remove the values which are null from the struct field.
temp_df_struct = Df.withColumn("VIN_COUNTRY_CD",struct('BXSR_VEHICLE_1_VIN_COUNTRY_CD','BXSR_VEHICLE_2_VIN_COUNTRY_CD','BXSR_VEHICLE_3_VIN_COUNTRY_CD','BXSR_VEHICLE_4_VIN_COUNTRY_CD','BXSR_VEHICLE_5_VIN_COUNTRY_CD'))
In these various columns some contains NULLs. Is there any way to remove null from the struct field?

You should always provide a small reproducible example - but here's my guess as to what you want
Example data
data = [("1", "10", "20", None, "30", "40"), ("2", None, "15", "25", "35", None)]
names_of_cols = [
"id",
"BXSR_VEHICLE_1_VIN_COUNTRY_CD",
"BXSR_VEHICLE_2_VIN_COUNTRY_CD",
"BXSR_VEHICLE_3_VIN_COUNTRY_CD",
"BXSR_VEHICLE_4_VIN_COUNTRY_CD",
"BXSR_VEHICLE_5_VIN_COUNTRY_CD",
]
df = spark.createDataFrame(data, names_of_cols)
df.show(truncate=False)
# +---+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
# | id|BXSR_VEHICLE_1_VIN_COUNTRY_CD|BXSR_VEHICLE_2_VIN_COUNTRY_CD|BXSR_VEHICLE_3_VIN_COUNTRY_CD|BXSR_VEHICLE_4_VIN_COUNTRY_CD|BXSR_VEHICLE_5_VIN_COUNTRY_CD|
# +---+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
# | 1| 10| 20| null| 30| 40|
# | 2| null| 15| 25| 35| null|
# +---+-----------------------------+-----------------------------+-----------------------------+-----------------------------+-----------------------------+
Reproducing what you have
You want to collect values from multiple columns into an array, such as
import re
from pyspark.sql.functions import col, array
collect_cols = [c for c in df.columns if re.match('BXSR_VEHICLE_\\d_VIN_COUNTRY_CD', c)]
collect_cols
# ['BXSR_VEHICLE_1_VIN_COUNTRY_CD', 'BXSR_VEHICLE_2_VIN_COUNTRY_CD', 'BXSR_VEHICLE_3_VIN_COUNTRY_CD', 'BXSR_VEHICLE_4_VIN_COUNTRY_CD', 'BXSR_VEHICLE_5_VIN_COUNTRY_CD']
(
df.
withColumn(
"VIN_COUNTRY_CD",
array(*collect_cols)
).
select('id', 'VIN_COUNTRY_CD').
show(truncate=False)
)
# +---+-----------------+
# |id |VIN_COUNTRY_CD |
# +---+-----------------+
# |1 |[10, 20,, 30, 40]|
# |2 |[, 15, 25, 35,] |
# +---+-----------------+
Solution
And then remove NULLs from the array
from pyspark.sql.functions import array, struct, lit, array_except
(
df.
withColumn(
"VIN_COUNTRY_CD",
array(*collect_cols)
).
withColumn(
'VIN_COUNTRY_CD',
array_except(
col('VIN_COUNTRY_CD'),
array(lit(None).cast('string'))
)
).
select('id', 'VIN_COUNTRY_CD').
show(truncate=False)
)
# +---+----------------+
# |id |VIN_COUNTRY_CD |
# +---+----------------+
# |1 |[10, 20, 30, 40]|
# |2 |[15, 25, 35] |
# +---+----------------+

Related

Dynamically select the columns in a Spark dataframe

I have data like in the dataframe below. As you can see, there are columns "2019" and "2019_p", "2020" and "2020_p", "2021" and "2021_p".
I want to select the final columns dynamically where if "2019" is null, take the value of "2019_p" and if the value of "2020" is null, take the value of "2020_p" and the same applies to "2021" etc.
I want to select the columns dynamically without hardcoding the column names.
How do I achieve this?
I need output like this:
you can simplify ZygD's approach to just use a list comprehension with coalesce (without regex).
# following list can be created from a source dataframe as well
year_cols = ['2019', '2020', '2021']
# [k for k in data_sdf.columns if k.startswith('20') and not k.endswith('_p')]
data_sdf. \
select('id', 'type',
*[func.coalesce(c, c+'_p').alias(c) for c in year_cols]
). \
show()
# +---+----+----+----+----+
# | id|type|2019|2020|2021|
# +---+----+----+----+----+
# | 1| A| 50| 65| 40|
# | 1| B| 25| 75| 75|
# +---+----+----+----+----+
where the list comprehension would yield the following
[func.coalesce(c, c+'_p').alias(c) for c in year_cols]
# [Column<'coalesce(2019, 2019_p) AS `2019`'>,
# Column<'coalesce(2020, 2020_p) AS `2020`'>,
# Column<'coalesce(2021, 2021_p) AS `2021`'>]
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'A', 50, None, 40, None, 65, None),
(1, 'B', None, 75, None, 25, None, 75)],
['Id', 'Type', '2019', '2020', '2021', '2019_p', '2020_p', '2021_p'])
One way could be this - using df.colRegex:
cols = list({c[:4] for c in df.columns if c not in ['Id', 'Type']})
df = df.select(
'Id', 'Type',
*[F.coalesce(*df.select(df.colRegex(f'`^{c}.*`')).columns).alias(c) for c in cols]
)
df.show()
# +---+----+----+----+----+
# | Id|Type|2020|2019|2021|
# +---+----+----+----+----+
# | 1| A| 65| 50| 40|
# | 1| B| 75| 25| 75|
# +---+----+----+----+----+
Also possible using startswith:
cols = list({c[:4] for c in df.columns if c not in ['Id', 'Type']})
df = df.select(
'Id', 'Type',
*[F.coalesce(*[x for x in df.columns if x.startswith(c)]).alias(c) for c in cols]
)
If you needed a one liner, create a dictionary of the columns and use k, value pair in the coalesce
df.select( 'Id','Type',*[coalesce(k,v).alias(k) for k,v in dict(zip(df.select(df.colRegex("`\\d{4}`")).columns,df.select(df.colRegex("`.*\\_\\D$`")).columns)).items()]).show()
+---+----+----+----+----+
| Id|Type|2019|2020|2021|
+---+----+----+----+----+
| 1| A| 50| 65| 40|
| 1| B| 25| 75| 75|
+---+----+----+----+----+

Spark DataFrame: get row wise sorted column names based on column values

For every row in the below dataframe, I want to find column names (as an array or tuple or something else) according to descending column entries. So, for dataframe
+---+---+---+---+---+
| ID|key| a| b| c|
+---+---+---+---+---+
| 0| 1| 5| 2| 1|
| 1| 1| 3| 4| 5|
+---+---+---+---+---+
I want to find
+---+---+---+---+---+------------------+
| ID|key| a| b| c|descending_columns|
+---+---+---+---+---+------------------+
| 0| 1| 5| 2| 1| [a,b,c]|
| 1| 1| 3| 4| 5| [c,b,a]|
+---+---+---+---+---+------------------+
Ideally and in general, I want to be able to iterate through pre-specified columns and apply a function based on those column entries. This could look like:
import pyspark.sql.functions as f
name_cols = ["a","b","c"]
for col in name_cols:
values_ls.append = []
...schema specification....
values_ls.append(f.col(col) ...get column value... )
df1 = df.withColumn("descending_columns", values_ls)
The question is rather simple, but seems to be quite challenging to implement efficiently in pyspark.
I am using pyspark version 2.3.3.
For Spark Versions < 2.4 you can achieve this without a udf using sort_array and struct.
First get a list of the columns to sort
cols_to_sort = df.columns[2:]
print(cols_to_sort)
#['a', 'b', 'c']
Now build a struct with two elements - a "value" and a "key". The "key" is the column name and the "value" is the column value. If you ensure that the "value" comes first in the struct, you can use sort_array to sort this array of structs in the manner you want.
After the array is sorted, you just need to iterate over it and extract the "key" part, which contains the column names.
from pyspark.sql.functions import array, col, lit, sort_array, struct
df.withColumn(
"descending_columns",
array(
*[
sort_array(
array(
*[
struct([col(c).alias("value"), lit(c).alias("key")])
for c in cols_to_sort
]
),
asc=False
)[i]["key"]
for i in range(len(cols_to_sort))
]
)
).show(truncate=False)
#+---+---+---+---+---+------------------+
#|ID |key|a |b |c |descending_columns|
#+---+---+---+---+---+------------------+
#|0 |1 |5 |2 |1 |[a, b, c] |
#|1 |1 |3 |4 |5 |[c, b, a] |
#+---+---+---+---+---+------------------+
Even though this looks complicated, it should offer better performance than the udf solution.
Update: To sort by the original column order in the case of a tie in the value, you could insert another value in the struct which contains the index. Since the sort is descending, we use the negative of the index.
For example, if your input dataframe were the following:
df.show()
#+---+---+---+---+---+
#| ID|key| a| b| c|
#+---+---+---+---+---+
#| 0| 1| 5| 2| 1|
#| 1| 1| 3| 4| 5|
#| 2| 1| 4| 4| 5|
#+---+---+---+---+---+
The last row above has a tie in value between a and b. We want a to sort before b in this case.
df.withColumn(
"descending_columns",
array(
*[
sort_array(
array(
*[
struct(
[
col(c).alias("value"),
lit(-j).alias("index"),
lit(c).alias("key")
]
)
for j, c in enumerate(cols_to_sort)
]
),
asc=False
)[i]["key"]
for i in range(len(cols_to_sort))
]
)
).show(truncate=False)
#+---+---+---+---+---+------------------+
#|ID |key|a |b |c |descending_columns|
#+---+---+---+---+---+------------------+
#|0 |1 |5 |2 |1 |[a, b, c] |
#|1 |1 |3 |4 |5 |[c, b, a] |
#|2 |1 |4 |4 |5 |[c, a, b] |
#+---+---+---+---+---+------------------+
You could insert the columns into a single struct and process that in a udf.
from pyspark.sql import functions as F
from pyspark.sql import types as T
name_cols = ['a', 'b', 'c']
def ordered_columns(row):
return [x for _,x in sorted(zip(row.asDict().values(), name_cols), reverse=True)]
udf_ordered_columns = F.udf(ordered_columns, T.ArrayType(T.StringType()))
df1 = (
df
.withColumn(
'row',
F.struct(*name_cols)
)
.withColumn(
'descending_columns',
udf_ordered_columns('row')
)
)
Something like this should work, if above doesn't, then let me know.

SPARK SQL, how expand nested array[.. ] field to flat table

I'm using Spark 2.2.0 and 1.6.1. One of my tasks has following table:
|ID|DEVICE |HASH|
----------------
|12|2,3,0,2,6,4|adf7|
where:
ID - long
DEVICE - string
HASH - string
I need to expand field 'DEVICE' to 6 columns, such as:
|ID|D1|D2|D3|D4|D5|D6|HASH|
---------------------------
|12|2 |3 |0 |2 |6 |4 |adf7|
Thank for help me.
Get the maximum length:
import org.apache.spark.sql.functions.{size, max}
import org.apache.spark.sql.Row
val df = Seq(("12", Seq(2, 3, 0, 2, 6, 4), "adf7")).toDF("id", "device", "hash")
val Row(n: Int) = df.select(max(size($"device"))).first
If you know the number beforehand just skip this and go straight to the second part.
Once you define n, just select:
df.select(
$"id" +: (0 until n).map(i => $"device"(i).alias(s"d$i")) :+ $"hash": _*
).show
// +---+---+---+---+---+---+---+----+
// | id| d0| d1| d2| d3| d4| d5|hash|
// +---+---+---+---+---+---+---+----+
// | 12| 2| 3| 0| 2| 6| 4|adf7|
// +---+---+---+---+---+---+---+----+

How to convert empty arrays to nulls?

I have below dataframe and i need to convert empty arrays to null.
+----+---------+-----------+
| id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]| [50, 55]|
|1111| []| []|
|1112| [45, 46]| [50, 50]|
|1113| []| []|
+----+---------+-----------+
i have tried below code which is not working.
df.na.fill("null").show()
expected output should be
+----+---------+-----------+
| id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]| [50, 55]|
|1111| NUll| NUll|
|1112| [45, 46]| [50, 50]|
|1113| NUll| NUll|
+----+---------+-----------+
For your given dataframe, you can simply do the following
from pyspark.sql import functions as F
df.withColumn("count(AS)", F.when((F.size(F.col("count(AS)")) == 0), F.lit(None)).otherwise(F.col("count(AS)"))) \
.withColumn("count(asdr)", F.when((F.size(F.col("count(asdr)")) == 0), F.lit(None)).otherwise(F.col("count(asdr)"))).show()
You should have output dataframe as
+----+---------+-----------+
| id|count(AS)|count(asdr)|
+----+---------+-----------+
|1110| [12, 45]| [50, 55]|
|1111| null| null|
|1112| [45, 46]| [50, 50]|
|1113| null| null|
+----+---------+-----------+
Updated
In case you have more than two array columns and you want to apply the above logic dynamically, you can use the following logic
from pyspark.sql import functions as F
for c in df.dtypes:
if "array" in c[1]:
df = df.withColumn(c[0], F.when((F.size(F.col(c[0])) == 0), F.lit(None)).otherwise(F.col(c[0])))
df.show()
Here,
df.dtypes would give you array of tuples with column name and datatype. As for the dataframe in the question it would be
[('id', 'bigint'), ('count(AS)', 'array<bigint>'), ('count(asdr)', 'array<bigint>')]
withColumn is applied to only array columns ("array" in c[1]) where F.size(F.col(c[0])) == 0 is the condition checking for when function which checks for the size of the array. if the condition is true i.e. empty array then None is populated else original value is populated. The loop is applied to all the array columns.
I don't think thats possible with na.fill, but this should work for you. The code converts all empty ArrayType-columns to null and keeps the other columns as they are:
import spark.implicits._
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.functions._
val df = Seq(
(110, Seq.empty[Int]),
(111, Seq(1,2,3))
).toDF("id","arr")
// get names of array-type columns
val arrColsNames = df.schema.fields.filter(f => f.dataType.isInstanceOf[ArrayType]).map(_.name)
// map all empty arrays to nulls
val emptyArraysAsNulls = arrColsNames.map(n => when(size(col(n))>0,col(n)).as(n))
// non-array-type columns, keep them as they are
val keepCols = df.columns.filterNot(arrColsNames.contains).map(col)
df
.select((keepCols ++ emptyArraysAsNulls):_*)
.show()
+---+---------+
| id| arr|
+---+---------+
|110| null|
|111|[1, 2, 3]|
+---+---------+
You need to check for the size of the array type column. Like:
df.show()
+----+---+
| id|arr|
+----+---+
|1110| []|
+----+---+
df.withColumn("arr", when(size(col("arr")) == 0 , lit(None)).otherwise(col("arr") ) ).show()
+----+----+
| id| arr|
+----+----+
|1110|null|
+----+----+
There is no easy solution like df.na.fill here. One way would be to loop over all relevant columns and replace values where appropriate. Example using foldLeft in scala:
val columns = df.schema.filter(_.dataType.typeName == "array").map(_.name)
val df2 = columns.foldLeft(df)((acc, colname) => acc.withColumn(colname,
when(size(col(colname)) === 0, null).otherwise(col(colname))))
First, all columns of array type is extracted and then these are iterated through. Since the size function is only defined for columns of array type this is a necessary step (and avoids looping over all columns).
Using the dataframe:
+----+--------+-----+
| id| col1| col2|
+----+--------+-----+
|1110|[12, 11]| []|
|1111| []| [11]|
|1112| [123]|[321]|
+----+--------+-----+
The result is as follows:
+----+--------+-----+
| id| col1| col2|
+----+--------+-----+
|1110|[12, 11]| null|
|1111| null| [11]|
|1112| [123]|[321]|
+----+--------+-----+
you can do it with selectExpr:
df_filled = df.selectExpr(
"id",
"if(size(column1)<=0, null, column1)",
"if(size(column2)<=0, null, column2)",
...
)
df.withColumn("arr", when(size(col("arr")) == 0, lit(None)).otherwise(col("arr") ) ).show()
Please keep in mind, it's also not work in pyspark.
By taking Ramesh Maharajans above solution as reference. I have found an another way of solution using UDFs. hope this helps you for multiple rules on your dataframe.
df
|store| 1| 2| 3|
+-----+----+----+----+
| 103|[90]| []| []|
| 104| []|[67]|[90]|
| 101|[34]| []| []|
| 102|[35]| []| []|
+-----+----+----+----+
use below code, import import pyspark.sql.functions as psf
This code works in pyspark
def udf1(x :list):
if x==[]: return "null"
else: return x
udf2 = udf(udf1, ArrayType(IntegerType()))
for c in df.dtypes:
if "array" in c[1]:
df=df.withColumn(c[0],udf2(psf.col(c[0])))
df.show()
output
|store| 1| 2| 3|
+-----+----+----+----+
| 103|[90]|null|null|
| 104|null|[67]|[90]|
| 101|[34]|null|null|
| 102|[35]|null|null|
+-----+----+----+----+

Spark: Replace missing values with values from another column

Suppose you have a Spark dataframe containing some null values, and you would like to replace the values of one column with the values from another if present. In Python/Pandas you can use the fillna() function to do this quite nicely:
df = spark.createDataFrame([('a', 'b', 'c'),(None,'e', 'f'),(None,None,'i')], ['c1','c2','c3'])
DF = df.toPandas()
DF['c1'].fillna(DF['c2']).fillna(DF['c3'])
How can this be done using Pyspark?
You need to use the coalesce function :
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDF.show()
# +----+----+
# | a| b|
# +----+----+
# |null|null|
# | 1|null|
# |null| 2|
# +----+----+
cDf.select(coalesce(cDf["a"], cDf["b"])).show()
# +--------------+
# |coalesce(a, b)|
# +--------------+
# | null|
# | 1|
# | 2|
# +--------------+
cDf.select('*', coalesce(cDf["a"], lit(0.0))).show()
# +----+----+----------------+
# | a| b|coalesce(a, 0.0)|
# +----+----+----------------+
# |null|null| 0.0|
# | 1|null| 1.0|
# |null| 2| 0.0|
# +----+----+----------------+
You can also apply coalesce on multiple columns :
cDf.select(coalesce(cDf["a"], cDf["b"], lit(0))).show()
# ...
This example is taken from the pyspark.sql API documentation.

Resources