Pypsark - Retain null values when using collect_list - nested

According to the accepted answer in pyspark collect_set or collect_list with groupby, when you do a collect_list on a certain column, the null values in this column are removed. I have checked and this is true.
But in my case, I need to keep the null columns -- How can I achieve this?
I did not find any info on this kind of a variant of collect_list function.
Background context to explain why I want nulls:
I have a dataframe df as below:
cId | eId | amount | city
1 | 2 | 20.0 | Paris
1 | 2 | 30.0 | Seoul
1 | 3 | 10.0 | Phoenix
1 | 3 | 5.0 | null
I want to write this to an Elasticsearch index with the following mapping:
"mappings": {
"doc": {
"properties": {
"eId": { "type": "keyword" },
"cId": { "type": "keyword" },
"transactions": {
"type": "nested",
"properties": {
"amount": { "type": "keyword" },
"city": { "type": "keyword" }
}
}
}
}
}
In order to conform to the nested mapping above, I transformed my df so that for each combination of eId and cId, I have an array of transactions like this:
df_nested = df.groupBy('eId','cId').agg(collect_list(struct('amount','city')).alias("transactions"))
df_nested.printSchema()
root
|-- cId: integer (nullable = true)
|-- eId: integer (nullable = true)
|-- transactions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount: float (nullable = true)
| | |-- city: string (nullable = true)
Saving df_nested as a json file, there are the json records that I get:
{"cId":1,"eId":2,"transactions":[{"amount":20.0,"city":"Paris"},{"amount":30.0,"city":"Seoul"}]}
{"cId":1,"eId":3,"transactions":[{"amount":10.0,"city":"Phoenix"},{"amount":30.0}]}
As you can see - when cId=1 and eId=3, one of my array elements where amount=30.0 does not have the city attribute because this was a null in my original data (df). The nulls are being removed when I use the collect_list function.
However, when I try writing df_nested to elasticsearch with the above index, it errors because there is a schema mismatch. This is basically the reason as to why I want to retain my nulls after applying the collect_list function.

from pyspark.sql.functions import create_map, collect_list, lit, col, to_json, from_json
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, HiveContext, SparkSession, types, Row
from pyspark.sql import functions as f
import os
app_name = "CollList"
conf = SparkConf().setAppName(app_name)
spark = SparkSession.builder.appName(app_name).config(conf=conf).enableHiveSupport().getOrCreate()
df = spark.createDataFrame([[1, 2, 20.0, "Paris"], [1, 2, 30.0, "Seoul"],
[1, 3, 10.0, "Phoenix"], [1, 3, 5.0, None]],
["cId", "eId", "amount", "city"])
print("Actual data")
df.show(10,False)
```
Actual data
+---+---+------+-------+
|cId|eId|amount|city |
+---+---+------+-------+
|1 |2 |20.0 |Paris |
|1 |2 |30.0 |Seoul |
|1 |3 |10.0 |Phoenix|
|1 |3 |5.0 |null |
+---+---+------+-------+
```
#collect_list that skips null columns
df1 = df.groupBy(f.col('city'))\
.agg(f.collect_list(f.to_json(f.struct([f.col(x).alias(x) for x in (c for c in df.columns if c != 'cId' and c != 'eId' )])))).alias('newcol')
print("Collect List Data - Missing Null Columns in the list")
df1.show(10, False)
```
Collect List Data - Missing Null Columns in the list
+-------+-------------------------------------------------------------------------------------------------------------------+
|city |collect_list(structstojson(named_struct(NamePlaceholder(), amount AS `amount`, NamePlaceholder(), city AS `city`)))|
+-------+-------------------------------------------------------------------------------------------------------------------+
|Phoenix|[{"amount":10.0,"city":"Phoenix"}] |
|null |[{"amount":5.0}] |
|Paris |[{"amount":20.0,"city":"Paris"}] |
|Seoul |[{"amount":30.0,"city":"Seoul"}] |
+-------+-------------------------------------------------------------------------------------------------------------------+
```
my_list = []
for x in (c for c in df.columns if c != 'cId' and c != 'eId' ):
my_list.append(lit(x))
my_list.append(col(x))
grp_by = ["eId","cId"]
df_nested = df.withColumn("transactions", create_map(my_list))\
.groupBy(grp_by)\
.agg(collect_list(f.to_json("transactions")).alias("transactions"))
print("collect list after create_map")
df_nested.show(10,False)
```
collect list after create_map
+---+---+--------------------------------------------------------------------+
|eId|cId|transactions |
+---+---+--------------------------------------------------------------------+
|2 |1 |[{"amount":"20.0","city":"Paris"}, {"amount":"30.0","city":"Seoul"}]|
|3 |1 |[{"amount":"10.0","city":"Phoenix"}, {"amount":"5.0","city":null}] |
+---+---+--------------------------------------------------------------------+
```

Related

Spark Dataframe manipulation

Input Dataframe:
caseid
indicator
1
STP
1
non-STP
2
STP
3
STP
3
non-STP
output Dataframe:
caseid
indicator
1
non-STP
2
STP
3
non-STP
Hello all, I would be really grateful if someone can help me in the above dataframe. in the output dataframe, I only want to keep the cases where the indicator is non-STP, whereas in the cases where the cases in STP keep that as it is.
Thanks in Advance
You could try with groupby and then check if values contain non-STP.
Example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
data = [
{"caseid": "1", "indicator": "STP"},
{"caseid": "1", "indicator": "non-STP"},
{"caseid": "2", "indicator": "STP"},
{"caseid": "3", "indicator": "STP"},
{"caseid": "3", "indicator": "non-STP"},
]
df = spark.createDataFrame(data)
df = (
df.groupBy("caseid")
.agg(F.concat_ws(",", F.collect_list(F.col("indicator"))).alias("indicator"))
.orderBy("caseid")
)
df = df.withColumn(
"indicator",
F.when(F.col("indicator").contains("non-STP"), F.lit("non-STP")).otherwise(
F.lit("STP")
),
)
Result:
root
|-- caseid: string (nullable = true)
|-- indicator: string (nullable = false)
+------+---------+
|caseid|indicator|
+------+---------+
|1 |non-STP |
|2 |STP |
|3 |non-STP |
+------+---------+

Modify nested property inside Struct column with PySpark

I want to modify/filter on a property inside a struct.
Let's say I have a dataframe with the following column :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [1, 2, 3]} |
#+------------------------------------------+
Schema:
struct<a:string, b:array<int>>
I want to filter out some values in 'b' property when value inside the array == 1
The result desired is the following :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [2, 3]} |
#+------------------------------------------+
Is it possible to do it without extracting the property, filter the values, and re-build another struct ?
Update:
For spark 3.1+, withField can be used to update the struct column without having to recreate all the struct. In your case, you can update the field b using filter function to filter the array values like this:
import pyspark.sql.functions as F
df1 = df.withColumn(
'arrayCol',
F.col('arrayCol').withField('b', F.filter(F.col("arrayCol.b"), lambda x: x != 1))
)
df1.show()
#+--------------------+
#| arrayCol|
#+--------------------+
#|{some_value, [2, 3]}|
#+--------------------+
For older versions, Spark doesn’t support adding/updating fields in nested structures. To update a struct column, you'll need to create a new struct using the existing fields and the updated ones:
import pyspark.sql.functions as F
df1 = df.withColumn(
"arrayCol",
F.struct(
F.col("arrayCol.a").alias("a"),
F.expr("filter(arrayCol.b, x -> x != 1)").alias("b")
)
)
One way would be to define a UDF:
Example:
import ast
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, MapType
def remove_value(col):
col["b"] = str([x for x in ast.literal_eval(col["b"]) if x != 1])
return col
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
{
"arrayCol": {
"a": "some_value",
"b": "[1, 2, 3]",
},
},
]
)
remove_value_udf = spark.udf.register(
"remove_value_udf", remove_value, MapType(StringType(), StringType())
)
df = df.withColumn(
"result",
remove_value_udf(F.col("arrayCol")),
)
Result:
root
|-- arrayCol: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- result: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------------------------+------------------------------+
|arrayCol |result |
+---------------------------------+------------------------------+
|{a -> some_value, b -> [1, 2, 3]}|{a -> some_value, b -> [2, 3]}|
+---------------------------------+------------------------------+

Can i create a dataframe from another dataframes rows

Can I create a dataframe from below's rows , as columns of the new dataframe using Pyspark?
+------------+
| col|
+------------|
|created_meta|
| updated_at|
|updated_meta|
| meta|
| Year|
| First Name|
| County|
| Sex|
| Count|
+------------
Two ways.
Using pivot:
df1 = df.groupBy().pivot('col').agg(F.lit(None)).limit(0)
df1.show()
+-----+------+---------+---+----+------------+----+----------+------------+
|Count|County|FirstName|Sex|Year|created_meta|meta|updated_at|updated_meta|
+-----+------+---------+---+----+------------+----+----------+------------+
+-----+------+---------+---+----+------------+----+----------+------------+
Creating it from scratch:
df2 = df.select([F.lit(r[0]) for r in df.collect()]).limit(0)
df2.show()
+------------+----------+------------+----+----+---------+------+---+-----+
|created_meta|updated_at|updated_meta|meta|Year|FirstName|County|Sex|Count|
+------------+----------+------------+----+----+---------+------+---+-----+
+------------+----------+------------+----+----+---------+------+---+-----+
// sorry in Scala + Spark
import spark.implicits._
import org.apache.spark.sql.functions._
val lst = List("created_meta",
"updated_at",
"updated_meta",
"meta",
"Year",
"First Name",
"County",
"Sex",
"Count")
val source = lst.toDF("col")
source.show(false)
// +------------+
// |col |
// +------------+
// |created_meta|
// |updated_at |
// |updated_meta|
// |meta |
// |Year |
// |First Name |
// |County |
// |Sex |
// |Count |
// +------------+
val l = source.select('col).as[String].collect.toList
val df1 = l.foldLeft(source)((acc, col) => {
acc.withColumn(col, lit(""))
})
val df2 = df1.drop("col")
df2.printSchema()
// root
// |-- created_meta: string (nullable = false)
// |-- updated_at: string (nullable = false)
// |-- updated_meta: string (nullable = false)
// |-- meta: string (nullable = false)
// |-- Year: string (nullable = false)
// |-- First Name: string (nullable = false)
// |-- County: string (nullable = false)
// |-- Sex: string (nullable = false)
// |-- Count: string (nullable = false)
df2.show(1, false)
// +------------+----------+------------+----+----+----------+------+---+-----+
// |created_meta|updated_at|updated_meta|meta|Year|First Name|County|Sex|Count|
// +------------+----------+------------+----+----+----------+------+---+-----+
// | | | | | | | | | |
// +------------+----------+------------+----+----+----------+------+---+-----+

How to perform calculation in spark dataframe that select from its own dataframe using pyspark

I have a pyspark schema which look like this :
root
|-- id: string (nullable = true)
|-- long: float (nullable = true)
|-- lat: float (nullable = true)
|-- geohash: string (nullable = true)
|-- neighbors: array (nullable = true)
| |-- element: string (containsNull = true)
The data look like this :
+---+---------+----------+---------+--------------------+
| id| lat| long|geohash_8| neighbors|
+---+---------+----------+---------+--------------------+
| 0|-6.361755| 106.79653| qqggy1yu|[qqggy1ys, qqggy1...|
| 1|-6.358584|106.793945| qqggy4ky|[qqggy4kw, qqggy4...|
| 2|-6.362967|106.798775| qqggy38m|[qqggy38j, qqggy3...|
| 3|-6.358316| 106.79832| qqggy680|[qqggy4xb, qqggy6...|
| 4| -6.36016| 106.7981| qqggy60j|[qqggy4pv, qqggy6...|
| 5|-6.357476| 106.79842| qqggy68j|[qqggy4xv, qqggy6...|
| 6|-6.360814| 106.79435| qqggy4j3|[qqggy4j1, qqggy4...|
| 7|-6.358231|106.794365| qqggy4t2|[qqggy4t0, qqggy4...|
| 8|-6.357654| 106.79736| qqggy4x7|[qqggy4x5, qqggy4...|
| 9|-6.358781|106.794624| qqggy4mm|[qqggy4mj, qqggy4...|
| 10|-6.357654| 106.79443| qqggy4t7|[qqggy4t5, qqggy4...|
| 11|-6.357079| 106.79443| qqggy4tr|[qqggy4tp, qqggy4...|
| 12|-6.359929| 106.79698| qqggy4pn|[qqggy4ny, qqggy4...|
| 13|-6.358111| 106.79633| qqggy4w9|[qqggy4w3, qqggy4...|
| 14|-6.359685| 106.79607| qqggy4q8|[qqggy4q2, qqggy4...|
| 15|-6.357945|106.794945| qqggy4td|[qqggy4t6, qqggy4...|
| 16|-6.360725|106.795456| qqggy4n4|[qqggy4jf, qqggy4...|
| 17|-6.363701| 106.79653| qqggy1wb|[qqggy1w8, qqggy1...|
| 18| -6.36329|106.794586| qqggy1t7|[qqggy1t5, qqggy1...|
| 19|-6.363304| 106.79429| qqggy1t5|[qqggy1sg, qqggy1...|
+---+---------+----------+---------+--------------------+
I want to calculate the distance from each id with its lat long and select all the lat long from all his neighbors then calculate the distance. Then every id will have list of distances in meters with all his neighbors.
I tried using iterative way, which loop every rows then select a dataframe then compute the haversine distance, However the performance is awful. I am stuck on how to apply using functional way in spark. Can anyone help with some suggestion or references.
Updated to address desire for combinations
If you want to do all the combinations, the steps are basically, associate each neighbor ID with it's lat/long, group them together into a single row for each combination set, then do compute distance on all the combinations. Here is example code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import Row
import itertools
schema = StructType([
StructField("id", StringType()),
StructField("lat", FloatType()),
StructField("long", FloatType()),
StructField("geohash_8", StringType()),
StructField("neighbors", ArrayType(StringType()))
])
data = [
("0", 10.0, 11.0, "A", ["B", "C", "D"]),
("1", 12.0, 13.0, "B", ["D"]),
("2", 14.0, 15.0, "C", []),
("3", 16.0, 17.0, "D", [])
]
input_df = spark.createDataFrame(sc.parallelize(data), schema)
# Explode to get a row for each comparison pair
df = input_df.withColumn('neighbor', explode('neighbors')).drop('neighbors')
# Join to get the lat/lon of the neighbor
neighbor_map = input_df.selectExpr('geohash_8 as nid', 'lat as nlat', 'long as nlong')
df = df.join(neighbor_map , col('neighbor') == col('nid'), 'inner').drop('nid')
# Add in rows for the root (geohash_8) records before grouping
root_rows = input_df.selectExpr("id", "lat", "long", "geohash_8", "geohash_8 as neighbor", "lat as nlat", "long as nlong")
df = df.unionAll(root_rows)
# Group by to roll the rows back up but now associating the lat/lon w/ the neighbors
df = df.selectExpr("id", "lat", "long", "geohash_8", "struct(neighbor, nlat, nlong) as neighbors")
df = df.groupBy("id", "lat", "long", "geohash_8").agg(collect_set("neighbors").alias("neighbors"))
# You now have all the data you need in one field, so you can write a python udf to do the combinations
def compute_distance(left_lat, left_lon, right_lat, right_lon):
return 10.0
def combinations(neighbors):
result = []
for left, right in itertools.combinations(neighbors, 2):
dist = compute_distance(left['nlat'], left['nlong'], right['nlat'], right['nlong'])
result.append(Row(left=left['neighbor'], right=right['neighbor'], dist=dist))
return result
udf_schema = ArrayType(StructType([
StructField("left", StringType()),
StructField("right", StringType()),
StructField("dist", FloatType())
]))
combinations_udf = udf(combinations, udf_schema)
# Finally, apply the UDF
df = df.withColumn('neighbors', combinations_udf(col('neighbors')))
df.printSchema()
df.show()
Which produces this:
root
|-- id: string (nullable = true)
|-- lat: float (nullable = true)
|-- long: float (nullable = true)
|-- geohash_8: string (nullable = true)
|-- neighbors: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- neighbor: string (nullable = true)
| | |-- nlat: float (nullable = true)
| | |-- nlong: float (nullable = true)
+---+----+----+---------+------------------------------------------------------------------------------------+
|id |lat |long|geohash_8|neighbors |
+---+----+----+---------+------------------------------------------------------------------------------------+
|0 |10.0|11.0|A |[[D, C, 10.0], [D, A, 10.0], [D, B, 10.0], [C, A, 10.0], [C, B, 10.0], [A, B, 10.0]]|
|2 |14.0|15.0|C |[] |
|1 |12.0|13.0|B |[[D, B, 10.0]] |
|3 |16.0|17.0|D |[] |
+---+----+----+---------+------------------------------------------------------------------------------------+

Select columns that satisfy a condition

I'm running the following notebook in zeppelin:
%spark.pyspark
l = [('user1', 33, 1.0, 'chess'), ('user2', 34, 2.0, 'tenis'), ('user3', None, None, ''), ('user4', None, 4.0, ' '), ('user5', None, 5.0, 'ski')]
df = spark.createDataFrame(l, ['name', 'age', 'ratio', 'hobby'])
df.show()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- ratio: double (nullable = true)
|-- hobby: string (nullable = true)
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1| 33| 1.0|chess|
|user2| 34| 2.0|tenis|
|user3|null| null| |
|user4|null| 4.0| |
|user5|null| 5.0| ski|
+-----+----+-----+-----+
agg_df = df.select(*[(1.0 - (count(c) / count('*'))).alias(c) for c in df.columns])
agg_df.show()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- ratio: double (nullable = true)
|-- hobby: string (nullable = true)
+----+---+-------------------+-----+
|name|age| ratio|hobby|
+----+---+-------------------+-----+
| 0.0|0.6|0.19999999999999996| 0.0|
+----+---+-------------------+-----+
Now, I want to select in agg_df only columns which value is < 0.35. In this case it should return ['name', 'ratio', 'hobby']
I can't figure out how to do it. Any hint?
you mean values < 0.35?. This should do
>>> [ key for (key,value) in agg_df.collect()[0].asDict().items() if value < 0.35 ]
['hobby', 'ratio', 'name']
to replace blank string with Null use the following udf function.
from pyspark.sql.functions import udf
process = udf(lambda x: None if not x else (x if x.strip() else None))
df.withColumn('hobby', process(df.hobby)).show()
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1| 33| 1.0|chess|
|user2| 34| 2.0|tenis|
|user3|null| null| null|
|user4|null| 4.0| null|
|user5|null| 5.0| ski|
+-----+----+-----+-----+
Here is my attempt for the function I was looking for based on rogue-one indications. Not sure if it is the fastest or most optimized:
from pyspark.sql.functions import udf, count
from functools import reduce
def filter_columns(df, threshold=0.35):
process = udf(lambda x: None if not x else (x if x.strip() else None)) # udf for stripping string values
string_cols = ([c for c in df.columns if df.select(c).dtypes[0][1] == 'string']) # string columns
new_df = reduce(lambda df, x: df.withColumn(x, process(x)), string_cols, df) # process all string columns
agg_df = new_df.select(*[(1.0 - (count(c) / count('*'))).alias(c) for c in new_df.columns]) # compute non-null/df.count ratio
cols_match_threshold = [ key for (key, value) in agg_df.collect()[0].asDict().items() if value < threshold ] # select only cols which value < threshold
return new_df.select(cols_match_threshold)
filter_columns(df, 0.35).show()
+-----+-----+
|ratio| name|
+-----+-----+
| 1.0|user1|
| 2.0|user2|
| null|user3|
| 4.0|user4|
| 5.0|user5|
+-----+-----+

Resources