Remove rows with invalid polygon values in a PySpark data frame?

Remove rows with invalid polygon values in a PySpark data frame? - apache-spark

We are using a PySpark function on a data frame which throws us an error. The error is most likely due to a faulty row in the data frame.
Schema of data frame looks like:
root
|-- geo_name: string (nullable = true)
|-- geo_latitude: double (nullable = true)
|-- geo_longitude: double (nullable = true)
|-- geo_bst: integer (nullable = true)
|-- geo_bvw: integer (nullable = true)
|-- geometry_type: string (nullable = true)
|-- geometry_polygon: string (nullable = true)
|-- geometry_multipolygon: string (nullable = true)
|-- polygon: geometry (nullable = false)
I have converted the column "geometry_polygon" in CSV to the geometry type column "polygon" like this:
station_groups_gdf.createOrReplaceTempView("station_gdf")
spatial_station_groups_gdf = spark_sedona.sql("SELECT *, ST_PolygonFromText(station_gdf.geometry_polygon, ',') AS polygon FROM station_gdf")
Example input data looks like this:
-RECORD 0-------------------------------------
geo_name | Neckarkanal
geo_latitude | 49.486697
geo_longitude | 8.504944
geo_bst | 0
geo_bvw | 0
geometry_type | Polygon
geometry_polygon | 8.4937, 49.4892, ...
geometry_multipolygon | null
polygon | POLYGON ((8.4937 ...
The error occurs with just calling:
df.show()
The error:
java.lang.IllegalArgumentException: Points of LinearRing do not form a closed linestring
To pinpoint these rows, we would like to iterate trough the data frame and apply a function to delete invalid values. Something like this:
dataframe.where(dataframe.polygon == valid).show()
dataframe.filter(dataframe.polygon == valid).show()
Do you know the best way to iterate row by row & deleting invalid values without in any way catching the PySpark data frame in its entirety (resulting in the error message and aborting the job)?

Since you had a dataframe, pandas_udf check should work very well. The function itself may not look very nice, but it works. In the below example, it can be seen that "geo_name" = X is invalid for a polygon, and in the output, the polygon for this row is not created.
Input:
df = spark_sedona.createDataFrame(
[('A', '-74, 40, -73, 39, -75, 38, -74, 40'),
('X', '-11'),
('Y', None),
('B', '-33, 50, -30, 38, -40, 27, -33, 50')],
['geo_name', 'geometry_polygon']
)
Script:
from pyspark.sql import functions as F
import pandas as pd
from shapely.geometry import Polygon
#F.pandas_udf('string')
def nullify_invalid_polygon(ser: pd.Series) -> pd.Series:
def nullify(s):
try:
p_shell = list(zip(*[iter(map(float, s.split(',')))]*2))
return s if Polygon(p_shell).is_valid and p_shell != [] else None
except (ValueError, AttributeError): pass
return ser.map(nullify)
df = df.withColumn('geometry_polygon', nullify_invalid_polygon('geometry_polygon'))
df.createOrReplaceTempView("station_gdf")
df = spark_sedona.sql("SELECT *, CASE WHEN isnull(geometry_polygon) THEN null ELSE ST_PolygonFromText(geometry_polygon, ',') END AS polygon FROM station_gdf")
Result:
df.printSchema()
# root
# |-- geo_name: string (nullable = true)
# |-- geometry_polygon: string (nullable = true)
# |-- polygon: geometry (nullable = true)
df.show(truncate=0)
# +--------+----------------------------------+------------------------------------------+
# |geo_name|geometry_polygon |polygon |
# +--------+----------------------------------+------------------------------------------+
# |A |-74, 40, -73, 39, -75, 38, -74, 40|POLYGON ((-74 40, -73 39, -75 38, -74 40))|
# |X |null |null |
# |Y |null |null |
# |B |-33, 50, -30, 38, -40, 27, -33, 50|POLYGON ((-33 50, -30 38, -40 27, -33 50))|
# +--------+----------------------------------+------------------------------------------+
The idea is to apply Polygon.is_valid. But since in a few cases it throws errors instead of returning False, it is put inside try...except.

Related

pyspark dataframes: Why can I select some nested fields but not others?

I'm trying to write some code to un-nest JSON into Dataframes using pyspark (3.0.1) in Python 3.9.1.
I have some dummy data with a schema as follows:
data.printSchema()
root
|-- recordID: string (nullable = true)
|-- customerDetails: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- dob: string (nullable = true)
|-- familyMembers: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- relationship: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- contactNumbers: struct (nullable = true)
| | | |-- work: string (nullable = true)
| | | |-- home: string (nullable = true)
| | |-- addressDetails: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- addressType: string (nullable = true)
| | | | |-- address: string (nullable = true)
When I select fields from familyMembers I get the following results as expected:
data.select('familyMembers.contactNumbers.work').show(truncate=False)
+------------------------------------------------+
|work |
+------------------------------------------------+
|[(07) 4612 3880, (03) 5855 2377, (07) 4979 1871]|
|[(07) 4612 3880, (03) 5855 2377] |
+------------------------------------------------+
data.select('familyMembers.name').show(truncate=False)
+------------------------------------+
|name |
+------------------------------------+
|[Jane Smith, Bob Smith, Simon Smith]|
|[Jackie Sacamano, Simon Sacamano] |
+------------------------------------+
Yet when I try to select fields from the addressDetails ArrayType (beneath familyMembers) I get an error:
>>> data.select('familyMembers.addressDetails.address').show(truncate=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1421, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/usr/local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/usr/local/lib/python3.9/site-packages/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: cannot resolve '`familyMembers`.`addressDetails`['address']' due to data type mismatch: argument 2 requires integral type, however, ''address'' is of string type.;;
'Project [familyMembers#71.addressDetails[address] AS address#277]
+- LogicalRDD [recordID#69, customerDetails#70, familyMembers#71], false
I'm confused. Both familyMembers and addressDetails are ArrayTypes, yet selecting from one works but not the other. Is there an explanation for this, or something I've missed? Is it because one is nested within the other?
Code to reproduce (with just 1 record):
from pyspark.sql.types import StructType
from pyspark.sql import SparkSession, DataFrame
import json
rawdata = [{"recordID":"abc-123","customerDetails":{"name":"John Smith","dob":"1980-04-23"},"familyMembers":[{"relationship":"mother","name":"Jane Smith","contactNumbers":{"work":"(07) 4612 3880","home":"(08) 8271 1577"},"addressDetails":[{"addressType":"residential","address":"29 Commonwealth St, Clifton, QLD 4361 "},{"addressType":"work","address":"20 A Yeo Ave, Highgate, SA 5063 "}]},{"relationship":"father","name":"Bob Smith","contactNumbers":{"work":"(03) 5855 2377","home":"(03) 9773 2483"},"addressDetails":[{"addressType":"residential","address":"1735 Fenaughty Rd, Kyabram South, VIC 3620"},{"addressType":"work","address":"12 Haldane St, Bonbeach, VIC 3196 "}]},{"relationship":"brother","name":"Simon Smith","contactNumbers":{"work":"(07) 4979 1871","home":"(08) 9862 6017"},"addressDetails":[{"addressType":"residential","address":"6 Darren St, Sun Valley, QLD 4680"},{"addressType":"work","address":"Arthur River, WA 6315"}]}]},]
strschema = '{"fields":[{"metadata":{},"name":"recordID","nullable":true,"type":"string"},{"metadata":{},"name":"customerDetails","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"dob","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"familyMembers","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"relationship","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"contactNumbers","nullable":true,"type":{"fields":[{"metadata":{},"name":"work","nullable":true,"type":"string"},{"metadata":{},"name":"home","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"addressDetails","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"addressType","nullable":true,"type":"string"},{"metadata":{},"name":"address","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}}],"type":"struct"},"type":"array"}}],"type":"struct"}'
spark = SparkSession.builder.appName("json-un-nester").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
schema = StructType.fromJson(json.loads(strschema))
datardd = sc.parallelize(rawdata)
data = spark.createDataFrame(datardd, schema=schema)
data.show()
data.select('familyMembers.name').show(truncate=False)
data.select('familyMembers.addressDetails.address').show(truncate=False)

To understand this you can print the schema of :
data.select('familyMembers.addressDetails').printSchema()
#root
# |-- familyMembers.addressDetails: array (nullable = true)
# | |-- element: array (containsNull = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- addressType: string (nullable = true)
# | | | |-- address: string (nullable = true)
See here you have an array of arrays of structs which is different from the initial schema you have. So you can't directly access address from the root, but you can select the first element of the nested array then access the struct field address :
data.selectExpr("familyMembers.addressDetails[0].address").show(truncate=False)
#+--------------------------------------------------------------------------+
#|familyMembers.addressDetails AS addressDetails#29[0].address |
#+--------------------------------------------------------------------------+
#|[29 Commonwealth St, Clifton, QLD 4361 , 20 A Yeo Ave, Highgate, SA 5063 ]|
#+--------------------------------------------------------------------------+
Or:
data.select(F.col('familyMembers.addressDetails').getItem(0).getItem("address"))

Along with the answer that #blackbishop provided You can also use the combination of the select and expr to get the output as below :
data.select(expr('familyMembers.addressDetails[0].address')))
Output :
You can also use explode to get all the addresses if you want as below :
data.select(explode('familyMembers.addressDetails')).select("col.address")
Output :

Aggregate one column, but show all columns in select

I try to show maximum value from column while I group rows by date column.
So i tried this code
maxVal = dfSelect.select('*')\
.groupBy('DATE')\
.agg(max('CLOSE'))
But output looks like that:
+----------+----------+
| DATE|max(CLOSE)|
+----------+----------+
|1987-05-08| 43.51|
|1987-05-29| 39.061|
+----------+----------+
I wanna have output like below
+------+---+----------+------+------+------+------+------+---+----------+
|TICKER|PER| DATE| TIME| OPEN| HIGH| LOW| CLOSE|VOL|max(CLOSE)|
+------+---+----------+------+------+------+------+------+---+----------+
| CDG| D|1987-01-02|000000|50.666|51.441|49.896|50.666| 0| 50.666|
| ABC| D|1987-01-05|000000|51.441| 52.02|51.441|51.441| 0| 51.441|
+------+---+----------+------+------+------+------+------+---+----------+
So my question is how to change the code to have output with all columns and aggregated 'CLOSE' column?
Scheme of my data looks like below:
root
|-- TICKER: string (nullable = true)
|-- PER: string (nullable = true)
|-- DATE: date (nullable = true)
|-- TIME: string (nullable = true)
|-- OPEN: float (nullable = true)
|-- HIGH: float (nullable = true)
|-- LOW: float (nullable = true)
|-- CLOSE: float (nullable = true)
|-- VOL: integer (nullable = true)
|-- OPENINT: string (nullable = true)

If you want the same aggregation all your columns in the original dataframe, then you can do something like,
import pyspark.sql.functions as F
expr = [F.max(coln).alias(coln) for coln in df.columns if 'date' not in coln] # df is your datafram
df_res = df.groupby('date').agg(*expr)
If you want multiple aggregations, then you can do like,
sub_col1 = # define
sub_col2=# define
expr1 = [F.max(coln).alias(coln) for coln in sub_col1 if 'date' not in coln]
expr2 = [F.first(coln).alias(coln) for coln in sub_col2 if 'date' not in coln]
expr=expr1+expr2
df_res = df.groupby('date').agg(*expr)
If you want only one of the columns aggregated and added to your original dataframe, then you can do a selfjoin after aggregating
df_agg = df.groupby('date').agg(F.max('close').alias('close_agg')).withColumn("dummy",F.lit("dummmy")) # dummy column is needed as a workaround in spark issues of self join
df_join = df.join(df_agg,on='date',how='left')
or you can use a windowing function
from pyspark.sql import Window
w= Window.partitionBy('date')
df_res = df.withColumn("max_close",F.max('close').over(w))

Handle string to array conversion in pyspark dataframe

I have a file(csv) which when read in spark dataframe has the below values for print schema
-- list_values: string (nullable = true)
the values in the column list_values are something like:
[[[167, 109, 80, ...]]]
Is it possible to convert this to array type instead of string?
I tried splitting it and using code available online for similar problems:
df_1 = df.select('list_values', split(col("list_values"), ",\s*").alias("list_values"))
but if I run the above code the array which I get skips a lot of values in the original array i.e.
output of the above code is:
[, 109, 80, 69, 5...
which is different from original array i.e. (-- 167 is missing)
[[[167, 109, 80, ...]]]
Since I am new to spark I don't have much knowledge how it is done (For python I could have done ast.literal_eval but spark has no provision for this.
So I'll repeat the question again :
How can I convert/cast an array stored as string to array i.e.
'[]' to [] conversion

Suppose your DataFrame was the following:
df.show()
#+----+------------------+
#|col1| col2|
#+----+------------------+
#| a|[[[167, 109, 80]]]|
#+----+------------------+
df.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
You could use pyspark.sql.functions.regexp_replace to remove the leading and trailing square brackets. Once that's done, you can split the resulting string on ", ":
from pyspark.sql.functions import split, regexp_replace
df2 = df.withColumn(
"col3",
split(regexp_replace("col2", r"(^\[\[\[)|(\]\]\]$)", ""), ", ")
)
df2.show()
#+----+------------------+--------------+
#|col1| col2| col3|
#+----+------------------+--------------+
#| a|[[[167, 109, 80]]]|[167, 109, 80]|
#+----+------------------+--------------+
df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# | |-- element: string (containsNull = true)
If you wanted the column as an array of integers, you could use cast:
from pyspark.sql.functions import col
df2 = df2.withColumn("col3", col("col3").cast("array<int>"))
df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# | |-- element: integer (containsNull = true)

How to perform calculation in spark dataframe that select from its own dataframe using pyspark

I have a pyspark schema which look like this :
root
|-- id: string (nullable = true)
|-- long: float (nullable = true)
|-- lat: float (nullable = true)
|-- geohash: string (nullable = true)
|-- neighbors: array (nullable = true)
| |-- element: string (containsNull = true)
The data look like this :
+---+---------+----------+---------+--------------------+
| id| lat| long|geohash_8| neighbors|
+---+---------+----------+---------+--------------------+
| 0|-6.361755| 106.79653| qqggy1yu|[qqggy1ys, qqggy1...|
| 1|-6.358584|106.793945| qqggy4ky|[qqggy4kw, qqggy4...|
| 2|-6.362967|106.798775| qqggy38m|[qqggy38j, qqggy3...|
| 3|-6.358316| 106.79832| qqggy680|[qqggy4xb, qqggy6...|
| 4| -6.36016| 106.7981| qqggy60j|[qqggy4pv, qqggy6...|
| 5|-6.357476| 106.79842| qqggy68j|[qqggy4xv, qqggy6...|
| 6|-6.360814| 106.79435| qqggy4j3|[qqggy4j1, qqggy4...|
| 7|-6.358231|106.794365| qqggy4t2|[qqggy4t0, qqggy4...|
| 8|-6.357654| 106.79736| qqggy4x7|[qqggy4x5, qqggy4...|
| 9|-6.358781|106.794624| qqggy4mm|[qqggy4mj, qqggy4...|
| 10|-6.357654| 106.79443| qqggy4t7|[qqggy4t5, qqggy4...|
| 11|-6.357079| 106.79443| qqggy4tr|[qqggy4tp, qqggy4...|
| 12|-6.359929| 106.79698| qqggy4pn|[qqggy4ny, qqggy4...|
| 13|-6.358111| 106.79633| qqggy4w9|[qqggy4w3, qqggy4...|
| 14|-6.359685| 106.79607| qqggy4q8|[qqggy4q2, qqggy4...|
| 15|-6.357945|106.794945| qqggy4td|[qqggy4t6, qqggy4...|
| 16|-6.360725|106.795456| qqggy4n4|[qqggy4jf, qqggy4...|
| 17|-6.363701| 106.79653| qqggy1wb|[qqggy1w8, qqggy1...|
| 18| -6.36329|106.794586| qqggy1t7|[qqggy1t5, qqggy1...|
| 19|-6.363304| 106.79429| qqggy1t5|[qqggy1sg, qqggy1...|
+---+---------+----------+---------+--------------------+
I want to calculate the distance from each id with its lat long and select all the lat long from all his neighbors then calculate the distance. Then every id will have list of distances in meters with all his neighbors.
I tried using iterative way, which loop every rows then select a dataframe then compute the haversine distance, However the performance is awful. I am stuck on how to apply using functional way in spark. Can anyone help with some suggestion or references.

Updated to address desire for combinations
If you want to do all the combinations, the steps are basically, associate each neighbor ID with it's lat/long, group them together into a single row for each combination set, then do compute distance on all the combinations. Here is example code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import Row
import itertools
schema = StructType([
StructField("id", StringType()),
StructField("lat", FloatType()),
StructField("long", FloatType()),
StructField("geohash_8", StringType()),
StructField("neighbors", ArrayType(StringType()))
])
data = [
("0", 10.0, 11.0, "A", ["B", "C", "D"]),
("1", 12.0, 13.0, "B", ["D"]),
("2", 14.0, 15.0, "C", []),
("3", 16.0, 17.0, "D", [])
]
input_df = spark.createDataFrame(sc.parallelize(data), schema)
# Explode to get a row for each comparison pair
df = input_df.withColumn('neighbor', explode('neighbors')).drop('neighbors')
# Join to get the lat/lon of the neighbor
neighbor_map = input_df.selectExpr('geohash_8 as nid', 'lat as nlat', 'long as nlong')
df = df.join(neighbor_map , col('neighbor') == col('nid'), 'inner').drop('nid')
# Add in rows for the root (geohash_8) records before grouping
root_rows = input_df.selectExpr("id", "lat", "long", "geohash_8", "geohash_8 as neighbor", "lat as nlat", "long as nlong")
df = df.unionAll(root_rows)
# Group by to roll the rows back up but now associating the lat/lon w/ the neighbors
df = df.selectExpr("id", "lat", "long", "geohash_8", "struct(neighbor, nlat, nlong) as neighbors")
df = df.groupBy("id", "lat", "long", "geohash_8").agg(collect_set("neighbors").alias("neighbors"))
# You now have all the data you need in one field, so you can write a python udf to do the combinations
def compute_distance(left_lat, left_lon, right_lat, right_lon):
return 10.0
def combinations(neighbors):
result = []
for left, right in itertools.combinations(neighbors, 2):
dist = compute_distance(left['nlat'], left['nlong'], right['nlat'], right['nlong'])
result.append(Row(left=left['neighbor'], right=right['neighbor'], dist=dist))
return result
udf_schema = ArrayType(StructType([
StructField("left", StringType()),
StructField("right", StringType()),
StructField("dist", FloatType())
]))
combinations_udf = udf(combinations, udf_schema)
# Finally, apply the UDF
df = df.withColumn('neighbors', combinations_udf(col('neighbors')))
df.printSchema()
df.show()
Which produces this:
root
|-- id: string (nullable = true)
|-- lat: float (nullable = true)
|-- long: float (nullable = true)
|-- geohash_8: string (nullable = true)
|-- neighbors: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- neighbor: string (nullable = true)
| | |-- nlat: float (nullable = true)
| | |-- nlong: float (nullable = true)
+---+----+----+---------+------------------------------------------------------------------------------------+
|id |lat |long|geohash_8|neighbors |
+---+----+----+---------+------------------------------------------------------------------------------------+
|0 |10.0|11.0|A |[[D, C, 10.0], [D, A, 10.0], [D, B, 10.0], [C, A, 10.0], [C, B, 10.0], [A, B, 10.0]]|
|2 |14.0|15.0|C |[] |
|1 |12.0|13.0|B |[[D, B, 10.0]] |
|3 |16.0|17.0|D |[] |
+---+----+----+---------+------------------------------------------------------------------------------------+

PySpark : Filter based on resultant query without additional dataframe

Consider the below example
>>> l = [("US","City1",125),("US","City2",123),("Europe","CityX",23),("Europe","CityY",17)]
>>> print l
[('US', 'City1', 125), ('US', 'City2', 123), ('Europe', 'CityX', 23), ('Europe', 'CityY', 17)]
>>> sc = SparkContext(appName="N")
>>> sqlsc = SQLContext(sc)
>>> df = sqlsc.createDataFrame(l)
>>> df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: long (nullable = true)
>>> df.registerTempTable("t1")
>>> rdf=sqlsc.sql("Select _1,sum(_3) from t1 group by _1").show()
+------+---+
| _1|_c1|
+------+---+
| US|248|
|Europe| 40|
+------+---+
>>> rdf.printSchema()
root
|-- _1: string (nullable = true)
|-- _c1: long (nullable = true)
>>> rdf.registerTempTable("t2")
>>> sqlsc.sql("Select * from t2 where _c1 > 200").show()
+---+---+
| _1|_c1|
+---+---+
| US|248|
+---+---+
So basically, I am trying to find all the _3 (which can be population subscribed to some service) which are above threshold in each country. In the above table, there is an additional dataframe is created (rdf)
Now, How do I eliminate the rdf dataframe and embed the complete query within df dataframe itself.
I tried, but pyspark throws error
>>> sqlsc.sql("Select _1,sum(_3) from t1 group by _1").show()
+------+---+
| _1|_c1|
+------+---+
| US|248|
|Europe| 40|
+------+---+
>>> sqlsc.sql("Select _1,sum(_3) from t1 group by _1 where _c1 > 200").show()
Traceback (most recent call last):
File "/ghostcache/kimanjun/spark-1.6.0/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.sql.
: java.lang.RuntimeException: [1.39] failure: ``union'' expected but `where' found

Here is a solution with no kind of temp tables:
#Do this to don't have conflict with sum in built-in spark functions
from pyspark.sql import sum as _sum
gDf = df.groupBy(df._1).agg(_sum(df._3).alias('sum'))
gDf.filter(gDf.sum > 200).show()
This solution we have a way of group and aggregate with a sum. To make sure that you don't have issues with the sum. Is better to the filter in another object.
I recommend you this link to see some useful ways much more powerful than using direct SQL in the dataframe.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Remove rows with invalid polygon values in a PySpark data frame? - apache-spark

Related

pyspark dataframes: Why can I select some nested fields but not others?

Aggregate one column, but show all columns in select

Handle string to array conversion in pyspark dataframe

How to perform calculation in spark dataframe that select from its own dataframe using pyspark

PySpark : Filter based on resultant query without additional dataframe

Categories

Resources