Joining rows from two dataframes with the closest point

Joining rows from two dataframes with the closest point - apache-spark

Hi I am kinda new to spark and I am not sure how to approach this.
I have 2 tables (way smaller for easier explanation):
A:Weather Data
B:travel data
I need to join these tables by finding the closest station when the trip started in the same date and do the same when the trip ended. so at the end I have all the weather data from the station at the time the trip started and when the trip finished, and just one row for each trip with the data from the closest weather station.
i have done something similar with geopandas and udf but it was way easier because i was looking for an interception. like this:
def find_state_gps(lat, long):
df = gdf_states.apply(lambda x: x["NAME"] if x["geometry"].intersects(Point(long,lat)) else None, axis =1)
idx = df.first_valid_index()
value = df.loc[idx] if idx is not None else "Not in USA territory"
return(value)
state_gps = udf(find_state_gps, StringType())
I am not sure how to handle the logic this time.
i also tried doing this query with no luck.
query = "SELECT STATION,\
NAME,\
LATITUDE,\
LONGITUDE,\
AWND,\
p.id_trip,\
p.Latitude,\
p.Longitude,\
p.startDate,\
Abs(p.latitude-LATITUDE)**2 + Abs(p.Longitude-LONGITUDE)**2\
AS dd\
FROM df2\
CROSS JOIN (\
SELECT id AS id_trip,\
station_id,\
Latitude,\
Longitude,\
startDate\
FROM df1\
) AS p ON 1=1\
ORDER BY dd"
and got the following error:
ParseException:
mismatched input '2' expecting {, ';'}(line 1, pos 189)
At the end i want something like this without repeated trips.
id
started_date
finish_date
finished
weather_station_start
weather_station_end
more columns about weather for starting and ending trip locations
1
bim
baz
bim
baz
bim
bim
2
bim
baz
bim
baz
bim
bim
I really appreciate your help guys.

I changed your sample data a bit because all stations have the same coordinates:
travel_data = spark.createDataFrame(
[
('0','2013-06-01','00:00:01','-73.98915076','40.7423543','40.74317449','-74.00366443','2013-06-01')
,('1','2013-06-01','00:00:08','-73.98915076','40.7423543','40.74317449','-74.00366443','2013-06-01')
,('2','2013-06-01','00:00:44','-73.99595065','40.69512845','40.69512845','-73.99595065','2013-06-01')
,('3','2013-06-01','00:01:04','-73.98758561','40.73524276','40.6917823','-73.9737299','2013-06-01')
,('4','2013-06-01','00:01:22','-74.01677685','40.70569254','40.68926942','-73.98912867','2013-06-01')
], ['id','startDate','startTime','Longitude','Latitude','end station latitude','end station longitude','stopdate']
)
weather_data = spark.createDataFrame(
[
('USINYWC0003','WHITE PLAINS 3.1 NNW 3, NY US','41.0639','-73.7722','71','2013-06-01','','','','','')
,('USINYWC0002','WHITE PLAINS 3.1 NNW 2, NY US','41.0638','-73.7723','71','2013-06-02','','','','','')
,('USINYWC0001','WHITE PLAINS 3.1 NNW 1, NY US','41.0635','-73.7724','71','2013-06-03','','','','','')
], ['STATION','NAME','LATITUDE','LONGITUDE','ELEVATION','DATE','AWND','AWND ATTRIBUTES','DAPR','DAPR ATTRIBUTES','DASE']
)
+---+----------+---------+------------+-----------+--------------------+---------------------+----------+
| id| startDate|startTime| Longitude| Latitude|end station latitude|end station longitude| stopdate|
+---+----------+---------+------------+-----------+--------------------+---------------------+----------+
| 0|2013-06-01| 00:00:01|-73.98915076| 40.7423543| 40.74317449| -74.00366443|2013-06-01|
| 1|2013-06-01| 00:00:08|-73.98915076| 40.7423543| 40.74317449| -74.00366443|2013-06-01|
| 2|2013-06-01| 00:00:44|-73.99595065|40.69512845| 40.69512845| -73.99595065|2013-06-01|
| 3|2013-06-01| 00:01:04|-73.98758561|40.73524276| 40.6917823| -73.9737299|2013-06-01|
| 4|2013-06-01| 00:01:22|-74.01677685|40.70569254| 40.68926942| -73.98912867|2013-06-01|
+---+----------+---------+------------+-----------+--------------------+---------------------+----------+
+-----------+--------------------+--------+---------+---------+----------+----+---------------+----+---------------+----+
| STATION| NAME|LATITUDE|LONGITUDE|ELEVATION| DATE|AWND|AWND ATTRIBUTES|DAPR|DAPR ATTRIBUTES|DASE|
+-----------+--------------------+--------+---------+---------+----------+----+---------------+----+---------------+----+
|USINYWC0003|WHITE PLAINS 3.1 ...| 41.0639| -73.7722| 71|2013-06-01| | | | | |
|USINYWC0002|WHITE PLAINS 3.1 ...| 41.0638| -73.7723| 71|2013-06-02| | | | | |
|USINYWC0001|WHITE PLAINS 3.1 ...| 41.0635| -73.7724| 71|2013-06-03| | | | | |
+-----------+--------------------+--------+---------+---------+----------+----+---------------+----+---------------+----+
Then, crossjoin the two dataframes in order to calculate the haversine distance between the start/end point and all stations. Not the best solution using a crossjoin, but depending on the size of your data it might be the easiest way
from pyspark.sql.types import *
from pyspark.sql.functions import col, radians, asin, sin, sqrt, cos, max, min
from pyspark.sql import Window as W
join_df = travel_data\
.crossJoin(weather_data.select('NAME',col('LATITUDE').alias('st_LAT'),col('LONGITUDE').alias('st_LON'), 'AWND')) \
.withColumn("dlon_start", radians(col("st_LON")) - radians(col("Longitude"))) \
.withColumn("dlat_start", radians(col("st_LAT")) - radians(col("Latitude"))) \
.withColumn("haversine_dist_start", asin(sqrt(
sin(col("dlat_start") / 2) ** 2 + cos(radians(col("Latitude")))
* cos(radians(col("st_LAT"))) * sin(col("dlon_start") / 2) ** 2
)
) * 2 * 3963 * 5280)\
.withColumn("dlon_end", radians(col("st_LON")) - radians(col("end station longitude"))) \
.withColumn("dlat_end", radians(col("st_LAT")) - radians(col("end station latitude"))) \
.withColumn("haversine_dist_end", asin(sqrt(
sin(col("dlat_end") / 2) ** 2 + cos(radians(col("Latitude")))
* cos(radians(col("st_LAT"))) * sin(col("dlon_end") / 2) ** 2
)
) * 2 * 3963 * 5280)\
.drop('dlon_start','dlat_start','dlon_end','dlat_end')
Finally, using window functions to pick the closest station from start point (result1) and closest station from end point (result2)
W = W.partitionBy("id")
result1 = join_df\
.withColumn("min_dist_start", min('haversine_dist_start').over(W))\
.filter(col("min_dist_start") == col('haversine_dist_start'))\
.select('id',col('startDate').alias('started_date'),col('stopdate').alias('finish_date'),col('NAME').alias('weather_station_start'),col('Latitude').alias('Latitude_start'),col('Longitude').alias('Longitude_start'))
result2 = join_df\
.withColumn("min_dist_end", min('haversine_dist_end').over(W))\
.filter(col("min_dist_end") == col('haversine_dist_end'))\
.select('id', col('NAME').alias('weather_station_end'))
final = result1.join(result2, 'id', 'left')
final.show()
Not sure of wich columns you want on the output but hope this give you some insights
output:
+---+------------+-----------+-----------------------------+--------------+---------------+-----------------------------+
|id |started_date|finish_date|weather_station_start |Latitude_start|Longitude_start|weather_station_end |
+---+------------+-----------+-----------------------------+--------------+---------------+-----------------------------+
|0 |2013-06-01 |2013-06-01 |WHITE PLAINS 3.1 NNW 1, NY US|40.7423543 |-73.98915076 |WHITE PLAINS 3.1 NNW 1, NY US|
|1 |2013-06-01 |2013-06-01 |WHITE PLAINS 3.1 NNW 1, NY US|40.7423543 |-73.98915076 |WHITE PLAINS 3.1 NNW 1, NY US|
|2 |2013-06-01 |2013-06-01 |WHITE PLAINS 3.1 NNW 1, NY US|40.69512845 |-73.99595065 |WHITE PLAINS 3.1 NNW 1, NY US|
|3 |2013-06-01 |2013-06-01 |WHITE PLAINS 3.1 NNW 1, NY US|40.73524276 |-73.98758561 |WHITE PLAINS 3.1 NNW 1, NY US|
|4 |2013-06-01 |2013-06-01 |WHITE PLAINS 3.1 NNW 1, NY US|40.70569254 |-74.01677685 |WHITE PLAINS 3.1 NNW 1, NY US|
+---+------------+-----------+-----------------------------+--------------+---------------+-----------------------------+

Related

How to match two columns from a DataFrame (including Arrays) with two columns from a CSV (Dataframe/Dictionary)

I have a DataFrame like this;
df = spark.createDataFrame([
[["Apple"],['iPhone EE','iPhone 11', 'iPhone 11 Pro']],
[["Acer"],['Iconia Talk S','liquid Z6 Plus']],
[["Casio"],['Casio G\'zOne Brigade']],
[["Alcatel"[,[]],
[["HTC", "Honor"].["Play 4", "Play 7"]]
]).toDF("brand","type")
And a csv like this;
Apple;iPhone EE
Apple;iPhone 11 Pro
Apple;iPhone XS
Acer;liquid Z6 Plus
Acer;Acer Predator 8
Casio;Casio G'zOne Ravine
Alcatel;3L
HTC;Play 4
Honor;Play 7
I need to create a new boolean column match.
If the combination of brand and type matches one of the rows from the CSV it's True otherwise False.
Expected output:
Brand | Type | Match
-------------------------------------------------------------
Apple | [iPhone EE, iPhone 11, iPhone 11 Pro] | True
Acer | [Iconia Talk S, liquid Z6 Plus] | True
Casio | [Casio G\'zOne Brigade] | False
Alcatel | [] | False
HTC, Honor | [Play 4, Play 7] | True
Update
brand is also of type array<string>
The csv file is just a start. it can be converted to a Dataframe or Dictionary (or whatever fits best).
How can I best accomplish this?

You can try size + array_intersect to set up this flag.
from pyspark.sql.functions import collect_set, size, array_intersect, broadcast, expr, flatten, collect_list, array_join
df_list = spark.read.csv("/path/to/csv_list", sep=';').toDF('brand_name','type')
df1 = df_list.groupby('brand_name').agg(collect_set('type').alias('types'))
df_new = df.join(broadcast(df1), expr("array_contains(brand, brand_name)"), "left") \
.groupby('brand', 'Type') \
.agg(flatten(collect_list('types')).alias('types')) \
.select(array_join('brand', ', ').alias('brand'), 'Type', (size(array_intersect('type', 'types'))>0).alias("Match"))
df_new.show(5,0)
+----------+-------------------------------------+-----+
|brand |Type |Match|
+----------+-------------------------------------+-----+
|Alcatel |[] |false|
|HTC, Honor|[Play 4, Play 7] |true |
|Casio |[Casio G'zOne Brigade] |false|
|Acer |[Iconia Talk S, liquid Z6 Plus] |true |
|Apple |[iPhone EE, iPhone 11, iPhone 11 Pro]|true |
+----------+-------------------------------------+-----+
Method-2: using Map (map<string,array<string>>):
from pyspark.sql.functions import arrays_overlap, array, lit, col, create_map, col, monotonically_increasing_id, first, explode, array_join
dict1 = df1.rdd.collectAsMap()
map1 = create_map([ t for k,v in dict1.items() for t in [lit(k), array(*map(lit,v))] ])
#Column<b"map(Casio, array(Casio G'zOne Ravine), Alcatel, array(3L), Acer, array(Acer Predator 8, liquid Z6 Plus), HTC, array(Play 4), Honor, array(Play 7), Apple, array(iPhone EE, iPhone 11 Pro, iPhone XS))">
df_new = df.withColumn('id', monotonically_increasing_id()) \
.withColumn('brand', explode('brand')) \
.withColumn('Match', arrays_overlap('type', map1[col('brand')])) \
.groupby('id') \
.agg(
array_join(collect_set('brand'),', ').alias('brand'),
first('Type').alias('Type'),
expr("sum(int(Match)) > 0 as Match")
)
df_new.show(5,0)
+---+----------+-------------------------------------+-----+
|id |brand |Type |Match|
+---+----------+-------------------------------------+-----+
|0 |Apple |[iPhone EE, iPhone 11, iPhone 11 Pro]|true |
|1 |Acer |[Iconia Talk S, liquid Z6 Plus] |true |
|3 |Alcatel |[] |false|
|2 |Casio |[Casio G'zOne Brigade] |false|
|4 |HTC, Honor|[Play 4, Play 7] |true |
+---+----------+-------------------------------------+-----+

this might be useful.
>>> import pyspark.sql.functions as F
>>> df = spark.createDataFrame([
... ["Apple",['iPhone EE','iPhone 11', 'iPhone 11 Pro']],
... ["Acer",['Iconia Talk S','liquid Z6 Plus']],
... ["Casio",['Casio G\'zOne Brigade']],
... ["Alcatel",[]]
... ]).toDF("brand","type")
>>> df.show(df.count(), False)
+-------+-------------------------------------+
|brand |type |
+-------+-------------------------------------+
|Apple |[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer |[Iconia Talk S, liquid Z6 Plus] |
|Casio |[Casio G'zOne Brigade] |
|Alcatel|[] |
+-------+-------------------------------------+
>>> file_df = sqlcontext.read.csv('/home/chai/brand.csv', header='true')
>>> file_df.show(file_df.count(), False)
+-------+-------------------+
|brand |types |
+-------+-------------------+
|Apple |iPhone EE |
|Apple |iPhone 11 Pro |
|Apple |iPhone XS |
|Acer |liquid Z6 Plus |
|Acer |Acer Predator 8 |
|Casio |Casio G'zOne Ravine|
|Alcatel|3L |
+-------+-------------------+
>>> file_df = file_df.groupBy('brand').agg(F.collect_list('types').alias('new'))
>>> file_df.show(file_df.count(), False)
+-------+-------------------------------------+
|brand |new |
+-------+-------------------------------------+
|Casio |[Casio G'zOne Ravine] |
|Alcatel|[3L] |
|Acer |[liquid Z6 Plus, Acer Predator 8] |
|Apple |[iPhone EE, iPhone 11 Pro, iPhone XS]|
+-------+-------------------------------------+
>>> def test(row_dict):
... new_dict = dict()
... for i in row_dict.get('type'):
... if i in row_dict.get('new'):
... new_dict['flag'] = 'True'
... else:
... new_dict['flag'] = 'False'
... if len(row_dict.get('type')) == 0 and len(row_dict.get('new')) > 0:
... new_dict['flag'] = 'False'
... new_dict['brand'] = row_dict.get('brand')
... new_dict['type'] = row_dict.get('type')
... new_dict['new'] = row_dict.get('new')
... return new_dict
...
>>> def row_to_dict(row):
... return row.asDict(recursive=True)
>>> rdd = all.rdd.map(row_to_dict)
>>> rdd1 = rdd.map(test)
>>> final_df = sqlcontext.createDataFrame(rdd1)
>>> final_df.show(final_df.count(), False)
+-------+-----+-------------------------------------+-------------------------------------+
|brand |flag |new |type |
+-------+-----+-------------------------------------+-------------------------------------+
|Apple |True |[iPhone EE, iPhone 11 Pro, iPhone XS]|[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer |True |[liquid Z6 Plus, Acer Predator 8] |[Iconia Talk S, liquid Z6 Plus] |
|Casio |False|[Casio G'zOne Ravine] |[Casio G'zOne Brigade] |
|Alcatel|False|[3L] |[] |
+-------+-----+-------------------------------------+-------------------------------------+

Spark: Find the value with the highest occurrence per group over rolling time window

Starting from the following spark data frame:
from io import StringIO
import pandas as pd
from pyspark.sql.functions import col
pd_df = pd.read_csv(StringIO("""device_id,read_date,id,count
device_A,2017-08-05,4041,3
device_A,2017-08-06,4041,3
device_A,2017-08-07,4041,4
device_A,2017-08-08,4041,3
device_A,2017-08-09,4041,3
device_A,2017-08-10,4041,1
device_A,2017-08-10,4045,2
device_A,2017-08-11,4045,3
device_A,2017-08-12,4045,3
device_A,2017-08-13,4045,3"""),infer_datetime_format=True, parse_dates=['read_date'])
df = spark.createDataFrame(pd_df).withColumn('read_date', col('read_date').cast('date'))
df.show()
Output:
+--------------+----------+----+-----+
|device_id | read_date| id|count|
+--------------+----------+----+-----+
| device_A|2017-08-05|4041| 3|
| device_A|2017-08-06|4041| 3|
| device_A|2017-08-07|4041| 4|
| device_A|2017-08-08|4041| 3|
| device_A|2017-08-09|4041| 3|
| device_A|2017-08-10|4041| 1|
| device_A|2017-08-10|4045| 2|
| device_A|2017-08-11|4045| 3|
| device_A|2017-08-12|4045| 3|
| device_A|2017-08-13|4045| 3|
+--------------+----------+----+-----+
I would like to find the most frequent id for each (device_id, read_date) combination, over a 3 day rolling window. For each group of rows selected by the time window, I need to find the most frequent id by summing up the counts per id, then return the top id.
Expected Output:
+--------------+----------+----+
|device_id | read_date| id|
+--------------+----------+----+
| device_A|2017-08-05|4041|
| device_A|2017-08-06|4041|
| device_A|2017-08-07|4041|
| device_A|2017-08-08|4041|
| device_A|2017-08-09|4041|
| device_A|2017-08-10|4041|
| device_A|2017-08-11|4045|
| device_A|2017-08-12|4045|
| device_A|2017-08-13|4045|
+--------------+----------+----+
I am starting to think this is only possible using a custom aggregation function. Since spark 2.3 is not out I will have to write this in Scala or use collect_list. Am I missing something?

Add window:
from pyspark.sql.functions import window, sum as sum_, date_add
df_w = df.withColumn(
"read_date", window("read_date", "3 days", "1 day")["start"].cast("date")
)
# Then handle the counts
df_w = df_w.groupBy('device_id', 'read_date', 'id').agg(sum_('count').alias('count'))
Use one of the solutions from Find maximum row per group in Spark DataFrame for example
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
rolling_window = 3
top_df = (
df_w
.withColumn(
"rn",
row_number().over(
Window.partitionBy("device_id", "read_date")
.orderBy(col("count").desc())
)
)
.where(col("rn") == 1)
.orderBy("read_date")
.drop("rn")
)
# results are calculated on the start of the time window - adjust read_date as needed
final_df = top_df.withColumn('read_date', date_add('read_date', rolling_window - 1))
final_df.show()
# +---------+----------+----+-----+
# |device_id| read_date| id|count|
# +---------+----------+----+-----+
# | device_A|2017-08-05|4041| 3|
# | device_A|2017-08-06|4041| 6|
# | device_A|2017-08-07|4041| 10|
# | device_A|2017-08-08|4041| 10|
# | device_A|2017-08-09|4041| 10|
# | device_A|2017-08-10|4041| 7|
# | device_A|2017-08-11|4045| 5|
# | device_A|2017-08-12|4045| 8|
# | device_A|2017-08-13|4045| 9|
# | device_A|2017-08-14|4045| 6|
# | device_A|2017-08-15|4045| 3|
# +---------+----------+----+-----+

I managed to find a very inefficient solution. Hopefully someone can spot improvements to avoid the python udf and call to collect_list.
from pyspark.sql import Window
from pyspark.sql.functions import col, collect_list, first, udf
from pyspark.sql.types import IntegerType
def top_id(ids, counts):
c = Counter()
for cnid, count in zip(ids, counts):
c[cnid] += count
return c.most_common(1)[0][0]
rolling_window = 3
days = lambda i: i * 86400
# Define a rolling calculation window based on time
window = (
Window()
.partitionBy("device_id")
.orderBy(col("read_date").cast("timestamp").cast("long"))
.rangeBetween(-days(rolling_window - 1), 0)
)
# Use window and collect_list to store data matching the window definition on each row
df_collected = df.select(
'device_id', 'read_date',
collect_list(col('id')).over(window).alias('ids'),
collect_list(col('count')).over(window).alias('counts')
)
# Get rid of duplicate rows where necessary
df_grouped = df_collected.groupBy('device_id', 'read_date').agg(
first('ids').alias('ids'),
first('counts').alias('counts'),
)
# Register and apply udf to return the most frequently seen id
top_id_udf = udf(top_id, IntegerType())
df_mapped = df_grouped.withColumn('top_id', top_id_udf(col('ids'), col('counts')))
df_mapped.show(truncate=False)
returns:
+---------+----------+------------------------+------------+------+
|device_id|read_date |ids |counts |top_id|
+---------+----------+------------------------+------------+------+
|device_A |2017-08-05|[4041] |[3] |4041 |
|device_A |2017-08-06|[4041, 4041] |[3, 3] |4041 |
|device_A |2017-08-07|[4041, 4041, 4041] |[3, 3, 4] |4041 |
|device_A |2017-08-08|[4041, 4041, 4041] |[3, 4, 3] |4041 |
|device_A |2017-08-09|[4041, 4041, 4041] |[4, 3, 3] |4041 |
|device_A |2017-08-10|[4041, 4041, 4041, 4045]|[3, 3, 1, 2]|4041 |
|device_A |2017-08-11|[4041, 4041, 4045, 4045]|[3, 1, 2, 3]|4045 |
|device_A |2017-08-12|[4041, 4045, 4045, 4045]|[1, 2, 3, 3]|4045 |
|device_A |2017-08-13|[4045, 4045, 4045] |[3, 3, 3] |4045 |
+---------+----------+------------------------+------------+------+

Removing NULL , NAN, empty space from PySpark DataFrame

I have a dataframe in PySpark which contains empty space, Null, and Nan.
I want to remove rows which have any of those. I tried below commands, but, nothing seems to work.
myDF.na.drop().show()
myDF.na.drop(how='any').show()
Below is the dataframe:
+---+----------+----------+-----+-----+
|age| category| date|empId| name|
+---+----------+----------+-----+-----+
| 25|electronic|17-01-2018| 101| abc|
| 24| sports|16-01-2018| 102| def|
| 23|electronic|17-01-2018| 103| hhh|
| 23|electronic|16-01-2018| 104| yyy|
| 29| men|12-01-2018| 105| ajay|
| 31| kids|17-01-2018| 106|vijay|
| | Men| nan| 107|Sumit|
+---+----------+----------+-----+-----+
What am I missing? What is the best way to tackle NULL, Nan or empty spaces so that there is no problem in the actual calculation?

NaN (not a number) has different meaning that NULL and empty string is just a normal value (can be converted to NULL automatically with csv reader) so na.drop won't match these.
You can convert all to null and drop
from pyspark.sql.functions import col, isnan, when, trim
df = spark.createDataFrame([
("", 1, 2.0), ("foo", None, 3.0), ("bar", 1, float("NaN")),
("good", 42, 42.0)])
def to_null(c):
return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))
df.select([to_null(c).alias(c) for c in df.columns]).na.drop().show()
# +----+---+----+
# | _1| _2| _3|
# +----+---+----+
# |good| 42|42.0|
# +----+---+----+

Maybe in your case it is not important but this code (modifed answer of Alper t. Turker) can handle different datatypes accordingly. The dataTypes can vary according your DataFrame of course. (tested on Spark version: 2.4)
from pyspark.sql.functions import col, isnan, when, trim
# Find out dataType and act accordingly
def to_null_bool(c, dt):
if df == "double":
return c.isNull() | isnan(c)
elif df == "string":
return ~c.isNull() & (trim(c) != "")
else:
return ~c.isNull()
# Only keep columns with not empty strings
def to_null(c, dt):
c = col(c)
return when(to_null_bool(c, dt), c)
df.select([to_null(c, dt[1]).alias(c) for c, dt in zip(df.columns, df.dtypes)]).na.drop(how="any").show()

How to set new list value based on condition in dataframe in Pyspark?

I have a DataFrame like below.
+---+------------------------------------------+
|id |features |
+---+------------------------------------------+
|1 |[6.629056, 0.26771536, 0.79063195,0.8923] |
|2 |[1.4850719, 0.66458416, -2.1034079] |
|3 |[3.0975454, 1.571849, 1.9053307] |
|4 |[2.526619, -0.33559006, -1.4565022] |
|5 |[-0.9286196, -0.57326394, 4.481531] |
|6 |[3.594114, 1.3512149, 1.6967168] |
+---+------------------------------------------+
I want to set some of my features value based on my where condition like below. I.e. where id=1, id=2 or id=6.
I want to set new features value where id=1, I current features value is [6.629056, 0.26771536, 0.79063195,0.8923], but I want to set [0,0,0,0].
I want to set new features value where id=2, I current features value is [1.4850719, 0.66458416, -2.1034079], but I want to set [0,0,0].
My final out put will be:
+------+-----------------------------------+
|id | features |
+-----+---------------------------------- -+
|1 | [0, 0, 0, 0] |
|2 | [0,0,0] |
|3 | [3.0975454, 1.571849, 1.9053307] |
|4 | [2.526619, -0.33559006, -1.4565022] |
|5 | [-0.9286196, -0.57326394, 4.481531] |
|6 | [0,0,0] |
+-----+------------------------------------+

Shaido's answer is fine if you have a limited set of id for which you know the length of the corresponding feature as well.
If that's not the case, it should be cleaner to use a UDF and the ids that you want to convert can be loaded in another Seq :
In Scala
val arr = Seq(1,2,6)
val fillArray = udf { (id: Int, array: WrappedArray[Double] ) =>
if (arr.contains(id) ) Seq.fill[Double](array.length)(0.0)
else array
}
df.withColumn("new_features" , fillArray($"id", $"features") ).show(false)
In Python
from pyspark.sql import functions as f
from pyspark.sql.types import *
arr = [1,2,6]
def fillArray(id, features):
if(id in arr): return [0.0] * len(features)
else : return features
fill_array_udf = f.udf(fillArray, ArrayType( DoubleType() ) )
df.withColumn("new_features" , fill_array_udf( f.col("id"), f.col("features") ) ).show()
Output
+---+------------------------------------------+-----------------------------------+
|id |features |new_features |
+---+------------------------------------------+-----------------------------------+
|1 |[6.629056, 0.26771536, 0.79063195, 0.8923]|[0.0, 0.0, 0.0, 0.0] |
|2 |[1.4850719, 0.66458416, -2.1034079] |[0.0, 0.0, 0.0] |
|3 |[3.0975454, 1.571849, 1.9053307] |[3.0975454, 1.571849, 1.9053307] |
|4 |[2.526619, -0.33559006, -1.4565022] |[2.526619, -0.33559006, -1.4565022]|
|5 |[-0.9286196, -0.57326394, 4.481531] |[-0.9286196, -0.57326394, 4.481531]|
|6 |[3.594114, 1.3512149, 1.6967168] |[0.0, 0.0, 0.0] |
+---+------------------------------------------+-----------------------------------+

Use when and otherwise if you have a small set of ids to change:
df.withColumn("features",
when(df.id === 1, array(lit(0), lit(0), lit(0), lit(0)))
.when(df.id === 2 | df.id === 6, array(lit(0), lit(0), lit(0)))
.otherwise(df.features)))
It should be faster than an UDF but if there are many ids to change it quickly becomes a lot of code. In this case, use an UDF as in philantrovert's answer.

Convert Python dictionary to Spark DataFrame

I have a Python dictionary :
dic = {
(u'aaa',u'bbb',u'ccc'):((0.3, 1.2, 1.3, 1.5), 1.4, 1),
(u'kkk',u'ggg',u'ccc',u'sss'):((0.6, 1.2, 1.7, 1.5), 1.4, 2)
}
I'd like to convert this dictionary to Spark DataFrame with columns :
['key', 'val_1', 'val_2', 'val_3', 'val_4', 'val_5', 'val_6']
example row (1) :
key | val_1 |val_2 | val_3 | val_4 | val_5| val_6|
u'aaa',u'bbb',u'ccc' | 0.3 |1.2 |1.3 |1.5 |1.4 |1 |
Thank you in advance

Extract items, cast key to list and combine everything into a single tuple:
df = sc.parallelize([
(list(k), ) +
v[0] +
v[1:]
for k, v in dic.items()
]).toDF(['key', 'val_1', 'val_2', 'val_3', 'val_4', 'val_5', 'val_6'])
df.show()
## +--------------------+-----+-----+-----+-----+-----+-----+
## | key|val_1|val_2|val_3|val_4|val_5|val_6|
## +--------------------+-----+-----+-----+-----+-----+-----+
## | [aaa, bbb, ccc]| 0.3| 1.2| 1.3| 1.5| 1.4| 1|
## |[kkk, ggg, ccc, sss]| 0.6| 1.2| 1.7| 1.5| 1.4| 2|
## +--------------------+-----+-----+-----+-----+-----+-----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Joining rows from two dataframes with the closest point - apache-spark

Related

How to match two columns from a DataFrame (including Arrays) with two columns from a CSV (Dataframe/Dictionary)

Spark: Find the value with the highest occurrence per group over rolling time window

Removing NULL , NAN, empty space from PySpark DataFrame

How to set new list value based on condition in dataframe in Pyspark?

Convert Python dictionary to Spark DataFrame

Categories

Resources