v=: ((1 2);(3 4);(0 5);<(2 1))
d =: (1,0.5,1,0.25);(0.5,1,0.75,0.25);(1,0.75,1,0);(0.75,0.25,0,1)
force=:(v ((0{>"0 v);])#{~ ] i.4) ,"1 0 <"0>(0{d)
force=:(v ((1{>"0 v);])#{~ ] i.4) ,"1 0 <"0>(1{d)
force=:(v ((2{>"0 v);])#{~ ] i.4) ,"1 0 <"0>(2{d)
force=:(v ((3{>"0 v);])#{~ ] i.4) ,"1 0 <"0>(3{d)
force=:(v ((4{>"0 v);])#{~ ] i.4) ,"1 0 <"0>(4{d)
force=:(v ((y{>"0 v);])#{~ ] i.4) ,"1 0 <"0>(y{d)
Given v and d,
1st column of force gives us (n+1)th vector from v.
2nd column of force gives us each vector from v.
3rd column of force gives us a constant between 2 vectors.
That is, (1 2);(1 2) will have 1 on 3rd column of force, but (1 2);(3 4) might not.
I want to make a monad function which gives us
force=:(v ((1{>"0 v);])#{~ ] i.4) ,"1 0 <"0>(1{d)
if we type force_constant 1
or force=:(v ((2{>"0 v);])#{~ ] i.4) ,"1 0 <"0>(2{d)
if we type force_constant 2
Could someone help?
You wrote most of it yourself already. Just take your final version of force, the one you wrote using y, and wrap it in an explicit definition:
v =: ((1 2);(3 4);(0 5);<(2 1))
d =: (1,0.5,1,0.25);(0.5,1,0.75,0.25);(1,0.75,1,0);(0.75,0.25,0,1)
force_constant =: monad def '(v ((y{>"0 v);])#{~ ] i.4) ,"1 0 <"0>(y{d)'
force_constant 1
+---+---+----+
|3 4|1 2|0.5 |
+---+---+----+
|3 4|3 4|1 |
+---+---+----+
|3 4|0 5|0.75|
+---+---+----+
|3 4|2 1|0.25|
+---+---+----+
force_constant 2
+---+---+----+
|0 5|1 2|1 |
+---+---+----+
|0 5|3 4|0.75|
+---+---+----+
|0 5|0 5|1 |
+---+---+----+
|0 5|2 1|0 |
+---+---+----+
Now, this formulation depends on the nouns v and d being globally defined. You might consider changing that so force_constant or related verbs take these arrays as inputs. The simplest method would be change the monads to dyads, and let v and d come in as the left argument, x¹.
¹But we can keep it simple for now. If you want more feedback on your code, feel free to post it over on http://codereview.stackexchange.com/.
Related
I need to apply 15 regular expressions to a Spark DataFrame.
I will add version with small df and 3 regexps here:
df = spark.createDataFrame([
Row(a=1, val1="aaa_wwwwwww"),
Row(a=2, val1="bwq_323"),
Row(a=3, val1="haha_kdjk_ska")
])
reg_exps = [
{"reg_val": "^aaa_[a-z]{5,12}$", "replace_with": "a"},
{"reg_val": "^bwq_[0-9]{2,4}$", "replace_with": "b"},
{"reg_val": "^haha_[0-9a-z_]{5,12}$", "replace_with": "c"},
]
for reg_exp in reg_exps:
df = df.withColumn(
"val1",
when(
col("val1").rlike(reg_exp["reg_val"]),
lit(reg_exp["replace_with"])
).otherwise(col("val1"))
)
df.show(truncate=False)
It should return following dataframe:
+---+----+
|a |val1|
+---+----+
|1 |a |
|2 |b |
|3 |c |
+---+----+
The code works as expected but it's really slow. Is there any ways of speeding it up?
Attempt 1
From what can be seen, you can create just one regexp_extract, without a loop.
For a. b. c:
df = df.withColumn("val1", regexp_extract("val1", r"^([a-c])_[\da-z]{5,12}$", 1))
For any letter that is in that position:
df = df.withColumn("val1", regexp_extract("val1", r"^([a-z])_[\da-z]{5,12}$", 1))
Attempt 2
Since you said, in your real case, you cannot merge your regexes, there's one thing you can simplify without it. Instead of several .withColumn, you can do just one. You would need to combine your .when() conditions into one: F.when().when().when().w....otherwise(). This can be done using reduce. With such form, I think, values which already got a regex match, would not experience several additional regex checks.
from pyspark.sql import functions as F
from functools import reduce
whens = reduce(
lambda acc, x: acc.when(F.col("val1").rlike(x["reg_val"]), x["replace_with"]),
reg_exps,
F
).otherwise(F.col("val1"))
df = df.withColumn("val1", whens)
Hi I am kinda new to spark and I am not sure how to approach this.
I have 2 tables (way smaller for easier explanation):
A:Weather Data
B:travel data
I need to join these tables by finding the closest station when the trip started in the same date and do the same when the trip ended. so at the end I have all the weather data from the station at the time the trip started and when the trip finished, and just one row for each trip with the data from the closest weather station.
i have done something similar with geopandas and udf but it was way easier because i was looking for an interception. like this:
def find_state_gps(lat, long):
df = gdf_states.apply(lambda x: x["NAME"] if x["geometry"].intersects(Point(long,lat)) else None, axis =1)
idx = df.first_valid_index()
value = df.loc[idx] if idx is not None else "Not in USA territory"
return(value)
state_gps = udf(find_state_gps, StringType())
I am not sure how to handle the logic this time.
i also tried doing this query with no luck.
query = "SELECT STATION,\
NAME,\
LATITUDE,\
LONGITUDE,\
AWND,\
p.id_trip,\
p.Latitude,\
p.Longitude,\
p.startDate,\
Abs(p.latitude-LATITUDE)**2 + Abs(p.Longitude-LONGITUDE)**2\
AS dd\
FROM df2\
CROSS JOIN (\
SELECT id AS id_trip,\
station_id,\
Latitude,\
Longitude,\
startDate\
FROM df1\
) AS p ON 1=1\
ORDER BY dd"
and got the following error:
ParseException:
mismatched input '2' expecting {, ';'}(line 1, pos 189)
At the end i want something like this without repeated trips.
id
started_date
finish_date
finished
weather_station_start
weather_station_end
more columns about weather for starting and ending trip locations
1
bim
baz
bim
baz
bim
bim
2
bim
baz
bim
baz
bim
bim
I really appreciate your help guys.
I changed your sample data a bit because all stations have the same coordinates:
travel_data = spark.createDataFrame(
[
('0','2013-06-01','00:00:01','-73.98915076','40.7423543','40.74317449','-74.00366443','2013-06-01')
,('1','2013-06-01','00:00:08','-73.98915076','40.7423543','40.74317449','-74.00366443','2013-06-01')
,('2','2013-06-01','00:00:44','-73.99595065','40.69512845','40.69512845','-73.99595065','2013-06-01')
,('3','2013-06-01','00:01:04','-73.98758561','40.73524276','40.6917823','-73.9737299','2013-06-01')
,('4','2013-06-01','00:01:22','-74.01677685','40.70569254','40.68926942','-73.98912867','2013-06-01')
], ['id','startDate','startTime','Longitude','Latitude','end station latitude','end station longitude','stopdate']
)
weather_data = spark.createDataFrame(
[
('USINYWC0003','WHITE PLAINS 3.1 NNW 3, NY US','41.0639','-73.7722','71','2013-06-01','','','','','')
,('USINYWC0002','WHITE PLAINS 3.1 NNW 2, NY US','41.0638','-73.7723','71','2013-06-02','','','','','')
,('USINYWC0001','WHITE PLAINS 3.1 NNW 1, NY US','41.0635','-73.7724','71','2013-06-03','','','','','')
], ['STATION','NAME','LATITUDE','LONGITUDE','ELEVATION','DATE','AWND','AWND ATTRIBUTES','DAPR','DAPR ATTRIBUTES','DASE']
)
+---+----------+---------+------------+-----------+--------------------+---------------------+----------+
| id| startDate|startTime| Longitude| Latitude|end station latitude|end station longitude| stopdate|
+---+----------+---------+------------+-----------+--------------------+---------------------+----------+
| 0|2013-06-01| 00:00:01|-73.98915076| 40.7423543| 40.74317449| -74.00366443|2013-06-01|
| 1|2013-06-01| 00:00:08|-73.98915076| 40.7423543| 40.74317449| -74.00366443|2013-06-01|
| 2|2013-06-01| 00:00:44|-73.99595065|40.69512845| 40.69512845| -73.99595065|2013-06-01|
| 3|2013-06-01| 00:01:04|-73.98758561|40.73524276| 40.6917823| -73.9737299|2013-06-01|
| 4|2013-06-01| 00:01:22|-74.01677685|40.70569254| 40.68926942| -73.98912867|2013-06-01|
+---+----------+---------+------------+-----------+--------------------+---------------------+----------+
+-----------+--------------------+--------+---------+---------+----------+----+---------------+----+---------------+----+
| STATION| NAME|LATITUDE|LONGITUDE|ELEVATION| DATE|AWND|AWND ATTRIBUTES|DAPR|DAPR ATTRIBUTES|DASE|
+-----------+--------------------+--------+---------+---------+----------+----+---------------+----+---------------+----+
|USINYWC0003|WHITE PLAINS 3.1 ...| 41.0639| -73.7722| 71|2013-06-01| | | | | |
|USINYWC0002|WHITE PLAINS 3.1 ...| 41.0638| -73.7723| 71|2013-06-02| | | | | |
|USINYWC0001|WHITE PLAINS 3.1 ...| 41.0635| -73.7724| 71|2013-06-03| | | | | |
+-----------+--------------------+--------+---------+---------+----------+----+---------------+----+---------------+----+
Then, crossjoin the two dataframes in order to calculate the haversine distance between the start/end point and all stations. Not the best solution using a crossjoin, but depending on the size of your data it might be the easiest way
from pyspark.sql.types import *
from pyspark.sql.functions import col, radians, asin, sin, sqrt, cos, max, min
from pyspark.sql import Window as W
join_df = travel_data\
.crossJoin(weather_data.select('NAME',col('LATITUDE').alias('st_LAT'),col('LONGITUDE').alias('st_LON'), 'AWND')) \
.withColumn("dlon_start", radians(col("st_LON")) - radians(col("Longitude"))) \
.withColumn("dlat_start", radians(col("st_LAT")) - radians(col("Latitude"))) \
.withColumn("haversine_dist_start", asin(sqrt(
sin(col("dlat_start") / 2) ** 2 + cos(radians(col("Latitude")))
* cos(radians(col("st_LAT"))) * sin(col("dlon_start") / 2) ** 2
)
) * 2 * 3963 * 5280)\
.withColumn("dlon_end", radians(col("st_LON")) - radians(col("end station longitude"))) \
.withColumn("dlat_end", radians(col("st_LAT")) - radians(col("end station latitude"))) \
.withColumn("haversine_dist_end", asin(sqrt(
sin(col("dlat_end") / 2) ** 2 + cos(radians(col("Latitude")))
* cos(radians(col("st_LAT"))) * sin(col("dlon_end") / 2) ** 2
)
) * 2 * 3963 * 5280)\
.drop('dlon_start','dlat_start','dlon_end','dlat_end')
Finally, using window functions to pick the closest station from start point (result1) and closest station from end point (result2)
W = W.partitionBy("id")
result1 = join_df\
.withColumn("min_dist_start", min('haversine_dist_start').over(W))\
.filter(col("min_dist_start") == col('haversine_dist_start'))\
.select('id',col('startDate').alias('started_date'),col('stopdate').alias('finish_date'),col('NAME').alias('weather_station_start'),col('Latitude').alias('Latitude_start'),col('Longitude').alias('Longitude_start'))
result2 = join_df\
.withColumn("min_dist_end", min('haversine_dist_end').over(W))\
.filter(col("min_dist_end") == col('haversine_dist_end'))\
.select('id', col('NAME').alias('weather_station_end'))
final = result1.join(result2, 'id', 'left')
final.show()
Not sure of wich columns you want on the output but hope this give you some insights
output:
+---+------------+-----------+-----------------------------+--------------+---------------+-----------------------------+
|id |started_date|finish_date|weather_station_start |Latitude_start|Longitude_start|weather_station_end |
+---+------------+-----------+-----------------------------+--------------+---------------+-----------------------------+
|0 |2013-06-01 |2013-06-01 |WHITE PLAINS 3.1 NNW 1, NY US|40.7423543 |-73.98915076 |WHITE PLAINS 3.1 NNW 1, NY US|
|1 |2013-06-01 |2013-06-01 |WHITE PLAINS 3.1 NNW 1, NY US|40.7423543 |-73.98915076 |WHITE PLAINS 3.1 NNW 1, NY US|
|2 |2013-06-01 |2013-06-01 |WHITE PLAINS 3.1 NNW 1, NY US|40.69512845 |-73.99595065 |WHITE PLAINS 3.1 NNW 1, NY US|
|3 |2013-06-01 |2013-06-01 |WHITE PLAINS 3.1 NNW 1, NY US|40.73524276 |-73.98758561 |WHITE PLAINS 3.1 NNW 1, NY US|
|4 |2013-06-01 |2013-06-01 |WHITE PLAINS 3.1 NNW 1, NY US|40.70569254 |-74.01677685 |WHITE PLAINS 3.1 NNW 1, NY US|
+---+------------+-----------+-----------------------------+--------------+---------------+-----------------------------+
let A=[[3,2],[1,-3]] and B=[[3],[-10]]
and solve equation AX=B by using torch.solve:
X, LU = torch.solve(B,A)
Then I got X=[[-1],[3]] and LU=[[3,2],[0.333,-3.666]].
According to definition of LU decompose, LU must be same as A, however they aren't same.
Can anyone explain this???
Thank you
The representation you got is a compact way of representing the lower trainagular matrix L and the upper trainagular matrix U. You can use torch.tril and torch.triu to get these matrices explicitly:
L = torch.tril(LU, -1) + torch.eye(LU.shape[-1])
U = torch.triu(LU)
verify:
In [*]: L
Out[*]:
tensor([[1.0000, 0.0000],
[0.3333, 1.0000]])
In [*]: U
Out[*]:
tensor([[ 3.0000, 2.0000],
[ 0.0000, -3.6667]])
And the product is indeed equal to A:
In [*]: torch.dist(L # U , A)
Out[*]: tensor(0.)
I have a DataFrame like this;
df = spark.createDataFrame([
[["Apple"],['iPhone EE','iPhone 11', 'iPhone 11 Pro']],
[["Acer"],['Iconia Talk S','liquid Z6 Plus']],
[["Casio"],['Casio G\'zOne Brigade']],
[["Alcatel"[,[]],
[["HTC", "Honor"].["Play 4", "Play 7"]]
]).toDF("brand","type")
And a csv like this;
Apple;iPhone EE
Apple;iPhone 11 Pro
Apple;iPhone XS
Acer;liquid Z6 Plus
Acer;Acer Predator 8
Casio;Casio G'zOne Ravine
Alcatel;3L
HTC;Play 4
Honor;Play 7
I need to create a new boolean column match.
If the combination of brand and type matches one of the rows from the CSV it's True otherwise False.
Expected output:
Brand | Type | Match
-------------------------------------------------------------
Apple | [iPhone EE, iPhone 11, iPhone 11 Pro] | True
Acer | [Iconia Talk S, liquid Z6 Plus] | True
Casio | [Casio G\'zOne Brigade] | False
Alcatel | [] | False
HTC, Honor | [Play 4, Play 7] | True
Update
brand is also of type array<string>
The csv file is just a start. it can be converted to a Dataframe or Dictionary (or whatever fits best).
How can I best accomplish this?
You can try size + array_intersect to set up this flag.
from pyspark.sql.functions import collect_set, size, array_intersect, broadcast, expr, flatten, collect_list, array_join
df_list = spark.read.csv("/path/to/csv_list", sep=';').toDF('brand_name','type')
df1 = df_list.groupby('brand_name').agg(collect_set('type').alias('types'))
df_new = df.join(broadcast(df1), expr("array_contains(brand, brand_name)"), "left") \
.groupby('brand', 'Type') \
.agg(flatten(collect_list('types')).alias('types')) \
.select(array_join('brand', ', ').alias('brand'), 'Type', (size(array_intersect('type', 'types'))>0).alias("Match"))
df_new.show(5,0)
+----------+-------------------------------------+-----+
|brand |Type |Match|
+----------+-------------------------------------+-----+
|Alcatel |[] |false|
|HTC, Honor|[Play 4, Play 7] |true |
|Casio |[Casio G'zOne Brigade] |false|
|Acer |[Iconia Talk S, liquid Z6 Plus] |true |
|Apple |[iPhone EE, iPhone 11, iPhone 11 Pro]|true |
+----------+-------------------------------------+-----+
Method-2: using Map (map<string,array<string>>):
from pyspark.sql.functions import arrays_overlap, array, lit, col, create_map, col, monotonically_increasing_id, first, explode, array_join
dict1 = df1.rdd.collectAsMap()
map1 = create_map([ t for k,v in dict1.items() for t in [lit(k), array(*map(lit,v))] ])
#Column<b"map(Casio, array(Casio G'zOne Ravine), Alcatel, array(3L), Acer, array(Acer Predator 8, liquid Z6 Plus), HTC, array(Play 4), Honor, array(Play 7), Apple, array(iPhone EE, iPhone 11 Pro, iPhone XS))">
df_new = df.withColumn('id', monotonically_increasing_id()) \
.withColumn('brand', explode('brand')) \
.withColumn('Match', arrays_overlap('type', map1[col('brand')])) \
.groupby('id') \
.agg(
array_join(collect_set('brand'),', ').alias('brand'),
first('Type').alias('Type'),
expr("sum(int(Match)) > 0 as Match")
)
df_new.show(5,0)
+---+----------+-------------------------------------+-----+
|id |brand |Type |Match|
+---+----------+-------------------------------------+-----+
|0 |Apple |[iPhone EE, iPhone 11, iPhone 11 Pro]|true |
|1 |Acer |[Iconia Talk S, liquid Z6 Plus] |true |
|3 |Alcatel |[] |false|
|2 |Casio |[Casio G'zOne Brigade] |false|
|4 |HTC, Honor|[Play 4, Play 7] |true |
+---+----------+-------------------------------------+-----+
this might be useful.
>>> import pyspark.sql.functions as F
>>> df = spark.createDataFrame([
... ["Apple",['iPhone EE','iPhone 11', 'iPhone 11 Pro']],
... ["Acer",['Iconia Talk S','liquid Z6 Plus']],
... ["Casio",['Casio G\'zOne Brigade']],
... ["Alcatel",[]]
... ]).toDF("brand","type")
>>> df.show(df.count(), False)
+-------+-------------------------------------+
|brand |type |
+-------+-------------------------------------+
|Apple |[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer |[Iconia Talk S, liquid Z6 Plus] |
|Casio |[Casio G'zOne Brigade] |
|Alcatel|[] |
+-------+-------------------------------------+
>>> file_df = sqlcontext.read.csv('/home/chai/brand.csv', header='true')
>>> file_df.show(file_df.count(), False)
+-------+-------------------+
|brand |types |
+-------+-------------------+
|Apple |iPhone EE |
|Apple |iPhone 11 Pro |
|Apple |iPhone XS |
|Acer |liquid Z6 Plus |
|Acer |Acer Predator 8 |
|Casio |Casio G'zOne Ravine|
|Alcatel|3L |
+-------+-------------------+
>>> file_df = file_df.groupBy('brand').agg(F.collect_list('types').alias('new'))
>>> file_df.show(file_df.count(), False)
+-------+-------------------------------------+
|brand |new |
+-------+-------------------------------------+
|Casio |[Casio G'zOne Ravine] |
|Alcatel|[3L] |
|Acer |[liquid Z6 Plus, Acer Predator 8] |
|Apple |[iPhone EE, iPhone 11 Pro, iPhone XS]|
+-------+-------------------------------------+
>>> def test(row_dict):
... new_dict = dict()
... for i in row_dict.get('type'):
... if i in row_dict.get('new'):
... new_dict['flag'] = 'True'
... else:
... new_dict['flag'] = 'False'
... if len(row_dict.get('type')) == 0 and len(row_dict.get('new')) > 0:
... new_dict['flag'] = 'False'
... new_dict['brand'] = row_dict.get('brand')
... new_dict['type'] = row_dict.get('type')
... new_dict['new'] = row_dict.get('new')
... return new_dict
...
>>> def row_to_dict(row):
... return row.asDict(recursive=True)
>>> rdd = all.rdd.map(row_to_dict)
>>> rdd1 = rdd.map(test)
>>> final_df = sqlcontext.createDataFrame(rdd1)
>>> final_df.show(final_df.count(), False)
+-------+-----+-------------------------------------+-------------------------------------+
|brand |flag |new |type |
+-------+-----+-------------------------------------+-------------------------------------+
|Apple |True |[iPhone EE, iPhone 11 Pro, iPhone XS]|[iPhone EE, iPhone 11, iPhone 11 Pro]|
|Acer |True |[liquid Z6 Plus, Acer Predator 8] |[Iconia Talk S, liquid Z6 Plus] |
|Casio |False|[Casio G'zOne Ravine] |[Casio G'zOne Brigade] |
|Alcatel|False|[3L] |[] |
+-------+-----+-------------------------------------+-------------------------------------+
I have the following sparkdataframe:
id weekly_sale
1 40000
2 120000
3 135000
4 211000
5 215000
6 331000
7 337000
I need to see in which of the following intervals items in weekly_sale column fall:
under 100000
between 100000 and 200000
between 200000 and 300000
more than 300000
so my desired output would be like:
id weekly_sale label
1 40000 under 100000
2 120000 between 100000 and 200000
3 135000 between 100000 and 200000
4 211000 between 200000 and 300000
5 215000 between 200000 and 300000
6 331000 more than 300000
7 337000 more than 300000
any pyspark, spark.sql and Hive context implementation will help me.
Assuming ranges and labels are defined as follows:
splits = [float("-inf"), 100000.0, 200000.0, 300000.0, float("inf")]
labels = [
"under 100000", "between 100000 and 200000",
"between 200000 and 300000", "more than 300000"]
df = sc.parallelize([
(1, 40000.0), (2, 120000.0), (3, 135000.0),
(4, 211000.0), (5, 215000.0), (6, 331000.0),
(7, 337000.0)
]).toDF(["id", "weekly_sale"])
one possible approach is to use Bucketizer:
from pyspark.ml.feature import Bucketizer
from pyspark.sql.functions import array, col, lit
bucketizer = Bucketizer(
splits=splits, inputCol="weekly_sale", outputCol="split"
)
with_split = bucketizer.transform(df)
and attach labels later:
label_array = array(*(lit(label) for label in labels))
with_split.withColumn(
"label", label_array.getItem(col("split").cast("integer"))
).show(10, False)
## +---+-----------+-----+-------------------------+
## |id |weekly_sale|split|label |
## +---+-----------+-----+-------------------------+
## |1 |40000.0 |0.0 |under 100000 |
## |2 |120000.0 |1.0 |between 100000 and 200000|
## |3 |135000.0 |1.0 |between 100000 and 200000|
## |4 |211000.0 |2.0 |between 200000 and 300000|
## |5 |215000.0 |2.0 |between 200000 and 300000|
## |6 |331000.0 |3.0 |more than 300000 |
## |7 |337000.0 |3.0 |more than 300000 |
## +---+-----------+-----+-------------------------+
There are of course different ways you can achieve the same goal. For example you can create a lookup table:
from toolz import sliding_window
from pyspark.sql.functions import broadcast
mapping = [
(lower, upper, label) for ((lower, upper), label)
in zip(sliding_window(2, splits), labels)
]
lookup_df =sc.parallelize(mapping).toDF(["lower", "upper", "label"])
df.join(
broadcast(lookup_df),
(col("weekly_sale") >= col("lower")) & (col("weekly_sale") < col("upper"))
).drop("lower").drop("upper")
or generate lookup expression:
from functools import reduce
from pyspark.sql.functions import when
def in_range(c):
def in_range_(acc, x):
lower, upper, label = x
return when(
(c >= lit(lower)) & (c < lit(upper)), lit(label)
).otherwise(acc)
return in_range_
label = reduce(in_range(col("weekly_sale")), mapping, lit(None))
df.withColumn("label", label)
The least efficient approach is an UDF.