I'd like to create a new column with values taken from a previous row. The rows are tagged by "ID" and are ordered by "Depth" in the original table. This is the custom expression that I think should do it, but I'm getting unexpected results where some rows aren't picking up a value from the previous row.
Max([Value]) over Intersect([ID],Previous([Depth]))
Here's an example table with my expected results and the actual result given by the expression above.
ID Depth Value Expected Actual
Object1 0 0.02
Object1 3033 68.87 0.02
Object1 3349 70.82 68.87
Object1 3538 70.65 70.82
Object1 3791 70.38 70.65 70.65
Object1 4044 69.31 70.38 70.38
Object1 4297 71.13 69.31 69.31
Object1 4549 70.9 71.13
Object1 4802 70.59 70.9
Object1 5055 71.56 70.59
Object1 5307 71.34 71.56 71.56
Object1 5560 71.32 71.34
Object1 6381 71.5 71.32
Object1 6444 71.62 71.5 71.5
Object1 6544 71.86 71.62 71.62
Object2 0 0.02
Object2 962 267.58 0.02 0.02
Object2 1024 276.67 267.58 267.58
Object2 1213 273.11 276.67 276.67
Object2 1529 275.56 273.11 273.11
Object2 1593 275.96 275.56 275.56
Object2 1656 275.15 275.96 275.96
Object2 2854 278.35 275.15 275.15
Object2 3107 276.45 278.35
Object2 3359 270.83 276.45
Object2 4370 272.89 270.83
Object2 4623 271.93 272.89
Object2 4877 269.93 271.93
Object2 5504 270.51 269.93
Object2 5538 270.38 270.51 270.51
Object2 5541 270.37 270.38 270.38
Object2 5688 269.8 270.37
It seems to me that the issue might have something to do with the ordering of the Depth column. As a workaround, I can successfully get the expected result if I first calculate a rank column (call it "RankColumn")
Rank([Depth],[ID])
Then I calculate the over statement using the calculated rank
Max([Value]) over Intersect([ID],Previous([RankColumn]))
First question: why didn't Spotfire recognize the order in the original Depth column?
Second question: if conducting the intermediate rank calculation is a necessary step, is there a more elegant way to write the expression (e.g. do it in one expression rather than creating an intermediate column)?
Related
I'm trying to join the GHCN weather dataset and another dataset:
Weather Dataset: (called "weather" in the code)
station_id
date
PRCP
SNOW
SNWD
TMAX
TMIN
Latitude
Longitude
Elevation
State
date_final
CA001010235
19730707
null
null
0.0
0.0
null
48.4
-123.4833
17.0
BC
1973-07-07
CA001010235
19780337
14
8
0.0
0.0
null
48.4
-123.4833
17.0
BC
1978-03-30
CA001010595
19690607
null
null
0.0
0.0
null
48.5833
-123.5167
17.0
BC
1969-06-07
Species Dataset: (called "ebird" in the code where "ebird_id" is unique for each row in the dataset)
speciesCode
comName
sciName
locId
locName
ObsDt
howMany
lat
lng
obsValid
obsReview
locationPrivate
subId
ebird_id
nswowl
Northern Saw-whet Owl
Aegolius acadicus
L787133
Kumdis Slough
2017-03-20 23:15
1
53.7392187
-132.1612358
TRUE
FALSE
TRUE
S35611913
eff-178121-fff
wilsni1
Wilson's Snipe
Gallinago delicata
L1166559
Hornby Island--Ford Cove
2017-03-20 21:44
1
49.4973435
-124.6768427
TRUE
FALSE
FALSE
S35323282
abc-1920192-fff
cacgoo1
Cackling Goose
Branta hutchinsii
L833055
Central Saanich--ȾIKEL (Maber Flats)
2017-03-20 19:24
5
48.5724686
-123.4305167
TRUE
FALSE
FALSE
S35322116
yhj-9102910-fff
Result Expected: I need to join these tables by finding the closest weather station for each row in the species dataset for the same date. So in this example, ebird_id "eff-178121-fff" is closest to the weather station "CA001010235" and the distance is around 20 kms.
speciesCode
comName
sciName
locId
locName
ObsDt
howMany
lat
lng
obsValid
obsReview
locationPrivate
subId
ebird_id
station_id
date
PRCP
SNOW
SNWD
TMAX
TMIN
Latitude
Longitude
Elevation
State
date_final
distance(kms)
nswowl
Northern Saw-whet Owl
Aegolius acadicus
L787133
Kumdis Slough
2017-03-20 23:15
1
53.7392187
-132.1612358
TRUE
FALSE
TRUE
S35611913
eff-178121-fff
CA001010235
20170320
null
null
0.0
0.0
null
48.4
-123.4833
17.0
BC
2017-03-20
20
wilsni1
Wilson's Snipe
Gallinago delicata
L1166559
Hornby Island--Ford Cove
2017-03-20 21:44
1
49.4973435
-124.6768427
TRUE
FALSE
FALSE
S35323282
abc-1920192-fff
CA001010595
20170320
null
null
0.0
0.0
null
48.5833
-123.5167
17.0
BC
2017-03-20
What I have tried so far: I referred to this link and it works for a sample of the datasets but when I tried to run the below code for the entire weather and entire species dataset, the cross join worked but the partitionBy and windows function line was taking too long. I also tried replacing the partionBy and windows function with pyspark SQL queries in case that would be faster but it's still taking time. Is there any optimized way to do this?
join_df = ebird.crossJoin(weather).withColumn("dist_longit", radians(weather["Longitude"]) - radians(ebird["lng"])).withColumn("dist_latit", radians(weather["Latitude"]) - radians(ebird["lat"]))
join_df = join_df.withColumn("haversine_distance_kms", asin(sqrt(
sin(join_df["dist_latit"] / 2) ** 2 + cos(radians(join_df["lat"]))
* cos(radians(join_df["Latitude"])) * sin(join_df["dist_longit"] / 2) ** 2
)
) * 2 * 6371).drop("dist_longit","dist_latit")
W = W.partitionBy("ebird_id")
result = join_df.withColumn("min_dist", min(join_df['haversine_distance_kms']).over(W))\
.filter(col("min_dist") == col('haversine_distance_kms'))
print(result.show(1))
Edit:
Size of the datasets:
print(weather.count()) #output: 8211812
print(ebird.count()) #output: 1564574
I have panel data and I trying to run Cronbach.alpha, however, the result is NA. Does anybody know why?
data = P_df
cronbach.alpha(p_df,na.rm = TRUE)
Cronbach's alpha for the 'p_df' data-set
Items: 169
Sample units: 1284
alpha: NA
the data comes in 3 columns after (orderbook = pd.DataFrame(orderbook_data):
timestamp bids asks
UNIX timestamp [bidprice, bidvolume] [askprice, askvolume]
list has 100 values of each. timestamp is the same
the problem is that I don't know how to access/index the values inside each row list [price, volume] of each column
I know that by running ---> bids = orderbook["bids"]
I get the list of 100 lists ---> [bidprice, bidvolume]
I'm looking to avoid doing a loop.... there has to be a way to just plot the data
I hope someone can undertand my problem. I just want to plot price on x and volume on y. The goal is to make it live
As you didn't present your input file, I prepared it on my own:
timestamp;bids
1579082401;[123.12, 300]
1579082461;[135.40, 220]
1579082736;[130.76, 20]
1579082801;[123.12, 180]
To read it I used:
orderbook = pd.read_csv('Input.csv', sep=';')
orderbook.timestamp = pd.to_datetime(orderbook.timestamp, unit='s')
Its content is:
timestamp bids
0 2020-01-15 10:00:01 [123.12, 300]
1 2020-01-15 10:01:13 [135.40, 220]
2 2020-01-15 10:05:36 [130.76, 20]
3 2020-01-15 10:06:41 [123.12, 180]
Now:
timestamp has been converted to native pandasonic type of datetime,
but bids is of object type (actually, a string).
and, as I suppose, this is the same when read from your input file.
And now the main task: The first step is to extract both numbers from bids,
convert them to float and int and save in respective columns:
orderbook = orderbook.join(orderbook.bids.str.extract(
r'\[(?P<bidprice>\d+\.\d+), (?P<bidvolume>\d+)]'))
orderbook.bidprice = orderbook.bidprice.astype(float)
orderbook.bidvolume = orderbook.bidvolume.astype(int)
Now orderbook contains:
timestamp bids bidprice bidvolume
0 2020-01-15 10:00:01 [123.12, 300] 123.12 300
1 2020-01-15 10:01:01 [135.40, 220] 135.40 220
2 2020-01-15 10:05:36 [130.76, 20] 130.76 20
3 2020-01-15 10:06:41 [123.12, 180] 123.12 180
and you can generate e.g. a scatter plot, calling:
orderbook.plot.scatter('bidprice', 'bidvolume');
or other plotting function.
Another possibility
Or maybe your orderbook_data is a dictionary? Something like:
orderbook_data = {
'timestamp': [1579082401, 1579082461, 1579082736, 1579082801],
'bids': [[123.12, 300], [135.40, 220], [130.76, 20], [123.12, 180]] }
In this case, when you create a DataFrame from it, the column types
are initially:
timestamp - int64,
bids - also object, but this time each cell contains a plain
pythonic list.
Then you can also convert timestamp column to datetime just like
above.
But to split bids (a column of lists) into 2 separate columns,
you should run:
orderbook[['bidprice', 'bidvolume']] = pd.DataFrame(orderbook.bids.tolist())
Then you have 2 new columns with respective components of the
source column and you can create your graphics jus like above.
I have a dataframe that contains missing data. I'm interested in exploring interpolation as a possible alternative to removing columns with missing data.
Below is a subset of the dataset. 'a_out' is outdoor temperature while 'b_in' etc. are temperatures from rooms in the same house.
a_out b_in c_in d_in e_in f_in
... ... ... ... ... ... ...
03/01/2016 6.51 17.71 15.15 14.04 15.27 16.32
04/01/2016 5.94 17.49 14.34 14.71
05/01/2016 6.74 17.57 14.80 15.18
06/01/2016 5.86 17.49 14.68 18.43 15.57
07/01/2016 5.18 17.18 14.02 14.88
08/01/2016 2.84 16.80 13.15 14.51 14.48
... ... ... ... ... ... ...
Might there be a way to interpolate the missing data, but with some weighting based on intact data in other columns? Perhaps 'cubic' interpolation could do the trick?
Thanks!
How to do a Pairwise Iterate Columns to find Similarities.
For All the Elemets from All The Colunms of one Data Frame, to be compared with all the elements from all the colunms of another Data Frame.
Eg :
df1 has two fields Name & Age
Name , Age
"Ajay Malhotra", 28
"Sujata Krishanan" , 27
"Madhav Shankar" , 33
df2 has two fields UserId & EmpId, eMail
" UserID " , " Emp ID " , "Email "
--------------------------------------
"Ajay.Malhotra", 100, "a.malt#nothing.com"
"Madhav.Shankar" , 101, "m.shankar"
"Sujata.Kris" , 1001,"Kris.Suja#nothing.com"
Some Method to give a Match Value can some hardCode 0.73 as example
def chekIfSame(leftString: String, rightString: String): Double = {
// Some Logic ..Gives a MatchValue
0.73
}
How to take Each Colunms from df_1, and each Colunms from df2 , and pass it to chekIfSame.
Output could be a Cartesian product like this
Name , UserId, MatchValue
--------------------------------------
"Sujata Krishanan", Sujata.Kris, 0.85
"Ajay Malhotra", Ajay.Malhotra , 0.98
"Ajay Malhotra", Sujata.Kris , 0.07
Two DataFrame nested for Each Loop
We wont be able to nested loop it.
But, we can Join and Pass it to a Function
joined = leftDf.join(rightDf)
val joinedWithScore = joined.withColumn("simlarScore", chekIfSame( joined(ltColName) , joined(rtColName)))
For this, we need to have it in chekIfSame as a UDF prior to the above operation.
def checkSimilarity = udf((left:String,right:String):Double => {
// Logic or hard code 0..73
0.73
}