HANA linestring to segments - geospatial

How can I divide a linestring to line segments? using SAP HANA DB
for example:
'LINESTRING(0 0, 2 2, 0 2, 0 5)'
will become:
'LINESTRING(0 0, 2 2) LINESTRING(2 2, 0 2) LINESTRING(0 2, 0 5)'

You can extract individual points with the function ST_PointN().
To call it for point 1, 2, 3, ... N
I use a temp table called TMP that would store counter values:
CREATE TABLE TMP (NUM INT);
select TOP 100 ROW_NUMBER() OVER () from PUBLIC.M_TABLES ;
Then I apply ST_PointN for every point using the counter table.
select NUM, LS.ST_PointN(NUM).ST_WKT() as CUR_POINT
from(
select NEW ST_LineString('LINESTRING(0 0, 2 2, 0 2, 0 5)') as LS, NUM
from TMP
)nested1
where NUM<=LS.ST_NumPoints()
This returns
NUM| CUR_POINT
1 |POINT (0 0)
2 |POINT (2 2)
3 |POINT (0 2)
4 |POINT (0 5)
You can easily concatenate those into a ST_Multipoint geometry with an aggregation using ST_UnionAggr():
select ST_UnionAggr(CUR_POINT).ST_AsWKT()
from(
select NUM, LS.ST_PointN(NUM) as CUR_POINT
from(
select NEW ST_LineString('LINESTRING(0 0, 2 2, 0 2, 0 5)') as LS, NUM
from TMP
)nested1
where NUM<=LS.ST_NumPoints()
)nested2
This returns
MULTIPOINT ((0 0),(2 2),(0 2),(0 5))
Note: we could do a loop instead of using a counter table.
Now for your exact question, we will build 3 ST_LineString() using a window function to combine the current point with the next one. There are multiple ways to write such a query, here's one:
select NEW ST_LineString('LineString ('||START_POINT.ST_X()||' '||START_POINT.ST_Y()||','||END_POINT.ST_X()||' '||END_POINT.ST_Y()||')').ST_AsWKT() as LS
from(
select CUR_POINT as START_POINT,
NEW ST_Point(FIRST_VALUE(CUR_POINT) OVER(order by NUM asc rows BETWEEN 1 following and 1 following) )as END_POINT
from(
select NUM, LS.ST_PointN(NUM) as CUR_POINT, LS.ST_NumPoints() as NB_POINTS
from(
select NEW ST_LineString('LINESTRING(0 0, 2 2, 0 2, 0 5)') as LS, NUM
from TMP
)nested1
where NUM<=LS.ST_NumPoints()
)nested2
)nested3
where END_POINT is not null
Tada:
LS
LINESTRING (0 0,2 2)
LINESTRING (2 2,0 2)
LINESTRING (0 2,0 5)

Related

selective averaging of dataframe entries depending on labels

I have a dataframe
ID KD DT
0 4 2 5.6
1 4 5 8.7
4 4 8 1.9
5 4 9 1.7
6 4 1 8.8
3 4 3 7.2
9 4 4 3.1
I also have an array of labels, same size as the total number of unique KD
L = [ 0, 0, 0, 1, 1, 1, 1] which simply indicates that KD == 1 is associated with label 0 KD == 2 with label 0 ... KD == 9 with label 1 etc. (L is stored for the sorted order of KD).
Now I have a two lists, l1 = [1,2,5,9] and l2 = [3,4,8]. I want to set the value of DT corresponding to the KD values in l2 such that it is the average of the DT values in l1 if both have the same labels.
In the example, KD == 3 as the same label (label = 0) as KD = 1 and 2 in l1. So we set DT = (8.8 + 5.6)/2 = 7.2.
I am now doing this using a for loop, by iterating over l2 and finding the l1 entries which has the same labels and then averaging. Is there a way that I can do this very efficiently, by getting rid of the for loop ?
My output can be a dictionary of the form
d = {3:7.2, 4: 5.2, 8: 5.2}
IIUC, first set_index the KD column, then you can select 'DT' and with where replace values that are not isin(l1) with Nan. then you groupby.transform the map of the column KD with their group number in L and get the mean. Finally loc only the KD that are isin(l2) and use to_dict to get your expected output
df_ = df.set_index('KD')
print ( df_['DT'].where(df_.index.isin(l1))\
.groupby(df_.index.map(pd.Series(L, df_.index.sort_values())))\
.transform('mean')\
.loc[df_.index.isin(l2)]\
.to_dict()
)
{8: 5.199999999999999, 3: 7.2, 4: 5.199999999999999}

How to select rows with only positive or negative values in Pandas

I have the following DF:
df=pd.DataFrame({
'x':[1, 2, 3, -1, -2, -3],
'y':[-1, 3, 2, -4, 3, -2],
'z':[1 , 1, 5, 2, 1, -1]
})
or
x y z
0 1 -1 1
1 2 3 1
2 3 2 5
3 -1 -4 2
4 -2 3 1
5 -3 -2 -1
The goal is to find the rows that all of their elements have the same (either negative or positive) values.
In this example, it means selecting rows 1, 2, and 5.
I would appreciate any help.
I am aware of this question: Pandas - Compare positive/negative values
but it doesn't address the case where the values are negative.
Thank you :)
You can use the OR (|) to combine conditions:
print(df[(df[df.columns] >= 0).all(axis=1) | (df[df.columns] <= 0).all(axis=1)])
Prints:
x y z
1 2 3 1
2 3 2 5
5 -3 -2 -1
Or simpler:
print(df[(df >= 0).all(axis=1) | (df <= 0).all(axis=1)])
EDIT: As #Erfan stated in the comments, if you want strictly negative (without 0), you use <:
print(df[(df >= 0).all(axis=1) | (df < 0).all(axis=1)])

How to make each bin of data as column of dataframe

I have have dataframe with column A , i want to divide column in bins and count of each bin as column of dataframe , for example bin from 0 to how many points and add this in in dataframe.
i used this code for binning but i am not sure how to insert count column in df.
df=pd.DataFrame({'max':[0.2,0.3,1,1.5,2.5,0.2]})
print(df)
max
0 0.2
1 0.3
2 1.0
3 1.5
4 2.5
5 0.2
bins = [0, 0.5, 1, 1.5, 2, 2.5]
x=pd.cut(df['max'], bins)
desired output
print(df)
0_0.5_count 0.5_1_count
0 3 1
First add parameter label to cut, then count by Series.value_counts and for DataFrame use Series.to_frame with transpose by DataFrame.T:
bins = [0, 0.5, 1, 1.5, 2, 2.5]
labels = ['{}_{}_count'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
x=pd.cut(df['max'], bins, labels=labels).value_counts().sort_index().to_frame(0).T
print (x)
0_0.5_count 0.5_1_count 1_1.5_count 1.5_2_count 2_2.5_count
0 3 1 1 0 1
Details:
print (pd.cut(df['max'], bins, labels=labels))
0 0_0.5_count
1 0_0.5_count
2 0.5_1_count
3 1_1.5_count
4 2_2.5_count
5 0_0.5_count
Name: max, dtype: category
Categories (5, object): [0_0.5_count < 0.5_1_count < 1_1.5_count < 1.5_2_count < 2_2.5_count]
print (pd.cut(df['max'], bins, labels=labels).value_counts())
0_0.5_count 3
2_2.5_count 1
1_1.5_count 1
0.5_1_count 1
1.5_2_count 0
Name: max, dtype: int64
Alternative solution with GroupBy.size:
bins = [0, 0.5, 1, 1.5, 2, 2.5]
labels = ['{}_{}_count'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
x= df.groupby(pd.cut(df['max'], bins, labels=labels)).size().rename_axis(None).to_frame().T
print (x)
0_0.5_count 0.5_1_count 1_1.5_count 1.5_2_count 2_2.5_count
0 3 1 1 0 1

Expand rows by column value in Presto

I have data which has id, number like:
id, number
1, 5
2, 3
And I would like data to be:
id, number
1, 0
1, 1
1, 2
1, 3
1, 4
1, 5
2, 0
2, 1
2, 2
2, 3
select t.id
,s.n
from mytable t
cross join unnest(sequence(0,t.number)) s (n)
;

How can I select rows into N groups per value of a certain column?

I have a table in the form
Span Available Time
A 0 0
B 1 0
C 1 0
A 1 1
B 0 1
C 1 1
... ... ...
A 1 N
B 0 N
C 1 N
I want to group this into groups of X Times per Span. So it would look like:
Span Available Time
A 1 0
A 0 1
... ... ...
A 1 X
B 1 0
B 1 1
... ... ...
B 0 X
C 0 0
C 1 1
... ... ...
C 0 X
A 1 X+1
A 0 X+2
... ... ...
A 1 2X
B 1 X+1
B 1 X+2
... ... ...
B 0 2X
... ... ...
... ... ...
A 0 N-X
A 1 N-X+1
... ... ...
A 0 N
B 1 N-X
B 0 N-X+1
... ... ...
B 1 N
C 0 N-X
C 1 N-X+1
... ... ...
C 1 N
Where X is a factor of N.
How can I group the data in this way using SQL or Spark's DataFrame API?
Also, how can I aggregate that table by X rows per span to get, for example, the percentage availability for the span from time 0 to X, X to 2X, etc.?
edit:
For context, each group of X rows represents a day, and the whole data set represents a week. So I want to aggregate the availability per day, per span.
edit:
Also, I know what X is. So I want to be able to say something like GROUP BY Span LIMIT X ORDER BY Time
edit:
As a final attempt to describe this better, I want the first X of the first span, then the first X of the next span, and then the first X of the last span, followed by the next X of the first span, the next X of the second span, etc., through to the last rows for each span.
Under the assumption that your time column contains a timestamp and you input data thus looks something like this example rdd:
val rdd = sc.parallelize(List(("A", 0, "2015-01-02 09:00:00"),
("A", 1, "2015-01-02 10:00:00"),
("A", 1, "2015-01-02 11:00:00"),
("B", 0, "2015-01-02 09:00:00"),
("B", 0, "2015-01-02 10:00:00"),
("B", 1, "2015-01-02 11:00:00"),
("A", 1, "2015-01-03 09:00:00"),
("A", 1, "2015-01-03 10:00:00"),
("A", 1, "2015-01-03 11:00:00"),
("B", 0, "2015-01-03 09:00:00"),
("B", 0, "2015-01-03 10:00:00"),
("B", 0, "2015-01-03 11:00:00")
))
you could achieve your grouping and aggregation like this:
rdd.map{case(span,availability,timestamp) => ((span,getDate(timestamp)), (List((availability, time)), availability, 1))}
.reduceByKey((v1,v2) => (v1._1 ++ v2._1, v1._2 + v2._2, v1._3 + v2._3))
.mapValues(v => (v._1, v._2.toDouble/v._3))
(Where getDate() is some function that will return the date from a timestamp.)
This will produce output in the format of (span, List((availability, time)), availability_percentage). For my example rdd the result will look like this:
(B,List((0,2015-01-02 09:00:00), (0,2015-01-02 10:00:00), (1,2015-01-02 11:00:00)),0.3333333333333333)
(A,List((0,2015-01-02 09:00:00), (1,2015-01-02 10:00:00), (1,2015-01-02 11:00:00)),0.6666666666666666)
(A,List((1,2015-01-03 09:00:00), (1,2015-01-03 10:00:00), (1,2015-01-03 11:00:00)),1.0)
(B,List((0,2015-01-03 09:00:00), (0,2015-01-03 10:00:00), (0,2015-01-03 11:00:00)),0.0)

Resources