Amazon Athena (Presto) SELECT statement to create (n^2 + n)/2 (𝑛th triangular number)

Amazon Athena (Presto) SELECT statement to create (n^2 + n)/2 (𝑛th triangular number) - presto

I'm using Athena and trying to find a way to create a select statement that will return a sequence in the below format:
Numer
1
2
2
3
3
3
4
4
4
4
And so on, up to 200.
Is it even possible?

Combine sequence() with UNNEST:
SELECT n FROM UNNEST(sequence(1, 5)) t(n)
CROSS JOIN UNNEST(sequence(1, n)) x(y);
presto:default> SELECT n
-> FROM UNNEST(sequence(1, 5)) t(n)
-> CROSS JOIN UNNEST(sequence(1, n)) x(y);
n
---
1
2
2
3
3
3
4
4
4
4
5
5
5
5
5
(15 rows)
(tested in Presto 326 but will work in Athena too)

Run:
SELECT numbers FROM (
SELECT * FROM (
VALUES flatten(
transform(
sequence(1, 4),
x -> repeat(x, cast(x AS INT))
)
)
) AS x (a) CROSS JOIN UNNEST(a) AS t (numbers)
);
it will return:
numbers
---------
1
2
2
3
3
3
4
4
4
4
(10 rows)

Related

Adding custom function to handle additional column/ data while resampleBarFeed in PyAlgoTrade

I am resampling custom columns with pyalogtrade.
Say I have an extra column called EX other than the default columns (open, close, high, low, etc). This is initialized and passed around in the bar.BasicBar() function as {'extraCol1' : 1.09 }.
By default, when the data is resampled, the EX is 0 in the resampled data. How/ where do we change, or override this default behaviour of pyalogtrade.
Example:
Say if we have three 1-minute OHLCV EX bars as follows (toy example)
Cols: [O H L C V EX]
1-min bar#1: [1 1 1 1 1 1]
1-min bar#2: [2 2 2 2 2 2]
1-min bar#3: [3 3 3 3 3 3]
While resampling it for 3-minute, we get a bar like this:
Cols: [O H L C V EX]
3-min bar#1: [1 3 1 3 3 *0*]
Suppose we want custom handling of the EX column, say:
resampled EX = 3rd minute - 1st minute = 3 - 2 = 1 (instead of the 0)
How does one achieve that?

How to select mutiple rows at a time in pandas?

When I have a DataFrame object and an unknown number of rows, I want to select 5 rows each time.
For instance, df has 11 rows , it will be selected 3 times, 5+5+1, and if the rows is 4, only one time will be selected.
How can I write the code using pandas?

Use groupby with a little arithmetic. This should be clean.
chunks = [g for _, g in df.groupby(df.index // 5)]
Depending on how you want your output structured, you may change g to g.values.tolist() (if you want a list instead).

numpy.split
np.split(df, np.arange(5, len(df), 5))
Demo
df = pd.DataFrame(dict(A=range(11)))
print(*np.split(df, np.arange(5, len(df), 5)), sep='\n\n')
A
0 0
1 1
2 2
3 3
4 4
A
5 5
6 6
7 7
8 8
9 9
A
10 10

Create a loop and then use the index for indexing the DataFrame:
for i in range(len(df), 5):
data = df.iloc[i*5:(i+1)*5]

Why does Excel average gives different result?

Here's the table:
Should not they have the same result mathematically? (the average score of the per column and per row average)

The missing cells mean that your cells aren't all weighted evenly.
For example, row 11 has only two cells 82.67 and 90. So for your row average for row 11 they are weighted much more heavily than in your column averages where they are 1/13 and 1/14 of a column instead of 1/2 of a row.
Try filling up all the empty cells with 0 and the averages should match.
Taking a more extreme version of Ruslan Karaev's example:
5 5 5 | 5
1 | 1 Average of Average of Rows = (5 + 1 + 0) / 3 = 2
0 | 0
-----
2 5 5
Average of Average of Columns = (2 + 5 + 5) / 3 = 4

Yes, for example, the following two expressions:
/ a + b X + Y \ / a + X b + Y \
( ----- + ----- ) ( ----- + ----- )
\ 2 2 / \ 2 2 /
------------------- -------------------
2 2
are indeed mathematically equivalent, both coming out to be (a + b + X + Y) / 4.
However, short of having enough sufficient precision to store values, you may find that rounding errors accumulate differently depending on the order of operations.
You can see this sort of effect in a much simpler example if you assume a 3-digit precision and divide one by three, then multiply the result of that by three again:
1 / 3 -> 0.333, 0.333 x 3 -> 0.999
Contrast that with doing the operations in the oppisite order:
1 x 3 = 3, 3 / 1 = 1

Pandas pivot table group summary

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'group':['s','s','s','p','p','p'],
'section':['a','b','b','a','a','b']
})
group section
0 s a
1 s b
2 s b
3 p a
4 p a
5 p b
I'd like a count of the number of sections per group and the maximum number of rows per section for each group. Like this:
group section count max min
s 2 2 1
p 2 2 1

IIUC you can use:
import pandas as pd
import numpy as np
df = pd.DataFrame({'group':['s','s','s','s','p','p','p','p','p'],
'section':['b','b','b','a','a','b','a','a','b']
})
print (df)
group section
0 s b
1 s b
2 s b
3 s a
4 p a
5 p b
6 p a
7 p a
8 p b
print (df.groupby(['group', 'section']).size() )
group section
p a 3
b 2
s a 1
b 3
dtype: int64
print (df.groupby(['group', 'section']).size().groupby(level=1).agg([len, min, max]) )
len min max
section
a 2 1 3
b 2 2 3
Or maybe you can change len to nunique:
print (df.groupby(['group', 'section']).size().groupby(level=1).agg(['nunique', min, max]) )
nunique min max
section
a 2 1 3
b 2 2 3
Or in need by first level of multiindex:
print (df.groupby(['group', 'section']).size().groupby(level=0).agg([len, min, max]) )
len min max
group
p 2 2 3
s 2 1 3
print (df.groupby(['group', 'section']).size().groupby(level=0).agg(['nunique', min, max]) )
nunique min max
group
p 2 2 3
s 2 1 3

You can achieve this by grouping on 'group' generate the value_counts and then grouping again:
In [91]:
df.groupby('group')['section'].apply(pd.Series.value_counts).groupby(level=1).agg(['nunique','max','min'])
Out[91]:
nunique max min
a 2 2 1
b 2 2 1
To get close to the desired result you can do this:
In [102]:
df.groupby('group')['section'].apply(pd.Series.value_counts).reset_index().drop('level_1', axis=1).groupby('group',as_index=False).agg(['nunique','max','min'])
Out[102]:
section
nunique max min
group
p 2 2 1
s 2 2 1

J: Applying two arguments to a monadic verb produces strange results

I was wondering what would happen if I apply two arguments to this verb: 3&*.
If the left one is 1 all works as if it was only one argument:
1 (3&*) 3
9
1 (3&*) 4
12
1 (3&*) 5
15
If I change it I discover why that worked:
2 (3&*) 5
45
3 (3&*) 5
135
10 (3&*) 5
295245
It seems that the left argument is interpreted as a repetition like ^:. So the last one is equal to 3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * 5 (10 3's), or:
5 * 3^10
295245
Can you explain this weird behavior? I was expecting something like domain error (which is ubiquitous), and that is thrown if I try to use fndisplay:
require 'j/addons/general/misc/fndisplay.ijs'
defverbs 'f'
defnouns 'x y n'
x (n&*) y
|domain error
| x (n&*)y

it is documented.
x m&v y ↔ m&v^:x y
x u&n y ↔ u&n^:x y
&Bond from J dictionary

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Amazon Athena (Presto) SELECT statement to create (n^2 + n)/2 (𝑛th triangular number) - presto

I'm using Athena and trying to find a way to create a select statement that will return a sequence in the below format: Numer 1 2 2 3 3 3 4 4 4 4 And so on, up to 200. Is it even possible?

Run: SELECT numbers FROM ( SELECT * FROM ( VALUES flatten( transform( sequence(1, 4), x -> repeat(x, cast(x AS INT)) ) ) ) AS x (a) CROSS JOIN UNNEST(a) AS t (numbers) ); it will return: numbers --------- 1 2 2 3 3 3 4 4 4 4 (10 rows)

Related

Adding custom function to handle additional column/ data while resampleBarFeed in PyAlgoTrade

How to select mutiple rows at a time in pandas?

Why does Excel average gives different result?

Pandas pivot table group summary

J: Applying two arguments to a monadic verb produces strange results

Categories

Resources