I have the following query:
let fooTable = datatable(TIMESTAMP: datetime, list_id: int, dim_count: int) [
datetime("2022-01-17T08:00:00Z"), -1, 120,
datetime("2022-01-17T08:00:00Z"), 1, 50,
datetime("2022-01-17T08:00:00Z"), 2, 30,
datetime("2022-01-17T08:00:00Z"), 8, 30,
datetime("2022-01-17T08:00:00Z"), 2001, 30,
datetime("2022-01-17T08:00:00Z"), 4, 30,
];
fooTable
| order by TIMESTAMP desc, dim_count desc
| evaluate pivot(list_id, take_any(dim_count), TIMESTAMP)
This produces the following results:
TIMESTAMP 1 -1 2 2001 4 8
2022-01-17T08:00:00Z 50 120 30 30 30 30
Which produce almost what I need - grouping the TIMESTAMP value and creating a column for each list_id value, to use the dim_count as value.
But I expected a different order of the columns (like in the input).
TIMESTAMP -1 1 2 8 2001 4
2022-01-17T08:00:00Z 120 50 30 30 30 30
How can I order the columns in such way? (the number and value of columns are dynamic).
Or, how can I control the order of the returned columns?
In reality I have more data (with more buckets of time), and I'd want to return the columns in the order of the largest sum of the column (dim_count).
So the order of the columns in the output would like:
fooTable
| summarize sum(dim_count) by list_id
| order by sum_dim_count desc
| project list_id
Which produces
-1
1
2
8
2001
4
And this is how I'd like the order of the columns (like in my expected output).
If the order is based on the row_number, you can use it with the project-reorder operator, the client will need to parse the column names and remove the prefix, for example:
let fooTable = datatable(TIMESTAMP: datetime, list_id: int, dim_count: int) [
datetime("2022-01-17T08:00:00Z"), -1, 120,
datetime("2022-01-17T08:00:00Z"), 1, 50,
datetime("2022-01-17T08:00:00Z"), 2, 30,
datetime("2022-01-17T08:00:00Z"), 8, 30,
datetime("2022-01-17T08:00:00Z"), 2001, 30,
datetime("2022-01-17T08:00:00Z"), 4, 30,
];
fooTable
| serialize
| extend list_id = strcat(row_number(), "_", list_id)
| order by TIMESTAMP desc, dim_count desc
| evaluate pivot(list_id, take_any(dim_count), TIMESTAMP)
| project-reorder *
TIMESTAMP
1_-1
2_1
3_2
4_8
5_2001
6_4
2022-01-17 08:00:00.0000000
120
50
30
30
30
30
Related
I have the following table:
PersonID CW_MilesRun PW_MilesRun CM_MilesRun PM_MilesRun
1 15 25 35 45
2 10 20 30 40
3 5 10 15 20
...
I need to split this table into a vertical table with an id for each field (i.e CD_MilesRun =1, CW_MilesRun = 2, etc) So that my table looks similar to this:
PersonID TimeID Description C_MilesRun P_MilesRun
1 1 Week 15 25
1 2 Month 35 45
2 1 Week 10 20
2 2 Month 30 40
3 1 Week 5 10
3 2 Month 15 20
In postgres, I would use something similar to:
SELECT
PersonID
, unnest(array[1,2]) AS TimeID
, unnest(array['Week','Month']) AS "Description"
, unnest(array["CW_MilesRun","CM_MilesRun"]) C_MilesRun
, unnest(array["PW_MilesRun","PM_MilesRun"]) P_MilesRun
FROM myTableHere
;
However, I cannot get a similar function in snowflake to work. Any ideas?
You can use FLATTEN() with LATERAL to get the result you want, although the query is quite different.
with tbl as (select $1 PersonID, $2 CW_MilesRun, $3 PW_MilesRun, $4 CM_MilesRun, $5 PM_MilesRun from values (1, 15, 25, 35, 45),(2, 10, 20, 30, 40),(3, 5, 10, 15, 20))
select
PersonID,
t.value[0] TimeID,
t.value[1] Description,
iff(t.index=0,CW_MilesRun,CM_MilesRun) C_MilesRun,
iff(t.index=1,PW_MilesRun,PM_MilesRun) P_MilesRun
from tbl, lateral flatten(parse_json('[[1, "Week"],[2, "Month"]]')) t;
PERSONID TIMEID DESCRIPTION C_MILESRUN P_MILESRUN
1 1 "Week" 15 25
1 2 "Month" 35 45
2 1 "Week" 10 20
2 2 "Month" 30 40
3 1 "Week" 5 10
3 2 "Month" 15 20
P.S. Use t.* to see what's available after flattening (perhaps that is obvious.)
You could alternatively use UNPIVOT and NATURAL JOIN.
Above answer is great ... just like thinking about alternative ways of doing things ... you never know when it might suit your needs - plus exposes you to a couple new cool functions.
with cte as (
select
1 PersonID,
15 CW_MilesRun,
25 PW_MilesRun,
35 CM_MilesRun,
45 PM_MilesRun
union
select
2 PersonID,
10 CW_MilesRun,
20 PW_MilesRun,
30 CM_MilesRun,
40 PM_MilesRun
union
select
3 PersonID,
5 CW_MilesRun,
10 PW_MilesRun,
15 CM_MilesRun,
20 PM_MilesRun
)
select * from
(select
PersonID,
CW_MilesRun weekly,
CM_MilesRun monthly
from
cte
) unpivot (C_MilesRun for description in (weekly, monthly))
natural join
(select * from
(select
PersonID,
PW_MilesRun weekly,
PM_MilesRun monthly
from
cte
) unpivot (P_MilesRun for description in (weekly, monthly))) f
Consider the small following dataframe:
import pandas as pd
value1 = [15, 20, 50, 70]
value2 = [15, 80, 45, 30]
base = [175, 150, 200, 125]
df = pd.DataFrame({"val1": value1, "val2": value2, "base": base})
df
val1 val2 base
0 15 15 175
1 20 80 150
2 50 45 200
3 70 30 125
Actually, there are much more rows and much more val*** columns...
I would like to express the figures given in the columns val*** as percent of their corresponding base (in the same row); as an example, 70 (last in val1) should become (70/125)*100, (which is 56), or 30 (last in val2) should become (30/125)*100 (which is 28) ; and so on for every figure.
I am sure the solution lies in a correct use of assign or apply and lambda, but I can't find how to do it ...
We can filter the val like columns then divide these columns by the base column along axis=0 followed by multiplication with 100 to calculate the percentage
df.filter(like='val').div(df['base'], axis=0).mul(100).add_suffix('%')
val1% val2%
0 8.571429 8.571429
1 13.333333 53.333333
2 25.000000 22.500000
3 56.000000 24.000000
I have a dataframe as shown. Using python, I want to get the sum of 'Value' for each 'Id' group upto the first occurrence of 'Stage' 12.
df = pd.DataFrame({'Id':[1,1,1,2,2,2,2],
'Date': ['2020-04-23', '2020-04-25', '2020-04-28', '2020-04-20', '2020-05-01', '2020-05-05', '2020-05-12'],
'Stage': [11, 12, 15, 11, 14, 12, 12],
'Value': [5, 4, 6, 12, 2, 8, 3]})
Id Date Stage Value
1 2020-04-23 11 5
1 2020-04-25 12 4
1 2020-04-28 15 6
2 2020-04-20 11 12
2 2020-05-01 14 2
2 2020-08-05 12 8
2 2020-05-12 12 3
My desired output:
Id Value
1 9
2 22
Would be very thankful if someone could help.
Let us try use the groupby transform idxmax filter the dataframe , then do another round of groupby
idx = df['Stage'].eq(12).groupby(df['id']).transform('idxmax')
output = df[df.index <= idx].groupby('id')['Value'].sum().reset_index()
Detail
the transform with idxmax will return the first index match with 12 for all the groupby row, then we need to filter the df with index less than that to get the data until the first 12 show up.
I have a partitioned table with about 2 billion rows in hive like:
id, num, num_partition
1, 1253742321.53124121, 12
4, 1253742323.53124121, 12
2, 1353742324.53124121, 13
3, 1253742325.53124121, 12
And I want to have a table like:
id, rank,rank_partition
89, 1, 0
...
1, 1253742321,12
7, 1253742322,12
4, 1253742323,12
8, 1253742324,12
3, 1253742325,12
...
2, 1353742324,13
...
I have tried to do this:
df = spark.sql("select *, rank/10000000 from (select id,row_number() over(order by num asc) rank from table)t1")
It was very slow, since order by will use only 1 reducer
and I've tried to do this:
df = spark.sql("select *, rank/10000000 from (select id,row_number() over(distribute by num_partition order by num_partition asc,num asc) rank from table)t1")
But in the result the num_partition hasn't been sorted
I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.
Sample dataframe header structure excerpt:
id dob gender pro_1 pro_10 pro_11 pro_2 ... pro_9 pre_1 pre_10 ...
I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:
id dob gender pro_1 pre_1 que_1 fre_1 gen_1 pro2 pre_2 que_2 fre_2 ...
(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)
Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!
EDIT:
Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:
id dob gender pro_1 pro_2 pro_3 ... pre_1 pre_2 pre_3 ...
EDIT 2.0:
Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.
Here is one way you can try:
# column names copied from your example
example_cols = 'id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10'.split()
# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
# id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10
#0 0 1 2 3 4 5 6 7 8 9
# number of columns excluded from sorting
N = 3
# get a list of columns from the dataframe
cols = df.columns.tolist()
# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]
# get the new dataframe based on the cols_new
df_new = df[cols_new]
# id dob gender pre_1 pre_10 pro_1 pro_2 pro_9 pro_10 pro_11
#0 0 1 2 8 9 3 6 7 4 5
Luckily there is a one liner in python that can fix this:
df = df.reindex(sorted(df.columns), axis=1)
For Example lets say you had this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': [2, 4, 8, 0],
'ID': [2, 0, 0, 0],
'Prod3': [10, 2, 1, 8],
'Prod1': [2, 4, 8, 0],
'Prod_1': [2, 4, 8, 0],
'Pre7': [2, 0, 0, 0],
'Pre2': [10, 2, 1, 8],
'Pre_2': [10, 2, 1, 8],
'Pre_9': [10, 2, 1, 8]}
)
print(df)
Output:
Name ID Prod3 Prod1 Prod_1 Pre7 Pre2 Pre_2 Pre_9
0 2 2 10 2 2 2 10 10 10
1 4 0 2 4 4 0 2 2 2
2 8 0 1 8 8 0 1 1 1
3 0 0 8 0 0 0 8 8 8
Then used
df = df.reindex(sorted(df.columns), axis=1)
Then the dataframe will then look like:
ID Name Pre2 Pre7 Pre_2 Pre_9 Prod1 Prod3 Prod_1
0 2 2 10 2 10 10 2 10 2
1 0 4 2 0 2 2 4 2 4
2 0 8 1 0 1 1 8 1 8
3 0 0 8 0 8 8 0 8 0
As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.
You need to split you column on '_' then convert to int:
c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']
df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)
Output:
A_1 B_1 A_2 B_2 A_3 B_3 A_10 B_10
0 68 11 59 69 37 68 76 17
1 19 37 52 54 23 93 85 3
Next case, you need human sorting:
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)
Output:
A_1 A_2 A_3 A_10 B_1 B_2 B_3 B_10
0 68 59 37 76 11 69 68 17
1 19 52 23 85 37 54 93 3
Try this.
To re-order the columns based on the number after the column name
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) # split based on the number after '_'
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])
To re-arrange column names based on the string part AND number part of the column names
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])