column values in a row - string

I have following table
id count hour age range
-------------------------------------
0 5 10 61 10-200
1 6 20 61 10-200
2 7 15 61 10-200
5 9 5 61 201-300
7 10 25 61 201-300
0 5 10 62 10-20
1 6 20 62 10-20
2 7 15 62 10-20
5 9 5 62 21-30
1 8 6 62 21-30
7 10 25 62 21-30
10 15 30 62 31-40
I need to select distinct values of column range
I tried following query
Select distinct range as interval from table name where age = 62;
its result is in a column as follows:
interval
----------
10-20
21-30
31-41
How can I get result as follows?
10-20, 21-30, 31-40
EDITED:
I am now trying following query:
select sys_connect_by_path(range,',') interval
from
(select distinct NVL(range,'0') range , ROW_NUMBER() OVER (ORDER BY RANGE) rn
from table_name where age = 62)
where connect_by_isleaf = 1 CONNECT BY rn = PRIOR rn+1 start with rn = 1;
Which is giving me output as:
Interval
----------------------------------------------------------------------------
, 10-20,10-20,10-20,21-30,21-30, 31-40
guys plz help me to get my desired output.

If you are on 11.2 rather than just 11.1, you can use the LISTAGG aggregate function
SELECT listagg( interval, ',' )
WITHIN GROUP( ORDER BY interval )
FROM (SELECT DISTINCT range AS interval
FROM table_name
WHERE age = 62)
If you are using an earlier version of Oracle, you could use one of the other Oracle string aggregation techniques on Tim Hall's page. Prior to 11.2, my personal preference would be to create a user-defined aggregate function so that you can then
SELECT string_agg( interval )
FROM (SELECT DISTINCT range AS interval
FROM table_name
WHERE age = 62)
If you don't want to create a function, however, you can use the ROW_NUMBER and SYS_CONNECT_BY_PATH approach though that tends to get a bit harder to follow
with x as (
SELECT DISTINCT range AS interval
FROM table_name
WHERE age = 62 )
select ltrim( max( sys_connect_by_path(interval, ','))
keep (dense_rank last order by curr),
',') range
from (select interval,
row_number() over (order by interval) as curr,
row_number() over (order by interval) -1 as prev
from x)
connect by prev = PRIOR curr
start with curr = 1

Related

Unnest from Table in Snowflake

I have the following table:
PersonID CW_MilesRun PW_MilesRun CM_MilesRun PM_MilesRun
1 15 25 35 45
2 10 20 30 40
3 5 10 15 20
...
I need to split this table into a vertical table with an id for each field (i.e CD_MilesRun =1, CW_MilesRun = 2, etc) So that my table looks similar to this:
PersonID TimeID Description C_MilesRun P_MilesRun
1 1 Week 15 25
1 2 Month 35 45
2 1 Week 10 20
2 2 Month 30 40
3 1 Week 5 10
3 2 Month 15 20
In postgres, I would use something similar to:
SELECT
PersonID
, unnest(array[1,2]) AS TimeID
, unnest(array['Week','Month']) AS "Description"
, unnest(array["CW_MilesRun","CM_MilesRun"]) C_MilesRun
, unnest(array["PW_MilesRun","PM_MilesRun"]) P_MilesRun
FROM myTableHere
;
However, I cannot get a similar function in snowflake to work. Any ideas?
You can use FLATTEN() with LATERAL to get the result you want, although the query is quite different.
with tbl as (select $1 PersonID, $2 CW_MilesRun, $3 PW_MilesRun, $4 CM_MilesRun, $5 PM_MilesRun from values (1, 15, 25, 35, 45),(2, 10, 20, 30, 40),(3, 5, 10, 15, 20))
select
PersonID,
t.value[0] TimeID,
t.value[1] Description,
iff(t.index=0,CW_MilesRun,CM_MilesRun) C_MilesRun,
iff(t.index=1,PW_MilesRun,PM_MilesRun) P_MilesRun
from tbl, lateral flatten(parse_json('[[1, "Week"],[2, "Month"]]')) t;
PERSONID TIMEID DESCRIPTION C_MILESRUN P_MILESRUN
1 1 "Week" 15 25
1 2 "Month" 35 45
2 1 "Week" 10 20
2 2 "Month" 30 40
3 1 "Week" 5 10
3 2 "Month" 15 20
P.S. Use t.* to see what's available after flattening (perhaps that is obvious.)
You could alternatively use UNPIVOT and NATURAL JOIN.
Above answer is great ... just like thinking about alternative ways of doing things ... you never know when it might suit your needs - plus exposes you to a couple new cool functions.
with cte as (
select
1 PersonID,
15 CW_MilesRun,
25 PW_MilesRun,
35 CM_MilesRun,
45 PM_MilesRun
union
select
2 PersonID,
10 CW_MilesRun,
20 PW_MilesRun,
30 CM_MilesRun,
40 PM_MilesRun
union
select
3 PersonID,
5 CW_MilesRun,
10 PW_MilesRun,
15 CM_MilesRun,
20 PM_MilesRun
)
select * from
(select
PersonID,
CW_MilesRun weekly,
CM_MilesRun monthly
from
cte
) unpivot (C_MilesRun for description in (weekly, monthly))
natural join
(select * from
(select
PersonID,
PW_MilesRun weekly,
PM_MilesRun monthly
from
cte
) unpivot (P_MilesRun for description in (weekly, monthly))) f

Calculate the number of occurrences of words in a column and find the second, third most common

I have a formula that finds the frequent occurring text and works well.
=INDEX(Rng,MATCH(MAX(COUNTIF(Rng,Rng)),COUNTIF(Rng,Rng),0))
How can I tweak to find the second highest, third highest?
2nd:
=LARGE(A2:A; 2)
3rd:
=LARGE(A2:A; 3)
update 1:
use query:
=QUERY(A:A,
"select A,count(A) where A is not null group by A label count(A)''")
to get only 2nd or 3rd you can use index like:
=INDEX(QUERY(A:A,
"select A,count(A) where A is not null group by A label count(A)''"), 2)
update 2:
=INDEX(QUERY({'Data Entry Errors'!I:I},
"select Col1,count(Col1) where Col1 is not null group by Col1 order by count(Col1) desc limit 3 label count(Col1)''"),,1)
In Google Sheets, to get the number of occurrences of each word in the column A2:A, use this:
=query(A2:A, "select A, count(A) where A is not null group by A order by count(A) desc label count(A) '' ", 0)
To get just the second and third result and the number of their occurrences, use this:
=query(A2:A, "select A, count(A) where A is not null group by A order by count(A) desc limit 2 offset 1 label count(A) '' ", 0)
To get just the names that are the second and third by the number of their occurrences, use this:
=query( query(A2:A, "select A, count(A) where A is not null group by A order by count(A) desc limit 2 offset 1 label count(A) '' ", 0), "select Col1", 0 )
For Excel 365
Say we have data in column A from A2 through A66 like:
20
11
27
18
3
31
2
30
8
1
18
32
3
5
4
6
4
1
22
11
2
46
33
34
25
53
37
9
20
2
12
4
5
4
23
39
19
4
28
22
5
16
24
7
6
10
13
31
56
23
1
16
27
39
1
6
11
6
20
11
24
12
9
29
12
and we want a frequency table listing the most frequent value, the second most frequent value, the third, etc.
The simplest approach is to construct a Pivot Table, but if you need a formula approach, then in B2 enter:
=UNIQUE(A2:A66)
in C2 enter:
=COUNTIF(A$2:A$66,B2)
We now sort cols B:C by C. In D2 enter:
=SORTBY(B2:C35,C2:C35,-1)
:

Reshape a pandas DataFrame using combination of row values in two columns

I have data for multiple customers in data frame as below-
Customer_id event_type month mins_spent
1 live CM 10
1 live CM1 10
1 catchup CM2 20
1 live CM2 30
2 live CM 45
2 live CM1 30
2 catchup CM2 20
2 live CM2 20
I need the result data frame so that there is one row for each customer and column are combined value of column month and event_type and value will be mins_spent. Result data frame as below-
Customer_id CM_live CM_catchup CM1_live CM1_catchup CM2_live CM2_catchup
1 10 0 10 0 30 20
2 45 0 30 0 20 20
Is there an efficient way to do this instead of iterating the input data frame and creating the new data frame ?
you can use pivot_table
# pivot your data frame
p = df.pivot_table(values='mins_spent', index='Customer_id',
columns=['month', 'event_type'], aggfunc=np.sum)
# flatten multi indexed columns with list comprehension
p.columns = ['_'.join(col) for col in p.columns]
CM_live CM1_live CM2_catchup CM2_live
Customer_id
1 10 10 20 30
2 45 30 20 20
You can create a new column (key) by concatenating columns month and event_type, and then use pivot() to reshape your data.
(df.assign(key = lambda d: d['month'] + '_' + d['event_type'])
.pivot(
index='Customer_id',
columns='key',
values='mins_spent'
))

Find a row matching multiple column criteria

I have a dataframe with 2M rows which is in the below format:
ID Number
1 30
1 40
1 60
2 10
2 30
3 60
I need to select the IDs have the number 30 and 40 present (in this case, output should be 1).
I know we can create a new DF having only numbers 30 & 40 and then groupby to see which IDs have more than count 1. But is there a way we can to do both in the groupby statement ?
My code:
a=df[(df['Number']==30) | (df['Number']==40) ]
b=a.groupby('ID')['Number'].nunique().to_frame(name='tt').reset_index()
b[b['tt'] > 1]
Use groupby filter and issubset
s = {30, 40}
df.groupby('ID').filter(lambda x: s.issubset(set(x.Number)))
Out[158]:
ID Number
0 1 30
1 1 40
2 1 60
I find the fact that the describe() method of Groupby objects returns a dataframe to be extremely helpful.
Output temp1 = a.groupby("ID").describe() and temp2 = a.groupby("ID").describe()["Number"] to a Jupyter notebook to see what they look like, then the following code (which follows on from yours) should make sense.
summary = a.groupby("ID").describe()["Number"]
summary.loc[summary["count"] > 1].index
I would create a df for each condition and then inner join them:
df1 = df[df.Number == 30][['Number']]
df2 = df[df.Number == 40][['Number']]
df3 = df1.join(df2,how='inner',on='Number')

How to get the total count from a certain date from a dataframe having datatime column

I am new to pandas Dataframe.
From MYSQL I have bound the following dataset to a Dataframe. Here how to get the total count for a particular date in jupyter. Also how to set a Datepicker widget in jupyter and by selecting the date range in the calendar how to show the total count for that selected date.
To be more specific:
1) Get total count for Todays date(by inputting only date) from RegistrationDate column
2) Get total count for Last 7 days(by inputting only date) from RegistrationDate column
3) Get total count by selecting the date range from Datepicker widget from RegistrationDate column
No RegistrationDate
0 7 2019-07-23 12:23:25
1 9 2019-07-23 03:23:25
2 11 2019-07-23 08:10:10
3 13 2019-07-22 09:23:25
4 15 2019-07-22 04:01:02
5 17 2019-07-21 12:23:25
6 19 2019-07-20 12:23:25
7 21 2019-07-19 12:23:25
8 67 2019-06-04 12:23:25
9 68 2019-06-05 12:23:25
10 69 2019-06-06 12:23:25
First index by date
Set index label to 'RegistrationDate' using
df.set_label('RegistrationDate', inplace=True)
Objective 1
Get user input for date using
today = input('2019-07-22 04:01:02')
count1 = df.loc[today]
will return
15
Objective 3
Ensure that your df.['RegistrationDate'] is a Series type
df.['RegistrationDate'] = pd.to_datetime(df.['RegistrationDate'])
get user inputs on start and end dates
start_date = input("start date:\t")
end_date = input("end date:\t")
create a Boolean mask and ensure that the input dates are datetime.datetime or datetime strings or pd.Timestamp
mask = (df['RegistrationDate'] > start_date) & (df['RegistrationDate'] <= end_date)
re-assign this to a temp_df and sum over columns
temp_df = df.loc[mask]
total_in_range = temp_df['No'].sum()

Resources