I have a database as the following:
And I would like to obtain a pandas dataframe filtered for the 2 rows per date, based on the top ones that have the highest population. The output should look like this:
I know that pandas offers a formula called nlargest:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html
but I don't think it is usable for this use case. Is there any workaround?
Thanks so much in advance!
I have mimicked your dataframe as below and provided a way forward to get the desired, hope that will helpful.
Your Dataframe:
>>> df
Date country population
0 2019-12-31 A 100
1 2019-12-31 B 10
2 2019-12-31 C 1000
3 2020-01-01 A 200
4 2020-01-01 B 20
5 2020-01-01 C 3500
6 2020-01-01 D 12
7 2020-02-01 D 2000
8 2020-02-01 E 54
Your Desired Solution:
You can use nlargest method along with set_index ans groupby method.
This is what you will get..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2)
Date country
2019-12-31 C 1000
A 100
2020-01-01 C 3500
A 200
2020-02-01 D 2000
E 54
Name: population, dtype: int64
Now, as you want the DataFrame into original state by resetting the index of the DataFrame, which will give you following ..
>>> df.set_index('country').groupby('Date')['population'].nlargest(2).reset_index()
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54
Another way around:
With groupby and apply function use reset_index with parameter drop=True and level= ..
>>> df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=[0,1], drop=True)
# df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=['Date',1], drop=True)
Date country population
0 2019-12-31 C 1000
1 2019-12-31 A 100
2 2020-01-01 C 3500
3 2020-01-01 A 200
4 2020-02-01 D 2000
5 2020-02-01 E 54
Related
I have a situation in which I need to calculate the difference between multiple columns and store them in a separate column under separate headers. My dataset looks like below:
cat_1 cat_2 cat_3 cat_4 date_1 date_2 date_3 date_4
a b b c 2020-01-01 2020-01-01 2020-01-25 2020-01-10
b c d 2019-01-11 2020-01-01 2020-01-15 2020-01-10
a b d 2018-11-01 2019-01-01 2020-01-15 2020-01-10
a b c d 2015-01-01 2016-01-29 2018-01-25 2019-01-10
.. and so on
The order will follow : a->b->c->d and the reverse is not true
I want to store the following combinations in new columns in number of days. There will be 4 combinations in total. Essentially I want to calculate the difference in two dates and store in days for the combination.
An example output for the first row of my dataset:
days_a-b days_a-c days_a-d days_b-c days_b-d days_c-d
0 9
355 364 -5
How to solve this one?
I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0
My Dataframe looks like
"dataframe_time"
INSERTED_UTC
0 2018-05-29
1 2018-05-22
2 2018-02-10
3 2018-04-30
4 2018-03-02
5 2018-11-26
6 2018-03-07
7 2018-05-12
8 2019-02-03
9 2018-08-03
10 2018-04-27
print(type(dataframe_time['INSERTED_UTC'].iloc[1]))
<class 'datetime.date'>
I am trying to group the dates together and find the count of their occurrence quaterly. Desired Output -
Quarter Count
2018-03-31 3
2018-06-30 5
2018-09-30 1
2018-12-31 1
2019-03-31 1
2019-06-30 0
I am running the following command to group them together
dataframe_time['INSERTED_UTC'].groupby(pd.Grouper(freq='Q'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
First are dates converted to datetimes and then is used DataFrame.resample with on for get column with datetimes:
dataframe_time.INSERTED_UTC = pd.to_datetime(dataframe_time.INSERTED_UTC)
df = dataframe_time.resample('Q', on='INSERTED_UTC').size().reset_index(name='Count')
Or your solution is possible change to:
df = (dataframe_time.groupby(pd.Grouper(freq='Q', key='INSERTED_UTC'))
.size()
.reset_index(name='Count'))
print (df)
INSERTED_UTC Count
0 2018-03-31 3
1 2018-06-30 5
2 2018-09-30 1
3 2018-12-31 1
4 2019-03-31 1
You can convert the dates to quarters by to_period('Q') and group by those:
df.INSERTED_UTC = pd.to_datetime(df.INSERTED_UTC)
df.groupby(df.INSERTED_UTC.dt.to_period('Q')).size()
You can also use value_counts:
df.INSERTED_UTC.dt.to_period('Q').value_counts()
Output:
INSERTED_UTC
2018Q1 3
2018Q2 5
2018Q3 1
2018Q4 1
2019Q1 1
Freq: Q-DEC, dtype: int64
I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)
I want to sort and group by a pandas data frame column alphabetically.
a b c
0 sales 2 NaN
1 purchase 130 230.0
2 purchase 10 20.0
3 sales 122 245.0
4 purchase 103 320.0
I want to sort column "a" such that it is in alphabetical order and is grouped as well i.e., the output is as follows:
a b c
1 purchase 130 230.0
2 10 20.0
4 103 320.0
0 sales 2 NaN
3 122 245.0
How can I do this?
I think you should use the sort_values method of pandas :
result = dataframe.sort_values('a')
It will sort your dataframe by the column a and it will be grouped either because of the sorting. See ya !