How do I create a new column in pandas which is the sum of another column based on a condition? - python-3.x

I am trying to get the result column to be the sum of the value column for all rows in the data frame where the country is equal to the country in that row, and the date is on or before the date in that row.
Date Country ValueResult
01/01/2019 France 10 10
03/01/2019 England 9 9
03/01/2019 Germany 7 7
22/01/2019 Italy 2 2
07/02/2019 Germany 10 17
17/02/2019 England 6 15
25/02/2019 England 5 20
07/03/2019 France 3 13
17/03/2019 England 3 23
27/03/2019 Germany 3 20
15/04/2019 France 6 19
04/05/2019 England 3 26
07/05/2019 Germany 5 25
21/05/2019 Italy 5 7
05/06/2019 Germany 8 33
21/06/2019 England 3 29
24/06/2019 England 7 36
14/07/2019 France 1 20
16/07/2019 England 5 41
30/07/2019 Germany 6 39
18/08/2019 France 6 26
04/09/2019 England 3 44
08/09/2019 Germany 9 48
15/09/2019 Italy 7 14
05/10/2019 Germany 2 50
I have tried the below code but it sums up the entire column
df['result'] = df.loc[(df['Country'] == df['Country']) & (df['Date'] >= df['Date']), 'Value'].sum()

as your dates are ordered you could do:
df['Result'] = df.grouby('Coutry').Value.cumsum()

Related

create a python function as store_id & date as input and output as the previous date's sales

I have this store_df DataFrame:
store_id date sales
0 1 2023-1-2 11
1 2 2023-1-3 22
2 3 2023-1-4 33
3 1 2023-1-5 44
4 2 2023-1-6 55
5 3 2023-1-7 66
6 1 2023-1-8 77
7 2 2023-1-9 88
8 3 2023-1-10 99
I am not able to solve this in the interview.
This was the exact question asked :
Create a dataset with 3 columns – store_id, date, sales Create 3 Store_id Each store_id has 3 consecutive dates Sales are recorded for 9 rows We are considering the same 9 dates across all stores Sales can be any random number
Write a function that fetches the previous day’s sales as output once we give store_id & date as input
The question can be handled in multiple ways.
If you want to just get the previous row per group, assuming that the values are consecutive and sorted by increasing dates, use a groupby.shift:
store_df['prev_day_sales'] = store_df.groupby('store_id')['sales'].shift()
Output:
store_id date sales prev_day_sales
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 55.0
8 3 2023-01-04 99 66.0
If, you really want to get the previous day's value (not the previous available day), use a merge:
store_df['date'] = pd.to_datetime(store_df['date'])
store_df.merge(store_df.assign(date=lambda d: d['date'].add(pd.Timedelta('1D'))),
on=['store_id', 'date'], suffixes=(None, '_prev_day'), how='left'
)
Note. This makes it easy to handle other deltas, like business days (replace pd.Timedelta('1D') with pd.offsets.BusinessDay(1)).
Example (with a different input):
store_id date sales sales_prev_day
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 NaN # there is no data for 2023-01-04
8 3 2023-01-04 99 66.0

How can I use "groupby()" for gathering country names?

I have three columns pandas dataframe; the name of the country, year and value. The year starts from 1960 to 2020 for each country.
The data looks like that;
Country Name
Year
value
USA
1960
12
Italy
1960
8
Spain
1960
5
Italy
1961
35
USA
1961
50
I would like to gather same country names. How can I do it? I could not succeed using groupby()ç Groupby() always requires functions like sum().
Country Name
Year
value
USA
1960
12
USA
1961
50
Italy
1960
8
Italy
1961
35
Spain
1960
5
Spain
1960
5

How to fill missing values relative to a value from another column

I'd like to fill missing values with conditions relative to the country:
For example, I'd want to replace China's missing values with the mean of Age and for USA it's the median of Age. For now, I don't want to touch of EU's missing values.
How could I do realise it ?
Below the dataframe
import pandas as pd
data = [['USA', ], ['EU', 15], ['China', 35],
['USA', 45], ['EU', 30], ['China', ],
['USA', 28], ['EU', 26], ['China', 78],
['USA', 65], ['EU', 53], ['China', 66],
['USA', 32], ['EU', ], ['China', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Country', 'Age'])
df.head(10)
Country Age
0 USA NaN
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China NaN
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU NaN
Thank you
Not sure if this is the best way to do it but it is one way to do it
age_series = df['Age'].copy()
df.loc[(df['Country'] == 'China') & (df['Age'].isnull()), 'Age'] = age_series.mean()
df.loc[(df['Country'] == 'USA') & (df['Age'].isnull()), 'Age'] = age_series.median()
Note that I copied the age column before hand so that you get the median of the original age series not after calculating the mean for the US. This is the final results
Country Age
0 USA 33.500000
1 EU 15.000000
2 China 35.000000
3 USA 45.000000
4 EU 30.000000
5 China 40.583333
6 USA 28.000000
7 EU 26.000000
8 China 78.000000
9 USA 65.000000
10 EU 53.000000
11 China 66.000000
12 USA 32.000000
13 EU NaN
14 China 14.000000
May be you can try this
df['Age']=(np.where((df['Country'] == 'China') & (df['Age'].isnull()),df['Age'].mean()
,np.where((df['Country'] == 'USA') & (df['Age'].isnull()),df['Age'].median(),df['Age']))).round()
Output
Country Age
0 USA 34.0
1 EU 15.0
2 China 35.0
3 USA 45.0
4 EU 30.0
5 China 41.0
6 USA 28.0
7 EU 26.0
8 China 78.0
9 USA 65.0
10 EU 53.0
11 China 66.0
12 USA 32.0
13 EU NaN
14 China 14.0
IIUC, we can create a function to handle this as it's not easily automated (although I may be wrong)
the idea is to pass in the country name & fill type (i.e mean median) you can extend the function to add in your agg types.
it returns a data frame that modifies yours, so you can use this to assign it back to your col
def missing_values(dataframe,country,fill_type):
"""
takes 3 arguments, dataframe, country & fill_type:
fill_type is the method used to fill `NA` values, mean, median, etc.
"""
fill_dict = dataframe.loc[dataframe['Country'] == country]\
.groupby("Country")["Age"].agg(
["mean", "median"]).to_dict(orient='index')
dataframe.loc[dataframe['Country'] == country, 'Age'] \
= dataframe['Age'].fillna(fill_dict[country][fill_type])
return dataframe
print(missing_values(df,'China','mean')
Country Age
0 USA NaN
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00
print(missing_values(df,'USA','median'))
Country Age
0 USA 38.50
1 EU 15.00
2 China 35.00
3 USA 45.00
4 EU 30.00
5 China 48.25
6 USA 28.00
7 EU 26.00
8 China 78.00
9 USA 65.00
10 EU 53.00
11 China 66.00
12 USA 32.00
13 EU NaN
14 China 14.00

how to calculate average total score by country

I have this below data in my excel,
Country Question 1 Question 2 Question 3 Total
Australia 10 10 10 30
Australia 7 10 7 24
Hong Kong 7 7 7 21
Japan 5 5 0 10
Australia 10 7 5 22
Hong Kong 7 7 7 21
Hong Kong 7 7 7 21
Australia 7 10 10 27
Australia 7 10 7 24
I need to find out the average total score by country. How should I calculate that? Or maybe someone can just tell me the formula for it. Sorry I am not good in excel and math :\
You can use the AVERAGEIF function in excel.
In the cell below all the totals do:
=AVERAGEIF(range, criteria, [average_range])

EXCEL - CountIF per category

I have this:
1 A B C
2 Country Value Valid
3 Sweden 10 0
4 Sweden 5 1
5 Sweden 1 1
6 Norway 5 1
7 Norway 5 1
8 Germany 12 1
9 Germany 2 1
10 Germany 3 1
11 Germany 1 0
I want to fill in B15 to D17 in table below with number of valid values (a 1 in column C) per country and value range:
A B C D
13 Value count
14 0 to 3 4 to 7 above 7
15 Sweden 1 1 0
16 Norway 0 2 0
17 Germany 3 0 1
I have tried IF combined with COUNTIF but i cant figure it out.
What would the formula be for cell B15?
Formula you are looking for is this:
=COUNTIFS($A$3:$A$11,$B15,$C$3:$C$11,1,$B$4:$B$11,"<4")
You will just need to change last criterion to $C$3:$C$11,">3",$C$3:$C$11,"<8" to make it count only values between.
Note: Germany will be 2 because value for valid in last row is 0

Resources