I have a dataframe:
import pandas as pd
d = {
'Country': ["Austria", "Austria", "Belgium", "USA", "USA", "USA", "USA"],
'Number2020': [15, None, 18, 20, 22, None, 30],
'Number2021': [20, 25, 18, None, None, None, 32],
}
df = pd.DataFrame(data=d)
df
Country Number2020 Number2021
0 Austria 15.0 20.0
1 Austria NaN 25.0
2 Belgium 18.0 18.0
3 USA 20.0 NaN
4 USA 22.0 NaN
5 USA NaN NaN
6 USA 30.0 32.0
and I want to sum up the nan values per each country. E.g.
Country Count_nans
Austria 1
USA 4
I have filtered the dataframe to leave only the rows with nans .
df_nan = df[df.Number2021.isna() | df.Number2020.isna()]
Country Number2020 Number2021
1 Austria NaN 25.0
3 USA 20.0 NaN
4 USA 22.0 NaN
5 USA NaN NaN
So it looks like a groupby operation? I have tried this.
nasum2021 = df_nan['Number2021'].isna().sum()
df_nan['countNames2021'] = df_nan.groupby(['Number2021'])['Number2021'].transform('count').fillna(nasum2021)
df_nan
It gives me 1 nan for Austria but 3 for the United States while it should be 4. so that is not right.
In my real dataframe, I have some 10 years and around 30 countries. thank you!
Solution for processing all columns without Country - first convert it to index, test missing values and aggregate sum, last sum columns:
s = df.set_index('Country').isna().groupby('Country').sum().sum(axis=1)
print (s)
Country
Austria 1
Belgium 0
USA 4
dtype: int64
If need remove 0 values add boolean indexing:
s = s[s.ne(0)]
You could use:
df.filter(like='Number').isna().sum(1).groupby(df['Country']).sum()
output:
Country
Austria 1
Belgium 0
USA 4
dtype: int64
or, filtering the rows with NaN first to only count the countries with at least 1 NaN:
df[df.filter(like='Number').isna().any(1)].groupby('Country')['Country'].count()
output:
Country
Austria 1
USA 3
Name: Country, dtype: int64
You could use pandas.DataFrame.agg along with pandas.DataFrame.isna:
>>> df.groupby('Country').agg(lambda x: x.isna().sum()).sum(axis=1)
Country
Austria 1
Belgium 0
USA 4
dtype: int64
Use:
df.groupby('Country').apply(lambda x: x.isna().sum().sum())
Output:
Related
I have a data frame where I need to show a totals in the last row in terms of the count of not null values of string value columns and a summation of a one column and the average of another column.
df2 = pd.DataFrame({ 'Name':['John', 'Tom', 'Tom' ,'Ole','Ole','Tom'],
'To_Count':['Yes', 'Yes','Yes', 'No', np.nan, np.nan],
'To_Count1':['Yes', 'Yes','Yes','No', np.nan,np.nan],
'To_Sum':[100, 200, 300, 500,600, 400],
'To_Avg':[100, 200, 300, 500, 600, 400],
})
This is how I code to get this result.
df2.loc["Totals",'To_Count':'To_Count1'] = df2.loc[:,'To_Count':'To_Count1'].count(axis=0)
df2.loc["Totals",'To_Sum'] = df2.loc[:,'To_Sum'].sum(axis=0)
df2.loc["Totals",'To_Avg'] = df2.loc[:,'To_Avg'].mean(axis=0)
However if I run this code again accidently the values get duplicated.
Is there a better way to get this result.
Expected result;
Use DataFrame.agg with dictionary:
df2.loc["Totals"] = df2.agg({'To_Sum': 'sum',
'To_Avg': 'mean',
'To_Count': 'count',
'To_Count1':'count'})
print (df2)
Name To_Count To_Count1 To_Sum To_Avg
0 John Yes Yes 100.0 100.0
1 Tom Yes Yes 200.0 200.0
2 Tom Yes Yes 300.0 300.0
3 Ole No No 500.0 500.0
4 Ole NaN NaN 600.0 600.0
5 Tom NaN NaN 400.0 400.0
Totals NaN 4.0 4.0 2100.0 350.0
More dynamic solution if many columns between To_Count and To_Count1:
d = dict.fromkeys(df2.loc[:,'To_Count':'To_Count1'].columns, 'count')
print (d)
df2.loc[:,'To_Count':'To_Count1']
df2.loc["Totals"] = df2.agg({**{'To_Sum': 'sum', 'To_Avg': 'mean'}, **d})
print (df2)
Name To_Count To_Count1 To_Sum To_Avg
0 John Yes Yes 100.0 100.0
1 Tom Yes Yes 200.0 200.0
2 Tom Yes Yes 300.0 300.0
3 Ole No No 500.0 500.0
4 Ole NaN NaN 600.0 600.0
5 Tom NaN NaN 400.0 400.0
Totals NaN 4.0 4.0 2100.0 350.0
Given two dataframes as follow:
df1:
id address price
0 1 8563 Parker Ave. Lexington, NC 27292 3
1 2 242 Bellevue Lane Appleton, WI 54911 3
2 3 771 Greenview Rd. Greenfield, IN 46140 5
3 4 93 Hawthorne Street Lakeland, FL 33801 6
4 5 8952 Green Hill Street Gettysburg, PA 17325 3
5 6 7331 S. Sherwood Dr. New Castle, PA 16101 4
df2:
state street quantity
0 PA S. Sherwood 12
1 IN Hawthorne Street 3
2 NC Parker Ave. 7
Let's say if both state and street from df2 are contained in address from df2, then merge df2 to df1.
How could I do that in Pandas? Thanks.
The expected result df:
id address ... street quantity
0 1 8563 Parker Ave. Lexington, NC 27292 ... Parker Ave. 7.00
1 2 242 Bellevue Lane Appleton, WI 54911 ... NaN NaN
2 3 771 Greenview Rd. Greenfield, IN 46140 ... NaN NaN
3 4 93 Hawthorne Street Lakeland, FL 33801 ... NaN NaN
4 5 8952 Green Hill Street Gettysburg, PA 17325 ... NaN NaN
5 6 7331 S. Sherwood Dr. New Castle, PA 16101 ... S. Sherwood 12.00
[6 rows x 6 columns]
My testing code:
df2['addr'] = df2['state'].astype(str) + df2['street'].astype(str)
pat = '|'.join(r'\b{}\b'.format(x) for x in df2['addr'])
df1['addr']= df1['address'].str.extract('\('+ pat + ')', expand=False)
df = df1.merge(df2, on='addr', how='left')
Output:
id address ... street_y quantity_y
0 1 8563 Parker Ave. Lexington, NC 27292 ... NaN nan
1 2 242 Bellevue Lane Appleton, WI 54911 ... NaN nan
2 3 771 Greenview Rd. Greenfield, IN 46140 ... NaN nan
3 4 93 Hawthorne Street Lakeland, FL 33801 ... NaN nan
4 5 8952 Green Hill Street Gettysburg, PA 17325 ... NaN nan
5 6 7331 S. Sherwood Dr. New Castle, PA 16101 ... NaN nan
[6 rows x 10 columns]
TRY:
pat_state = f"({'|'.join(df2['state'])})"
pat_street = f"({'|'.join(df2['street'])})"
df1['street'] = df1['address'].str.extract(pat=pat_street)
df1['state'] = df1['address'].str.extract(pat=pat_state)
df1.loc[df1['state'].isna(),'street'] = np.NAN
df1.loc[df1['street'].isna(),'state'] = np.NAN
df1 = df1.merge(df2, left_on=['state','street'], right_on=['state','street'], how ='left')
k="|".join(df2['street'].to_list())
df1=df1.assign(temp=df1['address'].str.findall(k).str.join(', '), temp1=df1['address'].str.split(",").str[-1])
dfnew=pd.merge(df1,df2, how='left', left_on=['temp','temp1'], right_on=['street',"state"])
I have the following dataframe:
data = pd.DataFrame({
'ID': [1, 1, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6],
'Date_Time': ['2010-01-01 12:01:00', '2010-01-01 01:27:33',
'2010-04-02 12:01:00', '2010-04-01 07:24:00', '2011-01-01 12:01:00',
'2011-01-01 01:27:33', '2013-01-01 12:01:00', '2014-01-01 12:01:00',
'2014-01-01 01:27:33', '2015-01-01 01:27:33', '2016-01-01 01:27:33',
'2011-01-01 01:28:00'],
'order': [2, 4, 5, 6, 7, 8, 9, 2, 3, 5, 6, 8],
'sort': [1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0]})
An would like to get the following columns:
1- sum_order_total_1 which sums up the values in the column order grouped by the column sort so in this case for the value 1 from column sort for each ID and returns Nans for zeros form column sort
2- sum_order_total_0 which sums up the values in the column order grouped by the column sort so in this case for the value 0 from column sort for each ID and returns Nans for oness form column sort
3- count_order_date_1 which sums up the values in column order of each ID grouped by column Date_Time for 1 in column sort and returns Nans for 0 from column sort
4- count_order_date_0 which sums up the values in column order of each ID grouped by column Date_Time for 0 in column sort and returns Nans for 1 from column sort
The expected reults should look like that attached photo here:
The problem with groupby (and pd.pivot_table) is that only provide half of the job. They give you the numbers but not in the format that you want. To finalize the format you can use apply.
For the total counts I used:
# Retrieve your data, similar as in the groupby query you provided.
data_total = pd.pivot_table(df, values='order', index=['ID'], columns=['sort'], aggfunc=np.sum)
data_total.reset_index(inplace=True)
Which results in the table:
sort ID 0 1
0 1 6.0 11.0
1 2 15.0 NaN
2 3 NaN 9.0
3 4 3.0 2.0
4 5 5.0 NaN
5 6 8.0 6.0
Now using this as an index ('ID' and 0 or 1 for the sort.) We can write a small function that will input the right value:
def filter_count(data, row, sort_value):
""" Select the count that belongs to the correct ID and sort combination. """
if row['sort'] == sort_value:
return data[data['ID'] == row['ID']][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
This leads to:
ID Date_Time order sort total_1 total_0
0 1 2010-01-01 12:01:00 2 1 11.0 NaN
1 1 2010-01-01 01:27:33 4 1 11.0 NaN
2 1 2010-04-02 12:01:00 5 1 11.0 NaN
3 1 2010-04-01 07:24:00 6 0 NaN 6.0
4 2 2011-01-01 12:01:00 7 0 NaN 15.0
5 2 2011-01-01 01:27:33 8 0 NaN 15.0
6 3 2013-01-01 12:01:00 9 1 9.0 NaN
7 4 2014-01-01 12:01:00 2 1 2.0 NaN
8 4 2014-01-01 01:27:33 3 0 NaN 3.0
9 5 2015-01-01 01:27:33 5 0 NaN 5.0
10 6 2016-01-01 01:27:33 6 1 6.0 NaN
11 6 2011-01-01 01:28:00 8 0 NaN 8.0
Now we can apply the same logic to the date, except that the date also contains information about the hours, minutes and seconds. Which can be filtered out using:
# Since we are interesting on a per day bases, we remove the hour/minute/seconds part
df['order_day'] = pd.to_datetime(df['Date_Time']).dt.strftime('%Y/%m/%d')
Now applying the same trick as above, we create a new pivot table, based on the 'ID' and 'order_day':
data_date = pd.pivot_table(df, values='order', index=['ID', 'order_day'], columns=['sort'], aggfunc=np.sum)
data_date.reset_index(inplace=True)
Which is:
sort ID order_day 0 1
0 1 2010/01/01 NaN 6.0
1 1 2010/04/01 6.0 NaN
2 1 2010/04/02 NaN 5.0
3 2 2011/01/01 15.0 NaN
4 3 2013/01/01 NaN 9.0
5 4 2014/01/01 3.0 2.0
6 5 2015/01/01 5.0 NaN
7 6 2011/01/01 8.0 NaN
Writing a second function to fill in the correct value based on 'ID' and 'date':
def filter_date(data, row, sort_value):
if row['sort'] == sort_value:
return data[(data['ID'] == row['ID']) & (data['order_day'] == row['order_day'])][sort_value].values[0]
return np.NaN
# Applying the above function for both sort values 0 and 1.
df['total_1'] = df.apply(lambda row: filter_count(data_total, row, 1), axis=1, result_type='expand')
df['total_0'] = df.apply(lambda row: filter_count(data_total, row, 0), axis=1, result_type='expand')
Now we only have to drop the temporary column 'order_day':
df.drop(labels=['order_day'], axis=1, inplace=True)
And the final answer becomes:
ID Date_Time order sort total_1 total_0 date_0 date_1
0 1 2010-01-01 12:01:00 2 1 11.0 NaN NaN 6.0
1 1 2010-01-01 01:27:33 4 1 11.0 NaN NaN 6.0
2 1 2010-04-02 12:01:00 5 1 11.0 NaN NaN 5.0
3 1 2010-04-01 07:24:00 6 0 NaN 6.0 6.0 NaN
4 2 2011-01-01 12:01:00 7 0 NaN 15.0 15.0 NaN
5 2 2011-01-01 01:27:33 8 0 NaN 15.0 15.0 NaN
6 3 2013-01-01 12:01:00 9 1 9.0 NaN NaN 9.0
7 4 2014-01-01 12:01:00 2 1 2.0 NaN NaN 2.0
8 4 2014-01-01 01:27:33 3 0 NaN 3.0 3.0 NaN
9 5 2015-01-01 01:27:33 5 0 NaN 5.0 5.0 NaN
10 6 2016-01-01 01:27:33 6 1 6.0 NaN NaN 6.0
11 6 2011-01-01 01:28:00 8 0 NaN 8.0 8.0 NaN
Suppose that I have a dataframe:
>>> import pandas as pd
>>> import numpy as np
>>> rand = np.random.RandomState(42)
>>> data_points = 10
>>> dates = pd.date_range('2020-01-01', periods=data_points, freq='D')
>>> state_city = [('USA', 'Washington'), ('France', 'Paris'), ('Germany', 'Berlin')]
>>>
>>> df = pd.DataFrame()
>>> for _ in range(data_points):
... state, city = state_city[rand.choice(len(state_city))]
... df_row = pd.DataFrame(
... {
... 'time' : rand.choice(dates),
... 'state': state,
... 'city': city,
... 'val1': rand.randint(0, data_points),
... 'val2': rand.randint(0, data_points)
... }, index=[0]
... )
...
... df = pd.concat([df, df_row], ignore_index=True)
...
>>> df = df.sort_values(['time', 'state', 'city']).reset_index(drop=True)
>>> df.loc[rand.randint(0, data_points, size=rand.randint(1, 3)), ['state']] = pd.NA
>>> df.loc[rand.randint(0, data_points, size=rand.randint(1, 3)), ['city']] = pd.NA
>>> df.val1 = df.val1.where(df.val1 < 5, pd.NA)
>>> df.val2 = df.val2.where(df.val2 < 5, pd.NA)
>>>
>>> df
time state city val1 val2
0 2020-01-03 USA Washington 4 2
1 2020-01-04 France <NA> <NA> 1
2 2020-01-04 Germany Berlin <NA> 4
3 2020-01-05 Germany Berlin <NA> <NA>
4 2020-01-06 France Paris 1 4
5 2020-01-06 Germany Berlin 4 1
6 2020-01-08 Germany Berlin 4 3
7 2020-01-10 Germany Berlin 2 <NA>
8 2020-01-10 <NA> Washington <NA> <NA>
9 2020-01-10 <NA> Washington 2 <NA>
>>>
As you can see there are some values. I would like to impute values for state/city as much as I could. In order to do so I will generate dataframe that can help.
>>> known_state_city = df[['state', 'city']].dropna().drop_duplicates()
>>> known_state_city
state city
0 USA Washington
2 Germany Berlin
4 France Paris
OK, now we have all state/city combinations.
How to use known_state_city dataframe in order to fill empty state when city is known?
I can find empty states that have populated city with:
>>> df.loc[df.state.isna() & df.city.notna(), 'city']
8 Washington
9 Washington
Name: city, dtype: object
But how to replace Washington with state from known_state_city not breaking index values (8 and 9) in order to replace df.state values?
If I don't have all combinations in known_state_city, how to update state in df with what I have?
We can do fillna with map twice:
# fill empty state
df['state'] = df['state'].fillna(df['city'].map(known_state_city.set_index('city')['state']))
# fill empty city
df['city'] = df['city'].fillna(df['state'].map(known_state_city.set_index('state')['city']))
Output:
time state city val1 val2
0 2020-01-03 USA Washington 4.0 2.0
1 2020-01-04 France Paris NaN 1.0
2 2020-01-04 Germany Berlin NaN 4.0
3 2020-01-05 Germany Berlin NaN NaN
4 2020-01-06 France Paris 1.0 4.0
5 2020-01-06 Germany Berlin 4.0 1.0
6 2020-01-08 Germany Berlin 4.0 3.0
7 2020-01-10 Germany Berlin 2.0 NaN
8 2020-01-10 USA Washington NaN NaN
9 2020-01-10 USA Washington 2.0 NaN
I want to merge content for respective rows' data only where some specific conditions are met.
Here is the test dataframe I am working on
Date Desc Debit Credit Bal
0 04-08-2019 abcdef 45654 NaN 345.0
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.06
4 06-08-2019 xyz NaN 350.0 695.06
in which, I want to join the rows where there is nan into Date to the previous row.
Output required:
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654 NaN 345.06
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.0
4 06-08-2019 xyz NaN 350.0 695.06
If anybody help me out with this? I have tried the following:
for j in [x for x in range(lst[0], lst[-1]+1) if x not in lst]:
print (test.loc[j-1:j, ].apply(lambda x: ''.join(str(x)), axis=1))
But could not get the expected result.
You can use
d = df["Date"].fillna(method='ffill')
df.update(df.groupby(d).transform('sum'))
print(df)
output
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 0.0 351.0
1 NaN abcdefjklmn 45654.0 0.0 351.0
2 05-08-2019 abd 45.0 0.0 345.0
3 06-08-2019 xyz 0.0 345.0 54645.0
idx = test.loc[test["Date"].isna()].index
test.loc[idx-1, "Desc"] = test.loc[idx-1]["Desc"].str.cat(test.loc[idx]["Desc"])
test.loc[idx-1, "Bal"] = (test.loc[idx-1]["Bal"].astype(str)
.str.cat(test.loc[idx]["Bal"].astype(str)))
## I tried to add two values but it didn't work as expected, giving 351.0
# test.loc[idx-1, "Bal"] = test.loc[idx-1]["Bal"].values + test.loc[idx]["Bal"].values
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 NaN 345.06.0
1 NaN jklmn NaN NaN 6
2 05-08-2019 abd 45.0 NaN 345
3 06-08-2019 xyz NaN 345.0 54645