Drop by multiple columns groups if specific values not exit in another column in Pandas - python-3.x

How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!

Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5

There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())

Related

pandas remove records conditionally based on records count of groups

I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
and mapping dataframe
mapping = pd.DataFrame({'Product':['A','C'],'Product1':['B','D']}, columns = ['Product','Product1'])
and i wanted to compare products as per mapping. product A data should match with product B data.. the logic is product A number of records is 4 so product B records also should be 4 and those 4 records should be from the week number before and after form last week number of product A and including the last week number. so before 1 week of week number 4 i.e. 3rd week and after 2 weeks of week number 4 i.e 5,6 and week 4 data.
similarly product C number of records is 3 so product D records also should be 3 and those records before and after last week number of product C. so product c last week number 3 so product D records will be week number 2,3,4.
wanted data frame will be like below i wanted to remove those yellow records
Define the following function selecting rows from df, for products from
the current row in mapping:
def selRows(row, df):
rows_1 = df[df.Product == row.Product]
nr_1 = rows_1.index.size
lastWk_1 = rows_1.Week.iat[-1]
rows_2 = df[df.Product.eq(row.Product1) & df.Week.ge(lastWk_1 - 1)].iloc[:nr_1]
return pd.concat([rows_1, rows_2])
Then call it the following way:
result = pd.concat([ selRows(row, grp)
for _, grp in df2.groupby(['Country'])
for _, row in mapping.iterrows() ])
The list comprehension above creates a list on DataFrames - results of
calls of selRows on:
each group of rows from df2, for consecutive countries (the outer loop),
each row from mapping (the inner loop).
Then concat concatenates all of them into a single DataFrame.
Solution first create mapped column by mapping DataFrame and create dictionaries for mapping for length and last (maximal) value by groups by Country and Product:
df2['mapp'] = df2['Product'].map(mapping.set_index('Product1')['Product'])
df1 = df2.groupby(['Country','Product'])['Week'].agg(['max','size'])
#subtracted 1 for last previous value
dprev = df1['max'].sub(1).to_dict()
dlen = df1['size'].to_dict()
print(dlen)
{('UK', 'A'): 4, ('UK', 'B'): 8, ('UK', 'C'): 3, ('UK', 'D'): 6}
Then Series.map values of dict and filter out less values, then filter by second dictionary by lengths with DataFrame.head:
df3 = (df2[df2[['Country','mapp']].apply(tuple, 1).map(dprev) <= df2['Week']]
.groupby(['Country','mapp'])
.apply(lambda x: x.head(dlen.get(x.name))))
print(df3)
Country Product Week val mapp
Country mapp
UK A 6 UK B 3 7 A
7 UK B 4 8 A
8 UK B 5 9 A
9 UK B 6 10 A
C 16 UK D 2 6 C
17 UK D 3 7 C
18 UK D 4 8 C
Then filter original rows unmatched mapping['Product1'], add new df3 and sorting:
df = (df2[~df2['Product'].isin(mapping['Product1'])]
.append(df3, ignore_index=True)
.sort_values(['Country','Product'])
.drop('mapp', axis=1))
print(df)
Country Product Week val
0 UK A 1 5
1 UK A 2 4
2 UK A 3 3
3 UK A 4 1
7 UK B 3 7
8 UK B 4 8
9 UK B 5 9
10 UK B 6 10
4 UK C 1 5
5 UK C 2 5
6 UK C 3 5
11 UK D 2 6
12 UK D 3 7
13 UK D 4 8

if specific value/string occurs in the entire dataframe I want to sum its index values

i have a dataframe in which I need to find a specific image name in the entire dataframe and sum its index values every time they are found. SO my data frame looks like:
c 1 2 3 4
g
0 180731-1-61.jpg 180731-1-61.jpg 180731-1-61.jpg 180731-1-61.jpg
1 1209270004-2.jpg 180609-2-31.jpg 1209270004-2.jpg 1209270004-2.jpg
2 1209270004-1.jpg 180414-2-38.jpg 180707-1-31.jpg 1209050002-1.jpg
3 1708260004-1.jpg 1209270004-2.jpg 180609-2-31.jpg 1209270004-1.jpg
4 1108220001-5.jpg 1209270004-1.jpg 1108220001-5.jpg 1108220001-2.jpg
I need to find the 1209270004-2.jpg in entire dataframe. And as it is found at index 1 and 3 I want to add the index values so it should be
1+3+1+1=6.
I tried the code:
img_fname = '1209270004-2.jpg'
df2 = df1[df1.eq(img_fname).any(1)]
sum = int(np.sum(df2.index.values))
print(sum)
I am getting the answer of sum 4 i.e 1+3=4. But it should be 6.
If the string occurence is only once or twice or thrice or four times like for eg 180707-1-31 is in column 3. then the sum should be 45+45+3+45 = 138. Which signifies that if the string is not present in the dataframe take vallue as 45 instead the index value.
You can multiple boolean mask by index values and then sum:
img_fname = '1209270004-1.jpg'
s = df1.eq(img_fname).mul(df1.index.to_series(), 0).sum()
print (s)
1 2
2 4
3 0
4 3
dtype: int64
out = np.where(s == 0, 45, s).sum()
print (out)
54
If dataset does not have many columns, this can also work with your original question
df1 = pd.DataFrame({"A":["aa","ab", "cd", "ab", "aa"], "B":["ab","ab", "ab", "aa", "ab"]})
s = 0
for i in df1.columns:
s= s+ sum(df1.index[df1.loc[:,i] == "ab"].tolist())
Input :
A B
0 aa ab
1 ab ab
2 cd ab
3 ab aa
4 aa ab
Output :11
Based on second requirement:

Fill Null values in Data-Frame with Column names

I have a data-frame with 55 columns and 2 million rows having mix of categorical and numeric fileds. There are null/na values in the data-set. I want to fill Null values with Column names.
The data-set I have is:
A B C D .....
1 na na 3 .....
na 3 4 na .....
........................
The output the I am trying to get is:
A B C D .....
1 B C 3 .....
A 3 4 D .....
........................
I am trying to use :
df.fillna(method='ffill')
Is there another way?
Python:3.6.5
Use DataFrame.fillna with columns converted to Series by Index.to_series:
df = df.fillna(df.columns.to_series())
print (df)
A B C D
0 1 B C 3
1 A 3 4 D
EDIT: If categorical columns in DataFrame select these columns and append non exist values by cat.add_categories:
for c in df.select_dtypes('category'):
df[c] = df[c].cat.add_categories(c)
df = df.fillna(df.columns.to_series())

Pandas Pivot Table Conditional Counting

I have a simple dataframe:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0]})
df
id value
0 a 0
1 a 15
2 a 20
3 b 30
4 b 0
And I want a pivot table with the number of values greater than zero.
I tried this:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:len(x>0))
But returned this:
value
id
a 3
b 2
What I need:
value
id
a 2
b 1
I read lots of solutions with groupby and filter. Is it possible to achieve this only with pivot_table command? If it is not, which is the best approach?
Thanks in advance
UPDATE
Just to make it clearer why I am avoinding filter solution. In my real and complex df, I have other columns, like this:
df = pd.DataFrame({'id': ['a','a','a','b','b'],'value':[0,15,20,30,0],'other':[2,3,4,5,6]})
df
id other value
0 a 2 0
1 a 3 15
2 a 4 20
3 b 5 30
4 b 6 0
I need to sum the column 'other', but when i filter I got this:
df=df[df['value']>0]
raw = pd.pivot_table(df, index='id',values=['value','other'],aggfunc={'value':len,'other':sum})
other value
id
a 7 2
b 5 1
Instead of:
other value
id
a 9 2
b 11 1
Need sum for count Trues created by condition x>0:
raw = pd.pivot_table(df, index='id',values='value',aggfunc=lambda x:(x>0).sum())
print (raw)
value
id
a 2
b 1
As #Wen mentioned, another solution is:
df = df[df['value'] > 0]
raw = pd.pivot_table(df, index='id',values='value',aggfunc=len)
You can filter the dataframe before pivoting:
pd.pivot_table(df.loc[df['value']>0], index='id',values='value',aggfunc='count')

Get column names from pandas DataFrame in format dtype:object

I have a similar doubt to the one in the mentioned link. Instead of returning column names in a list, I want column names in the format dtype:object.
For example,
A
B
C
D
Name:x,dtype:object
I am using Excel file in xlsx format.
Link: Get list from pandas DataFrame column headers
I think you need read_excel first for df and then Series constructor or Index.to_series for Series from column names:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5]})
print (df)
A B C D
0 1 4 7 1
1 2 5 8 3
2 3 6 9 5
s = pd.Series(df.columns.values, name='x')
print (s)
0 A
1 B
2 C
3 D
Name: x, dtype: object
s1 = df.columns.to_series().rename('x')
print (s1)
A A
B B
C C
D D
Name: x, dtype: object

Resources