Remove exact rows and frequency of rows of a data.frame where certain column values match with column values of another data.frame in python 3 - python-3.x

Consider the following two data.frames created using pandas in python 3:
a1 = pd.DataFrame(({'NO': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8'],
'A': [1, 2, 3, 4, 5, 2, 4, 2],
'B': ['a', 'b', 'c', 'd', 'e', 'b', 'd', 'b']}))
a2 = pd.DataFrame(({'NO': ['d9', 'd10', 'd11', 'd12'],
'A': [1, 2, 3, 2],
'B': ['a', 'b', 'c', 'b']}))
I would like to remove the exact rows of a1 that are in a2 wherever the values of columns 'A' an 'B' are the same (except for the 'NO' column) so that the result should be:
A B NO
4 d d4
5 e d5
4 d d7
2 b d8
Is there any built-in function in pandas or any other library in python 3 to get this result?

Related

Update values of non NaN positions in a dataframe column

I want to update the values of non-NaN entries in a dataframe column
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, np.nan, np.nan, 2, 3, np.nan],
}
df = pd.DataFrame(d)
The data for updating the value column is in a list
new_value = [10, 15, 1, 18]
I could get the non-NaN entries in column value
df["value"].notnull()
I'm not sure how to assign the new values.
Suggestions will be really helpful.
df.loc[df["value"].notna(), 'value'] = new_value
By df["value"].notna() you select the rows where value is not NAN, then you specify the column (value in this case). It is important that the number of rows selected by the condition matches the number of values in new_value.
You can first identify the index which have nan values.
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, np.nan, np.nan, 2, 3, np.nan],
}
df = pd.DataFrame(d)
print(df)
r, _ = np.where(df.isna())
new_value = [10, 15, 18] # There are only 3 nans
df.loc[r,'value'] = new_value
print(df)
Output:
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 10.0
3 0 2 B 20.0
4 2 2 B 2.0
5 0 2 B 3.0
6 1 4 A 30.0

how to remain max column in groupby table?

I made summarized table like below using pandas groupby function
I
II
A
apple
3
banana
4
B
dog
1
cat
2
C
seoul
9
tokyo
5
I want to remain if II column has max value in each category.
For example, in A category I want to remain banana row only because it has max value in II column.
the result table what I want to get is like below.
I
II
A
banana
4
B
cat
2
C
seoul
9
Thanks.
Dataframe used by me:
df=pd.DataFrame({'II': {('A', 'apple'): 3,
('A', 'banana'): 4,
('B', 'dog'): 1,
('B', 'cat'): 2,
('C', 'seoul'): 9,
('C', 'tokyo'): 5}})
Try via sort_values(),reset_index() and drop_duplicates():
out=(df.sort_values('II',ascending=False)
.reset_index()
.drop_duplicates('level_0')
.set_index('level_0')
.rename_axis(index=None)
.rename(columns={'level_1':'I'}))
OR
out=(df.reset_index()
.sort_values('II',ascending=False)
.groupby('level_0')
.first()
.rename(columns={'level_1':'I'})
.rename_axis(index=None))
output of out:
I II
C seoul 9
A banana 4
B cat 2
Not sure if this is the most elegant solution, but if you want this should work with a groupby object.
# Creating the Dummy DataFrame
d = {
'Letter': ['A', 'A', 'B', 'B', 'C', 'C'], 'Word': ['apple', 'banana',
'dog', 'cat', 'seoul', 'tokyo'], 'II': [3, 4, 1, 2, 9, 5]
}
df = pd.DataFrame(data=d)
df_max = df.groupby('Letter')[['II']].agg('max')
df_max = df_max.merge(df, how='left', on='II') # merge the "Word" column back into df_max
You could then reorder the columns if you need them to be in a specific order.

Conversion of dataframe to required dictionary format

I am trying to convert the below data frame to a dictionary
Dataframe:
import pandas as pd
df = pd.DataFrame({'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c':[4,3,5,5,5,3], 'd':[3,4,5,5,7,8]})
print(df)
Sample Dataframe:
a b c d
0 A 1 4 3
1 A 2 3 4
2 B 5 5 5
3 B 5 5 5
4 B 4 5 7
5 C 6 3 8
I required this data frame in the below-mentioned dictionary format
[{"a":"A","data_values":[{"b":1,"c":4,"d":3},{"b":2,"c":3,"d":4}]},
{"a":"B","data_values":[{"b":5,"c":5,"d":5},{"b":5,"c":5,"d":5},
{"b":4,"c":5,"d":7}]},{"a":"C","data_values":[{"b":6,"c":3,"d":8}]}]
Use DataFrame.groupby with custom lambda function for convert values to dictionaries by DataFrame.to_dict:
L = (df.set_index('a')
.groupby('a')
.apply(lambda x: x.to_dict('records'))
.reset_index(name='data_values')
.to_dict('records')
)
print (L)
[{'a': 'A', 'data_values': [{'b': 1, 'c': 4, 'd': 3},
{'b': 2, 'c': 3, 'd': 4}]},
{'a': 'B', 'data_values': [{'b': 5, 'c': 5, 'd': 5},
{'b': 5, 'c': 5, 'd': 5},
{'b': 4, 'c': 5, 'd': 7}]},
{'a': 'C', 'data_values': [{'b': 6, 'c': 3, 'd': 8}]}]

How to create new rows for entries that do not exist, in pandas

I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'cat': ['a', 'a', 'a', 'b'], 'br': [1,2,2,3], 'ch': ['A', 'A', 'B', 'C'],
'value': [10,20,30,40]})
For every cat and br, I want to add the ch that is missing with value 0
My final dataframe should look like this:
foo_final = pd.DataFrame({'cat': ['a', 'a', 'a', 'b', 'a', 'a', 'a', 'b', 'b'],
'br': [1,2,2,3, 1, 1, 2, 3, 3],
'ch': ['A', 'A', 'B','C','B', 'C', 'C', 'A', 'B'],
'value': [10,20,30,40, 0,0, 0,0,0]})
Use DataFrame.set_index
for Multiindex and then DataFrame.unstack with DataFrame.stack:
foo = foo.set_index(['cat','br','ch']).unstack(fill_value=0).stack().reset_index()
print (foo)
cat br ch value
0 a 1 A 10
1 a 1 B 0
2 a 1 C 0
3 a 2 A 20
4 a 2 B 30
5 a 2 C 0
6 b 3 A 0
7 b 3 B 0

Pandas Get All Values from Multiindex levels

Given the following pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I'd like to access each value of 'C' (or level 2) as a list to use for plotting.
I'd like to do the same for 'A' and 'B' (levels 0 and 1) in such a way that it preserves spacing so that I can use those lists as well. I'm ultimately trying to use them to create something like this via plotting:
Here's the question from which this one stemmed.
Thanks in advance!
You can use get_level_values to get the index values at a specific level from a multi-index:
In [127]:
table.index.get_level_values('C')
Out[127]:
Index(['a', 'b', 'a', 'b', 'a', 'a', 'b', 'a', 'b'], dtype='object', name='C')
In [128]:
table.index.get_level_values('B')
Out[128]:
Index(['x', 'x', 'y', 'y', 'z', 'x', 'y', 'z', 'z'], dtype='object', name='B')
In [129]:
table.index.get_level_values('A')
Out[129]:
Index(['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], dtype='object', name='A')
get_level_values accepts an int param for the level or a label
Note that for the higher levels, the values are repeated to correspond with the index length at the lowest level, for display purposes you don't see this

Resources