How to insert the last row of each group after each rows - python-3.x

My need is to duplicate the last row of each id group max(num) after each row of the same group
import pandas as pd
data = [{'id': 110, 'val1': 'A', 'num': 0},
{'id': 110, 'val1': 'B', 'num': 1},
{'id': 110, 'val1': 'C', 'num': 2},
{'id': 220, 'val1': 'E', 'num': 0},
{'id': 220, 'val1': 'F', 'num': 1},
{'id': 220, 'val1': 'G', 'num': 2},
{'id': 220, 'val1': 'X', 'num': 3},
{'id': 300, 'val1': 'H', 'num': 0},
{'id': 300, 'val1': 'I', 'num': 1}]
df = pd.DataFrame(data)
df
My dataframe:
What I m looking for:

Here is one way merge with wide_to_long, the drop_duplicates assumed the data frame is well ordered , if not , use sort_values first
s=df.merge(df.drop_duplicates('id',keep='last'),on='id').query('val1_x!=val1_y').reset_index()
newdf=pd.wide_to_long(s,['val1','num'],i=['index','id'],j='drop',suffix='\\w+').\
reset_index('id').reset_index(drop=True)
newdf
id val1 num
0 110 A 0
1 110 C 2
2 110 B 1
3 110 C 2
4 220 E 0
5 220 X 3
6 220 F 1
7 220 X 3
8 220 G 2
9 220 X 3
10 300 H 0
11 300 I 1

Related

Update values of non NaN positions in a dataframe column

I want to update the values of non-NaN entries in a dataframe column
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, np.nan, np.nan, 2, 3, np.nan],
}
df = pd.DataFrame(d)
The data for updating the value column is in a list
new_value = [10, 15, 1, 18]
I could get the non-NaN entries in column value
df["value"].notnull()
I'm not sure how to assign the new values.
Suggestions will be really helpful.
df.loc[df["value"].notna(), 'value'] = new_value
By df["value"].notna() you select the rows where value is not NAN, then you specify the column (value in this case). It is important that the number of rows selected by the condition matches the number of values in new_value.
You can first identify the index which have nan values.
import pandas as pd
from pprint import pprint
import numpy as np
d = {
't': [0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A'],
'value': [0.1, 0.2, np.nan, np.nan, 2, 3, np.nan],
}
df = pd.DataFrame(d)
print(df)
r, _ = np.where(df.isna())
new_value = [10, 15, 18] # There are only 3 nans
df.loc[r,'value'] = new_value
print(df)
Output:
t input type value
0 0 2 A 0.1
1 1 2 A 0.2
2 2 2 A 10.0
3 0 2 B 20.0
4 2 2 B 2.0
5 0 2 B 3.0
6 1 4 A 30.0

How to group by column in a dataframe and create pivot tables in a loop

I have the following table df .
ID CATEG LEVEL COLS VALUE COMMENT
1 A 3 Apple 388 comment1
1 A 3 Orange 204 comment1
1 A 2 Orange 322 comment1
1 A 1 Orange 716 comment1
1 A 1 Apple 282 comment1
1 A 2 Apple 555 comment1
1 A Berry 289 comment1
2 A Car 316 comment1
1 B Berry 297 comment1
1 B 3 Apple 756 comment1
1 B 2 Apple 460 comment1
1 B 3 Orange 497 comment1
1 B 2 Orange 831 comment1
1 B 1 Orange 225 comment1
1 B 1 Apple 395 comment1
2 B Car 486 comment1
1 C 2 Orange 320 comment1
1 C 1 Orange 208 comment1
1 C 1 Apple 464 comment1
1 C 2 Apple 613 comment1
1 C 3 Apple 369 comment1
1 C Berry 474 comment1
2 C Car 888 comment1
1 C 3 Orange 345 comment1
2 B Car 664 comment2
I want to create this view in dataframe and write in excel for each group of ID.Example for ID 1. In my sample there is only one comment so sheet name be like ID_COMMENT like 1_comment1:-
Berry Apple Orange
1 2 3 1 2 3
A 289 388 555 282 204 322 716
B 297 756 460 395 497 831 225
C 474 369 613 464 345 320 208
If LEVEL is None/na I should be able to create/ split the df based on COLS and comments alone with name "ID_NULL_COMMENT" as sheet name like:-
2_NULL_comment1 sheet :-
CATEG Car
A 316
B 486
C 888
2_NULL_comment2 sheet :-
CATEG Car
B 664
what I tried :
from pandas import ExcelWriter
writer = ExcelWriter('Values.xlsx')
distinct_id_df= np.unique(df[['ID']], axis=0)
for ID in distinct_id_df.iloc[:,0] :
sample_df = pd.DataFrame()
for df in sample_df:
for i in(distinct_id_df):
distinct_id_df = df.groupby['ID'].pivot_table('VALUE', ['LEVEL','CATEEG'],'COLS')
sample_df = sample_df.append(df)
print(sample_df.shape, '===>', datetime.now())
sample_df.to_excel(writer,'{}''{}'.format(id).format(comments),index= False)
writer.save()
This is not correct clearly, Im unable to do the pivot correctly and also stuck on how to loop correctly to place in different sheet.
Use:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2], 'CATEG': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'B'], 'LEVEL': [3.0, 3.0, 2.0, 1.0, 1.0, 2.0, np.nan, np.nan, np.nan, 3.0, 2.0, 3.0, 2.0, 1.0, 1.0, np.nan, 2.0, 1.0, 1.0, 2.0, 3.0, np.nan, np.nan, 3.0, np.nan], 'COLS': ['Apple', 'Orange', 'Orange', 'Orange', 'Apple', 'Apple', 'Berry', 'Car', 'Berry', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Apple', 'Car', 'Orange', 'Orange', 'Apple', 'Apple', 'Apple', 'Berry', 'Car', 'Orange', 'Car'], 'VALUE': [388, 204, 322, 716, 282, 555, 289, 316, 297, 756, 460, 497, 831, 225, 395, 486, 320, 208, 464, 613, 369, 474, 888, 345, 664], 'COMMENT': ['comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment2']})
#check misisng values
mask = df['LEVEL'].isna()
#split DataFrames for different processing
df1 = df[~mask]
df2 = df[mask]
#pivoting with differnet columns parameters
df1 = df1.pivot_table(index=['ID','COMMENT','CATEG'],
columns=['COLS','LEVEL'],
values='VALUE')
# print (df1)
df2 = df2.pivot_table(index=['ID','COMMENT','CATEG'], columns='COLS',values='VALUE')
# print (df1)
from pandas import ExcelWriter
with pd.ExcelWriter('Values.xlsx') as writer:
#groupby by first 2 levels ID, COMMENT
for (ids,comments), sample_df in df1.groupby(['ID','COMMENT']):
#removed first 2 levels, also removed only NaNs columns
df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
#new sheetnames by f-strings
name = f'{ids}_{comments}'
#write to file
df.to_excel(writer,sheet_name=name)
for (ids,comments), sample_df in df2.groupby(['ID','COMMENT']):
df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
name = f'{ids}_NULL_{comments}'
df.to_excel(writer,sheet_name=name)
Another solution without repeating code:
mask = df['LEVEL'].isna()
dfs = {'no_null': df[~mask], 'null': df[mask]}
from pandas import ExcelWriter
with pd.ExcelWriter('Values.xlsx') as writer:
for k, v in dfs.items():
if k == 'no_null':
add = ''
cols = ['COLS','LEVEL']
else:
add = 'NULL_'
cols = 'COLS'
df = v.pivot_table(index=['ID','COMMENT','CATEG'], columns=cols, values='VALUE')
for (ids,comments), sample_df in df.groupby(['ID','COMMENT']):
df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
name = f'{ids}_{add}{comments}'
df.to_excel(writer,sheet_name=name)

Conversion of dataframe to required dictionary format

I am trying to convert the below data frame to a dictionary
Dataframe:
import pandas as pd
df = pd.DataFrame({'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c':[4,3,5,5,5,3], 'd':[3,4,5,5,7,8]})
print(df)
Sample Dataframe:
a b c d
0 A 1 4 3
1 A 2 3 4
2 B 5 5 5
3 B 5 5 5
4 B 4 5 7
5 C 6 3 8
I required this data frame in the below-mentioned dictionary format
[{"a":"A","data_values":[{"b":1,"c":4,"d":3},{"b":2,"c":3,"d":4}]},
{"a":"B","data_values":[{"b":5,"c":5,"d":5},{"b":5,"c":5,"d":5},
{"b":4,"c":5,"d":7}]},{"a":"C","data_values":[{"b":6,"c":3,"d":8}]}]
Use DataFrame.groupby with custom lambda function for convert values to dictionaries by DataFrame.to_dict:
L = (df.set_index('a')
.groupby('a')
.apply(lambda x: x.to_dict('records'))
.reset_index(name='data_values')
.to_dict('records')
)
print (L)
[{'a': 'A', 'data_values': [{'b': 1, 'c': 4, 'd': 3},
{'b': 2, 'c': 3, 'd': 4}]},
{'a': 'B', 'data_values': [{'b': 5, 'c': 5, 'd': 5},
{'b': 5, 'c': 5, 'd': 5},
{'b': 4, 'c': 5, 'd': 7}]},
{'a': 'C', 'data_values': [{'b': 6, 'c': 3, 'd': 8}]}]

How to create new rows for entries that do not exist, in pandas

I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'cat': ['a', 'a', 'a', 'b'], 'br': [1,2,2,3], 'ch': ['A', 'A', 'B', 'C'],
'value': [10,20,30,40]})
For every cat and br, I want to add the ch that is missing with value 0
My final dataframe should look like this:
foo_final = pd.DataFrame({'cat': ['a', 'a', 'a', 'b', 'a', 'a', 'a', 'b', 'b'],
'br': [1,2,2,3, 1, 1, 2, 3, 3],
'ch': ['A', 'A', 'B','C','B', 'C', 'C', 'A', 'B'],
'value': [10,20,30,40, 0,0, 0,0,0]})
Use DataFrame.set_index
for Multiindex and then DataFrame.unstack with DataFrame.stack:
foo = foo.set_index(['cat','br','ch']).unstack(fill_value=0).stack().reset_index()
print (foo)
cat br ch value
0 a 1 A 10
1 a 1 B 0
2 a 1 C 0
3 a 2 A 20
4 a 2 B 30
5 a 2 C 0
6 b 3 A 0
7 b 3 B 0

how to make multiple columns into list of key value in python

I have a below data frame and i want to create a key value pair in list using the columns in data frame, how can i do it in python.
df=
city code qty1 type
hyd 1 10 a
hyd 2 12 b
ban 2 15 c
ban 4 25 d
pune 1 10 e
pune 3 12 f
i want to create a new data frame as below:
df1 =
city list
hyd [{"1":"10","type":"a"},{"2":"12","type":"b"}]
ban [{"2":"15","type":"c"},{"4":"25","type":"d"}]
pune [{"1":"10","type":"e"},{"3":"12","type":"f"}]
defaultdict
from collections import defaultdict
d = defaultdict(list)
for t in df.itertuples():
d[t.city].append({t.code: t.qty1, 'type': t.type})
pd.Series(d).rename_axis('city').to_frame('list')
list
city
ban [{2: 15, 'type': 'c'}, {4: 25, 'type': 'd'}]
hyd [{1: 10, 'type': 'a'}, {2: 12, 'type': 'b'}]
pune [{1: 10, 'type': 'e'}, {3: 12, 'type': 'f'}]
groupby
pd.Series([
{c: q, 'type': t}
for c, q, t in zip(df.code, df.qty1, df.type)
]).groupby(df.city).apply(list).to_frame('list')
list
city
ban [{2: 15, 'type': 'c'}, {4: 25, 'type': 'd'}]
hyd [{1: 10, 'type': 'a'}, {2: 12, 'type': 'b'}]
pune [{1: 10, 'type': 'e'}, {3: 12, 'type': 'f'}]

Resources