Python Pandas DF Pivot and Groupby - python-3.x

I need to iterate through my dataframe rows and pivot the single column bounding_box_y into 8 columns each time the value in text_y column changes.
original data frame
desired data frame
Can anyone help with some code that does NOT hardcode values into the code? The entire dataframe is over 6000 rows. I need to pivot the one column into 8 each time the value in another column changes.
Thanks!

Please try to include your data as callable code, so others can easily copy/paste and experiment. In your case you can get it with df.head(16).to_dict('list'). I used the following
df = pd.DataFrame({
'boundingBox_y': [183, 120, 305, 120, 305, 161, 182, 161, 318, 120, 381, 120, 382, 162, 318, 161],
'text_y': (['FORM'] * 8) + (['ABC'] * 8),
'confidence': ([0.987] * 8) + ([0.976] * 8)
})
Then you can pivot your dataframe but you need to add a new column to hold the pivoted column names.
# rename the current values column
df.rename({'boundingBox_y': 'value'}, axis=1, inplace=True)
# create a column that contains the columns headers and can be pivoted
df['boundingBox_y'] = df.groupby(['confidence', 'text_y']).transform('cumcount')
# pivot your df
df = df.pivot(index=['confidence', 'text_y'],
columns='boundingBox_y', values='value')
Output
boundingBox_y 0 1 2 3 4 5 6 7
confidence text_y
0.976 ABC 318 120 381 120 382 162 318 161
0.987 FORM 183 120 305 120 305 161 182 161

Related

adding reversed columns to dataframe [duplicate]

This question already has an answer here:
Reversing the order of values in a single column of a Dataframe
(1 answer)
Closed 1 year ago.
Trying to add a reversed column to a data frame, but it just adds in normal order. For me, it looks like it is just following the index of the dataframe. Is it possible to reorder the index?
df_reversed = df['Buy'].iloc[::-1]
Data["newColumn"] = df_reversed
Image of the output
Image of df_reversed
This is how I want the output to be
A slight modification from #Chicodelarose, you can reverse just the values and get the result you want as follows:
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
df["calories_reversed"] = df["calories"].values[::-1]
print(df)
Output will be:
calories duration
0 420 50
1 380 40
2 390 45
calories duration calories_reversed
0 420 50 390
1 380 40 380
2 390 45 420
You need to call reset_index before assigning the values to the new column so that they are added to the data frame in reverse order:
Example:
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
df["calories_reversed"] = df["calories"][::-1].reset_index(drop=True)
print(df)
Output:
calories duration
0 420 50
1 380 40
2 390 45
calories duration calories_reversed
0 420 50 390
1 380 40 380
2 390 45 420

In pandas dataframe, how to make one column act on all the others?

Consider the small following dataframe:
import pandas as pd
value1 = [15, 20, 50, 70]
value2 = [15, 80, 45, 30]
base = [175, 150, 200, 125]
df = pd.DataFrame({"val1": value1, "val2": value2, "base": base})
df
val1 val2 base
0 15 15 175
1 20 80 150
2 50 45 200
3 70 30 125
Actually, there are much more rows and much more val*** columns...
I would like to express the figures given in the columns val*** as percent of their corresponding base (in the same row); as an example, 70 (last in val1) should become (70/125)*100, (which is 56), or 30 (last in val2) should become (30/125)*100 (which is 28) ; and so on for every figure.
I am sure the solution lies in a correct use of assign or apply and lambda, but I can't find how to do it ...
We can filter the val like columns then divide these columns by the base column along axis=0 followed by multiplication with 100 to calculate the percentage
df.filter(like='val').div(df['base'], axis=0).mul(100).add_suffix('%')
val1% val2%
0 8.571429 8.571429
1 13.333333 53.333333
2 25.000000 22.500000
3 56.000000 24.000000

Getting columns by list of substring values

I have dataframe which is mentioned below, i have large data wanted to create diffrent data frame from substring values of column
df
ID ex_srr123 ex2_srr124 ex3_srr125 ex4_srr1234 ex23_srr5323
san 12 43 0 34 0
mat 53 0 34 76 656
jon 82 223 23 32 21
jack 0 12 2 0 0
i have a list of substring of column
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I wanted
df2=
ID ex_srr123 ex2_srr12
san 12 43
mat 53 0
jon 82 223
jack 0 12
I tried
df2=df[coln1]
i didn't get what i wanted please help me how can i get desire output
Statically
df2 = df.filter(regex="srr123$|srr124$").copy()
Dynamically
coln1 = ['srr123', 'srr124']
df2 = df.filter(regex=f"{coln1[0]}$|{coln1[1]}$").copy()
The $ signifies the end of the string, so that the column ex4_srr1234 isn't also included in your result.
Look into the filter method
df.filter(regex="srr123|srr124").copy()
I am making a few assumptions:
'ID' is a column and not the index.
The third column in df2 should read 'ex2_srr124' instead of 'ex2_srr12'.
You do not want to include columns of 'df' in 'df2' if the substring does not match everything after the underscore (since 'srr123' is a substring of 'ex4_srr1234' but you did not include it in 'df2').
# set the provided data frames
df = pd.DataFrame([['san', 12, 43, 0, 34, 0],
['mat', 53, 0, 34, 76, 656],
['jon', 82, 223, 23, 32, 21],
['jack', 0, 12, 2, 0, 0]],
columns = ['ID', 'ex_srr123', 'ex2_srr124', 'ex3_srr125', 'ex4_srr1234', 'ex23_srr5323'])
# set the list of column-substrings
coln1=['srr123', 'srr124']
coln2=['srr1234','srr5323']
I suggest to solve this as follows:
# create df2 and add the ID column
df2 = pd.DataFrame()
df2['ID'] = df['ID']
# iterate over each substring in a list of column-substrings
for substring in coln1:
# iterate over each column name in the df columns
for column_name in df.columns.values:
# check if column name ends with substring
if substring == column_name[-len(substring):]:
# assign the new column to df2
df2[column_name] = df[column_name]
This yields the desired dataframe df2:
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12
df.filter(regex = '|'.join(['ID'] + [col+ '$' for col in coln1])).copy()
ID ex_srr123 ex2_srr124
0 san 12 43
1 mat 53 0
2 jon 82 223
3 jack 0 12

How to groupby one column of DataFrame while it appends corresponding rows in another column and multiplies to the number of itself in its column?

Imagine we have 2 columns DataFrame, col1 has a unique number while col2 has repeated number like below.
I want to be like this:
Try:
# Setup
df = pd.DataFrame({'col1':{0:89,1:53,2:97,3:106,4:115,5:56,6:55,7:105,8:71,9:70,10:110},'col2':{0:205,1:205,2:205,3:203,4:203,5:203,6:202,7:201,8:200,9:200,10:198}})
df_new = df.groupby('col2', sort=False)['col1'].apply(list).reset_index()
df_new['col2'] = df_new['col1'].str.len().astype(str) + '*' + df_new.pop('col2').astype(str)
print(df_new)
[out]
col1 col2
0 [89, 53, 97] 3*205
1 [106, 115, 56] 3*203
2 [55] 1*202
3 [105] 1*201
4 [71, 70] 2*200
5 [110] 1*198

Converting a List of Pandas Series to a single Pandas DataFrame

I am using statsmodels.api on my data set. I have a list of panda series. The panda series has key value pairs. The keys are the names of the columns and the values contain the data. But, I have a list of series where the keys (column names) are repeated. I want to save all of the values from the list of pandas series to a single dataframe where the column names are the keys of the panda series. All of the series in the list have the same keys. I want to save them as a single data frame so that I can export the dataframe as a CSV. Any idea how I can save the keys as my column names of the df and then have the values fill the rest of the information.
Each series in the list returns something like this:
index 0 of the list: <class 'pandas.core.series.Series'>
height 23
weight 10
size 45
amount 9
index 1 of the list: <class 'pandas.core.series.Series'>
height 11
weight 99
size 25
amount 410
index 2 of the list: <class 'pandas.core.series.Series'>
height 3
weight 0
size 115
amount 92
I would like to be able to read a dataframe such that these values are saved as the following:
DataFrame:
height weight size amount
23 10 45 9
11 11 25 410
3 3 115 92
pd.DataFrame(data=your_list_of_series)
When creating a new DataFrame, pandas will accept a list of series for the data argument. The indices of your series will become the column names of the DataFrame.
Not the most efficient way, but this does the trick:
import pandas as pd
series_list =[ pd.Series({ 'height': 23,
'weight': 10,
'size': 45,
'amount': 9
}),
pd.Series({ 'height': 11,
'weight': 99,
'size': 25,
'amount': 410
}),
pd.Series({ 'height': 3,
'weight': 0,
'size': 115,
'amount': 92
})
]
pd.DataFrame( [series.to_dict() for series in series_list] )
Did you try just calling pd.DataFrame() on the list of series? That should just work.
import pandas as pd
series_list = [
pd.Series({
'height': 23,
'weight': 10,
'size': 45,
'amount': 9
}),
pd.Series({
'height': 11,
'weight': 99,
'size': 25,
'amount': 410
}),
pd.Series({
'height': 3,
'weight': 0,
'size': 115,
'amount': 92
})
]
df = pd.DataFrame(series_list)
print(df)
df.to_csv('path/to/save/foo.csv')
Output:
height weight size amount
0 23 10 45 9
1 11 99 25 410
2 3 0 115 92

Resources