How to Perform operation in each columns in Pandas [duplicate] - python-3.x

This question already has answers here:
pandas convert columns to percentages of the totals
(4 answers)
Closed 1 year ago.
I have this dataset and I want to check the percentage of each cell per year. Such as dividing each value by the sum of values of that year ( value/sum(1960) )*100. How can I get the value for each column and each row?

If I'm understanding correctly, you want the equivalent of 5 / sum([1, 2, 3, 4, 5]) * 100. If that's the case, then you could do the following:
subset_cols = df.columns
perc_df = df[subset_cols] / df[subset_cols].sum(axis=0) * 100
Using axis=0 will apply the function to each column, whereas axis=1 will apply to each row.

To convert column values into percentages, this is the simplest way:
df['1960_percentages'] = 100*df.1960/df.1960.sum()
Repeat similarly for other columns.
Note: This creates a new column in your dataframe keeping the original data intact. If you would just like to replace, do the following:
df.1960 /= (df.1960.sum() / 100)
Edit: To do the same for multiple columns at once:
cols = # list of columns to apply this over (set to df.columns for all columns)
df[cols] /= (df[cols].sum(axis=0) / 100)

Related

Filter Dataframe by comparing one column to list of other columns

I have a dataframe with numerous float columns. I want to filter the dataframe, leaving only the values that are inbetween the High and Low columns of the same dataframe.
I know how to do this when the conditions are one column compared to another column. But there are 102 columns, so I cannot write a condition for each column. And all my research just illustrates how to compare two columns and not one column against all others (or I am not typing the right search terms).
I tried df= df[ (df['High'] <= df[DFColRBs]) & (df['Low'] >= df[DFColRBs])].copy() But it erases everything.
and I tried booleanselction = df[ (df[DFColRBs].between(df['High'],df['Low'])]
and I tried: df= df[(df[DFColRBs].ge(df['Low'])) & (df[DFColRBs].le(df['Low']))].copy()
and I tried:
BoolMatrix = (df[DFColRBs].ge(DF_copy['Low'], axis=0)) & (df[DFColRBs].le(DF_copy['Low'], axis=0))
df= df[BoolMatrix].copy()
But it erases everything in dataframe, even 3 columns that are not included in the list.
I appreciate the guidance.
Example Dataframe:
High Low Close _1m_21 _1m_34 _1m_55 _1m_89 _1m_144 _1m_233 _5m_21 _5m_34 _5m_55
0 1.23491 1.23456 1.23456 1.23401 1.23397 1.23391 1.2339 1.2337 1.2335 1.23392 1.23363 1.23343
1 1.23492 1.23472 1.23472 1.23422 1.23409 1.234 1.23392 1.23375 1.23353 1.23396 1.23366 1.23347
2 1.23495 1.23479 1.23488 1.23454 1.23422 1.23428 1.23416 1.23404 1.23372 1.23415 1.234 1.23367
3 1.23494 1.23472 1.23473 1.23457 1.23425 1.23428 1.23417 1.23405 1.23373 1.23415 1.234 1.23367
Based on what you've said in the comments, best to split the df into the pieces you want to operate on and the ones you don't, then use matrix operations.
tmp_df = DF_copy.iloc[:, 3:].copy()
# or tmp_df = DF_copy[DFColRBs].copy()
# mask by comparing test columns with the high and low columns
m = tmp_df.le(DF_copy['High'], axis=0) & tmp_df.ge(DF_copy['Low'], axis=0)
# combine the masked df with the original cols
DF_copy2 = pd.concat([DF_copy.iloc[:, :3], tmp_df.where(m)], axis=1)
# or replace with DF_copy.iloc[:, :3] with DF_copy.drop(columns=DFColRBs)

How to get number of columns in a DataFrame row that are above threshold

I have a simple python 3.8 DataFrame with 8 columns (simply labeled 0, 1, 2, etc.) with approx. 3500 rows. I want a subset of this DataFrame where there are at least 2 columns in each row that are above 1. I would prefer not to have to check each column individually, but be able to check all columns. I know I can use the .any(1) to check all the columns, but I need there to be at least 2 columns that meet the threshold, not just one. Any help would be appreciated. Sample code below:
import pandas as pd
df = pd.DataFrame({0:[1,1,1,1,100],
1:[1,3,1,1,1],
2:[1,3,1,1,4],
3:[1,1,1,1,1],
4:[3,4,1,1,5],
5:[1,1,1,1,1]})
Easiest way I can think to sort/filter later would be to create another column at the end df[9] that houses the count:
df[9] = df.apply(lambda x: x.count() if x > 2, axis=1)
This code doesn't work, but I feel like it's close?
df[(df>1).sum(axis=1)>=2]
Explanation:
(df>1).sum(axis=1) gives the number of columns in that row that is greater than 1.
then with >=2 we filter those rows with at least 2 columns that meet the condition --which we counted as explained in the previous bullet
The value of x in the lambda is a Series, which can be indexed like this.
df[9] = df.apply(lambda x: x[x > 2].count(), axis=1)

Calculate percentage of grouped values

I have a Pandas dataframe that looks like"
I calculated the number of Win, Lost and Draw for each year, and now it looks like:
My goal is to calculate the percentage of each score group by year. To become like:
But I stuck here.
I looked in this thread but was not able to apply it on my df.
Any thoughts?
Here is quite a simple method I wrote for this task:
Just do as follows:
create a dataframe of the total score within each year:
total_score = df.groupby('year')['score'].sum().reset_index(name = 'total_score_each_year')
merge the original and the new dataframe into a single dataframe:
df = df.merge(total_score, on = 'year')
calculate the percents:
df['percentages'] = 100 * (df['score'] / df['total_score_each_year'])
That's it, I hope it helps :)
You could try using : df.iat[row, column]
it would look something like this:
percentages = []
for i in range(len(df) // 3):
draws = df.iat[i, 2]
losses = df.iat[i + 1, 2]
wins = df.iat[i + 2, 2]
nbr_of_games = draws + losses + wins
percentages.append[draws * 100/nbr_of_games] #Calculate percentage of draws
percentages.append[losses * 100/nbr_of_games] #Calculate percentage of losses
percentages.append[wins * 100/nbr_of_games] #Calculate percentage of wins
df["percentage"] = percentages
This may not be the fastest way to do it but i hope it helps !
Similar to #panter answer, but in only one line and without creating any additional DataFrame:
df['percentage'] = df.merge(df.groupby('year').score.sum(), on='year', how='left').apply(
lambda x: x.score_x * 100 / x.score_y, axis=1
)
In detail:
df.groupby('year').score.sum() creates a DataFrame with the sum of the score per year.
df.merge creates a Dataframe equal to the original df, but with the column score renamed to score_x and an additional column score_y, that represents the sum of all the scores for the year of each row; the how='left' keeps only row in the left DataFrame, i.e., df.
.apply computes for each the correspondent percentage, using score_x and score_y (mind the axis=1 option, to apply the lambda row by row).

How map() function works in python?

I want to apply numpy function average on pandas dataframe object. Since, I want to apply this function on row wise element of dataframe object, therefore I have applied map function. code is as follows:
df = pd.DataFrame(np.random.rand(5,3),columns = ['Col1','Col2','Col3'])
df_averge_row = df.apply(np.average(weights=[[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5]]),axis=0)
Unfortunately, it is not working. Any Suggestion would be helpful
Since you have 3 columns in each row and are applying the function row-wise (not column wise) per your question, the weights function can only have 3 elements (one per each column in a given row, let's say [1,2,3]):
df = pd.DataFrame(np.random.rand(5,3),columns = ['Col1','Col2','Col3'])
weights = weights=[1,2,3]
df_averge_row = df.apply(lambda x: np.average(x, weights=weights),axis=1)
df_averge_row
out:
0 0.618617
1 0.757778
2 0.551463
3 0.497654
4 0.755083
dtype: float64

Exporting a list as a new column in a pandas dataframe as part of a nested for loop

I am inputting multiple spreadsheets with multiple columns of data. For each spreadsheet, the maximum value of each column is found. Then, for each element in the column, the element is divided by the maximum value of that column. The output should be a value (between 0 and 1) for each element in the column in ascending order. This is appended to a list which should be added to the source spreadsheet as a column.
Currently, the nested loops are performing correctly apart from the final step, as far as I understand. Each column is added to the spreadsheet EXCEPT the values are for the final column of the source spreadsheet rather than values related to each individual column.
I have tried changing the indents to associate levels of the code with different parts (as I think this is the problem) and tried moving the appended column along in the dataframe, to no avail.
for i in distlist:
#listname = i[4:] + '_norm'
df2 = pd.read_excel(i,header=0,index_col=None, skip_blank_lines=True)
df3 = df2.dropna(axis=0, how='any')
cols = []
for column in df3:
cols.append(column)
for x in cols:
listname = x + ' norm'
maxval = df3[x].max()
print(maxval)
mylist = []
for j in df3[x]:
findNL = (j/maxval)
mylist.append(findNL)
df3[listname] = mylist
saveloc = 'E:/test/'
filename = i[:-18] + '_Normalised.xlsx'
df3.to_excel(saveloc+filename, index=False)
New columns are added to the output dataframe with bespoke headings relating to the field headers in the source spreadsheet and renamed according to (listname). The data in each one of these new columns is identical and relates to the final column in the spreadsheet. To me, it seems to be overwriting the values each time (as if looping through the entire spreadsheet, not outputting for each column), and adding it to the spreadsheet.
Any help would be much appreciated. I think it's something simple, but I haven't managed to work out what...
If I understand you correctly, you are overcomplicating things. You dont need a for loop for this. You can simplify your code:
# Make example dataframe, this is not provided
df = pd.DataFrame({'col1':[1, 2, 3, 4],
'col2':[5, 6, 7, 8]})
print(df)
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Now we can use DataFrame.apply and use add_suffix to give the new columns _norm suffix and after that concat the columns to one final dataframe
df_conc = pd.concat([df, df.apply(lambda x: x/x.max()).add_suffix('_norm')],axis=1)
print(df_conc)
col1 col2 col1_norm col2_norm
0 1 5 0.25 0.625
1 2 6 0.50 0.750
2 3 7 0.75 0.875
3 4 8 1.00 1.000
Many thanks. I think I was just overcomplicating it. Incidentally, I think my code may do the same job, but because there is so little difference in the values, it wasn't notable.
Thanks for your help #Erfan

Resources