Calculates the difference of a DataFrame element compared with another element in the DataFrame columns - python-3.x

I have a pivot_table, e.g.
WEEK w1 W2 ... Wn
col_1
A 1 2 ... n
B 1 2 ... n
C 1 2 ... n
...
I wonder if i can get the difference of Wn & Wn-1 at once?
WEEK w1 W2 ... Wn
col_1
A 0 1 ... 1
B 0 1 ... 1
C 0 1 ... 1
...
I found pandas.DataFrame.diff() but don't know how to use it correctly. Thanks for any suggestions!

The function diff calculates the difference between all rows and their previous one and returns a dataframe with equal size to the applied one.
This leaves your first row NaN because there is no previous row of the first row.If you want to calculate the difference between columns simply set df.diff(axis=1).This will return a dataframe with the first column NaN.If you want to find the difference between two columns only,apply the diff function to these two columns
df[:,-2:].diff(axis=1)
This select the last two columns of your dataframe and returns a new table with 2 columns,one NaN and the other with the difference Wn - Wn-1

Related

Compare two classes with range of Marks

I have a dataframe with two classes (A or B) and marks and I want to present the mark ranges per class.
Dataframe:
Class Mark Department
A 74.0 1
A 73.0 2
B 72.0 1
A 75.0 1
B 64.0 2
What I want to achieve:
Class Mark Range
A 73.0-75.0
B 64.0-72.0
and I was thinking of using the min max (creating a new field for the range). But as a start, I tried to just group it:
df['count'] = 1
result = df.pivot_table('count', index='Mark', columns='Class', aggfunc='sum').fillna(0)
which is complex and I abandoned this quickly.
I then I only kept two columns in my dataframe (Mark and Class) and used the following:
df[['Mark','Class']].values
And now I just have to create the Mark range column. I was thinking whether there was a simpler way without the steps to simply pivot the data and check the range (min max of columnA grouped by ColumnB).
We can use GroupBy.apply and get the max and min per group and represent them as string with f-strings:
df = (
df.groupby('Class')['Mark'].apply(lambda x: f'{x.min()}-{x.max()}')
.reset_index(name='Mark Range')
)
Class Mark Range
0 A 73.0-75.0
1 B 64.0-72.0
Simple but ugly:
temp = df.groupby('Class')['Mark'].agg({'min': min, 'max': max})
temp['range'] = temp['min'].map(str) + '-' + temp['max'].map(str)
Result of doing temp[['range']]:
range
Class
A 73.0-75.0
B 64.0-72.0
If you are interested in using pivot_table:
df_new = (df.pivot_table('Mark', 'Class', aggfunc=lambda x: f'{x.min()}-{x.max()}')
.add_suffix(' Range').reset_index())
Out[1543]:
Class Mark Range
0 A 73.0-75.0
1 B 64.0-72.0
As in your comment. To add Deparment, just use the list ['Class', 'Department'] for index as follows
df_new = (df.pivot_table('Mark', ['Class', 'Department'],
aggfunc=lambda x: f'{x.min()}-{x.max()}')
.add_suffix(' Range').reset_index())
Out[259]:
Class Department Mark Range
0 A 1 74.0-75.0
1 A 2 73.0-73.0
2 B 1 72.0-72.0
3 B 2 64.0-64.0

turn rows into columns and coping and keeping the same index infrond of all previous together columns

the inverse of
Transpose columns to rows keeping first 3 columns the same
turn:
id col1 col2 col3
1 A B
2 X Y Z
into:
id
1 A
1 B
2 X
2 Y
2 Z
I'm trying unpivot() but from the solution, I cited I need to use .unstack() ?

I want to count the occurrence of duplicate values in a column in a dataframe and update the count in a new column in python

Example: Let's say I have a df
Id
A
B
C
A
A
B
It should look like:
Id count
A. 1
B. 1
C. 1
A. 2
A. 3
B. 2
Note: I've tried using the for loop method and while loop option but it works for small datasets but takes a lot of time for large datasets.
for i in df:
for j in df:
if i==j:
count+=1
You can groupby with cumcount, like this:
df['counts'] = df.groupby('Id', sort=False).cumcount() + 1
df.head()
Id counts
0 A 1
1 B 1
2 C 1
3 A 2
4 A 3
5 B 2
dups_values = df.pivot_table(index=['values'], aggfunc='size')
print(dups_values)

Take the mean of n numbers in a DataFrame column and "drag" formula down similar to Excel

I'm trying to take the mean of n numbers in a pandas DataFrame column and "drag" the formula down each row to get the respective mean.
Let's say there are 6 rows of data with "Numbers" in column A and "Averages" in column B. I want to take the average of A1:A2, then "drag" that formula down to get the average of A2:A3, A3:A4, etc.
list = [55,6,77,75,9,127,13]
finallist = pd.DataFrame(list)
finallist.columns = ['Numbers']
Below gives me the average of rows 0:2 in the Numbers column. So calling out the rows with .iloc[0:2]) works, but when I try to shift down a row it doesn't work:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2])
Below I'm trying to take the average of the first two rows, then shift down by 1 as you move down the rows, but I get a value of NaN:
finallist['Average'] = statistics.mean(finallist['Numbers'].iloc[0:2].shift(1))
I expected the .iloc[0:2].shift(1)) to shift the mean function down 1 row but still apply to 2 total rows, but I got a value of NaN.
Here's a screenshot of my output:
What's happening in your shift(1) approach is that you're actually shifting the index in your data "down" once, so this code:
df['Numbers'].iloc[0:2].shift(1)
Produces the output:
0 NaN
1 55.0
Then you take the average of these two, which evalutes to NaN, and then you assign that single value to every element of the Averages Series here:
df['Averages'] = statistics.mean(df['Numbers'].iloc[0:2].shift(1))
You can instead use rolling() combined with mean() to get a sliding average across the entire data frame like this:
import pandas as pd
values = [55,6,77,75,9,127,13]
df = pd.DataFrame(values)
df.columns = ['Numbers']
df['Averages'] = df.rolling(2, min_periods=1).mean()
This produces the following output:
Numbers Averages
0 55 55.0
1 6 30.5
2 77 41.5
3 75 76.0
4 9 42.0
5 127 68.0
6 13 70.0

Renaming columns in dataframe w.r.t another specific column

BACKGROUND: Large excel mapping file with about 100 columns and 200 rows converted to .csv. Then stored as dataframe. General format of df as below.
Starts with a named column (e.g. Sales) and following two columns need to be renamed. This pattern needs to be repeated for all columns in excel file.
Essentially: Link the subsequent 2 columns to the "parent" one preceding them.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
APPROACH FOR SOLUTION: I assume it would be possible to begin with an index (e.g. index of Sales column 1 = x) and then rename the following two columns as (x+1) and (x+2).
Then take in the text for the next named column (e.g. Validation) and so on.
I know the rename() function for dataframes.
BUT, not sure how to apply the iteratively for changing column titles.
EXPECTED OUTPUT: Unnamed 2 & 3 changed to Sales_Commented and Sales_No_Comment, respectively.
Similarly Unnamed 5 & 6 change to Validation_Commented and Validation_No_Comment.
Again, repeated for all 100 columns of file.
EDIT: Due to the large number of cols in the file, creating a manual list to store column names is not a viable solution. I have already seen this elsewhere on SO. Also, the amount of columns and departments (Sales, Validation) changes in different excel files with the mapping. So a dynamic solution is required.
Sales Sales_Commented Sales_No_Comment Validation Validation_Commented Validation_No_Comment
0 Commented No comment Commented No comment
1 x x
2 x
3 x x x
As a python novice, I considered a possible approach for the solution using the limited knowledge I have, but not sure what this would look like as a workable code.
I would appreciate all help and guidance.
1.You need is to make a list with the column names that you would want.
2.Make it a dict with the old column names as the keys and new column name as the values.
3. Use df.rename(columns = your_dictionary).
import numpy as np
import pandas as pd
df = pd.read_excel("name of the excel file",sheet_name = "name of sheet")
print(df.head())
Output>>>
Sales Unnamed : 2 Unnamed : 3 Validation Unnamed : 5 Unnamed : 6 Unnamed :7
0 NaN Commented No comment NaN Comment No comment Extra
1 1.0 2 1 1.0 1 1 1
2 3.0 1 1 1.0 1 1 1
3 4.0 3 4 5.0 5 6 6
4 5.0 1 1 1.0 21 3 6
# get new names based on the values of a previous named column
new_column_names = []
counter = 0
for col_name in df.columns:
if (col_name[:7].strip()=="Unnamed"):
new_column_names.append(base_name+"_"+df.iloc[0,counter].replace(" ", "_"))
else:
base_name = col_name
new_column_names.append(base_name)
counter +=1
# convert to dict key pair
dictionary = dict(zip(df.columns.tolist(),new_column_names))
# rename columns
df = df.rename(columns=dictionary)
# drop first column
df = df.iloc[1:].reset_index(drop=True)
print(df.head())
Output>>
Sales Sales_Commented Sales_No_comment Validation Validation_Comment Validation_No_comment Validation_Extra
0 1.0 2 1 1.0 1 1 1
1 3.0 1 1 1.0 1 1 1
2 4.0 3 4 5.0 5 6 6
3 5.0 1 1 1.0 21 3 6

Resources