Take a column from a Dataframe and normalize all of the other columns against it? - python-3.x

I've got a Dataframe like this:
df = pd.DataFrame(np.reshape(np.arange(0,9), (3,3)))
print(df)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
I'd like to normalize two of the columns against a reference column. For example, if I chose df[0] as my reference column, then df[1] and df[2] would also have a mean of 3 and a standard deviation of 3.
What's the best way to do this?

You can shift and scale the values in each column by the mean and standard deviation of the reference column ref:
ref = 0
means = df.mean()
stds = df.std()
(df - means + means[ref]) / stds * stds[ref]

Related

python3.7 & pandas - use column value in row as lookup value to return different column value

I've got a tricky situation - tricky for me since I'm really new to python. I've got a dataframe in pandas and I need to logic my way through building a new column that will be used later in a data match from a difference source. Basically, the picture tells what I can't figure out.
For any of the LOW labels I need to retrieve their MID_LEVEL label and copy it to a new column. The DESIRED OUTPUT column is what I need to create.
You can see that the LABEL_PATH is formatted in a way that I can use the first 9 digits as a "lookup" to find the corresponding LABEL, but I can't figure out how to achieve that. As an example, for any row that the LABEL_PATH starts with "0.02.0004" the desired output needs to be "MID_LEVEL1".
This dataset has around 25k rows, so wanted to avoid row iteration as well.
Any help would be greatly appreciated!
Chosing a similar example as you did:
df = pd.DataFrame({"a":["1","1.1","1.1.1","1.1.2","2"],"b":range(5)})
df["c"] = np.nan
mask = df.a.apply(lambda x: len(x.split(".")) < 3)
df.loc[mask,"c"] = df.b[mask]
df.c.fillna(method="ffill", inplace=True)
Most of the magic takes place in the line where mask is defined, but it's not that difficult: if the value in a gets split into less than 3 parts (i.e., has at most one dot), mark it as True, otherwise not.
Use that mask to copy over the values, and then fill unspecified values with valid values from above.
I am using this data for comparison :
test_dict = {"label_path": [1, 2, 3, 4, 5, 6], "label": ["low1", "low2", "mid1", "mid2", "high1", "high2"], "desired_output": ["mid1", "mid2", "mid1", "mid2", "high1", "high2"]}
df = pd.DataFrame(test_dict)
Which gives :
label_path label desired_output
0 1 low1 mid1
1 2 low2 mid2
2 3 mid1 mid1
3 4 mid2 mid2
4 5 high1 high1
5 6 high2 high2
With a bit ogf logic and a merge :
desired_label_df = df.drop_duplicates("desired_output", keep="last")
desired_label_df = desired_label_df[["label_path", "desired_output"]]
desired_label_df.columns = ["desired_label_path", "desired_output"]
df = df.merge(desired_label_df, on="desired_output", how="left")
Gives us :
label_path label desired_output desired_label_path
0 1 low1 mid1 3
1 2 low2 mid2 4
2 3 mid1 mid1 3
3 4 mid2 mid2 4
4 5 high1 high1 5
5 6 high2 high2 6
Edit: if you want to create the desired_output column, just do the following :
df["desired_output"] = df["label"].apply(lambda x: x.replace("low", "mid"))

is it possible to manually assign the value of the dummy variable?

I have a data set for automotive sales and i want to change the feature 'aspiration' which contains two unique values 'std' & 'turbo' to categorical values using pd.get_dummies. using the code below;
dummy_variable_2 = pd.get_dummies(df['aspiration'])
It is automatically assigning 0 to 'std' & 1 to 'turbo'.
I would like to change to 'std' to 1 & 'turbo' to 0.
The return of pd.get_dummmies is a dataframe, which contains one column for each unique value in the dataframe. Whereby, in each column only the values of the corresponding unique value are set to one.
In your case, the dataframe contains two columns. One column is named turbo and one column std. If you want the column where the values of std are set to one, you have to do following:
df = pd.DataFrame({"aspiration":["std", "turbo", "std", "std", "std", "turbo"]})
dummies = pd.get_dummies(df)
std= dummies["aspiration_std"]
In this example, the variable dummy looks like:
std turbo
0 1 0
1 0 1
2 1 0
3 1 0
4 1 0
5 0 1
and std looks like:
0 1
1 0
2 1
3 1
4 1
5 0

Compare two classes with range of Marks

I have a dataframe with two classes (A or B) and marks and I want to present the mark ranges per class.
Dataframe:
Class Mark Department
A 74.0 1
A 73.0 2
B 72.0 1
A 75.0 1
B 64.0 2
What I want to achieve:
Class Mark Range
A 73.0-75.0
B 64.0-72.0
and I was thinking of using the min max (creating a new field for the range). But as a start, I tried to just group it:
df['count'] = 1
result = df.pivot_table('count', index='Mark', columns='Class', aggfunc='sum').fillna(0)
which is complex and I abandoned this quickly.
I then I only kept two columns in my dataframe (Mark and Class) and used the following:
df[['Mark','Class']].values
And now I just have to create the Mark range column. I was thinking whether there was a simpler way without the steps to simply pivot the data and check the range (min max of columnA grouped by ColumnB).
We can use GroupBy.apply and get the max and min per group and represent them as string with f-strings:
df = (
df.groupby('Class')['Mark'].apply(lambda x: f'{x.min()}-{x.max()}')
.reset_index(name='Mark Range')
)
Class Mark Range
0 A 73.0-75.0
1 B 64.0-72.0
Simple but ugly:
temp = df.groupby('Class')['Mark'].agg({'min': min, 'max': max})
temp['range'] = temp['min'].map(str) + '-' + temp['max'].map(str)
Result of doing temp[['range']]:
range
Class
A 73.0-75.0
B 64.0-72.0
If you are interested in using pivot_table:
df_new = (df.pivot_table('Mark', 'Class', aggfunc=lambda x: f'{x.min()}-{x.max()}')
.add_suffix(' Range').reset_index())
Out[1543]:
Class Mark Range
0 A 73.0-75.0
1 B 64.0-72.0
As in your comment. To add Deparment, just use the list ['Class', 'Department'] for index as follows
df_new = (df.pivot_table('Mark', ['Class', 'Department'],
aggfunc=lambda x: f'{x.min()}-{x.max()}')
.add_suffix(' Range').reset_index())
Out[259]:
Class Department Mark Range
0 A 1 74.0-75.0
1 A 2 73.0-73.0
2 B 1 72.0-72.0
3 B 2 64.0-64.0

Renaming columns in dataframe w.r.t another specific column

BACKGROUND: Large excel mapping file with about 100 columns and 200 rows converted to .csv. Then stored as dataframe. General format of df as below.
Starts with a named column (e.g. Sales) and following two columns need to be renamed. This pattern needs to be repeated for all columns in excel file.
Essentially: Link the subsequent 2 columns to the "parent" one preceding them.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
APPROACH FOR SOLUTION: I assume it would be possible to begin with an index (e.g. index of Sales column 1 = x) and then rename the following two columns as (x+1) and (x+2).
Then take in the text for the next named column (e.g. Validation) and so on.
I know the rename() function for dataframes.
BUT, not sure how to apply the iteratively for changing column titles.
EXPECTED OUTPUT: Unnamed 2 & 3 changed to Sales_Commented and Sales_No_Comment, respectively.
Similarly Unnamed 5 & 6 change to Validation_Commented and Validation_No_Comment.
Again, repeated for all 100 columns of file.
EDIT: Due to the large number of cols in the file, creating a manual list to store column names is not a viable solution. I have already seen this elsewhere on SO. Also, the amount of columns and departments (Sales, Validation) changes in different excel files with the mapping. So a dynamic solution is required.
Sales Sales_Commented Sales_No_Comment Validation Validation_Commented Validation_No_Comment
0 Commented No comment Commented No comment
1 x x
2 x
3 x x x
As a python novice, I considered a possible approach for the solution using the limited knowledge I have, but not sure what this would look like as a workable code.
I would appreciate all help and guidance.
1.You need is to make a list with the column names that you would want.
2.Make it a dict with the old column names as the keys and new column name as the values.
3. Use df.rename(columns = your_dictionary).
import numpy as np
import pandas as pd
df = pd.read_excel("name of the excel file",sheet_name = "name of sheet")
print(df.head())
Output>>>
Sales Unnamed : 2 Unnamed : 3 Validation Unnamed : 5 Unnamed : 6 Unnamed :7
0 NaN Commented No comment NaN Comment No comment Extra
1 1.0 2 1 1.0 1 1 1
2 3.0 1 1 1.0 1 1 1
3 4.0 3 4 5.0 5 6 6
4 5.0 1 1 1.0 21 3 6
# get new names based on the values of a previous named column
new_column_names = []
counter = 0
for col_name in df.columns:
if (col_name[:7].strip()=="Unnamed"):
new_column_names.append(base_name+"_"+df.iloc[0,counter].replace(" ", "_"))
else:
base_name = col_name
new_column_names.append(base_name)
counter +=1
# convert to dict key pair
dictionary = dict(zip(df.columns.tolist(),new_column_names))
# rename columns
df = df.rename(columns=dictionary)
# drop first column
df = df.iloc[1:].reset_index(drop=True)
print(df.head())
Output>>
Sales Sales_Commented Sales_No_comment Validation Validation_Comment Validation_No_comment Validation_Extra
0 1.0 2 1 1.0 1 1 1
1 3.0 1 1 1.0 1 1 1
2 4.0 3 4 5.0 5 6 6
3 5.0 1 1 1.0 21 3 6

How to remove duplicates rows by same values in different order in dataframe by pandas

How to remove the duplicates in the df? df only has 1 column. In this case "60,25" and "25,60" is a pair of duplicated rows. The output should be the new df. For each pair of duplicated row, the kept row in format "A,B" where A < B, the removed row should be the one A>B. In this case, "25,60" and "80,123" should be kept. For unique row, it should stay whatever it is.
IIUC, using get_dummies with duplicated
df[~df.A.str.get_dummies(sep=',').duplicated()]
Out[956]:
A
0 A,C
1 A,B
4 X,Y,Z
Data input
df
Out[957]:
A
0 A,C
1 A,B
2 C,A
3 B,A
4 X,Y,Z
5 Z,Y,X
Update op change the question totally to different question
newdf=df.A.str.get_dummies(sep=',')
newdf[~newdf.duplicated()].dot(newdf.columns+',').str[:-1]
Out[976]:
0 25,60
1 123,37
dtype: object
I'd do a combination of things.
Use pandas.Series.str.split to split by commas
Use apply(frozenset) to get a hashable set such that I can use duplicated
Use pandas.Series.duplicated with keep='last'
df[~df.A.str.split(',').apply(frozenset).duplicated(keep='last')]
A
1 123,17
3 80,123
4 25,60
5 25,42
Addressing comments
df.A.apply(
lambda x: tuple(sorted(map(int, x.split(','))))
).drop_duplicates().apply(
lambda x: ','.join(map(str, x))
)
0 25,60
1 17,123
2 80,123
5 25,42
Name: A, dtype: object
Setup
df = pd.DataFrame(dict(
A='60,25 123,17 123,80 80,123 25,60 25,42'.split()
))

Resources