Removing repetitive/duplicate occurance in excel using python

Removing repetitive/duplicate occurance in excel using python - excel

I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.
This is my input excel:
And need output like this:

This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :
by creating a mask where you return a true/false boolean if the row is == the row above.
assuming your dataframe is called df
mask = df['NAME'].ne(df['NAME'].shift())
df.loc[~mask,'NAME'] = ''
explanation :
what we are doing above is the following,
first selecting a single column, or in pandas terminology a series, we then apply a .ne (not equal to) which in effect is !=
lets see this in action.
import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']
now, lets create a dataframe similar to yours.
df = pd.DataFrame({'NAME' : names,
'DEFAULT' : defaults,
'CLASS' : classes,
'AGE' : [np.random.randint(1,5) for len in names],
'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables.
so, if we did df['NAME'].ne('Omar') which is the same as [df['NAME'] != 'Omar'] we would get.
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq to the row above.
we do this by calling [.shift][2] hyperlinked for more info.
what this basically does is shift the rows by its index with a defined variable number, lets call this n.
if we called df['NAME'].shift(1)
0 NaN
1 Rekha
2 Rekha
3 Jaya
4 Jaya
5 Sushma
6 Nita
7 Nita
we can see here that that Rekha has moved down
so putting that all together,
df['NAME'].ne(df['NAME'].shift())
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
we assign this to a self defined variable called mask you could call this whatever you want.
we then use [.loc][2] which lets you access your dataframe by labels or a boolean array, in this instance an array.
however, we only want to access the booleans which are False so we use a ~ which inverts the logic of our array.
NAME DEFAULT CLASS AGE GROUP
1 Rekha third 1 4
3 Jaya fifth 1 1
6 Nita fifth 1 2
7 Nita fourth 1 4
all we need to do now is change these rows to blanks as your initial requirment, and we are left with.
NAME DEFAULT CLASS AGE GROUP
0 Rekha forth 2 2
1 third 1 4
2 Jaya c-default forth 3 3
3 fifth 1 1
4 Sushma fourth3 1
5 Nita c-default third 4 2
6 fifth 1 2
7 fourth1 4
hope that helps!

Related

How can I find the highest value between rows every time that they met a certain condition?

I have been struggling with a problem with my data frame build in pandas that is current like this
MyDataFrame:
Index Status Value
0 A 10
1 A 8
2 A 5
3 B 9
4 B 5
5 A 1
6 B 2
7 A 3
8 A 5
9 A 1
The desired output would be:
Index Status Value
0 A 10
1 B 9
2 A 1
3 B 2
4 A 5
So far I tried to use range and while conditions to filter, however, if I put a conditional like :
for i in range:
if Status[i] == "A":
print(Value[i])
if Status == "B":
break
** The code above is more an example of what I have been trying to reach my goal, I tried to use .iloc and range with while, but maybe in the wrong way idk.*
The desired output isn't printed.
One thing that complicates this filtering process is that MyDataFrame changes every time that I run the script since it uses another base of data to create this DataFrame.
I believe that I'm missing something simple, but it has been almost a week and I can't figure out.
Thanks in advance for all your answers and support.

Let us try using shift with cumsum create the groupby key , then it is groupby + agg
out = df.groupby(df.Status.ne(df.Status.shift()).cumsum()).agg({'Status':'first','Value':'max'})
Out[14]:
Status Value
Status
1 A 10
2 B 9
3 A 1
4 B 2
5 A 5

Very close to #BEN_YO:
grp = (df['Status'] != df['Status'].shift()).cumsum()
df.loc[df.groupby(grp)['Value'].idxmax()]
Output:
Status Value
Index
0 A 10
3 B 9
5 A 1
6 B 2
8 A 5
Create groups using shift and inequality with cumsum, then groupby and find the index of the max value of 'Value', idxmax, and filter the dataframe using loc

Looping in rows in pandas

My data frame has first columns as IDs as follows:
ID
A123
A234
A456
A123
A234
Now I need to create a new column Indicator which will add one in front of each ID which is getting repeated.
Desired Output:
ID Indicator
A123 1
A234 1
A456 0
A123 1
A234 1

This is a pretty simple operation in Pandas once you get the hang of it, so you may want to invest some time in a tutorial. What you need to do is call the conventient function duplicated() of the ID column, an instance of pandas.core.series.Series. So:
import pandas as pd
df = pd.DataFrame(["A123", "A234", "A456", "A123", "A234"], columns=["ID"])
df.ID.duplicated()
0 False
1 False
2 False
3 True
4 True
Name: ID, dtype: bool
It returns a Series with boolean values. You can take that new Seriesand call its apply function that will then return a Series with values using the return of apply. So to turn each boolean into 0 or 1, all you need to do is apply int:
df.ID.duplicated().apply(int) // or df["ID"].duplicated().apply(int)
0 0
1 0
2 0
3 1
4 1
Name: ID, dtype: int64
There are lots of other convention functions in Series. If you need to do something more complicated, you can apply() a custom function, e.g.
def custom_function(value):
return str(int(value))
df.ID.duplicated().apply(custom_function)
0 0
1 0
2 0
3 1
4 1
Name: ID, dtype: object
You can also use the apply() of the DataFrame itself to call functions across all rows or columns, specified using axis.

Renaming columns in dataframe w.r.t another specific column

BACKGROUND: Large excel mapping file with about 100 columns and 200 rows converted to .csv. Then stored as dataframe. General format of df as below.
Starts with a named column (e.g. Sales) and following two columns need to be renamed. This pattern needs to be repeated for all columns in excel file.
Essentially: Link the subsequent 2 columns to the "parent" one preceding them.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
APPROACH FOR SOLUTION: I assume it would be possible to begin with an index (e.g. index of Sales column 1 = x) and then rename the following two columns as (x+1) and (x+2).
Then take in the text for the next named column (e.g. Validation) and so on.
I know the rename() function for dataframes.
BUT, not sure how to apply the iteratively for changing column titles.
EXPECTED OUTPUT: Unnamed 2 & 3 changed to Sales_Commented and Sales_No_Comment, respectively.
Similarly Unnamed 5 & 6 change to Validation_Commented and Validation_No_Comment.
Again, repeated for all 100 columns of file.
EDIT: Due to the large number of cols in the file, creating a manual list to store column names is not a viable solution. I have already seen this elsewhere on SO. Also, the amount of columns and departments (Sales, Validation) changes in different excel files with the mapping. So a dynamic solution is required.
Sales Sales_Commented Sales_No_Comment Validation Validation_Commented Validation_No_Comment
0 Commented No comment Commented No comment
1 x x
2 x
3 x x x
As a python novice, I considered a possible approach for the solution using the limited knowledge I have, but not sure what this would look like as a workable code.
I would appreciate all help and guidance.

1.You need is to make a list with the column names that you would want.
2.Make it a dict with the old column names as the keys and new column name as the values.
3. Use df.rename(columns = your_dictionary).
import numpy as np
import pandas as pd
df = pd.read_excel("name of the excel file",sheet_name = "name of sheet")
print(df.head())
Output>>>
Sales Unnamed : 2 Unnamed : 3 Validation Unnamed : 5 Unnamed : 6 Unnamed :7
0 NaN Commented No comment NaN Comment No comment Extra
1 1.0 2 1 1.0 1 1 1
2 3.0 1 1 1.0 1 1 1
3 4.0 3 4 5.0 5 6 6
4 5.0 1 1 1.0 21 3 6
# get new names based on the values of a previous named column
new_column_names = []
counter = 0
for col_name in df.columns:
if (col_name[:7].strip()=="Unnamed"):
new_column_names.append(base_name+"_"+df.iloc[0,counter].replace(" ", "_"))
else:
base_name = col_name
new_column_names.append(base_name)
counter +=1
# convert to dict key pair
dictionary = dict(zip(df.columns.tolist(),new_column_names))
# rename columns
df = df.rename(columns=dictionary)
# drop first column
df = df.iloc[1:].reset_index(drop=True)
print(df.head())
Output>>
Sales Sales_Commented Sales_No_comment Validation Validation_Comment Validation_No_comment Validation_Extra
0 1.0 2 1 1.0 1 1 1
1 3.0 1 1 1.0 1 1 1
2 4.0 3 4 5.0 5 6 6
3 5.0 1 1 1.0 21 3 6

How can I input values from a list or dataframe into each cell in existing excel file?

So basically, I want to update a worksheet with new data, overwriting existing cells in excel. Both files have the same column names (I do not want to create a new workbook nor add a new column).
Here I am retreiving the data that I want:
import pandas as pd
df1 = pd.read_csv
print(df1)
Ouput (I just copy and pasted the first 5 rows, there are about 500 rows total):
Index Type Stage CDID Period Index Value
0 812008000 6 2 JTV9 201706 121.570
1 812008000 6 2 JTV9 201707 121.913
2 812008000 6 2 JTV9 201708 121.686
3 812008000 6 2 JTV9 201709 119.809
4 812008000 6 2 JTV9 201710 119.841
5 812128000 6 1 K2VA 201706 122.030
The existing excel file has the same columns (and row total) as df1, but I just want to have the 'Index' column repopulated with the new values. Let's just say it looks like this (i.e. so I want the previous values for Index to go into the corresponding column):
Index Type Stage CDID Period Index Value
0 512901100 6 2 JTV9 201706 121.570
1 412602034 6 2 JTV9 201707 121.913
2 612307802 6 2 JTV9 201708 121.686
3 112808360 6 2 JTV9 201709 119.809
4 912233066 6 2 JTV9 201710 119.841
5 312128003 6 1 K2VA 201706 122.030
Here I am retrieving the excel file, and attempting to overwrite it:
from win32com.client import Dispatch
import os
xl = Dispatch("Excel.Application")
xl.Visible = True
wbs_path = ('folder path')
for wbname in os.listdir(wbs_path):
if not wbname.endswith("file name.xlsx"):
continue
wb = xl.Workbooks.Open(wbs_path + '\\' + wbname)
sh = wb.Worksheets("sheet name")
sh.Range("A1:A456").Value = df1[["Index"]]
wb.Save()
wb.Close()
xl.Quit()
But this doesn't do anything.
If I type in strings, such as:
h.Range("A1:A456").Value = 'o', 'x', 'c'
This repeats o in cells through A1 through to A456 (it updates the spreadsheet), but ignores x and c. I have tried converting df1 into a list and numpy array, but this doesn't work.
Does anyone know a solution or alternative workaround?

If the index of the dataframe is the same you can update columns by using update(). It could work like this:
df1.update(df2['Index'].to_frame())
Note: the to frame() is probably not needed
EDIT:
Since you try to update a excel-file and not a dataframe, my answer is probably not enough.
For this part I would suggest to load the file into a dataframe, update the data and save it.
df1 = pd.read_excel('file.xlsx', sheet_name='sheet_name')
# do the update
writer = pd.ExcelWriter('file.xlsx')
df1.to_excel(writer,sheet_name='sheet_name', engine='xlsxwriter')
writer.save()

Using pandas style to give colors to some rows with a specific condition

This is the output of pandas in excel format:
Id comments number
1 so bad 1
1 so far 2
2 always 3
2 very good 4
3 very bad 5
3 very nice 6
3 so far 7
4 very far 8
4 very close 9
4 busy 10
I want to use pandas to give a color (for example: gray color) to rows that their value for Id column is even. For example rows 3 and 4 have even Id numbers, but rows 5, 6 and 7 have odd Id numbers. Is there any possible way to use pandas to do it?

As explained in the documentation http://pandas.pydata.org/pandas-docs/stable/style.html what you basically want to do is write a style function and apply it to the style object.
def _color_if_even(s):
return ['background-color: grey' if val % 2 == 0 else '' for val in s]
and call it on my Styler object, i.e.,
df.style.apply(_color_if_even, subset=['id'])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string