Groupby column value and keep row based on another column value - python-3.x

I have a DataFrame, df, that has a range of values such as:
ID
Code
01
AB001
02
AB002
02
BC123
02
CD576
03
AB444
03
CD332
04
BC434
04
CD894
I want to remove all duplicates in the ID column and keep the row that has a certain value in Code. Let's suppose that if the Code that starts with BC is available, I want to keep that row. Otherwise, I want to take the first row with the ID. My desired output would look like:
ID
Code
01
AB001
02
BC123
03
AB444
04
BC434
I want to do something like:
# 'x' denotes a list of rows per unique ID
def keep_row(x):
# determine if 'BC' is even an available Code
if any([True for row in x if row['Code'].startswith('BC') else False]):
return the row that has Code that starts with BC
else:
# return the first row with the unique ID if there is no Code that begins with BC
return x[0]
df.groupby('ID', group_keys=False).apply(lambda x: keep_row(x))
I'd appreciate any help - thanks.

You can sort your dataframe by ID and boolean value (False when Code starts with "BC"), then .groupby() and take first item:
df["tmp"] = ~df.Code.str.startswith("BC")
df = df.sort_values(by=["ID", "tmp"])
print(df.groupby("ID", as_index=False)["Code"].first())
Prints:
ID Code
0 1 AB001
1 2 BC123
2 3 AB444
3 4 BC434

Related

Rank groups without duplicates [duplicate]

I am trying to get a unique rank value (e.g. {1, 2, 3, 4} from a subgroup in my data. SUMPRODUCT will produce ties{1, 1, 3, 4}, I am trying to add the COUNTIFS to the end to adjust the duplicate rank away.
subgroup
col B col M rank
LMN 01 1
XYZ 02
XYZ 02
ABC 03
ABC 01
XYZ 01
LMN 02 3
ABC 01
LMN 03 4
LMN 03 4 'should be 5
ABC 02
XYZ 02
LMN 01 1 'should be 2
So far, I've come up with this.
=SUMPRODUCT(($B$2:$B$38705=B2)*(M2>$M$2:$M$38705))+countifs(B2:B38705=B2,M2:M38705=M2)
What have I done wrong here?
The good news is that you can throw away the SUMPRODUCT function and replace it with a pair of COUNTIFS functions. The COUNTIFS can use full column references without detriment and is vastly more efficient than the SUMPRODUCT even with the SUMPRODUCT cell ranges limited to the extents of the data.
In N2 as a standard function,
=COUNTIFS(B:B, B2,M:M, "<"&M2)+COUNTIFS(B$2:B2, B2, M$2:M2, M2)
Fill down as necessary.
      
  Filtered Results
        
Solution basing on OP
Studying your post demanding to post any alternatives, I got interested in a solution based on your original approach via the SUMPRODUCT function.
IMO this could show the right way for the sake of the art:
Applied method
Get
a) all current ids with a group value lower or equal to the current value
MINUS
b) the number of current ids with the identical group value starting count from the current row
PLUS
c) the increment of 1
Formula example, e.g. in cell N5:
=SUMPRODUCT(($B$2:$B$38705=$B5)*($M$2:$M$38705<=$M5))-COUNTIFS($B5:$B$38705,$B5,$M5:$M$38705,$M5)+1
P.S.
Of course, I agree with you preferring the above posted solution, too :+)

How to filter a column in Excel

Let's say there are two columns. 1st column is ID, 2nd column is purchase.
I want to filter the 1st (ID) column for distinct values only but if the ID contains a distinct value in purchase I want to keep that ID value.
For example:
ID Purchase
01 food
01 food
01 water
01 electricity
02 water
02 candy
02 water
02 juice
Using Excel's Advanced Sort & Filter tool under Data I can obtain all distinct values of ID. Such that I will only return:
01 food
02 water
But I would like to return all distinct IDs with distinct purchases. Such as:
01 food
01 water
01 electricity
02 water
02 candy
02 juice
Any help or guidance is appreciated!
Select both columns when applying the Advanced Filter for Unique values.

Using a function to replace cell values in a column

I have a fairly large Dataframes 22000X29 . I want to clean up one particular column for data aggregation. A number of cells can be replaced by one column value. I would like to write a function to accomplish this using replace function. How do I pass the column name to the function?
I tried passing the column name as a variable to the function.
Of course, I could do this variable by variable, but that would be tedious
#replace in df from list
def replaceCell(mylist,myval,mycol,mydf):
for i in range(len(mylist)):
mydf.mycol.replace(to_replace=mylist[i],value=myval,inplace=True)
return mydf
replaceCell((c1,c2,c3,c4,c5,c6,c7),c0,'SCity',cimsBid)
cimsBid is the Dataframes, SCity is the column in which I want values to be changed
Error message:
AttributeError: 'DataFrame' object has no attribute 'mycol'
Try accessing your column as:
mydf[mycol]
On this command:
mydf.mycol.replace(to_replace=mylist[i],value=myval,inplace=True)
Pandas columns access by attribute operator doesn't allows on variable name. You need to access it through indexing operator [] as:
mydf[mycol].replace(to_replace=mylist[i],value=myval,inplace=True)
There are few more warnings here
Warning
You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed. See here for an explanation of
valid identifiers.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding
element or column.
hi try these function hopefully it will work
def replace_values(replace_dict,mycol,mydf):
mydf = mydf.replace({mycol: replace_dict})
return mydf
pass replacing values as dictonary
Address the column as a string.
You should pass the whole list of values you want to replace (to_replace) and a list of new values (value). (Don't use tuples.
If you want to replace all values with the same new value, it might be best
def replaceCell(mylist,myval,mycol,mydf):
mydf[mycol].replace(to_replace=mylist,value=myval,inplace=True)
return mydf
# example dataframe
df = pd.DataFrame( {'SCity':['A','D','D', 'B','C','A','B','D'] ,
'value':[23, 42,76,34,87,1,52,94]})
# replace the 'SCity' column with a new value
mylist = list(df['SCity'])
myval = ['c0']*len(mylist)
df = replaceCell(mylist,myval,'SCity',df)
# the output
df
SCity value
0 c0 23
1 c0 42
2 c0 76
3 c0 34
4 c0 87
5 c0 1
6 c0 52
7 c0 94
This returns the df, with the replaced values.
If you intend to only change a few values, you can do this in a loop.
def replaceCell2(mylist,myval,mycol,mydf):
for i in range(len(mylist)):
mydf[mycol].replace(to_replace=mylist[i],value=myval,inplace=True)
return mydf
# example dataframe
df = pd.DataFrame( {'SCity':['A','D','D', 'B','C','A','B','D'] ,
'value':[23, 42,76,34,87,1,52,94]})
# Only entries with value 'A' or 'B' will be replaced by 'c0'
mylist = ['A','B']
myval = 'c0'
df = replaceCell2(mylist,myval,'SCity',df)
# the output
df
SCity value
0 c0 23
1 D 42
2 D 76
3 c0 34
4 C 87
5 c0 1
6 c0 52
7 D 94

How to extract 5 number and nanmes from the list [duplicate]

I am trying to get a unique rank value (e.g. {1, 2, 3, 4} from a subgroup in my data. SUMPRODUCT will produce ties{1, 1, 3, 4}, I am trying to add the COUNTIFS to the end to adjust the duplicate rank away.
subgroup
col B col M rank
LMN 01 1
XYZ 02
XYZ 02
ABC 03
ABC 01
XYZ 01
LMN 02 3
ABC 01
LMN 03 4
LMN 03 4 'should be 5
ABC 02
XYZ 02
LMN 01 1 'should be 2
So far, I've come up with this.
=SUMPRODUCT(($B$2:$B$38705=B2)*(M2>$M$2:$M$38705))+countifs(B2:B38705=B2,M2:M38705=M2)
What have I done wrong here?
The good news is that you can throw away the SUMPRODUCT function and replace it with a pair of COUNTIFS functions. The COUNTIFS can use full column references without detriment and is vastly more efficient than the SUMPRODUCT even with the SUMPRODUCT cell ranges limited to the extents of the data.
In N2 as a standard function,
=COUNTIFS(B:B, B2,M:M, "<"&M2)+COUNTIFS(B$2:B2, B2, M$2:M2, M2)
Fill down as necessary.
      
  Filtered Results
        
Solution basing on OP
Studying your post demanding to post any alternatives, I got interested in a solution based on your original approach via the SUMPRODUCT function.
IMO this could show the right way for the sake of the art:
Applied method
Get
a) all current ids with a group value lower or equal to the current value
MINUS
b) the number of current ids with the identical group value starting count from the current row
PLUS
c) the increment of 1
Formula example, e.g. in cell N5:
=SUMPRODUCT(($B$2:$B$38705=$B5)*($M$2:$M$38705<=$M5))-COUNTIFS($B5:$B$38705,$B5,$M5:$M$38705,$M5)+1
P.S.
Of course, I agree with you preferring the above posted solution, too :+)

Why is this giving me a float value in the second column and creating two rows?

My code opens an excel workbook and copies the first two rows from the first 4 sheets on the workbook.
With this code, it produces 5 different columns when if you refer to the desired output I want it to be 2 different columns.
Another thing I don't understand is why the 3rd and 4th column is a float. Why is it making it a float?
I tried to create a better question because my other questions have not been well-received so if you have any feedback that helps too.
workbook = xlrd.open_workbook('input.xlsx')
data = []
for i in range (0,5):
sheet = workbook.sheet_by_index(i)
data.append([sheet.cell_value(row, 0) for row in range(sheet.nrows)])
data.append([sheet.cell_value(row, 1) for row in range(sheet.nrows)])
transposed = zip(*data)
with open('file.txt','w') as fou:
writer = csv.writer(fou, delimiter='\t')
for row in transposed:
writer.writerow(row)
Output:
F800 00 F8C8 32.0
F804 01 F8CC 33.0
F808 02 F8D0 34.0
F80C 03 F8D4 35.0
F810 04 F8D8 36.0
F814 05 F8DC 37.0
F818 06 F8E0 38.0
Desired Output:
F800 00
F804 01
F808 02
F80C 03
F810 04
F814 05
F818 06
F81C 07
F8C8 32
F8CC 33
F8D0 34
F8D4 35
F8D8 36
F8DC 37
F8E0 38
Try this:
data = [[], []]
for i in range (0,5):
sheet = workbook.sheet_by_index(i)
data[0].extend([sheet.cell_value(row, 0) for row in range(sheet.nrows)])
data[1].extend([sheet.cell_value(row, 1) for row in range(sheet.nrows)])
transposed = zip(*data)
with open('file.txt','w') as fou:
writer = csv.writer(fou, delimiter='\t')
for row in transposed:
writer.writerow(row)
data was initialized to contain 2 empty lists. The lists are extended in the for loop. This should get you the correct number of output columns.
For the float value problem, Excel stores numeric data as floats (no ints). So before exporting you may need to use Excel's formatting features to get rid of the decimals or convert the numbers to text.
Hope that helps.

Resources