How to remove duplicate values using pandas and keep any one [duplicate] - python-3.x

This question already has answers here:
Drop all duplicate rows across multiple columns in Python Pandas
(8 answers)
Closed 2 years ago.
I have a data-frame which looks like:
A B C D E
a aa 1 2 3
b aa 4 5 6
c cc 7 8 9
d cc 11 10 3
e dd 71 81 91
As rows (1,2) and rows (3,4) has duplicate values of column B. I want to keep only one of them.
The Final output should be:
A B C D E
a aa 1 2 3
c cc 7 8 9
e dd 71 81 91
How can I use pandas to accomplish this?

DataFrame.drop_duplicates(subset="B", keep='first')
keep: keep is to control how to consider duplicate value.
It has only three distinct values and the default is ‘first’.
If ‘first’, it considers the first value as unique and the rest of the same values as duplicate.
If ‘last’, it considers the last value as unique and the rest of the same values as duplicate.
If False, it considers all of the same values as duplicates

Try drop_duplicates
df = df.drop_duplicates('B')
A B C D E
0 a aa 1 2 3
2 c cc 7 8 9
4 e dd 71 81 91

In the general case,
We need to drop across multiple columns. In that case, you need to use as follow
df.drop_duplicates(subset=['A', 'C'], keep=First)
We specify the column names in the subset argument and we use the keep argument to say what we need to keep
first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.

Related

Percentage of values when one column has values and other column is null

May be this is the duplicate of other question but I am not able to solve the problem.
I have transaction data having 100 features and 2.3 million rows. I want to find percentage of values present in one column and Null in other column for every combination of columns.
Example:
A B C D
1 NA 2 3
2 4 5 6
NA 5 6 7
8 2 NA NA
9 8 7 6
So output should be:
When A has values B has Null 1/4=0.25 times
When A has values C has Null 1/4=0.25 times
Similarly for every other combination of columns and create a dataframe for it.
I tried combination of columns function in Python but it's not giving the desired result.
itertools.combinations(daf.columns, n)
You can write 2 for loops to iterate for individual columns and then compare.

Subtract a subset of columns from a key column in Pandas Pivot

I have a pivot table with multiple columns of data in a time series:
A B C D
11/1/2018 1 5 5 7
11/2/2018 2 6 6 8
11/3/2018 3 7 7 9
The values in the data columns are not important for this example. I would like to subtract the value in the "key" column (column A in this case) from a subset of columns: B & C in this case. I would then like to drop any columns not in the subset or the key column. Result would be:
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4
I have subtracted columns in the past via code like this:
df['dif'] = df['B'] -df['A']
But this will add the "dif" column. I would like to replace column B with B-A values. Also, instead of passing the instructions one at a time (B-A, C-A), would like to pass the list something like "if column in list, subtract key column, else drop column."
Thanks
pandas.DataFrame.sub with axis=0
When subtracting a Series from a DataFrame Pandas will align the columns of the DataFrame with the index of the Series by default. This is what happens when you use the - operator. However, when you use the pandas.DataFrame.sub method, you can override that default and specify that the DataFrame should align its index with the index of the Series.
def f(d, key, subset):
return d[[key]].join(d[subset].sub(d[key], axis=0))
f(df, 'A', ['B', 'C'])
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4
You can use apply to substract A from the subset columns that you choose and finally join again with A.
df['A'].to_frame().join(df[['B','C']].apply(lambda x: x - df['A']))
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4

vlookup with multiple resulting values

I have a table array that looks like this:
A B
1 2
1 3
1 9
2 3
2 4
2 11
2 23
2 56
3 7
4 13
My VLOOKUP formula is to check for 1 in column A and then return the corresponding B value. Is there anyway I can get all the values for 1? Currently it just returns back the last corresponding number for 1 i.e. 9 in column B.
You can Pivot that data, or you can try INDEX and MATCH together, perhaps even an IF command with it.

Identify Rows with Same Values in 2 Different Columns

I have a data set of roughly 405,000 rows and 23 columns. I need the records where the value in column "D" is the same as the value in column "H" for that row.
So for
A B C D E F G H
13 8 21 ok 3 S - of
51 7 22 no 3 A k no
24 3 23 by 3 S * we
24 4 24 we 3 S ! ok
24 9 25 by 3 S # we
75 2 26 ok 3 S 9 ok
etc...
I'd get back the 2nd row, the 6th row, etc...
A B C D E F G H
51 7 22 no 3 A k no
75 2 26 ok 3 S 9 ok
Based on other posts like: Formula to find matching row value based on cells in multiple columns I tried using a Pivot Table, but it complains I can't put either of my two columns in the "Columns" area because there is too much data. With both columns in the "Rows" area, I get a relationship of D to H, but I can't then find a way to filter on only those where D = H.
I've also looked into countifs(), vlookup, and index / match functions, but I can't figure this out. Help please.
I would do a simple "IF()" formula in a new column.
For your example add a new column I and use the following formula in the first data row (I2):
=IF(D2=H2,"Yes","No")
Fill down to the end of the data.
Then using Excel filters or countif you can check the number of "Yes" vs "No" in your data.

Average over a column if text in another column maches

I want to compute the average over one column if the text in another column matches a certain text.
eg:
A B C
aa 6 =AVERAGEIF(B1:B6;EXACT(A1:A6;"aa"))
bb 15
aa 8
bb 17
cc 1
aa 5
But the value in C gets 1. Why? How can I do what I want?
I would suggest using the AVERAGEIFS() function instead of AVERAGEIF(). See below:
=AVERAGEIFS(B1:B6,A1:A6,"aa")
This will yield as result of 6.3333.
Cheers.

Resources