Finding Duplicates and Creating a column which points out the duplicates in pandas - python-3.x

| Col1 | Col2 | Col3 |
|------|------|------|
| m | n | o |
| m | q | e |
| a | b | r |
Let's say I have a pandas DataFrame as shown above. Notice the col1 values are same for the 0th and 1st row. Is there way to find all the duplicate entries on the dataframe based on Col1 only.
Additionally i wold also like to add another column say is_duplicate which would say True for all the duplicate instances of my DataFrame and False otherwise.
Note: I want to find the duplicates based only on basis of the value in Col1 the other columuns can be or might not be duplicates, They should'nt be taken into consideration.

.duplicated() has exactly that functionality:
df['is_duplicate'] = df.duplicated('Col1')

I found it :
df["is_duplicate"] = df.Col1.duplicated(keep=False)

Related

Using openoffice Calc or Excel to find variables which occur in multiple columns?

I need a way to check if a variable occurs in all of my columns at least once.
Example with 3 columns, while Occu in Col1-3 checks if the variable which is listed in Col3 exists in every other column. Only for a the formula should return "true".
How can I achieve this? I tried using "countif" but unfortunately this only works for a maximum of 2 columns.
+------+------+------+----------------+
| Col1 | Col2 | Col3 | Occu in Col1-3 |
+------+------+------+----------------+
| a | a | f | false |
| b | c | a | true |
| c | d | e | false |
+------+------+------+----------------+
Any help is upvoted, thanks
Multiply another COUNTIF statement for each column:
=COUNTIF(A$1:A$3,"=" & C1) * COUNTIF(B$1:B$3,"=" & C1)

Better way to refresh imported columns?

I have a table in spotfire with a couple columns imported from another table as a lookup. As an example, Col2 is used to match for the import of ImportedCol:
+------+------+-------------+
| Col1 | Col2 | ImportedCol |
+------+------+-------------+
| 1 | A | Val1 |
| 2 | B | Val2 |
| 3 | A | Val1 |
| 4 | C | Val3 |
| 5 | B | Val2 |
| 6 | A | Val1 |
| 7 | D | Val4 |
+------+------+-------------+
However, the data in Col2 is subject to change. In that event, I need ImportedCol to change with it, however Spotfire seems to just keep the old imported data. Right now I've been deleting the imported column and re-adding it to refresh the link. Is there a way to dynamically import the data as the document loads or with any refresh of the information links?
I have found that this happens sometimes although I'm not exactly sure how to explain why. my workaround is to create "virtual" data tables based on your existing ones.
consider your linked table as A and your embedded table as B. start from a default state -- that is, before importing any columns.
add a new data table. the source for this table should be "From Current Analysis" and using A. we will consider this one as C, and it becomes your main data table, and C will update when any changes are made to A or B.
to illustrate:
I found the issue.
Turns out that pivoting on data in the same table creates a circular reference which overrides the embed/link setting on that table. My workaround was to make the pivot as its own information link, then have the table join the original link and the new pivot one.

SUM of multiple VLOOKUP

It seems like a simple problem, but I do not manage to solve it. I have the following tables:
Values
| Key | Value |
|-----|-------|
| A | 1 |
| B | 2 |
| C | 3 |
Results
| Foo | Bar |
|-----|-----|
| A | B |
| C | B |
| A | A |
| B | C |
| ... | ... |
What I am looking for is a final row in the Results table that looks for the key in the Values table, takes its value and sums all the keys in a column (i.e. FOO and BAR). The final result would be:
| Foo | Bar |
|-----|-----|
| A | B |
| C | B |
| A | A |
| B | C |
|-----|-----|
| 7 | 8 |
I have been trying with different VLOOKUP, INDEX and MATCH functions, but still I am not able. Any ideas?
I asume you want a solution without extra columns. Then you are into Array formulas (a.k.a CSE or ControlShiftEnter functions).
Combination of {=SUM(VLOOKUP(...))} doesn't work, but combination of {=SUM(SUMIF(...))} does:
in A12 enter =SUM(SUMIF($A$1:$A$3;A7:A10;$B$1:$B$3)) and save with Ctrl+Shift+Enter. You then can copy this to B12.
Problem is you will need to change the Array function every time you add values to the list A7:B10 (or you initially make the range sufficiently large) ... this would speak more for extra =VLOOKUP() columns as suggested by CustomX.
I'm not sure of other solutions, but you could solve this by using an extra 2 columns, E and F for example.
Enter this in column E: =VLOOKUP(C2;$A$1:$B$3;2;0)
Enter this in column F: =VLOOKUP(D2;$A$1:$B$3;2;0)
Pull the formulas down and add a SUM at the bottom of column C and D to calculate columns E and F.
Extra: These are the columns I used for your examples.
Key = column A
Value = column B
Foo = column C
Bar = column D

Display all matching values in one comma separated cell

I have two columns of data in an Excel 2010 spreadsheet. In Column A is a category, and in Column B is a value. There will be multiple values in Column B for each unique category in Column A.
What I want to achieve in a separate sheet is to display all of the values for each each unique category in one comma (or semi-colon etc) separated cell.
For example, if my first sheet looks like this:
----------------------
| Category | Value |
----------------------
| Cat1 | Val A |
| Cat1 | Val B |
| Cat1 | Val C |
| Cat2 | Val D |
| Cat3 | Val E |
| Cat3 | Val F |
| Cat3 | Val G |
| Cat3 | Val H |
----------------------
I'd want to display the following in another sheet:
---------------------------------------
| Category | Value |
---------------------------------------
| Cat1 | Val A,Val B,Val C |
| Cat2 | Val D |
| Cat3 | Val E,Val F,Val G, Val H |
---------------------------------------
Can this be achieved with a formula? Vlookup will only find the first matching value, of course. I've Googled it, but the individual search terms involved in the query are so generic I'm getting swamped with inappropriate results.
Please try (in a copy on another sheet):
Insert a column on the left with =IF(B2<>B3,"","x") in A2 (assuming Category is in B1). In D2 put =IF(B1=B2,D1&", "&C2,C2) and copy both formulae down to suit. Copy and Paste Special Values over the top. Filter on ColumnA for x and delete selected rows. Unfilter and delete ColumnA.

Pivot table custom aggregation function

I have a table like this in an excel spreadshhet:
Col1 | Col2 | Col3
-----------+-----------+-----------
A | X | 1
A | Y | 2
B | X | 3
B | Y | 4
B | Z | 5
I want to use the aggregation feature of the pivot table.
Using the typical SUM of VALUES aggregation, provided by Excel, I get:
Col1 | Col3
-----------+-----------
A | 3
B | 12
But I want to use a different aggregation function. I want to use something that does:
Square root of ( Sum of (Square(x)))
So that in the end I get the table:
Col1 | Col3
-----------+-----------
A | SQRT(5) <= Sqrt(1*1 + 2*2)
B | SQRT(50) <= Sqrt(3*3 + 4*4 + 5*5)
Is there ANY way (VBA, C++, assembly, whatever it takes) I can specify my own functions in the aggregation list?
NOTE: I KNOW HOW TO DO THIS IN A SHEET, DON'T BOTHER ANSWERING IF IT'S NOT ABOUT WRITING CUSTOM AGGREGATION FUNCTIONS FOR PIVOTTABLE
Its not possible to write a custom aggregation function for a standard pivot table. But you can probably do what you want using MDX... maybe an MDX expert would like to comment?

Resources