How to check the duplicate pair in excel? - excel

I tried to find the pairs in multiple columns in excel.
abc def 1 <-duplicate 1
ael fjw 1
dlf qwr 1
cvz god 1 <-duplicate 2
abc def -1 <-duplicate 1
slf erw -1
def abc -1 <-duplicate 1
god cvz -1 <-dupllicate 2
cnv odf -1
After that, I should eliminate the pairs that have the value -1.
I tried excel duplicate values pairs in multiple column post, but it showed an unexpected result.
If it is hard to run in Excel, it is okay to suggest the code in python or R.
In particular, I checked the post Removing duplicate interaction pairs in python sets which is a similar problem in python.
But this example is corresponding to the numerical value.
Also, if there are any problems with my question, please correct them.

Assuming your first row of data is in A1:C1, this formula in D1:
=IF(AND(SUM(COUNTIFS(A$1:A1,INDEX(A1:B1,{1;2}),B$1:B1,INDEX(A1:B1,{2;1})))>1,C1=-1),"Delete","")
and copied down.
If your version of Excel does not use the semicolon as row- or column-separator within array constants then the parts
{1;2}
and
{2;1}
will require amendment.

Related

How to replace text in column by the value contained in the columns named in this text

In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc column (formula).
So to be clear, here is an example :
Input:
|param_1|param_2|calc
|-------|-------|--------
|Cell 1 |Cell 2 |param_1-param_2
|Cell 3 |Cell 4 |param_2/param_1
Output needed:
|param_1|param_2|calc
|-------|-------|--------
|Cell 1 |Cell 2 |Cell 1-Cell 2
|Cell 3 |Cell 4 |Cell 4/Cell 3
In the column calc, the default value is a formula. It can be something as much as simple as the ones provided above or it can be something like "2*(param_8-param_4)/param_2-(param_3/param_7)".
What I'm looking for is something to substitute all the param_x by the values in the related columns regarding the names.
I've tried a lot of things but nothing works at all and most of the time when I use replace or regex_replace with a column for the replacement value, the error the column is not iterable occurs.
Moreover, the columns param_1, param_2, ..., param_x are generated dynamically and the calc column values can some of these columns but not necessary all of them.
Could you help me on the subject with a dynamic solution ?
Thank you so much.
Best regards
Update: Turned out I misunderstood the requirement. This would work:
for exp in ["regexp_replace(calc, '"+col+"', "+col+")" for col in df.schema.names]:
df=df.withColumn("calc", F.expr(exp))
Yet Another Update: To Handle Null Values add coalesce:
for exp in ["coalesce(regexp_replace(calc, '"+col+"', "+col+"), calc)" for col in df.schema.names]:
df=df.withColumn("calc", F.expr(exp))
Input/Output:
------- Keeping the below section for a while just for reference -------
You can't directly do that - as you won't be able to use column value directly unless you collect in a python object (which is obviously not recommended).
This would work with the same:
df = spark.createDataFrame([["1","2", "param_1 - param_2"],["3","4", "2*param_1 + param_2"]]).toDF("param_1", "param_2", "calc");
df.show()
df=df.withColumn("row_num", F.row_number().over(Window.orderBy(F.lit("dummy"))))
as_dict = {row.asDict()["row_num"]:row.asDict()["calc"] for row in df.select("row_num", "calc").collect()}
expression = f"""CASE {' '.join([f"WHEN row_num ='{k}' THEN ({v})" for k,v in as_dict.items()])} \
ELSE NULL END""";
df.withColumn("Result", F.expr(expression)).show();
Input/Output:

Iterate in column for specific value and insert 1 if found or 0 if not found in new column python

I have a DataFrame as shown in the attached image. My columns of interest are fgr and fgr1. As you can see, they both contain values corresponding to years.
I want to iterate in the the two columns and for any value present, I want 1 if the value is present or else 0.
For example, in fgr the first value is 2028. So, the first row in column 2028 will have a value 1 and all other columns have value 0.
I tried using lookup but I did not succeed. So, any pointers will be really helpful.
Example dataframe
Data:
Data file in Excel
This fill do you job. You can use for loops aswell but I think this approach will be faster.
df["Matched"] = df["fgr"].isin(df["fgr1"])*1
Basically you check if values from one are in anoter column and if they are, you get True or False. You then multiply by 1 to get 1 and 0 instead of True or False.
From this answer
Not the most efficient, but should work for your case(time consuming if large dataset)
s = df.reset_index().melt(['index','fgr','fgr1'])
s['value'] = s.variable.eq(s.fgr.str[:4]).astype(int)
s['value2'] = s.variable.eq(s.fgr1.str[:4]).astype(int)
s['final'] = np.where(s['value']+s['value2'] > 0,1,0)
yourdf = s.pivot_table(index=['index','fgr','fgr1'],columns = 'variable',values='final',aggfunc='first').reset_index(level=[1,2])
yourdf

Excel sum based on matrix condition and multiple criteria

Following from the example here I'm trying to add additional conditions to a sum formula. I've represented an example below:
The output that I'm looking for for example for Jan 2017 is
2017
1
UP A 1
UP B 6
UP C 6
DOWN A 1
DOWN B 8
DOWN C 7
I tried with the following formula:
=MMULT(--($B$17:$C$17="X"),MATCH(1,($A23=$C$2:$C$14)*(C$21=$A$2:$A$14)*(C$22=$B$2:$B$14)*($E$2:$E$14=$D$2:$D$14),0))
but I get a N/A value.
Does anyone know it if is possible to do it?
In your first example the number of rows in array1 and number of columns in array2 were equal, five. Here you have two columns and 13 rows. That they are unequal here is part (all) of the reason why you are having an issue.
Also your match function is returning a Boolean not an array
I have a way to do this using matrix condition and multiple criteria but had to change problem up a bit, see photo for example:
{=MMULT(--(D18:P18="x"),E$2:E$14*(--(A$2:A$14=$C$21)*--(B$2:B$14=$C$22)*--(C$2:C$14=A24)))"
https://i.stack.imgur.com/FEvgR.png
You can create a formula to fill the second matrix with X's see below
=IF(OR(INDIRECT("D"&VALUE(D20))=$A$18,INDIRECT("D"&VALUE(D20))=$B$18),"X","")
https://i.stack.imgur.com/4rS4L.png
That being said I don't think this is particularly efficient as you are treating the one of the matrixes as a all 1's so you basically just adding an extra criteria / Boolean with added complexity....that being said u asked for this specifically and I believe that I have delivered that LOL
Just add two SUMIFS together.
=SUMIFS($E$2:$E$14, $A$2:$A$14, C$21, $B$2:$B$14, C$22, $C$2:$C$14, $A23, $D$2:$D$14, IF(INDEX($B$17:$C$19, MATCH($B23, $A$17:$A$19, 0), 1)="x", $B$16))+
SUMIFS($E$2:$E$14, $A$2:$A$14, C$21, $B$2:$B$14, C$22, $C$2:$C$14, $A23, $D$2:$D$14, IF(INDEX($B$17:$C$19, MATCH($B23, $A$17:$A$19, 0), 2)="x", $C$16))

Ranking when there are duplicates

How can I return the ranking of each value in a row, even in the case of duplicates? Please see my example below.
While many questions have been answered regarding the handling of duplicate values in a ranking, I have come short in achieving a method that works for all of my cases.
EDIT: The previous picture above was a bad example that did not address my problem. Here is a new picture of the behavior.
In certain cases it skips to 7 when the rank should only be 1:6. In other cases it seems to work, and then not work in similar cases. Data is:
2.61879723030607 2.3428 2.61879723030607 2.4571 2.7324 2.1790
2.97203355745108 2.5355 2.97203355745108 2.6721 3.0561 2.4136
2.4895 2.2781 2.6218 2.4369 2.6898 2.1361
2.32650000000000 2.2124 2.3453 2.32650000000000 2.3938 2.0283
2.34132608128450 2.1331 2.34132608128450 2.2800 2.5758 2.0446
2.58668483692925 2.1476 2.58668483692925 2.3019 2.5124 2.0135
2.2555 2.0884 2.3368 2.0980 2.3928 1.9787
2.32878217762168 2.1080 2.32878217762168 2.1250 2.5360 1.9807
2.50891263421977 2.2480 2.50891263421977 2.4239 2.9070 2.2638
2.97755287506272 2.4457 2.97755287506272 2.6830 3.0566 2.3987
3.0850 2.5380 5.3880 2.8304 3.1579 2.5030
3.0120 2.3815 3.0639 2.6762 3.0831 2.4253
2.49235468138485 2.1436 2.49235468138485 2.3159 2.5542 1.9991
2.13109025589563 2.1060 2.13109025589563 2.1555 2.3225 1.9787
2.24900295032614 2.0332 2.24900295032614 2.1780 2.5084 2.0043
2.4010 2.0438 2.5857 2.2126 2.4511 2.0329
EDIT2: Implementing RANK instead of RANK.EQ showing no difference:
I think you've got an error in your setup. My understanding is each row is meant to be a separate independent case, however your formula for calculating rank has fixed row and column references, when it should have only fixed column references. Right now, the rank for every value is being found based on the first row in your data. Instead of:
=RANK.EQ(B4,$B$4:$G$4,1)
It should be:
=RANK.EQ(B4,$B4:$G4,1)
This then alters your results in the 2nd and 3rd blocks and you should get the desired result in the 3rd block.
With the formula below in Cell B2:B4 you can filter the unique numbers in Column A.
Please note that this is an array formula so once you enter it you have to mark it and press CTRL + ALT + DEL. Hope this solves your problem. More details regarding this formula you can also find here https://exceljet.net/formula/extract-unique-items-from-a-list
Column A Column B
1
1 1 = {=INDEX($A$1:$A$5000,MATCH(0,COUNTIF($B$1:B1,$A$1:$A$5000),0))}
1 2 = {=INDEX($A$1:$A$5000,MATCH(0,COUNTIF($B$1:B2,$A$1:$A$5000),0))}
1 6 = {=INDEX($A$1:$A$5000,MATCH(0,COUNTIF($B$1:B3,$A$1:$A$5000),0))}
1
1
1
1
1
1
1
2
1
6
6
6
6
6
6
6
6
6
6
6
6
6
Try RANK instead of RANK.EQ as below. Though I am not sure whether this will work as I am testing on Excel 07.
Enter the following formula in Cell H1
=RANK(A1,$A1:$F1,1)+COUNTIF($A1:A1,A1)-1
Copy/Drag the formula down and across (to right) as required. See image for reference.
As per Microsoft Documentation on RANK.EQ function here
RANK.EQ gives duplicate numbers the same rank. However, the presence of duplicate numbers affects the ranks of subsequent numbers. For example, in a list of integers sorted in ascending order, if the number 10 appears twice and has a rank of 5, then 11 would have a rank of 7 (no number would have a rank of 6)

Trouble of finding matching values from a different workbook excel

I have this excel files, this is what my data looks like in the first workbook, which could have 2000 + entries and in a general format.
A
1 5001987
2 1458285
3 2506588
4 4745089
5 2540486
.
.
My other excel file looks like this, but also in a general, but the data within it is generated by something else which results of its output like this.
A
1 ['2506588']
2 ['2540181']
3 ['2553486']
4 ['2540181']
5 ['2540389']
6 ['2553384']
On a specific column somewhere, i have written this function:
=IF(VLOOKUP([outputbarcode.xlsx]Sheet1!$B$4,B2:B1992,2,TRUE),"Y","N")
I simply want it to look if excefile 2 cell A1 value exist in excelfile 1, print Y, if not, N.
Running the function above returns #N/A
Is there something wrong with my function?
On excel file 2, try:
=IFERROR(IF(INDEX(MATCH(VALUE(MID(A1,3,7)), Sheet1!A:A, 0),)>0, "Y"), "N")
Sheet1 is excel file 1 here. I prefer index & match to vlookup. You can search why.
I suggest that you do an edit/replace and remove those odd characters permanently. Then you won't need the mid() function but the rest of #Sangbok lee answer will be fine and that may help with future operations.

Resources