How to replace text in column by the value contained in the columns named in this text - apache-spark

In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc column (formula).
So to be clear, here is an example :
Input:
|param_1|param_2|calc
|-------|-------|--------
|Cell 1 |Cell 2 |param_1-param_2
|Cell 3 |Cell 4 |param_2/param_1
Output needed:
|param_1|param_2|calc
|-------|-------|--------
|Cell 1 |Cell 2 |Cell 1-Cell 2
|Cell 3 |Cell 4 |Cell 4/Cell 3
In the column calc, the default value is a formula. It can be something as much as simple as the ones provided above or it can be something like "2*(param_8-param_4)/param_2-(param_3/param_7)".
What I'm looking for is something to substitute all the param_x by the values in the related columns regarding the names.
I've tried a lot of things but nothing works at all and most of the time when I use replace or regex_replace with a column for the replacement value, the error the column is not iterable occurs.
Moreover, the columns param_1, param_2, ..., param_x are generated dynamically and the calc column values can some of these columns but not necessary all of them.
Could you help me on the subject with a dynamic solution ?
Thank you so much.
Best regards

Update: Turned out I misunderstood the requirement. This would work:
for exp in ["regexp_replace(calc, '"+col+"', "+col+")" for col in df.schema.names]:
df=df.withColumn("calc", F.expr(exp))
Yet Another Update: To Handle Null Values add coalesce:
for exp in ["coalesce(regexp_replace(calc, '"+col+"', "+col+"), calc)" for col in df.schema.names]:
df=df.withColumn("calc", F.expr(exp))
Input/Output:
------- Keeping the below section for a while just for reference -------
You can't directly do that - as you won't be able to use column value directly unless you collect in a python object (which is obviously not recommended).
This would work with the same:
df = spark.createDataFrame([["1","2", "param_1 - param_2"],["3","4", "2*param_1 + param_2"]]).toDF("param_1", "param_2", "calc");
df.show()
df=df.withColumn("row_num", F.row_number().over(Window.orderBy(F.lit("dummy"))))
as_dict = {row.asDict()["row_num"]:row.asDict()["calc"] for row in df.select("row_num", "calc").collect()}
expression = f"""CASE {' '.join([f"WHEN row_num ='{k}' THEN ({v})" for k,v in as_dict.items()])} \
ELSE NULL END""";
df.withColumn("Result", F.expr(expression)).show();
Input/Output:

Related

How to check the duplicate pair in excel?

I tried to find the pairs in multiple columns in excel.
abc def 1 <-duplicate 1
ael fjw 1
dlf qwr 1
cvz god 1 <-duplicate 2
abc def -1 <-duplicate 1
slf erw -1
def abc -1 <-duplicate 1
god cvz -1 <-dupllicate 2
cnv odf -1
After that, I should eliminate the pairs that have the value -1.
I tried excel duplicate values pairs in multiple column post, but it showed an unexpected result.
If it is hard to run in Excel, it is okay to suggest the code in python or R.
In particular, I checked the post Removing duplicate interaction pairs in python sets which is a similar problem in python.
But this example is corresponding to the numerical value.
Also, if there are any problems with my question, please correct them.
Assuming your first row of data is in A1:C1, this formula in D1:
=IF(AND(SUM(COUNTIFS(A$1:A1,INDEX(A1:B1,{1;2}),B$1:B1,INDEX(A1:B1,{2;1})))>1,C1=-1),"Delete","")
and copied down.
If your version of Excel does not use the semicolon as row- or column-separator within array constants then the parts
{1;2}
and
{2;1}
will require amendment.

Iterate in column for specific value and insert 1 if found or 0 if not found in new column python

I have a DataFrame as shown in the attached image. My columns of interest are fgr and fgr1. As you can see, they both contain values corresponding to years.
I want to iterate in the the two columns and for any value present, I want 1 if the value is present or else 0.
For example, in fgr the first value is 2028. So, the first row in column 2028 will have a value 1 and all other columns have value 0.
I tried using lookup but I did not succeed. So, any pointers will be really helpful.
Example dataframe
Data:
Data file in Excel
This fill do you job. You can use for loops aswell but I think this approach will be faster.
df["Matched"] = df["fgr"].isin(df["fgr1"])*1
Basically you check if values from one are in anoter column and if they are, you get True or False. You then multiply by 1 to get 1 and 0 instead of True or False.
From this answer
Not the most efficient, but should work for your case(time consuming if large dataset)
s = df.reset_index().melt(['index','fgr','fgr1'])
s['value'] = s.variable.eq(s.fgr.str[:4]).astype(int)
s['value2'] = s.variable.eq(s.fgr1.str[:4]).astype(int)
s['final'] = np.where(s['value']+s['value2'] > 0,1,0)
yourdf = s.pivot_table(index=['index','fgr','fgr1'],columns = 'variable',values='final',aggfunc='first').reset_index(level=[1,2])
yourdf

Comparing two cells and show both of them when it is not equal

I have a table in EXCEL that looks like this:
ATTRIBUTE_IN_1|TYPE_IN_1|ATTRIBUTE_IN_2 |TYPE_IN_2 |
________________________________________________________
advance |String | accounts |decimal(20,10) |
education |String | Job_category |String |
date of birth |date | advance |decimal(20,10) |
and etc.
I want to compare TYPE_IN_1 and TYPE_IN_2 (2nd and 4th columns) for equal values in ATTRIBUTE_IN_1 and ATTRIBUTE_IN_2 (1st and 3rd columns).
In the result I want to see values, when ATTRIBUTE_IN_1 and ATTRIBUTE_IN_2 have equal value, but TYPE_IN_1 and TYPE_IN_2 (2nd and 4th columns) have unequal values.
For that example in result I want to see:
ATTRIBUTE_IN_1|TYPE_IN_1|ATTRIBUTE_IN_2 |TYPE_IN_2 |
_______________________________________________________
advance |String | advance |decimal(20,10) |
What formula can i use in MS EXCEL for solving this problem?
Excel doesn't really lend itself to this type of question. You can cobble something together, perhaps along these lines (for attribute 1):
=IFERROR(INDEX($A$2:$A$10,INT(SMALL(IF(($A$2:$A$10=TRANSPOSE($C$2:$C$10))*($B$2:$B$10<>TRANSPOSE($D$2:$D$10)),
ROWS($A$2:$A$10)*(ROW($A$2:$A$10)-ROW($A$2))+TRANSPOSE(ROW($A$2:$A$10)-ROW($A$2))),ROW(1:1))/ROWS($A$2:$A$10))+1),"")
This sets up a 2D array where the rows are the attribute 1's and the columns are the attribute 2's. The array elements where there is a conflict are set to 1. The rest of the formula is to get the corresponding values out in row/column order.
For type 1:
=IFERROR(INDEX($B$2:$B$10,INT(SMALL(IF(($A$2:$A$10=TRANSPOSE($C$2:$C$10))*($B$2:$B$10<>TRANSPOSE($D$2:$D$10)),
ROWS($A$2:$A$10)*(ROW($A$2:$A$10)-ROW($A$2))+TRANSPOSE(ROW($A$2:$A$10)-ROW($A$2))),ROW(1:1))/ROWS($A$2:$A$10))+1),"")
For type 2:
=IFERROR(INDEX($D$2:$D$10,MOD(SMALL(IF(($A$2:$A$10=TRANSPOSE($C$2:$C$10))*($B$2:$B$10<>TRANSPOSE($D$2:$D$10)),
ROWS($A$2:$A$10)*(ROW($A$2:$A$10)-ROW($A$2))+TRANSPOSE(ROW($A$2:$A$10)-ROW($A$2))),ROW(1:1)),ROWS($A$2:$A$10))+1),"")
It's interesting to compare this with the SQL for doing the same thing e.g. in SQLite:
.mode csv
.import book2.csv test1
select a.attribute_in_1,a.type_in_1,b.type_in_2
from
test1 a inner join test1 b
on a.attribute_in_1=b.attribute_in_2 and a.type_in_1<>b.type_in_2;

Rearrange excel table cells - reordering

I don't know excel very well and I am trying to take something like this (with a lot of entries):
Field ......Value ....... ID
A .......... blabla1 .......1
B ...........blabla2 .......1
C ...........blabla3 .......1
D ...........blabla4 .......1
A ...........blabla5 .......2
B ...........blabla6 .......2
C ...........blabla7 .......2
D ...........blabla8 .......2
and turn into something more readable like this:
ID -----A -------------B ---------------- C ---------------- D
1 ------blabla1 -----blabla2 -------- blabla3 --------blabla4
2 ------blabla5----- blabla6 -------- blabla7-------- blabla8
Does anyone know a good way to do that? Thank you
(sorry about the bad formatting)
The exact delimiter beween each word is key if text not already split in separate cells..
Assuming there are numerous words in place of '.....', with each word separated by a single space (different delimiter would be required if the blablas represented sentences comprising one / more spaces), then you could achieve the desired table representation as follows
(several function in this soln requires Office 365 compatible version of Excel,
the lookup in step 3 does not require Office 365, but may mean IDs and Fields need to be manually entered or VB could be deployed):
Starting position (after removing bank rows):
Field Value ID
A blabla1 1
B blabla2 1
C blabla3 1
D blabla4 1
A blabla5 2
B blabla6 2
C blabla7 2
D blabla8 2
1) Split cells according to delimiter (skip this step if not relevant)
=TRANSPOSE(FILTERXML("<x><y>"&SUBSTITUTE(F3," ","</y><y>")&"</y></x>","//y"))
(replace the " " inside the substitute function with a different delimiter if required/desired)
2) Obtain unique IDs (rows) and Fields (columns)
=UNIQUE(K4:K11)
=TRANSPOSE(UNIQUE(I4:I11))
3) Index lookup for table content
=INDEX(J4:J11,MATCH(M4#&N3#,K4:K11&I4:I11,0),0)

Compare multiple columns, pull out only cells that appear in every column

I have 10 or so columns in my worksheet. Each column contains about 200 names, and there is no other data on the sheet.
What I'd like to do is create a new column that only contains the names that are common between the columns. So essentially compare each cell in each column to all the other cells in all the other columns, and only return the the common cells.
For example:
Column1 : name_A, name_C, name_F
Column2: name_C, name_B, name_D
Column3: name_C, name_Z, name_X
So in this example, the new column would only contain name_C, because it's the only value common to all three columns.
Is there any way to do this? My knowledge of Excel is quite poor, and I can't find anything similar to my problem online so I would appreciate any help.
Thanks for reading,
N
Put everything on a single spreadsheet and create a pivot table is probably more efficient than the algorithm you have on your mind.
here is my mock-up. I added extra names to demonstrate better
D(formula) has the easiest version. this will list only values that appear in all columns, but these will appear on the same lines as the corresponding name in column A, with blanks, and not sorted (giving D(result))
IF you would like all the names to appear the the top - as shown here in column E you can either sort your table (you will have to re-sort if the columns change) OR you can use my solution below:
get yourself the MoreFunc Addon for Excell ( here is the last working download link I found, and here is a good installation walk-through video )
once all is done select cells E1:E8, click the formula bar and type the following: =UNIQUEVALUES(IF(COUNTIF(A2:C9,A2:A9)=3,A2:A9,""))
accept the formula by clicking ctrl-shift-enter (this will create an array-formula and curly braces will appear around your formula)
A B C D(formula) D(result) E(result - sorted)
-------------------------------------------------------------------------------------------------------
1 | name_A name_C name_C =IF(COUNTIF($A$1:$C$8,A1)=3,A1,"") name_m
2 | name_C name_B name_Z =IF(COUNTIF($A$1:$C$8,A2)=3,A2,"") name_C name_C
3 | name_F name_D name_X =IF(COUNTIF($A$1:$C$8,A3)=3,A3,"")
4 | name_t name_o name_g =IF(COUNTIF($A$1:$C$8,A4)=3,A4,"")
5 | name_y name_p name_h =IF(COUNTIF($A$1:$C$8,A5)=3,A5,"")
6 | name_u name_k name_7 =IF(COUNTIF($A$1:$C$8,A6)=3,A6,"")
7 | name_i name_5 name_9 =IF(COUNTIF($A$1:$C$8,A7)=3,A7,"")
8 | name_m name_m name_m =IF(COUNTIF($A$1:$C$8,A8)=3,A8,"") name_m

Resources