Sqlite join columns on mapping of values - python-3.x

I want to be able to join two tables, where there is a mapping between the column values, rather than their values matching.
So rather than:
A|m | f B|m | f
a1 1 b1 1
a2 2 b2 3
a3 3 b3 5
SELECT a1, a2, b1, b2
FROM A
INNER JOIN B on B.f = A.f
giving:
|m| A.f B.f |m|
a1 1 1 b1
a3 3 3 b2
Given then mapping (1->a)(2->b)(3->c)
A|m | f B|m | f
a1 1 b1 a
a2 2 b2 b
a3 3 b3 c
to give when joined on f:
|m| A.f B.f |m|
a1 1 a b1
a3 3 c b2
The question below seems to be trying something similar, but they seem to want to change the column values, I just want the mappng to be part of the query, I don't want to change the column values thenselves. Besides it is in R and I'm working in Python.
Mapping column values

One solution is to create a temporary table of mappings AB:
CREATE TEMP TABLE AB (a TEXT, b TEXT, PRIMARY KEY(a, b));
Then insert mappings,
INSERT INTO temp.AB VALUES (1, "a"), (2, "b"), (3, "c");
or executemany with params.
Then select using intermediary table.
SELECT A.m AS Am, A.f AS Af, B.f AS Bf, B.m AS Bm
FROM A
LEFT JOIN temp.AB ON A.f=AB.a
LEFT JOIN B ON B.f=AB.b;
If you don't want to create an intermediary table, another solution would be building the query yourself.
mappings = ((1,'a'), (3,'c'))
sql = 'SELECT A.m AS Am, A.f AS Af, B.f AS Bf, B.m AS Bm FROM A, B WHERE ' \
+ ' OR '.join(['(A.f=? AND B.f=?)'] * len(mappings))
c.execute(sql, [i for m in mappings for i in m])

Related

how to substract value present in one row to other row within same table?

I'm trying to subtract two rows of different columns. Example table
C1
C2
C3
A1
2
A2
3
B1
4
So essentially, I want A2-A1 from C3 and C2 columns respectively. My approach was to somehow get values in C2New column and then subtract.
C1
C2
C2New
C3
C4
A1
2
2
A2
2
3
1
B1
4
If you are using explorer, here is how you can create the table:
let X = datatable( c1:string , c2:int , c3:int )
[ 'a1',2,3,
'a2', 0,3,
'b1', 0,4
];
X
| project c1, c2, c3
I have tried different joins, selfjoins, lookups and toscalar etc., expecting it would populate a value in empty cells and I would then create a new column or scalar with the difference in values. I'm totally new to coding and querying. Your help is appreciated.
KQL script:
let X = datatable( c1:string , c2:int , c3:int )
[ 'a1',2,3,
'a2', 0,3,
'b1', 0,4
];
X
| project c1, c2, c3
| serialize
| extend prevC2 = prev(c2,1)
| extend c4 = c3 - prevC2
Use Serialize operator to the table and then use prev function to get the previous row value.
Then subtract the c3 value from previous row c2 value.
Updated Script
As per David דודו Markovitz's comment, I updated script.
let X = datatable( c1:string , c2:int , c3:int )
[ 'a1',2,3,
'a2', 0,3,
'b1', 0,4
];
X| serialize c4 = c3 - prev(c2)
Output data
c1
c2
c3
prevc2
c4
a1
2
3
a2
0
3
2
1
b1
0
4
0
4

Pandas Filter rows by comparing columns A1 with A2

CHR
SNP
BP
A1
A2
OR
P
8
rs62513865
101592213
T
C
1.00652
0.8086
8
rs79643588
106973048
A
T
1.01786
0.4606
I have this table example, and I want to filter rows by comparing column A1 with A2.
If this four conditions happen, delete the line
A1
A2
A
T
T
A
C
G
G
C
(e.g. line 2 in the first table).
How can i do that using python Pandas ?
here is one way to do it
Combine the two columns for each of the two DF. Make it a list in case of the second DF and search the first combination in the second one
df[~(df['A1']+df['A2']).str.strip()
.isin(df2['A1']+df2['A2'].tolist())]
CHR SNP BP A1 A2 OR P
0 8 rs62513865 101592213 T C 1.00652 0.8086
keeping
Assuming df1 and df2, you can simply merge to keep the common values:
out = df1.merge(df2)
output:
CHR SNP BP A1 A2 OR P
0 8 rs79643588 106973048 A T 1.01786 0.4606
dropping
For removing the rows, perform a negative merge:
out = (df1.merge(df2, how='outer', indicator=True)
.loc[lambda d: d.pop('_merge').eq('left_only')]
)
Or merge and get the remaining indices to drop (requires unique indices):
out = df1.drop(df1.reset_index().merge(df2)['index'])
output:
CHR SNP BP A1 A2 OR P
0 8.0 rs62513865 101592213.0 T C 1.00652 0.8086
alternative approach
As it seems you have nucleotides and want to drop the cases that do not match a A/T or G/C pair, you could translate A to T and C to G in A1 and check that the value is not identical to that of A2:
m = df1['A1'].map({'A': 'T', 'C': 'G'}).fillna(df1['A1']).ne(df1['A2'])
out = df1[m]

elegant way to iterate & compare in Spark DataFrame

I have a Spark DataFrame with 2 columns: C1:Seq[Any] and C2:Double. I want to
Sort by length of C1.
For each element c1 in C1, compare with every other element in C1 that is longer than c1.
2.1 If c1 is contained in an another element cx, then compare c2 with c2x.
2.2 If c2 > c2x, then filter out (c1x, c2x).
Is there an elegant way to achieve this?
Sample Input:
C1 C2
ab 1.0
abc 0.5
Expected output:
C1 C2
ab 1.0
Contain = subset. e.g. ab is contained in abc.
I have a Spark DataFrame with 2 columns: C1:Seq[Any] and C2:Double
val rdd = sc.parallelize(List(("ab", 1.0), ("abc", 0.5)))
Sort by length of C1.
val rddSorted = rdd.sortBy(_._1.length).collect().distinct
For each element c1 in C1, compare with every other element in C1 that is longer than c1.
2.1 If c1 is contained in an another element cx, then compare c2 with c2x.
2.2 If c2 > c2x, then filter out (c1x, c2x).
val result = for(
(x, y) <- rddSorted;
(a, b) <- rddSorted.dropWhile{case(c,d) => c == x && d == y};
if(a.contains(x) && a.length > x.length && y > b)
)yield (x, y)
Thats all. You should get what you are looking for

Nested SEARCH and IF grid

Column A lists the categories for individual products. Each product will have between 1 to 14 categories. The goal is to split all categories into separate cells (Columns B through O), to become easier to sort.
I've created formulas for columns B through O, to SEARCH for hyphens "-" and separate each category into its own column. Here's what the output should look like (except for the bottom row):
A B C D E F G ... O
Categories Cat 1 Cat 2 Cat 3 Cat 4 Cat 5 Cat 6 ... Cat 14
UX UX
CFT-WET CFT WET
WEM-US-CFT WEM US CFT
NC-US-CFT NC US CFT
TP-OB-SB-WEB TP OB SB WEB
DB-B-FC DB B FC
P-TP-SB-CP-DT P TP SB CP DT
DP-S-OB-WB-SB-FC DP S OB WB SB FC
P-TP-SB-CP-WEB-WS-S-TP-OB-C-CT-G-FC-MCB
I initially built these formulas:
Col B: =if(iserror(search("-",$A5)),$A5,search("-",$A5))
Col C: =if(iserror(search("-",$A5,sum(search("-",$A5)+1))),"",mid($A5,sum(search("-",$A5),1),sum(search("-",$A5,sum(search("-",$A5)+1)),-search("-",$A5),-1)))
Col D: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5)+1)+1),-search("-",$A5,search("-",$A5)+1),-1)))
Col E: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5)+1)+1),-1)))
Col F: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1),-1)))
Col G: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1),-1)))
Col H: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1),-1)))
Col I: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1),-1)))
Col J: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1),-1)))
Col K: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1),-1)))
Col L: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1),-1)))
Col M: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1),-1)))
Col N: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1),-1)))
Col O: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)),"",mid($A5,sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1),-search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1)+1),-1)))
...but these fail to recognize the final category for each row. Col B's formula has no problems, so I began to edit the formulas in Columns C to O:
Col C: =if(iserror(search("-",$A5,sum(search("-",$A5)+1))),if($B5<>"",right($A5,sum(len($A5),-search("-",$A5))),""),mid($A5,sum(search("-",$A5),1),sum(search("-",$A5,sum(search("-",$A5)+1)),-search("-",$A5),-1)))
Col D: =if(iserror(search("-",$A5,search("-",$A5,search("-",$A5)+1)+1)),if($C5<>"",if(iserror(search("-",$A5,search("-",$A5)+1)),"",right($A5,len(sum($A5,-search("-",$A5,search("-",$A5)+1))))),""),mid($A5,sum(search("-",$A5,search("-",$A5)+1),1),sum(search("-",$A5,search("-",$A5,search("-",$A5)+1)+1),-search("-",$A5,search("-",$A5)+1),-1)))
It appears this solution will work for all situations except when Column A has only one category. When that happens, a #VALUE! error will populate all columns to the right of Col B. How do I solve this?
For now, I'm stuck with this:
A B C D E F G ... O
Categories Cat 1 Cat 2 Cat 3 Cat 4 Cat 5 Cat 6 ... Cat 14
UX UX #VALUE! #VALUE! #VALUE! #VALUE! #VALUE! ... #VALUE!
CFT-WET CFT WET
WEM-US-CFT WEM US CFT
NC-US-CFT NC US CFT
TP-OB-SB-WEB TP OB SB WEB
DB-B-FC DB B FC
P-TP-SB-CP-DT P TP SB CP DT
DP-S-OB-WB-SB-FC DP S OB WB SB FC
P-TP-SB-CP-WEB-WS-S-TP-OB-C-CT-G-FC-MCB
Here's what I would use
I'm assuming row 1 of your spreadsheet contains "Cat 1", "Cat 2" etc. This formula relies on there being a space between "Cat" and the category number. I'm also assuming column A contains the concatenated string.
Note this requires that no category will ever have the character : in it. If there is a risk of this, you can modify the ":" to become any other character in the formula below (the below example goes in cell B2
=IF(LEN($A2)-LEN(SUBSTITUTE($A2,"-",""))+1>=VALUE(MID(B$1,SEARCH(" ",B$1)+1,100)), MID("-"&$A2&"-",SEARCH(":",SUBSTITUTE("-"&$A2&"-","-",":",VALUE(MID(B$1,SEARCH(" ",B$1)+1,100))))+1,SEARCH(":",SUBSTITUTE("-"&$A2&"-","-",":",VALUE(MID(B$1,SEARCH(" ",B$1)+1,100)+1)))-SEARCH(":",SUBSTITUTE("-"&$A2&"-","-",":",VALUE(MID(B$1,SEARCH(" ",B$1)+1,100))))-1),"")
A brief explanation: the last optional term of SUBSTITUTE is instance_num, which means you can change the nth instance of "-" to ":" and then search on that. This keeps the formula pretty simple. I've also added a "-" before and after the string in the formula to avoid having to use any if statements (there will always be a trailing and preceding "-")

How to sum constants if the values of a row contian a specific value in excel?

I have the following row in excel:
12 4 12p 12a 12b
I need to sum this elements with their values from the legend.
12 = 12;
4 = 4;
12p = 12,5;
12a = 12,2;
12b = 12,3;
For example
=12 + 4 + 12,5 + 12,2 + 12,3
Any ideas?
If you have all the elements within one cell as a single string of text, the optimal approach would be to start by using text-to-column to split them up. So you'll have 12 in A, 4 in B, 12p in C, 12a in D, 12b in E. If that's not an option, I can show you string manipulations that can be an alternative.
You'll need to turn your "legend" into a look-up table, (perhaps on sheet2?), with column A having: p, a, b, etc.. and column B having the relative values.
Once that's done, place this formula on sheet1, in F column:
=A2+IFERROR(VLOOKUP(RIGHT(A2),Sheet2!$A:$B,2,FALSE),0)
Then drag it to the right 5 times, and it will have the values of the elements "translated".
You can sum the translated range easily.

Resources