elegant way to iterate & compare in Spark DataFrame - apache-spark

I have a Spark DataFrame with 2 columns: C1:Seq[Any] and C2:Double. I want to
Sort by length of C1.
For each element c1 in C1, compare with every other element in C1 that is longer than c1.
2.1 If c1 is contained in an another element cx, then compare c2 with c2x.
2.2 If c2 > c2x, then filter out (c1x, c2x).
Is there an elegant way to achieve this?
Sample Input:
C1 C2
ab 1.0
abc 0.5
Expected output:
C1 C2
ab 1.0
Contain = subset. e.g. ab is contained in abc.

I have a Spark DataFrame with 2 columns: C1:Seq[Any] and C2:Double
val rdd = sc.parallelize(List(("ab", 1.0), ("abc", 0.5)))
Sort by length of C1.
val rddSorted = rdd.sortBy(_._1.length).collect().distinct
For each element c1 in C1, compare with every other element in C1 that is longer than c1.
2.1 If c1 is contained in an another element cx, then compare c2 with c2x.
2.2 If c2 > c2x, then filter out (c1x, c2x).
val result = for(
(x, y) <- rddSorted;
(a, b) <- rddSorted.dropWhile{case(c,d) => c == x && d == y};
if(a.contains(x) && a.length > x.length && y > b)
)yield (x, y)
Thats all. You should get what you are looking for

Related

how to substract value present in one row to other row within same table?

I'm trying to subtract two rows of different columns. Example table
C1
C2
C3
A1
2
A2
3
B1
4
So essentially, I want A2-A1 from C3 and C2 columns respectively. My approach was to somehow get values in C2New column and then subtract.
C1
C2
C2New
C3
C4
A1
2
2
A2
2
3
1
B1
4
If you are using explorer, here is how you can create the table:
let X = datatable( c1:string , c2:int , c3:int )
[ 'a1',2,3,
'a2', 0,3,
'b1', 0,4
];
X
| project c1, c2, c3
I have tried different joins, selfjoins, lookups and toscalar etc., expecting it would populate a value in empty cells and I would then create a new column or scalar with the difference in values. I'm totally new to coding and querying. Your help is appreciated.
KQL script:
let X = datatable( c1:string , c2:int , c3:int )
[ 'a1',2,3,
'a2', 0,3,
'b1', 0,4
];
X
| project c1, c2, c3
| serialize
| extend prevC2 = prev(c2,1)
| extend c4 = c3 - prevC2
Use Serialize operator to the table and then use prev function to get the previous row value.
Then subtract the c3 value from previous row c2 value.
Updated Script
As per David דודו Markovitz's comment, I updated script.
let X = datatable( c1:string , c2:int , c3:int )
[ 'a1',2,3,
'a2', 0,3,
'b1', 0,4
];
X| serialize c4 = c3 - prev(c2)
Output data
c1
c2
c3
prevc2
c4
a1
2
3
a2
0
3
2
1
b1
0
4
0
4

Pandas Filter rows by comparing columns A1 with A2

CHR
SNP
BP
A1
A2
OR
P
8
rs62513865
101592213
T
C
1.00652
0.8086
8
rs79643588
106973048
A
T
1.01786
0.4606
I have this table example, and I want to filter rows by comparing column A1 with A2.
If this four conditions happen, delete the line
A1
A2
A
T
T
A
C
G
G
C
(e.g. line 2 in the first table).
How can i do that using python Pandas ?
here is one way to do it
Combine the two columns for each of the two DF. Make it a list in case of the second DF and search the first combination in the second one
df[~(df['A1']+df['A2']).str.strip()
.isin(df2['A1']+df2['A2'].tolist())]
CHR SNP BP A1 A2 OR P
0 8 rs62513865 101592213 T C 1.00652 0.8086
keeping
Assuming df1 and df2, you can simply merge to keep the common values:
out = df1.merge(df2)
output:
CHR SNP BP A1 A2 OR P
0 8 rs79643588 106973048 A T 1.01786 0.4606
dropping
For removing the rows, perform a negative merge:
out = (df1.merge(df2, how='outer', indicator=True)
.loc[lambda d: d.pop('_merge').eq('left_only')]
)
Or merge and get the remaining indices to drop (requires unique indices):
out = df1.drop(df1.reset_index().merge(df2)['index'])
output:
CHR SNP BP A1 A2 OR P
0 8.0 rs62513865 101592213.0 T C 1.00652 0.8086
alternative approach
As it seems you have nucleotides and want to drop the cases that do not match a A/T or G/C pair, you could translate A to T and C to G in A1 and check that the value is not identical to that of A2:
m = df1['A1'].map({'A': 'T', 'C': 'G'}).fillna(df1['A1']).ne(df1['A2'])
out = df1[m]

how to filter a dataframe based on another dataframe?

I have two Dataframe in pyspark:
d1: (x,y,value) and d2: (k,v, value). The entries in d1 are unique (you can consider the column x alone is a unique, and y alone as a key)
x y value
a b 0.2
c d 0.4
e f 0,8
d2 is the following format:
k v value
a c 0.7
k k 0.3
j h 0.8
e p 0.1
a b 0.1
I need to filter d2 accorning the co-occurence on d1. i.e., a , c 0.7 and e p 0.1 should be deleted as a can occur only with b and similarly for e.
I tried to select from d1 the x and y columns.
sourceList = df1.select("x").collect()
sourceList = [row.x for row in sourceList]
sourceList_b = sc.broadcast(sourceList)
then
check_id_isin = sf.udf(lambda x: x in sourceList , BooleanType())
d2 = d2.where(~d2.k.isin(sourceList_b.value))
for small datasets it works well but for large one, the collect cause an exception. I want to know if there is a better logic to compute this step.
One way could be to join d1 to d2, then fill the missing value in the column y from the column v using coalesce, then filter the row where y and v are different such as:
import pyspark.sql.functions as F
(d2.join( d1.select('x','y').withColumnRenamed('x','k'), #rename x to k for easier join
on=['k'], how='left') #join left to keep only d2 rows
.withColumn('y', F.coalesce('y', 'v')) #fill the value missing in y with the one from v
.filter((F.col('v') == F.col('y'))) #keep only where the value in v are equal to y
.drop('y').show()) #drop the column y not necessary
and you get:
+---+---+-----+
| k| v|value|
+---+---+-----+
| k| k| 0.3|
| j| h| 0.8|
+---+---+-----+
and should keep also any rows where both values in couple (x,y) are in (k,v)
So you have two problems here:
Logic for joining these two tables:
This can be done by performing an inner join on two columns instead of one. This is the code for that:
# Create an expression wherein you do an inner join on two cols
joinExpr = ((d1.x = d2.k) & (d1.y == d2.y))
joinDF = d1.join(d2, joinExpr)
The second problem is speed. There are multiple ways of fixing it. Here are my top two:
a. If one of the dataframes is significantly smaller (usually under 2 GB) than the other dataframe, then you can use the broadcast join. It essentially copies the smaller dataframe to all the workers so that there is no need to shuffle while joining. Here is an example:
from pyspark.sql.functions import broadcast
joinExpr = ((d1.x = d2.k) & (d1.y == d2.y))
joinDF = d1.join(broadcast(d2), joinExpr)
b. Try adding more workers and increasing the memory.
What you probably want todo is think of this in relational terms. Join d1 and d2 on d1.x = d2.k AND d1.y = d2.kv. An inner join will drop any records from D2 that don't have a corresponding pair in d1. By join a join spark will do a cluster wide shuffle of the data allowing for much greater parallelism and scalability compared to a broadcast exchange which general caps out at about ~10mb of data (which is what spark uses as the cut over point between a shuffle join and a broadcast join.
Also as in FYI WHERE (a,b) IS IN (...) gets translated into a join in most cases unless the (...) is a small set of data.
https://github.com/vaquarkhan/vaquarkhan/wiki/Apache-Spark--Shuffle-hash-join-vs--Broadcast-hash-join

Sqlite join columns on mapping of values

I want to be able to join two tables, where there is a mapping between the column values, rather than their values matching.
So rather than:
A|m | f B|m | f
a1 1 b1 1
a2 2 b2 3
a3 3 b3 5
SELECT a1, a2, b1, b2
FROM A
INNER JOIN B on B.f = A.f
giving:
|m| A.f B.f |m|
a1 1 1 b1
a3 3 3 b2
Given then mapping (1->a)(2->b)(3->c)
A|m | f B|m | f
a1 1 b1 a
a2 2 b2 b
a3 3 b3 c
to give when joined on f:
|m| A.f B.f |m|
a1 1 a b1
a3 3 c b2
The question below seems to be trying something similar, but they seem to want to change the column values, I just want the mappng to be part of the query, I don't want to change the column values thenselves. Besides it is in R and I'm working in Python.
Mapping column values
One solution is to create a temporary table of mappings AB:
CREATE TEMP TABLE AB (a TEXT, b TEXT, PRIMARY KEY(a, b));
Then insert mappings,
INSERT INTO temp.AB VALUES (1, "a"), (2, "b"), (3, "c");
or executemany with params.
Then select using intermediary table.
SELECT A.m AS Am, A.f AS Af, B.f AS Bf, B.m AS Bm
FROM A
LEFT JOIN temp.AB ON A.f=AB.a
LEFT JOIN B ON B.f=AB.b;
If you don't want to create an intermediary table, another solution would be building the query yourself.
mappings = ((1,'a'), (3,'c'))
sql = 'SELECT A.m AS Am, A.f AS Af, B.f AS Bf, B.m AS Bm FROM A, B WHERE ' \
+ ' OR '.join(['(A.f=? AND B.f=?)'] * len(mappings))
c.execute(sql, [i for m in mappings for i in m])

Converting a python 3 dictionary into a dataframe (keys are tuples)

I have a dictionary that has the following structure:
Dict [(t1,d1)] = x
(x are integers, t1 and d1 are strings)
I want to convert this Dictionary into a dataframe of the following format:
d1 d2 d3 d4
t1 x y z x
t2 etc.
t3
t4
...
The following command
d.DataFrame([[key,value] for key,value in Dict.items()],columns=["key_col","val_col"])
gives me
key_col val_col
0 (book, d1) 100
1 (pen, d1) 10
2 (book, d2) 30
3 (pen, d2) 0
How do I make d's my column names and t's my row names?
Pandas automatically assumes tuple keys are multiindex. Pass dictionary to series constructor and unstack.
pd.Series(dct).unstack()

Resources