Converting a python 3 dictionary into a dataframe (keys are tuples) - python-3.x

I have a dictionary that has the following structure:
Dict [(t1,d1)] = x
(x are integers, t1 and d1 are strings)
I want to convert this Dictionary into a dataframe of the following format:
d1 d2 d3 d4
t1 x y z x
t2 etc.
t3
t4
...
The following command
d.DataFrame([[key,value] for key,value in Dict.items()],columns=["key_col","val_col"])
gives me
key_col val_col
0 (book, d1) 100
1 (pen, d1) 10
2 (book, d2) 30
3 (pen, d2) 0
How do I make d's my column names and t's my row names?

Pandas automatically assumes tuple keys are multiindex. Pass dictionary to series constructor and unstack.
pd.Series(dct).unstack()

Related

Pandas Filter rows by comparing columns A1 with A2

CHR
SNP
BP
A1
A2
OR
P
8
rs62513865
101592213
T
C
1.00652
0.8086
8
rs79643588
106973048
A
T
1.01786
0.4606
I have this table example, and I want to filter rows by comparing column A1 with A2.
If this four conditions happen, delete the line
A1
A2
A
T
T
A
C
G
G
C
(e.g. line 2 in the first table).
How can i do that using python Pandas ?
here is one way to do it
Combine the two columns for each of the two DF. Make it a list in case of the second DF and search the first combination in the second one
df[~(df['A1']+df['A2']).str.strip()
.isin(df2['A1']+df2['A2'].tolist())]
CHR SNP BP A1 A2 OR P
0 8 rs62513865 101592213 T C 1.00652 0.8086
keeping
Assuming df1 and df2, you can simply merge to keep the common values:
out = df1.merge(df2)
output:
CHR SNP BP A1 A2 OR P
0 8 rs79643588 106973048 A T 1.01786 0.4606
dropping
For removing the rows, perform a negative merge:
out = (df1.merge(df2, how='outer', indicator=True)
.loc[lambda d: d.pop('_merge').eq('left_only')]
)
Or merge and get the remaining indices to drop (requires unique indices):
out = df1.drop(df1.reset_index().merge(df2)['index'])
output:
CHR SNP BP A1 A2 OR P
0 8.0 rs62513865 101592213.0 T C 1.00652 0.8086
alternative approach
As it seems you have nucleotides and want to drop the cases that do not match a A/T or G/C pair, you could translate A to T and C to G in A1 and check that the value is not identical to that of A2:
m = df1['A1'].map({'A': 'T', 'C': 'G'}).fillna(df1['A1']).ne(df1['A2'])
out = df1[m]

Pandas dataframe deduplicate rows with column logic

I have a pandas dataframe with about 100 million rows. I am interested in deduplicating it but have some criteria that I haven't been able to find documentation for.
I would like to deduplicate the dataframe, ignoring one column that will differ. If that row is a duplicate, except for that column, I would like to only keep the row that has a specific string, say X.
Sample dataframe:
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
Desired output:
>>> df_dedup
A B C
0 1 2 00X
1 1 3 010
So, alternatively stated, the row index 2 would be removed because row index 0 has the information in columns A and B, and X in column C
As this data is slightly large, I hope to avoid iterating over rows, if possible. Ignore Index is the closest thing I've found to the built-in drop_duplicates().
If there is no X in column C then the row should require that C is identical to be deduplicated.
In the case in which there are matching A and B in a row, but have multiple versions of having an X in C, the following would be expected.
df = pd.DataFrame(columns=["A","B","C"],
data = [[1,2,"0X0"],
[1,2,"X00"],
[1,2,"0X0"]])
Output should be:
>>> df_dedup
A B C
0 1 2 0X0
1 1 2 X00
Use DataFrame.duplicated on columns A and B to create a boolean mask m1 corresponding to condition where values in column A and B are not duplicated, then use Series.str.contains + Series.duplicated on column C to create a boolean mask corresponding to condition where C contains string X and C is not duplicated. Finally using these masks filter the rows in df.
m1 = ~df[['A', 'B']].duplicated()
m2 = df['C'].str.contains('X') & ~df['C'].duplicated()
df = df[m1 | m2]
Result:
#1
A B C
0 1 2 00X
1 1 3 010
#2
A B C
0 1 2 0X0
1 1 2 X00
Does the column "C" always have X as the last character of each value? You could try creating a column D with 1 if column C has an X or 0 if it does not. Then just sort the values using sort_values and finally use drop_duplicates with keep='last'
import pandas as pd
df = pd.DataFrame(columns = ["A","B","C"],
data = [[1,2,"00X"],
[1,3,"010"],
[1,2,"002"]])
df['D'] = 0
df.loc[df['C'].str[-1] == 'X', 'D'] = 1
df.sort_values(by=['D'], inplace=True)
df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)
This is assuming you also want to drop duplicates in case there is no X in the 'C' column among the duplicates of columns A and B
Here is another approach. I left 'count' (a helper column) in for transparency.
# use df as defined above
# count the A,B pairs
df['count'] = df.groupby(['A', 'B']).transform('count').squeeze()
m1 = (df['count'] == 1)
m2 = (df['count'] > 1) & df['C'].str.contains('X') # could be .endswith('X')
print(df.loc[m1 | m2]) # apply masks m1, m2
A B C count
0 1 2 00X 2
1 1 3 010 1

how to filter a dataframe based on another dataframe?

I have two Dataframe in pyspark:
d1: (x,y,value) and d2: (k,v, value). The entries in d1 are unique (you can consider the column x alone is a unique, and y alone as a key)
x y value
a b 0.2
c d 0.4
e f 0,8
d2 is the following format:
k v value
a c 0.7
k k 0.3
j h 0.8
e p 0.1
a b 0.1
I need to filter d2 accorning the co-occurence on d1. i.e., a , c 0.7 and e p 0.1 should be deleted as a can occur only with b and similarly for e.
I tried to select from d1 the x and y columns.
sourceList = df1.select("x").collect()
sourceList = [row.x for row in sourceList]
sourceList_b = sc.broadcast(sourceList)
then
check_id_isin = sf.udf(lambda x: x in sourceList , BooleanType())
d2 = d2.where(~d2.k.isin(sourceList_b.value))
for small datasets it works well but for large one, the collect cause an exception. I want to know if there is a better logic to compute this step.
One way could be to join d1 to d2, then fill the missing value in the column y from the column v using coalesce, then filter the row where y and v are different such as:
import pyspark.sql.functions as F
(d2.join( d1.select('x','y').withColumnRenamed('x','k'), #rename x to k for easier join
on=['k'], how='left') #join left to keep only d2 rows
.withColumn('y', F.coalesce('y', 'v')) #fill the value missing in y with the one from v
.filter((F.col('v') == F.col('y'))) #keep only where the value in v are equal to y
.drop('y').show()) #drop the column y not necessary
and you get:
+---+---+-----+
| k| v|value|
+---+---+-----+
| k| k| 0.3|
| j| h| 0.8|
+---+---+-----+
and should keep also any rows where both values in couple (x,y) are in (k,v)
So you have two problems here:
Logic for joining these two tables:
This can be done by performing an inner join on two columns instead of one. This is the code for that:
# Create an expression wherein you do an inner join on two cols
joinExpr = ((d1.x = d2.k) & (d1.y == d2.y))
joinDF = d1.join(d2, joinExpr)
The second problem is speed. There are multiple ways of fixing it. Here are my top two:
a. If one of the dataframes is significantly smaller (usually under 2 GB) than the other dataframe, then you can use the broadcast join. It essentially copies the smaller dataframe to all the workers so that there is no need to shuffle while joining. Here is an example:
from pyspark.sql.functions import broadcast
joinExpr = ((d1.x = d2.k) & (d1.y == d2.y))
joinDF = d1.join(broadcast(d2), joinExpr)
b. Try adding more workers and increasing the memory.
What you probably want todo is think of this in relational terms. Join d1 and d2 on d1.x = d2.k AND d1.y = d2.kv. An inner join will drop any records from D2 that don't have a corresponding pair in d1. By join a join spark will do a cluster wide shuffle of the data allowing for much greater parallelism and scalability compared to a broadcast exchange which general caps out at about ~10mb of data (which is what spark uses as the cut over point between a shuffle join and a broadcast join.
Also as in FYI WHERE (a,b) IS IN (...) gets translated into a join in most cases unless the (...) is a small set of data.
https://github.com/vaquarkhan/vaquarkhan/wiki/Apache-Spark--Shuffle-hash-join-vs--Broadcast-hash-join

add numeric prefix to pandas dataframe column names

how would I add variable numeric prefix to dataframe column names
If I have a DataFrame df
colA colB
0 A X
1 B Y
2 C Z
How would I rename the columns according to the number of columns. Something like this:
1_colA 2_colB
0 A X
1 B Y
2 C Z
The actually number of columns is very large to be renamed manually
Thanks for the help
Use enumerate for count with f-strings and list comprehension:
#python 3.6+
df.columns = [f'{i}_{x}' for i, x in enumerate(df.columns, 1)]
#python below 3.6
#df.columns = ['{}_{}'.format(i, x) for i, x in enumerate(df.columns, 1)]
print (df)
1_colA 2_colB
0 A X
1 B Y
2 C Z

elegant way to iterate & compare in Spark DataFrame

I have a Spark DataFrame with 2 columns: C1:Seq[Any] and C2:Double. I want to
Sort by length of C1.
For each element c1 in C1, compare with every other element in C1 that is longer than c1.
2.1 If c1 is contained in an another element cx, then compare c2 with c2x.
2.2 If c2 > c2x, then filter out (c1x, c2x).
Is there an elegant way to achieve this?
Sample Input:
C1 C2
ab 1.0
abc 0.5
Expected output:
C1 C2
ab 1.0
Contain = subset. e.g. ab is contained in abc.
I have a Spark DataFrame with 2 columns: C1:Seq[Any] and C2:Double
val rdd = sc.parallelize(List(("ab", 1.0), ("abc", 0.5)))
Sort by length of C1.
val rddSorted = rdd.sortBy(_._1.length).collect().distinct
For each element c1 in C1, compare with every other element in C1 that is longer than c1.
2.1 If c1 is contained in an another element cx, then compare c2 with c2x.
2.2 If c2 > c2x, then filter out (c1x, c2x).
val result = for(
(x, y) <- rddSorted;
(a, b) <- rddSorted.dropWhile{case(c,d) => c == x && d == y};
if(a.contains(x) && a.length > x.length && y > b)
)yield (x, y)
Thats all. You should get what you are looking for

Resources