I am having a df with two columns col_1 and col_2. The entries in col_1 are related with entries in col_2. It is some sort of relationship where A belongs to B, and B belongs to C & D, therefore A belongs to B, C and D.
import pandas as pd
col_1 = ["A", "A", "B", "B", "I", "J", "C", "A"]
col_2 = ["B", "H", "C", "D", "J", "L", "E", "Z"]
df = pd.DataFrame({"col_1":col_1, "col_2":col_2})
df.sort_values("col_1", inplace=True)
df
I want to extract the relationship by keeping the first occurring key as the "my_key" and all other keys in "Connected" column.
How can I fetch all keys which are connected to each others ,keeping the conditions in mind.
The keys that are in col_1 should not be in the list of col_2
and
Only the related keys should be in front of my_key
Use networkx with connected_components for dictionary:
import networkx as nx
# Create the graph from the dataframe
g = nx.Graph()
g.add_edges_from(df[['col_1','col_2']].itertuples(index=False))
connected_components = nx.connected_components(g)
# Find the component id of the nodes
node2id = {}
for cid, component in enumerate(connected_components):
for node in component:
node2id[node] = cid + 1
Then get first values of groups to column col_1 and map all another values in lists:
g1 = df['col_1'].map(node2id)
df1 = df.loc[~g.duplicated(), ['col_1']]
s = pd.Series(list(node2id.keys()), index=list(node2id.values()))
s = s[~s.isin(df1['col_1'])]
d = s.groupby(level=0).agg(list)
df1['Connected'] = g1.map(d)
print (df1)
col_1 Connected
0 A [C, B, E, H, D, Z]
4 I [J, L]
For plotting use:
pos = nx.spring_layout(g, scale=20)
nx.draw(g, pos, node_color='lightblue', node_size=500, with_labels=True)
Related
I have a data frame that looks like
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)
expected output
d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}
I have tried like this :
arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])
How do I remove the duplicate entries in each row and column of dataframe?
a,a,b,c should be a,b,c
From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:
def myfunc(x):
s=pd.Series(x.split(','))
res=s[s.ne(s.shift())]
return ','.join(res.values)
print(df.applymap(myfunc))
col1 col2
0 a,b a,b
1 a,c,b a,b,a
Another function can be created with itertools.groupby such as :
from itertools import groupby
def myfunc(x):
l=[x[0] for x in groupby(x.split(','))]
return ','.join(l)
You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)
def remove_dups(string):
split = string.split(',') # split string into a list
uniques = set(split) # remove duplicate list elements
return ','.join(uniques) # rejoin the list elements into a string
result = df.applymap(remove_dups)
This returns:
col1 col2
0 a,b a,b
1 a,c,b a,b
Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?
Edit2: to preserve the original order, you can replace the set() function with unique_everseen()
from more_itertools import unique_everseen
.
.
.
uniques = unique_everseen(split)
i have this problem to solve, this is a continuation of a previus question How to iterate over pandas df with a def function variable function and the given answer worked perfectly, but now i have to append all the data in a 2 columns dataframe (Adduct_name and mass).
This is from the previous question:
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass. The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1,
1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4,
"C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3",
316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
The solution to this problem was:
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 5.
for i in range(5):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
Output
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
Now that is the rigth calculations but i need now a file where:
-only exists 2 columns (Name and mass)
-All the different adducts are appended one after another
desired out put
Name Mass
a_M+3H 199.82902
a_M+3Na 221.810726
a_M+H 597.472455
a_2M+H 1193.937634
a_M-3H 197.814450
b_M+3H 514.293038
.
.
.
c_M+3H
and so on.
Also i need to combine the name of the respective compound with the ion form (M+3H, M+H, etc).
At this point i have no code for that.
I would apprecitate any advice and a better approach since the begining.
This part is an update of the question above:
Is posible to obtain and ouput like this one:
Name Mass RT
a_M+3H 199.82902 1
a_M+3Na 221.810726 1
a_M+H 597.472455 1
a_2M+H 1193.937634 1
a_M-3H 197.814450 1
b_M+3H 514.293038 3
.
.
.
c_M+3H 2
The RT is the same value for all forms of a compound, in this example is RT for a =1, b = 3, c =2, etc.
Is posible to incorporate (Keep this column) from the data set df (which i update here below)?. As you can see that df has more columns like "Formula" and "RT" which desapear after calculations.
import pandas as pd
data1 = [[a, "C3H64O7", 596.465179, 1], [b, "C30H42O7", 514.293038, 3], [c,
"C44H56O8", 712.397498, 2], [d, "C24H32O6S", 448.191949, 4], [e, "C20H28O3",
316.203834, 1.5]]
df = pd.DataFrame(data1, columns=["Name", "Formula", "exact_mass", "RT"])
Part three! (sorry and thank you)
this is a trial i did on a small data set (df) using the code below, with the same df_al of above.
df=
Code
#Defining variables for calculation
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
df_ID= df["Name"]
#Defining the RT dictionary
RT = dict(zip(df["Name"], df["RT"]))
#Removing RT column
df=df.drop(columns=["RT"])
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 46.
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
df
output
#Melting
df = pd.melt(df, id_vars=['Name'], var_name = "Adduct", value_name= "Exact_mass", value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
df['RT'] = df.Name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
del df['Name']
del df['Adduct']
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
output
Why NaN?
Here is how I will go about it, pandas.melt comes to rescue:
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO('''
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
''')
df = pd.read_csv(s, sep="\s+")
df = pd.melt(df, id_vars=['Name'], value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
del df['Name']
del df['variable']
RT = {'a':1, 'b':2, 'c':3, 'd':5, 'e':1.5}
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
Here is the output:
I have two Spark Dataframes.
DataFrame A:
Col_A1 Col_A2
1 ["x", "y", "z"]
2 ["a", "x", "y"]
3 ["a", "b", "c"]
DataFrame B:
Col_B1
"x"
"a"
"y"
I want to check which entries of dataframe A has, say, "x" of Dataframe B in its Col_A2 and it return it as new dataframe itself. Repeatedly I want to do the same for rest of the entries of data frame B.
Output needs to be something like:
DataFrame A_x:
Col_A1 Col_A2
1 ["x", "y", "z"]
2 ["a", "x", "y"]
DataFrame A_a:
Col_A1 Col_A2
2 ["a", "x", "y"]
3 ["a", "b", "c"]
Dataframe A_y
Col_A1 Col_A2
1 ["x", "y", "z"]
2 ["a", "x", "y"]
I tried using udfs and map function, but didn't really get what I'm looking for.
Thanks in advance.
If your dataframe B is small and can be collected to a list, plus that the number of its distinct values is small, you could write a simple UDF for each of its elements [UPDATE: see end of post for an easier way]; here is an example for 'x':
spark.version
# u'2.2.0'
from pyspark.sql import Row
df_a = spark.createDataFrame([Row(1, ["x", "y", "z"]),
Row(2, ["a", "x", "y"]),
Row(3, ["a", "b", "c"])],
["col_A1", "col_A2"])
#udf('boolean')
def x_isin(v):
if 'x' in v:
return True
else:
return False
temp_x = df_a.withColumn('x_isin', x_isin(df_a.col_A2))
temp_x.show()
# +------+---------+------+
# |col_A1| col_A2|x_isin|
# +------+---------+------+
# | 1|[x, y, z]| true|
# | 2|[a, x, y]| true|
# | 3|[a, b, c]| false|
# +------+---------+------+
df_a_x = temp_x.filter(temp_x.x_isin==True).drop('x_isin')
df_a_x.show()
# +------+---------+
# |col_A1| col_A2|
# +------+---------+
# | 1|[x, y, z]|
# | 2|[a, x, y]|
# +------+---------+
UPDATE (after Marie's comment):
Thanks to Marie for pointing out the array_contains function - now you indeed do not need a UDF to build temp_x:
import pyspark.sql.functions as func
temp_x = df_a.withColumn('x_isin', func.array_contains(df_a.col_A2, 'x'))
temp_x.show() # same result as shown above
I have two dataframes, both indexed with timestamps. I would like to preserve the order of the columns in the first dataframe that is merged.
For example:
#required packages
import pandas as pd
import numpy as np
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_1', '_2'])
print("\nData Frame Three:\n", df3)
The above code generates two data frames the first with columns C, B, and A. The second dataframe has columns B, C, and D. The current output has the columns in the following order; C_1, B_1, A, B_2, C_2, D. What I want the columns from the output of the merge to be C_1, C_2, B_1, B_2, A_1, D_2. The order of the columns is preserved from the first data frame and any data similar to the second data frame is added next to the corresponding data.
Could there be a setting in merge or can I use sort_index to do this?
EDIT: Maybe a better way to phrase the sorting process would be to call it uncollated. Where each column is put together and so on.
Using an OrderedDict, as you suggested.
from collections import OrderedDict
from itertools import chain
c = df3.columns.tolist()
o = OrderedDict()
for x in c:
o.setdefault(x.split('_')[0], []).append(x)
c = list(chain.from_iterable(o.values()))
df3 = df3[c]
An alternative that involves extracting the prefixes and then calling sorted on the index.
# https://stackoverflow.com/a/46839182/4909087
p = [s[0] for s in c]
c = sorted(c, key=lambda x: (p.index(x[0]), x))
df = df[c]
I have a DF with 100 million rows and 5000+ columns. I am trying to find the corr between colx and remaining 5000+ columns.
aggList1 = [mean(col).alias(col + '_m') for col in df.columns] #exclude keys
df21= df.groupBy('key1', 'key2', 'key3', 'key4').agg(*aggList1)
df = df.join(broadcast(df21),['key1', 'key2', 'key3', 'key4']))
df= df.select([func.round((func.col(colmd) - func.col(colmd + '_m')), 8).alias(colmd)\
for colmd in all5Kcolumns])
aggCols= [corr(colx, col).alias(col) for col in colsall5K]
df2 = df.groupBy('key1', 'key2', 'key3').agg(*aggCols)
Right now it is not working because of spark 64KB codegen issue (even spark 2.2). So i am looping for each 300 columns and merging all at the end. But it is taking more than 30 hours in a cluster with 40 nodes (10 core each and each node with 100GB). Any help to tune this?
Below things already tried
- Re partition DF to 10,000
- Checkpoint in each loop
- cache in each loop
You can try with a bit of NumPy and RDDs. First a bunch of imports:
from operator import itemgetter
import numpy as np
from pyspark.statcounter import StatCounter
Let's define a few variables:
keys = ["key1", "key2", "key3"] # list of key column names
xs = ["x1", "x2", "x3"] # list of column names to compare
y = "y" # name of the reference column
And some helpers:
def as_pair(keys, y, xs):
""" Given key names, y name, and xs names
return a tuple of key, array-of-values"""
key = itemgetter(*keys)
value = itemgetter(y, * xs) # Python 3 syntax
def as_pair_(row):
return key(row), np.array(value(row))
return as_pair_
def init(x):
""" Init function for combineByKey
Initialize new StatCounter and merge first value"""
return StatCounter().merge(x)
def center(means):
"""Center a row value given a
dictionary of mean arrays
"""
def center_(row):
key, value = row
return key, value - means[key]
return center_
def prod(arr):
return arr[0] * arr[1:]
def corr(stddev_prods):
"""Scale the row to get 1 stddev
given a dictionary of stddevs
"""
def corr_(row):
key, value = row
return key, value / stddev_prods[key]
return corr_
and convert DataFrame to RDD of pairs:
pairs = df.rdd.map(as_pair(keys, y, xs))
Next let's compute statistics per group:
stats = (pairs
.combineByKey(init, StatCounter.merge, StatCounter.mergeStats)
.collectAsMap())
means = {k: v.mean() for k, v in stats.items()}
Note: With 5000 features and 7000 group there should no issue with keeping this structure in memory. With larger datasets you may have to use RDD and join but this will be slower.
Center the data:
centered = pairs.map(center(means))
Compute covariance:
covariance = (centered
.mapValues(prod)
.combineByKey(init, StatCounter.merge, StatCounter.mergeStats)
.mapValues(StatCounter.mean))
And finally correlation:
stddev_prods = {k: prod(v.stdev()) for k, v in stats.items()}
correlations = covariance.map(corr(stddev_prods))
Example data:
df = sc.parallelize([
("a", "b", "c", 0.5, 0.5, 0.3, 1.0),
("a", "b", "c", 0.8, 0.8, 0.9, -2.0),
("a", "b", "c", 1.5, 1.5, 2.9, 3.6),
("d", "e", "f", -3.0, 4.0, 5.0, -10.0),
("d", "e", "f", 15.0, -1.0, -5.0, 10.0),
]).toDF(["key1", "key2", "key3", "y", "x1", "x2", "x3"])
Results with DataFrame:
df.groupBy(*keys).agg(*[corr(y, x) for x in xs]).show()
+----+----+----+-----------+------------------+------------------+
|key1|key2|key3|corr(y, x1)| corr(y, x2)| corr(y, x3)|
+----+----+----+-----------+------------------+------------------+
| d| e| f| -1.0| -1.0| 1.0|
| a| b| c| 1.0|0.9972300220940342|0.6513360726920862|
+----+----+----+-----------+------------------+------------------+
and the method provided above:
correlations.collect()
[(('a', 'b', 'c'), array([ 1. , 0.99723002, 0.65133607])),
(('d', 'e', 'f'), array([-1., -1., 1.]))]
This solution, while a bit involved, is quite elastic and can be easily adjusted to handle different data distributions. It should be also possible to given further boost with JIT.