search in sublists and match common elements with other sublist - python-3.x

i am searching for an answer but i didn't find anything about my problem.
x=[['100',220, 0.5, 0.25, 0.1],['105',400, 0.12, 0.56, 0.9],['600',340, 0.4, 0.7, 0.45]]
y=[['1','100','105','601'],['2','104','105','600'],['3','100','105','604']]
i want as result:
z=[['1','100',0.5,0.25,0.1,'105',0.12,0.56,0.9],['2','105',0.12,0.56,0.9,'600',0.4,0.7,0.45],['3','100',0.5, 0.25, 0.1,'105', 0.12, 0.56, 0.9]]
i want to search in list y and match list x with list y where i get a new list z that containts the common sublists.
this is just an example, normally contains list x and y 10000 sublists.
i compare out of y ['1','100','105','601'] and search the '100','105','601' in list x (example ['100',220, 0.5, 0.25, 0.1]). if i find a match i make a new list z.
Can someone help me?

Answer edited because comments
You said in the comments:
search the second, third and fourth number in each y. and compare that with the number on place one in list x
and
then i would like to add (from list x) the numbers on place 1,3,4,5
Then try something like this:
x = [
['100', 220, 0.5, 0.25, 0.1],
['105', 400, 0.12, 0.56, 0.9],
['600', 340, 0.4, 0.7, 0.45]
]
y = [
['1', '100', '105', '601'],
['2', '104', '105', '600'],
['3', '100', '105', '604']
]
z = []
xx = dict((k, v) for k, _, *v in x)
for first, *yy in y:
zz = [first]
for n in yy:
numbers = xx.get(n)
if numbers:
zz.append(n)
zz.extend(numbers)
z.append(zz)
print(z)
z should now be:
[['1', '100', 0.5, 0.25, 0.1, '105', 0.12, 0.56, 0.9],
['2', '105', 0.12, 0.56, 0.9, '600', 0.4, 0.7, 0.45],
['3', '100', 0.5, 0.25, 0.1, '105', 0.12, 0.56, 0.9]]
First, I convert x into a dictionary, for easy lookup.
The iteration pattern used here was introduced with pep-3132 and works like this:
>>> head, *tail = range(5)
>>> head
0
>>> tail
[1, 2, 3, 4]

Related

How to define custom function for scipy's binned_statistic_2d?

The documentation for scipy's binned_statistic_2d function gives an example for a 2D histogram:
from scipy import stats
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
Makes sense, but I'm now trying to implement a custom function. The custom function description is given as:
function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.
I wasn't sure exactly how to implement this, so I thought I'd check my understanding by writing a custom function that reproduces the count option. I tried
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, custom_func, bins=[binx, biny])
but this generates an error like so:
556 # Make sure `values` match `sample`
557 if(statistic != 'count' and Vlen != Dlen):
558 raise AttributeError('The number of `values` elements must match the '
559 'length of each `sample` dimension.')
561 try:
562 M = len(bins)
AttributeError: The number of `values` elements must match the length of each `sample` dimension.
How is this custom function supposed to be defined?
The reason for this error is that when using a custom statistic function (or any non-count statistic), you have to pass some array or list of arrays to the values parameter (with the number of elements matching the number in x). You can't just leave it as None as in your example, even though it is irrelevant and does not get used when computing counts of data points in each bin.
So, to match the results, you can just pass the same x object to the values parameter:
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, x, custom_func, bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))
The result matches that of the count statistic:
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))

Adding column to empty DataFrames via a loop

I have the following code:
for key in temp_dict:
temp_dict[key][0][0] = temp_dict[key][0][0].insert(0, "Date", None)
where temp_dict is:
{'0.5SingFuel': [[Empty DataFrame
Columns: [Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]], 'Sing180': [[Empty DataFrame
Columns: [Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]], 'Sing380': [[Empty DataFrame
Columns: [Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]]}
What I would like to have is:
{'0.5SingFuel': [[Empty DataFrame
Columns: [Date, Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]], 'Sing180': [[Empty DataFrame
Columns: [Date, Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]], 'Sing380': [[Empty DataFrame
Columns: [Date, Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]]}
My code produces the following error:
ValueError: cannot insert Date, already exists
I would have thought that I was looping from one dict key to the next, but I was going through the debugger and it looks like:
Code does what it is supposed to
Moves onto next key and the previous key becomes empty
The new key already has "Date" in the columns and then the code tries to add it, which of course it can't
This probably makes no sense, hence why I need some help - I am confused.
I think I am mis-assigning the variables, but not completely sure how.
One problem is that insert is kind of an inplace operation, so you don't need to reassign. The second problem is if the column exists, then insert does not work as you said, so you need to check if it is in the columns already, and maybe reorder to put this column as first.
# dummy dictionary, same structure
d = {0:[[pd.DataFrame(columns=['a','b'])]],
1:[[pd.DataFrame(columns=['a','c'])]]}
# name of the column to insert
col='c'
for key in d.keys():
df_ = d[key][0][0] # easier to define a variable
if col not in df_.columns:
df_.insert(0,col,None)
else: # reorder and reassign in this case, remove the else if you don't need
d[key][0][0] = df_[[col] + df_.columns.difference([col]).tolist()]
print(d)
# {0: [[Empty DataFrame
# Columns: [c, a, b] # c added as column
# Index: []]], 1: [[Empty DataFrame
# Columns: [c, a] # c in first position now
# Index: []]]}

How to convert list's values to dict's values

I have a list. Actually this is word's index.
lst = [[1, 2, 3],
[4, 5],
[6]]
and I have a dictionary. Dictionary's value is word's vector(word2vec) and each vector has same dimension(of course).
dic={1:array([0.1, 0.2, 0.3]),
2:array([0.4, 0.5, 0.6]),
3:array([0.7, 0.8, 0.9]),
4:array([1.0, 1.1, 1.2]),
5:array([1.3, 1.4, 1.5]),
6:array([1.6, 1.7, 1.8])}
and I want convert list's values(word index) to dict's values(word vector) what a pair with dictionary(as you look below).
lst = [[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9]],
[[1.0, 1.1, 1.2], [1.3, 1.4, 1.5]],
[[1.6, 1.7, 1.8]]]
Can you help me??
One can use the below helper function:
def word2vec(list_param, dict_param):
for i in range(len(list_param)):
for j in range(len(list_param[i])):
list_param[i][j] = dict_param[list[i][j]]
The list will have the updated values as required. Would strongly recommend not to use reserved key words like list,dict... as variable names.
Using map because I love it :)
ll = [[1, 2, 3], [4, 5], [6]]
dd = {1:array([0.1, 0.2, 0.3]),
2:array([0.4, 0.5, 0.6]),
3:array([0.7, 0.8, 0.9]),
4:array([1.0, 1.1, 1.2]),
5:array([1.3, 1.4, 1.5]),
6:array([1.6, 1.7, 1.8])}
res = []
for item in ll:
res.append(list(map(lambda x: list(dd[x]), item)))
print(res)
Gives
[
[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], [0.7, 0.8, 0.9]],
[[1.0, 1.1, 1.2], [1.3, 1.4, 1.5]],
[[1.6, 1.7, 1.8]]
]

Getting Key Error: 0 while creating Consensus Matrix

I am getting "Type Error: 0" on dict when acquiring the length of the Dict
(t = len(Motifs[0])
I reviewed the previous post on "Type Error: 0) and I tried casting
t = int(len(Motifs[0]))
def Consensus(Motifs):
k = len(Motifs[0])
profile = ProfileWithPseudocounts(Motifs)
consensus = ""
for j in range(k):
maximum = 0
frequentSymbol = ""
for symbol in "ACGT":
if profile[symbol][j] > maximum:
maximum = profile[symbol][j]
frequentSymbol = symbol
consensus += frequentSymbol
return consensus
def ProfileWithPseudocounts(Motifs):
t = len(Motifs)
k = len(Motifs[0])
profile = {}
count = CountWithPseudocounts(Motifs)
for key, motif_lists in sorted(count.items()):
profile[key] = motif_lists
for motif_list, number in enumerate(motif_lists):
motif_lists[motif_list] = number/(float(t+4))
return profile
def CountWithPseudocounts(Motifs):
t = len(Motifs)
k = len(Motifs[0])
count = {}
for symbol in "ACGT":
count[symbol] = []
for j in range(k):
count[symbol].append(1)
for i in range(t):
for j in range(k):
symbol = Motifs[i][j]
count[symbol][j] += 1
return count
Motifs = {'A': [0.4, 0.3, 0.0, 0.1, 0.0, 0.9],
'C': [0.2, 0.3, 0.0, 0.4, 0.0, 0.1],
'G': [0.1, 0.3, 1.0, 0.1, 0.5, 0.0],
'T': [0.3, 0.1, 0.0, 0.4, 0.5, 0.0]}
#print(type(Motifs))
print(Consensus(Motifs))
"Type Error: 0"
"t = len(Motifs)"
"k = len(Motifs[0])"
"symbol = Motifs[i][j]"
on lines(9, 24, 35, 44) when code executes!!! Traceback:
Traceback (most recent call last):
File "myfile.py", line 47, in <module>
print(Consensus(Motifs))
File "myfile.py", line 2, in Consensus
k = len(Motifs[0])
KeyError: 0
I should get the "Consensus matrix" without errors
You have a dictionary called Motifs with 4 keys:
>>> Motifs.keys()
dict_keys(['A', 'C', 'G', 'T'])
But you are trying to get the value for the key 0, that does not exist (see, for example, Motifs[0] on line 2).
You should use a valid key as, for example, Motifs['A'].
You defined Motifs as a dictionary.
Motifs = {'A': [0.4, 0.3, 0.0, 0.1, 0.0, 0.9],
'C': [0.2, 0.3, 0.0, 0.4, 0.0, 0.1],
'G': [0.1, 0.3, 1.0, 0.1, 0.5, 0.0],
'T': [0.3, 0.1, 0.0, 0.4, 0.5, 0.0]}
Motifs[0] raises KeyError: 0 because the keys are ['T', 'G', 'A', 'C'].
It seems like you wanted to access the length of the first List associated with key A.
You can achieve this by taking len(Motifs['A']).
Note: Ordering of elements in a python dictionary is only a language feature starting from Python3.7.
Mail thread here.

How to construct a numpy array with its each element be the minimum value of all possible values?

I want to construct a 1d numpy array a, and I know each a[i] has several possible values. Of course, the numbers of the possible values of any two elements of a can be different. For each a[i], I want to set it be the minimum value of all the possible values.
For example, I have two array:
idx = np.array([0, 1, 0, 2, 3, 3, 3])
val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
The array I want to construct is following:
a = np.array([0.1, 0.5, 0.6, 0.1])
So does there exist any function in numpy can finish this work?
Here's one approach -
def groupby_minimum(idx, val):
sidx = idx.argsort()
sorted_idx = idx[sidx]
cut_idx = np.r_[0,np.flatnonzero(sorted_idx[1:] != sorted_idx[:-1])+1]
return np.minimum.reduceat(val[sidx], cut_idx)
Sample run -
In [36]: idx = np.array([0, 1, 0, 2, 3, 3, 3])
...: val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
...:
In [37]: groupby_minimum(idx, val)
Out[37]: array([ 0.1, 0.5, 0.6, 0.1])
Here's another using pandas -
import pandas as pd
def pandas_groupby_minimum(idx, val):
df = pd.DataFrame({'ID' : idx, 'val' : val})
return df.groupby('ID')['val'].min().values
Sample run -
In [66]: pandas_groupby_minimum(idx, val)
Out[66]: array([ 0.1, 0.5, 0.6, 0.1])
You can also use binned_statistic:
from scipy.stats import binned_statistic
idx_list=np.append(np.unique(idx),np.max(idx)+1)
stats=binned_statistic(idx,val,statistic='min', bins=idx_list)
a=stats.statistic
I think, in older scipy versions, statistic='min' was not implemented, but you can use statistic=np.min instead. Intervals are half open in binned_statistic, so this implementation is safe.

Resources