Getting Key Error: 0 while creating Consensus Matrix - python-3.x

I am getting "Type Error: 0" on dict when acquiring the length of the Dict
(t = len(Motifs[0])
I reviewed the previous post on "Type Error: 0) and I tried casting
t = int(len(Motifs[0]))
def Consensus(Motifs):
k = len(Motifs[0])
profile = ProfileWithPseudocounts(Motifs)
consensus = ""
for j in range(k):
maximum = 0
frequentSymbol = ""
for symbol in "ACGT":
if profile[symbol][j] > maximum:
maximum = profile[symbol][j]
frequentSymbol = symbol
consensus += frequentSymbol
return consensus
def ProfileWithPseudocounts(Motifs):
t = len(Motifs)
k = len(Motifs[0])
profile = {}
count = CountWithPseudocounts(Motifs)
for key, motif_lists in sorted(count.items()):
profile[key] = motif_lists
for motif_list, number in enumerate(motif_lists):
motif_lists[motif_list] = number/(float(t+4))
return profile
def CountWithPseudocounts(Motifs):
t = len(Motifs)
k = len(Motifs[0])
count = {}
for symbol in "ACGT":
count[symbol] = []
for j in range(k):
count[symbol].append(1)
for i in range(t):
for j in range(k):
symbol = Motifs[i][j]
count[symbol][j] += 1
return count
Motifs = {'A': [0.4, 0.3, 0.0, 0.1, 0.0, 0.9],
'C': [0.2, 0.3, 0.0, 0.4, 0.0, 0.1],
'G': [0.1, 0.3, 1.0, 0.1, 0.5, 0.0],
'T': [0.3, 0.1, 0.0, 0.4, 0.5, 0.0]}
#print(type(Motifs))
print(Consensus(Motifs))
"Type Error: 0"
"t = len(Motifs)"
"k = len(Motifs[0])"
"symbol = Motifs[i][j]"
on lines(9, 24, 35, 44) when code executes!!! Traceback:
Traceback (most recent call last):
File "myfile.py", line 47, in <module>
print(Consensus(Motifs))
File "myfile.py", line 2, in Consensus
k = len(Motifs[0])
KeyError: 0
I should get the "Consensus matrix" without errors

You have a dictionary called Motifs with 4 keys:
>>> Motifs.keys()
dict_keys(['A', 'C', 'G', 'T'])
But you are trying to get the value for the key 0, that does not exist (see, for example, Motifs[0] on line 2).
You should use a valid key as, for example, Motifs['A'].

You defined Motifs as a dictionary.
Motifs = {'A': [0.4, 0.3, 0.0, 0.1, 0.0, 0.9],
'C': [0.2, 0.3, 0.0, 0.4, 0.0, 0.1],
'G': [0.1, 0.3, 1.0, 0.1, 0.5, 0.0],
'T': [0.3, 0.1, 0.0, 0.4, 0.5, 0.0]}
Motifs[0] raises KeyError: 0 because the keys are ['T', 'G', 'A', 'C'].
It seems like you wanted to access the length of the first List associated with key A.
You can achieve this by taking len(Motifs['A']).
Note: Ordering of elements in a python dictionary is only a language feature starting from Python3.7.
Mail thread here.

Related

How to define custom function for scipy's binned_statistic_2d?

The documentation for scipy's binned_statistic_2d function gives an example for a 2D histogram:
from scipy import stats
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
Makes sense, but I'm now trying to implement a custom function. The custom function description is given as:
function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.
I wasn't sure exactly how to implement this, so I thought I'd check my understanding by writing a custom function that reproduces the count option. I tried
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, custom_func, bins=[binx, biny])
but this generates an error like so:
556 # Make sure `values` match `sample`
557 if(statistic != 'count' and Vlen != Dlen):
558 raise AttributeError('The number of `values` elements must match the '
559 'length of each `sample` dimension.')
561 try:
562 M = len(bins)
AttributeError: The number of `values` elements must match the length of each `sample` dimension.
How is this custom function supposed to be defined?
The reason for this error is that when using a custom statistic function (or any non-count statistic), you have to pass some array or list of arrays to the values parameter (with the number of elements matching the number in x). You can't just leave it as None as in your example, even though it is irrelevant and does not get used when computing counts of data points in each bin.
So, to match the results, you can just pass the same x object to the values parameter:
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, x, custom_func, bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))
The result matches that of the count statistic:
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))

Matplotlib Pandas: Subplots of 3 columns and each column is a subplot of 3 rows

I have the following pandas dataframe:
My goal is to plot the dataframe in 3 columns, where each column is a 'section'. And, at the same time, each plot is a subplot of 3 lines and 1 column, where one line is 'Col1 [%]', second line is 'Col 2' and last is 'Col 3 [%]'
If I set subplots=True, I obtain the following plot:
Else, with subplots=False, I obtain:
But what I need is to obtain the 3 columns, but where each column plot will be equals to the graph with suplots=True. How can I do that?
Thanks a lot in advance!
My code:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
# DATA
dfplot = pd.DataFrame(columns = ['section', 'description', 'Col1 [%]', 'Col 2', 'Col 3 [%]'])
dfplot['description'] = ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9']
dfplot['section'] = [1, 1, 1, 2, 2, 2, 3, 3, 3]
dfplot['Col1 [%]'] = [82, 89, 86, 100, 100, 99, 16, 16, 16]
dfplot['Col 2'] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
dfplot['Col 3 [%]'] = [99.19, 98.7, 99.36, 99.9, 99.93, 99.5, 97.66, 97.84, 97.66]
dfplot = dfplot.groupby(['section', 'description'], as_index=True).last()
# PLOT -------------
# Set levels to group labels in ax X
cols = list(set(l_columns_values))
dfplot.index.set_levels([cols, l_strains], level=[0,1])
fig, axes = plt.subplots(nrows=1, ncols=len(cols),
sharey=True, sharex=True,
figsize=(14 / 2.54, 10 / 2.54) # width, height
)
for i, col in enumerate(list(set(l_contigs))):
ax = axes[i] #, j]
print(ax)
print("i= {}, col= {}".format(i, col))
dfplot.loc[col].plot.area(ax=ax,
#layout=(3, 1),
stacked=True,
subplots=True, ## <--
grid=True,
table=False,
sharex=True,
sharey=True,
figsize=(20,7),
fontsize=12,
#xticks = np.arange(0, len(cols)+1, 1)
)
#ax[i].set_ylim(-1,100)
ax.set_xlabel(col, weight='bold', fontsize=20)
ax.set_axisbelow(True)
for tick in ax.get_xticklabels():
tick.set_rotation(90)
# make the ticklines invisible
ax.tick_params(axis=u'both', which=u'both', length=0)
plt.tight_layout()
# remove spacing in between
fig.subplots_adjust(wspace=0.5) # space between plots
# legend
plt.legend(loc='upper right')
# Add title
fig.suptitle('My title')
plt.show()
A bit of interpretation - a graph for each column and section.
There was an issue in your code - you were overwriting ax array with a reference to it. I've used a different variable name: axt
dfplot = pd.DataFrame(columns = ['section', 'description', 'Col1 [%]', 'Col 2', 'Col 3 [%]'])
dfplot['description'] = ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9']
dfplot['section'] = [1, 1, 1, 2, 2, 2, 3, 3, 3]
dfplot['Col1 [%]'] = [82, 89, 86, 100, 100, 99, 16, 16, 16]
dfplot['Col 2'] = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
dfplot['Col 3 [%]'] = [99.19, 98.7, 99.36, 99.9, 99.93, 99.5, 97.66, 97.84, 97.66]
# dfplot = dfplot.groupby(['section', 'description'], as_index=True).last()
dfplot = dfplot.set_index(["section", "description"])
fig, ax = plt.subplots(len(dfplot.index.get_level_values(0).unique()),len(dfplot.columns), figsize=[20,5],
sharey=True, sharex=False)
# Add title
fig.suptitle('My title')
for i,v in enumerate(dfplot.index.get_level_values(0).unique()):
for j, c in enumerate(dfplot.columns):
axt = ax[j][i]
dfplot.loc[(v),[c]].plot.area(ax=axt, stacked=True)
axt.set_xlabel(f"Section {v}", weight='bold', fontsize=20)
axt.set_axisbelow(True)
# make the ticklines invisible
axt.tick_params(axis=u'both', which=u'both', length=0)
axt.legend(loc='upper right')
for tick in axt.get_xticklabels():
tick.set_rotation(90)
output

How to efficiently deal with nested data in PySpark?

I had a situation here, and found that collect_list in spark is not efficient when the item is already a list.
Basically, I tried to calculate the mean of a nested list (the size of each list is guaranteed to be the same). When the data set becomes, for example, 10 M rows, it may produce out of memory errors. Originally, I thought it has something to do with the udf (to calculate the mean). But actually, I found that the aggregation part (collect_list of lists) is the real problem.
What I am doing now is to divide the 10 M rows into multiple blocks (by 'user'), aggregate each block individually, and then union them at the end. Any better suggestion on efficiently dealing with nested data?
For example, the toy example is like this:
data = [('user1','place1', ['place1', 'place2', 'place3'], [0.0, 0.5, 0.4], [0.0, 0.4, 0.3]),
('user1','place2', ['place1', 'place2', 'place3'], [0.7, 0.0, 0.4], [0.6, 0.0, 0.3]),
('user2','place1', ['place1', 'place2', 'place3'], [0.0, 0.4, 0.3], [0.0, 0.3, 0.4]),
('user2','place3', ['place1', 'place2', 'place3'], [0.1, 0.2, 0.0], [0.3, 0.1, 0.0]),
('user3','place2', ['place1', 'place2', 'place3'], [0.3, 0.0, 0.4], [0.2, 0.0, 0.4]),
]
data_df = sparkApp.sparkSession.createDataFrame(data, ['user', 'place', 'places', 'data1', 'data2'])
data_agg = data_df.groupBy('user') \
.agg(f.collect_list('place').alias('place_list'),
f.first('places').alias('places'),
f.collect_list('data1').alias('data1'),
f.collect_list('data1').alias('data2'),
)
import numpy as np
def average_values(sim_vectors):
if len(sim_vectors) == 1:
return sim_vectors[0]
mat = np.array(sim_vectors)
mean_vector = np.mean(mat, axis=0)
return np.round(mean_vector, 3).tolist()
avg_vectors_udf = f.udf(average_values, ArrayType(DoubleType()))
data_agg_ave = data_agg.withColumn('data1', avg_vectors_udf('data1')) \
.withColumn('data2', avg_vectors_udf('data2'))
The result would be:
+-----+----------------+--------------------+-----------------+-----------------+
| user| place_list| places| data1| data2|
+-----+----------------+--------------------+-----------------+-----------------+
|user1|[place1, place2]|[place1, place2, ...|[0.35, 0.25, 0.4]|[0.35, 0.25, 0.4]|
|user3| [place2]|[place1, place2, ...| [0.3, 0.0, 0.4]| [0.3, 0.0, 0.4]|
|user2|[place1, place3]|[place1, place2, ...|[0.05, 0.3, 0.15]|[0.05, 0.3, 0.15]|
+-----+----------------+--------------------+-----------------+-----------------+

How to construct a numpy array with its each element be the minimum value of all possible values?

I want to construct a 1d numpy array a, and I know each a[i] has several possible values. Of course, the numbers of the possible values of any two elements of a can be different. For each a[i], I want to set it be the minimum value of all the possible values.
For example, I have two array:
idx = np.array([0, 1, 0, 2, 3, 3, 3])
val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
The array I want to construct is following:
a = np.array([0.1, 0.5, 0.6, 0.1])
So does there exist any function in numpy can finish this work?
Here's one approach -
def groupby_minimum(idx, val):
sidx = idx.argsort()
sorted_idx = idx[sidx]
cut_idx = np.r_[0,np.flatnonzero(sorted_idx[1:] != sorted_idx[:-1])+1]
return np.minimum.reduceat(val[sidx], cut_idx)
Sample run -
In [36]: idx = np.array([0, 1, 0, 2, 3, 3, 3])
...: val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
...:
In [37]: groupby_minimum(idx, val)
Out[37]: array([ 0.1, 0.5, 0.6, 0.1])
Here's another using pandas -
import pandas as pd
def pandas_groupby_minimum(idx, val):
df = pd.DataFrame({'ID' : idx, 'val' : val})
return df.groupby('ID')['val'].min().values
Sample run -
In [66]: pandas_groupby_minimum(idx, val)
Out[66]: array([ 0.1, 0.5, 0.6, 0.1])
You can also use binned_statistic:
from scipy.stats import binned_statistic
idx_list=np.append(np.unique(idx),np.max(idx)+1)
stats=binned_statistic(idx,val,statistic='min', bins=idx_list)
a=stats.statistic
I think, in older scipy versions, statistic='min' was not implemented, but you can use statistic=np.min instead. Intervals are half open in binned_statistic, so this implementation is safe.

search in sublists and match common elements with other sublist

i am searching for an answer but i didn't find anything about my problem.
x=[['100',220, 0.5, 0.25, 0.1],['105',400, 0.12, 0.56, 0.9],['600',340, 0.4, 0.7, 0.45]]
y=[['1','100','105','601'],['2','104','105','600'],['3','100','105','604']]
i want as result:
z=[['1','100',0.5,0.25,0.1,'105',0.12,0.56,0.9],['2','105',0.12,0.56,0.9,'600',0.4,0.7,0.45],['3','100',0.5, 0.25, 0.1,'105', 0.12, 0.56, 0.9]]
i want to search in list y and match list x with list y where i get a new list z that containts the common sublists.
this is just an example, normally contains list x and y 10000 sublists.
i compare out of y ['1','100','105','601'] and search the '100','105','601' in list x (example ['100',220, 0.5, 0.25, 0.1]). if i find a match i make a new list z.
Can someone help me?
Answer edited because comments
You said in the comments:
search the second, third and fourth number in each y. and compare that with the number on place one in list x
and
then i would like to add (from list x) the numbers on place 1,3,4,5
Then try something like this:
x = [
['100', 220, 0.5, 0.25, 0.1],
['105', 400, 0.12, 0.56, 0.9],
['600', 340, 0.4, 0.7, 0.45]
]
y = [
['1', '100', '105', '601'],
['2', '104', '105', '600'],
['3', '100', '105', '604']
]
z = []
xx = dict((k, v) for k, _, *v in x)
for first, *yy in y:
zz = [first]
for n in yy:
numbers = xx.get(n)
if numbers:
zz.append(n)
zz.extend(numbers)
z.append(zz)
print(z)
z should now be:
[['1', '100', 0.5, 0.25, 0.1, '105', 0.12, 0.56, 0.9],
['2', '105', 0.12, 0.56, 0.9, '600', 0.4, 0.7, 0.45],
['3', '100', 0.5, 0.25, 0.1, '105', 0.12, 0.56, 0.9]]
First, I convert x into a dictionary, for easy lookup.
The iteration pattern used here was introduced with pep-3132 and works like this:
>>> head, *tail = range(5)
>>> head
0
>>> tail
[1, 2, 3, 4]

Resources