converting sequence of nucleotide into 2D array of integers - python-3.x

I am trying to convert nucleotide to integer using the following mapping:
A -> 0
C -> 1
G -> 2
T -> 3
The sequence of nucleotide is saved in a pandas dataframe and it looks like:
0
0 GGATAATA
1 CGATAACC
I have used the df.apply() method to do the task. Here is the code:
import pandas as pd
a = ["GGATAATA","CGATAACC"]
d = dict(zip('A C G T'.split(), range(4)))
df = pd.DataFrame(a)
mapping = df[0].apply(lambda s: np.array([d[i] for i in s]))
It returns the following numpy array which is one dimensional:
print(mapping.values)
array([array([2, 2, 0, 3, 0, 0, 3, 0]), array([1, 2, 0, 3, 0, 0, 1, 1])],
dtype=object)
However, the expected output should be two dimensional array:
[[2,2,0,3,0,0,3,0],
[1,2,0,3,0,0,1,1]]

IIUC
df['0'].apply(list).explode().replace(d).groupby(level=0).agg(list).to_list()
Out[579]: [[2, 2, 0, 3, 0, 0, 3, 0], [1, 2, 0, 3, 0, 0, 1, 1]]

Use map:
list(map(lambda x: list(map(lambda c: d[c], list(x))), df[0]))
Output
[[2, 2, 0, 3, 0, 0, 3, 0], [1, 2, 0, 3, 0, 0, 1, 1]]
or
df[0].agg(list).explode().replace(d).groupby(level=0).agg(list).tolist()
I think the first solusion is faster
%%timeit
list(map(lambda x: list(map(lambda c: d[c], list(x))), df[0]))
11.7 µs ± 392 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
df[0].agg(list).explode().replace(d).groupby(level=0).agg(list).tolist()
5.02 ms ± 697 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

using .str.split() and stack with map
seq = {'A' : 0,
'C' : 1,
'G' : 2,
'T' : 3}
df[0].str.split('',expand=True).stack().map(seq).dropna().groupby(level=0).agg(list)
#out:
0 [2.0, 2.0, 0.0, 3.0, 0.0, 0.0, 3.0, 0.0]
1 [1.0, 2.0, 0.0, 3.0, 0.0, 0.0, 1.0, 1.0]
dtype: object

import pandas as pd
a = ["GGATAATA","CGATAACC"]
d = dict(zip('A C G T'.split(), range(4)))
df = pd.DataFrame(a)
# implement mapping
mapping = str.maketrans('ACGT', '0123')
df[0] = df[0].map(lambda x: x.translate(mapping))
# expected output
output = df[0].map(lambda x: [int(x) for i in list(x)]).tolist()

Related

Counting number of certain values in each column using pandas and collections

I have a txt file including 9 columns and 6 rows. The first 8 columns are either of these values: "1" , "2" and "3". I named these columns from "A" to "H". I named the last column: "class".
The last column is a name : "HIGH". Here is the txt file (data.txt):
1,1,1,1,2,1,1,3,HIGH
1,1,1,2,2,1,1,3,HIGH
1,1,1,1,1,1,1,3,HIGH
1,1,1,2,1,1,1,3,HIGH
1,1,1,3,2,1,1,3,HIGH
1,1,1,2,1,2,1,3,HIGH
I am trying to count the number of each value in each column and print a list that should have 3 components including the numbers of "1", "2" and "3" values in that column respectively. For example in the first column (e.g A) all values are "1". I expect to get : A : [6,0,0]. For the 8th column (e.g. H) where all values are "3", I expect to get: H : [0,0,6] or for the fourth column (e.g. D) I have two "1" , three "2" and one "3". So I expect : D : [2,3,1]. I tried to get it done using pandas and collection . Here is what I did:
import pandas as pd
from collections import Counter
df = pd.read_csv('data.txt')
df.columns = ['A','B','C','D','E','F','G','H','class']
X = df.ix[:, 0:8].values
y = df.ix[:, 8].values
deg = ['HIGH']
names = ['A','B','C','D','E','F','G','H']
for j in range(0, 8):
freqs = Counter(X[y == deg[0], j])
print(names[j],':',list(freqs.values()))
The output of the above code are empty lists. Here is what it returns:
A : []
B : []
C : []
D : []
E : []
F : []
G : []
H : []
How can I modify the above code to get what I want?
Thanks!
Use pandas.Series.value_counts
df.loc[:, :"H"].apply(pd.Series.value_counts).fillna(0).to_dict("l")
Output:
{'A': [6.0, 0.0, 0.0],
'B': [6.0, 0.0, 0.0],
'C': [6.0, 0.0, 0.0],
'D': [2, 3, 1],
'E': [3.0, 3.0, 0.0],
'F': [5.0, 1.0, 0.0],
'G': [6.0, 0.0, 0.0],
'H': [0.0, 0.0, 6.0]}
Define the following function:
def cntInts(col):
vc = col.value_counts()
return [ vc.get(i, 0) for i in range(1,4) ]
Then apply it and print results:
for k, v in df.loc[:, 'A':'H'].apply(cntInts).iteritems():
print(f'{k}: {v}')
For your data sample I got:
A: [6, 0, 0]
B: [6, 0, 0]
C: [6, 0, 0]
D: [2, 3, 1]
E: [3, 3, 0]
F: [5, 1, 0]
G: [6, 0, 0]
H: [0, 0, 6]
Or maybe it is enough to call just:
df.loc[:, 'A':'H'].apply(cntInts)
This time the result is a Series, which when printed yields:
A [6, 0, 0]
B [6, 0, 0]
C [6, 0, 0]
D [2, 3, 1]
E [3, 3, 0]
F [5, 1, 0]
G [6, 0, 0]
H [0, 0, 6]
dtype: object
Edit
Following your comments I suppose that there is something wrong with your data.
To trace the actual reason:
Define a string variable:
txt = '''1,1,1,1,2,1,1,3,HIGH
1,1,1,2,2,1,1,3,HIGH
1,1,1,1,1,1,1,3,HIGH
1,1,1,2,1,1,1,3,HIGH
1,1,1,3,2,1,1,3,HIGH
1,1,1,2,1,2,1,3,HIGH'''
Run:
import io
df = pd.read_csv(io.StringIO(txt), names=['A','B','C','D','E','F','G','H','class'])
Run my code on my data. The result should be just as expected.
Then read your input file (also into df) and run my code again.
Probably there is some difference between your data and mine.
Especially look for any extra spaces in your input file,
check also column types (after read_csv).
Solution with collections is select all columns without last, convert Counter to Series, so output is DataFrame, replace missing values by DataFrame.fillna, convert values to integers and last to dictionary by DataFrame.to_dict:
from collections import Counter
d = (df.iloc[:, :-1].apply(lambda x: pd.Series(Counter(x)))
.fillna(0)
.astype(int)
.to_dict("list"))
print (d)
{'A': [6, 0, 0], 'B': [6, 0, 0],
'C': [6, 0, 0], 'D': [1, 4, 1],
'E': [3, 3, 0], 'F': [5, 1, 0],
'G': [6, 0, 0], 'H': [0, 0, 6]}
Only pandas solution with pandas.value_counts:
d = (df.iloc[:, :-1].apply(pd.value_counts)
.fillna(0)
.astype(int)
.to_dict("list"))
print (d)
{'A': [6, 0, 0], 'B': [6, 0, 0],
'C': [6, 0, 0], 'D': [2, 3, 1],
'E': [3, 3, 0], 'F': [5, 1, 0],
'G': [6, 0, 0], 'H': [0, 0, 6]}
Working within python, since your end result is a dictionary:
from string import ascii_uppercase
from collections import Counter, defaultdict
from itertools import chain, product
import csv
d = defaultdict(list)
fieldnames = ascii_uppercase[:9]
# test.csv is your file above
with open('test.csv') as csvfile:
reader = csv.DictReader(csvfile, fieldnames = list(fieldnames))
reader = Counter(chain.from_iterable(row.items() for row in reader))
for col, value in product(fieldnames, ("1","2","3")):
if col != fieldnames[-1]:
d[col].append(reader.get((col,value), 0))
print(d)
defaultdict(list,
{'A': [6, 0, 0],
'B': [6, 0, 0],
'C': [6, 0, 0],
'D': [2, 3, 1],
'E': [3, 3, 0],
'F': [5, 1, 0],
'G': [6, 0, 0],
'H': [0, 0, 6]})

Slicing an array with arrays

I know this has been answered many times and I went through every SO question on this topic, but none of them seemed to tackle my problem.
This code yields an exception:
TypeError: only integer scalar arrays can be converted to a scalar index
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
sindex = np.array([0, 3, 4])
eindex = np.array([2, 5, 6])
r = a[sindex: eindex]
I have an array with start indexes and another one with end indexes and I simply wanted to extract whatever is in between them. Notice the difference between sindex and eindex is constant, for example 2. So eindex is always what ever is in sindex + 2.
So the expected result should be:
[1, 2, 4, 5, 5, 6]
Is there a way to do this without a for loop?
For a constant interval difference, we can setup sliding windows and simply index with the starting indices array. Thus, we can use broadcasting_app or strided_app from this post -
d = 2 # interval difference
out = broadcasting_app(a, L = d, S = 1)[sindex].ravel()
out = strided_app(a, L = d, S = 1)[sindex].ravel()
Or use scikit-image's built-in view_as_windows -
from skimage.util.shape import view_as_windows
out = view_as_windows(a,d)[sindex].ravel()
To set d, we can use -
d = eindex[0] - sindex[0]
You can't tell compiled numpy to take multiple slices directly. The alternatives to joining multiple slices involve some sort of advanced indexing.
In [509]: a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
...:
...: sindex = np.array([0, 3, 4])
...: eindex = np.array([2, 5, 6])
The most obvious loop:
In [511]: np.hstack([a[i:j] for i,j in zip(sindex, eindex)])
Out[511]: array([1, 2, 4, 5, 5, 6])
A variation that uses the loop to construct indices first:
In [516]: a[np.hstack([np.arange(i,j) for i,j in zip(sindex, eindex)])]
Out[516]: array([1, 2, 4, 5, 5, 6])
Since the slice size is all the same, we can generate one arange and step that with sindex:
In [521]: a[np.arange(eindex[0]-sindex[0]) + sindex[:,None]]
Out[521]:
array([[1, 2],
[4, 5],
[5, 6]])
and then ravel. This is a more direct expression of #Divakar'sbroadcasting_app`.
With this small example, timings are similar.
In [532]: timeit np.hstack([a[i:j] for i,j in zip(sindex, eindex)])
13.4 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [533]: timeit a[np.hstack([np.arange(i,j) for i,j in zip(sindex, eindex)])]
21.2 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [534]: timeit a[np.arange(eindex[0]-sindex[0])+sindex[:,None]].ravel()
10.1 µs ± 48.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [535]: timeit strided_app(a, L=2, S=1)[sindex].ravel()
21.8 µs ± 207 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
strided_app and view_as_windows use striding tricks to view the array as an array of size d windows, and use sindex to select a subset of them.
In larger cases, relative timings may vary with the size of the slices versus the number of slices.
You can just use sindex. Refer the following image

how to have multiple conditions in a list comprehension in which conditions are in an array

Eg let A = [3,4]
and Y be a array of multiple values like
Y = [2,3,2,2,2,2,2,3,3,3,3,3]
then I want to select all those labels of Y where Y is in A
So I wrote the following code:
`Yij = [Y[Y == x] for x in a]`
Output:
[array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]), array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4])]
but this will lead a list of list.
I on other hand want a normal array.
Any suggestion on how can I make this work?
A list comprehension solution:
>>> A = set([3, 4])
>>> Y = [2,3,2,2,2,2,2,3,3,3,3,3]
>>> Z = [y for y in Y if y in A]
>>> Z
[3, 3, 3, 3, 3, 3]
Here are some timings to show the performance difference between using set lookup and list lookup:
In [21]: A = set(range(0, 1000, 5))
In [22]: B = list(range(0, 1000, 5))
In [23]: C = list(range(0, 1000))
In [24]: %timeit [y for y in C if y in A]
59.6 µs ± 329 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [25]: %timeit [y for y in C if y in B]
2.94 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy apply_along_axis vectorisation

I am trying to implement a function that takes each row in a numpy 2d array and returns me scalar result of a certain calculations. My current code looks like the following:
img = np.array([
[0, 5, 70, 0, 0, 0 ],
[10, 50, 4, 4, 2, 0 ],
[50, 10, 1, 42, 40, 1 ],
[10, 0, 0, 6, 85, 64],
[0, 0, 0, 1, 2, 90]]
)
def get_y(stride):
stride_vals = stride[stride > 0]
pix_thresh = stride_vals.max() - 1.5*stride_vals.std()
return np.argwhere(stride>pix_thresh).mean()
np.apply_along_axis(get_y, 0, img)
>> array([ 2. , 1. , 0. , 2. , 2.5, 3.5])
It works as expected, however, performance isn't great as in real dataset there are ~2k rows and ~20-50 columns for each frame, coming 60 times a second.
Is there a way to speed-up the process, perhaps by not using np.apply_along_axis function?
Here's one vectorized approach setting the zeros as NaN and that let's us use np.nanmax and np.nanstd to compute those max and std values avoiding the zeros, like so -
imgn = np.where(img==0, np.nan, img)
mx = np.nanmax(imgn,0) # np.max(img,0) if all are positive numbers
st = np.nanstd(imgn,0)
mask = img > mx - 1.5*st
out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
Runtime test -
In [94]: img = np.random.randint(-100,100,(2000,50))
In [95]: %timeit np.apply_along_axis(get_y, 0, img)
100 loops, best of 3: 4.36 ms per loop
In [96]: %%timeit
...: imgn = np.where(img==0, np.nan, img)
...: mx = np.nanmax(imgn,0)
...: st = np.nanstd(imgn,0)
...: mask = img > mx - 1.5*st
...: out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
1000 loops, best of 3: 1.33 ms per loop
Thus, we are seeing a 3x+ speedup.

How to create 2-D array from 3-D Numpy array?

I have a 3 dimensional Numpy array corresponding to an RGB image. I need to create a 2 dimensional Numpy array from it such that if any pixel in the R, G, or B channel is 1, then the corresponding pixel in the 2-D array is 255.
I know how to use something like a list comprehension on a Numpy array, but the result is the same shape as the original array. I need the new shape to be 2-D.
Ok, assuming you want the output pixel to be 0 where it shouldn't be 255 and your input is MxNx3.
RGB = RGB == 1 # you can skip this if your original (RGB) contains only 0's and 1's anyway
out = np.where(np.logical_or.reduce(RGB, axis=-1), 255, 0)
One approach could be with using any() along the third dim and then multiplying by 255, so that the booleans are automatically upscaled to int type, like so -
(img==1).any(axis=2)*255
Sample run -
In [19]: img
Out[19]:
array([[[1, 8, 1],
[2, 4, 7]],
[[4, 0, 6],
[4, 3, 1]]])
In [20]: (img==1).any(axis=2)*255
Out[20]:
array([[255, 0],
[ 0, 255]])
Runtime test -
In [45]: img = np.random.randint(0,5,(1024,1024,3))
# #Paul Panzer's soln
In [46]: %timeit np.where(np.logical_or.reduce(img==1, axis=-1), 255, 0)
10 loops, best of 3: 22.3 ms per loop
# #nanoix9's soln
In [47]: %timeit np.apply_along_axis(lambda a: 255 if 1 in a else 0, 0, img)
10 loops, best of 3: 40.1 ms per loop
# Posted soln here
In [48]: %timeit (img==1).any(axis=2)*255
10 loops, best of 3: 19.1 ms per loop
Additionally, we could convert to np.uint8 and then multiply it with 255 for some further performance boost -
In [49]: %timeit (img==1).any(axis=2).astype(np.uint8)*255
100 loops, best of 3: 18.5 ms per loop
And more, if we work with individual slices along the third dim -
In [68]: %timeit ((img[...,0]==1) | (img[...,1]==1) | (img[...,2]==1))*255
100 loops, best of 3: 7.3 ms per loop
In [69]: %timeit ((img[...,0]==1) | (img[...,1]==1) | (img[...,2]==1)).astype(np.uint8)*255
100 loops, best of 3: 5.96 ms per loop
use apply_along_axis. e.g.
In [28]: import numpy as np
In [29]: np.random.seed(10)
In [30]: img = np.random.randint(2, size=12).reshape(3, 2, 2)
In [31]: img
Out[31]:
array([[[1, 1],
[0, 1]],
[[0, 1],
[1, 0]],
[[1, 1],
[0, 1]]])
In [32]: np.apply_along_axis(lambda a: 255 if 1 in a else 0, 0, img)
Out[32]:
array([[255, 255],
[255, 255]])
see the doc of numpy for details.

Resources