How do I get the microarray data? - python-3.x

thank you for your help. I want to use the following python code, to read and process data from an affymetrix microarray data set. I want to elucidate differential gene expression in disease conditions of Crohn's disease and Ulcerative colitis, in mononuclear cells. The code runs perfectly, but when I try to see the content of X, I get an empty array at the output (like this : array([], dtype=float64)), which of course is not useful. Here is a link to the raw data set : https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1615
I have tried long to figure out why I have an empty and unprocessable output, but to no avail. Here is the code:
import gzip
import numpy as np
"""
Read in a SOFT format data file. The following values can be exported:
GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample descriptions of length d
X : A dxn array of gene expression values
"""
#path to the data file
fname = "../data/GDS1615_full.soft.gz"
## Open the data file directly as a gzip file
with gzip.open(fname) as fid:
SIF = {}
for line in fid:
if line.startswith(line, len("!dataset_table_begin")):
break
elif line.startswith(line, len("!subject_description")):
subset_description = line.split("=")[1].strip()
elif line.startswith(line, len("!subset_sample_id")):
subset_ids = [x.strip() for x in subset_ids]
for k in subset_ids:
SIF[k] = subset_description
## Next line is the column headers (sample id's)
SID = next(fid).split("\t")
## The column indices that contain gene expression data
I = [i for i,x in enumerate(SID) if x.startswith("GSM")]
## Restrict the column headers to those that we keep
SID = [SID[i] for i in I]
## Get a list of sample labels
STP = [SIF[k] for k in SID]
## Read the gene expression data as a list of lists, also get the gene
## identifiers
GID,X = [],[]
for line in fid:
## This is what signals the end of the gene expression data
## section in the file
if line.startswith("!dataset_table_end"):
break
V = line.split("\t")
## Extract the values that correspond to gene expression measures
## and convert the strings to numbers
x = [float(V[i]) for i in I]
X.append(x)
GID.append(V[0] + ";" + V[1])
X = np.array(X)
## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]
At the console, I get such output:
X
Out[94]: array([], dtype=float64)
X.shape
Out[95]: (0,)
Thank you once more for your suggestions.

This worked perfectly:
import gzip
import numpy as np
"""
Read in a SOFT format data file. The following values can be exported:
GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample desriptions of length d
X : A dxn array of gene expression values
"""
#path to the data file
fname = "../data/GDS1615_full.soft.gz"
## Open the data file directly as a gzip file
with gzip.open(fname) as fid:
SIF = {}
for line in fid:
if line.startswith(b"!dataset_table_begin"):
break
elif line.startswith(b"!subset_description"):
subset_description = line.decode('utf8').split("=")[1].strip()
elif line.startswith(b"!subset_sample_id"):
subset_ids = line.decode('utf8').split("=")[1].split(",")
subset_ids = [x.strip() for x in subset_ids]
for k in subset_ids:
SIF[k] = subset_description
## Next line is the column headers (sample id's)
SID = next(fid).split(b"\t")
## The column indices that contain gene expression data
I = [i for i,x in enumerate(SID) if x.startswith(b"GSM")]
## Restrict the column headers to those that we keep
SID = [SID[i] for i in I]
## Get a list of sample labels
STP = [SIF[k.decode('utf8')] for k in SID]
## Read the gene expression data as a list of lists, also get the gene
## identifiers
GID,X = [],[]
for line in fid:
## This is what signals the end of the gene expression data
## section in the file
if line.startswith(b"!dataset_table_end"):
break
V = line.split(b"\t")
## Extract the values that correspond to gene expression measures
## and convert the strings to numbers
x = [float(V[i]) for i in I]
X.append(x)
GID.append(V[0].decode() + ";" + V[1].decode())
X = np.array(X)
## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]
results:
X.shape
Out[4]: (22283, 127)

Related

How to color in red values that are different in adjacent columns?

I have the following dataframe, and I want to color in read the values that are different for each adjacent feature. So for example for 'max', CRIM raw=88.98 and CRIM wisorized=41.53 should be in red whereas for AGE they should remain black.
How can I do this? Attached is the CSV file.
,25%,25%,50%,50%,75%,75%,count,count,max,max,mean,mean,min,min,std,std
,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized
CRIM,0.08,0.08,0.26,0.26,3.68,3.68,506.0,506.0,88.98,41.53,3.61,3.38,0.01,0.01,8.6,6.92
ZN,0.0,0.0,0.0,0.0,12.5,12.5,506.0,506.0,100.0,90.0,11.36,11.3,0.0,0.0,23.32,23.11
INDUS,5.19,5.19,9.69,9.69,18.1,18.1,506.0,506.0,27.74,25.65,11.14,11.12,0.46,1.25,6.86,6.81
CHAS,0.0,0.0,0.0,0.0,0.0,0.0,506.0,506.0,1.0,1.0,0.07,0.07,0.0,0.0,0.25,0.25
NOX,0.45,0.45,0.54,0.54,0.62,0.62,506.0,506.0,0.87,0.87,0.55,0.55,0.38,0.4,0.12,0.12
RM,5.89,5.89,6.21,6.21,6.62,6.62,506.0,506.0,8.78,8.34,6.28,6.29,3.56,4.52,0.7,0.68
AGE,45.02,45.02,77.5,77.5,94.07,94.07,506.0,506.0,100.0,100.0,68.57,68.58,2.9,6.6,28.15,28.13
DIS,2.1,2.1,3.21,3.21,5.19,5.19,506.0,506.0,12.13,9.22,3.8,3.78,1.13,1.2,2.11,2.05
RAD,4.0,4.0,5.0,5.0,24.0,24.0,506.0,506.0,24.0,24.0,9.55,9.55,1.0,1.0,8.71,8.71
TAX,279.0,279.0,330.0,330.0,666.0,666.0,506.0,506.0,711.0,666.0,408.24,407.79,187.0,188.0,168.54,167.79
PTRATIO,17.4,17.4,19.05,19.05,20.2,20.2,506.0,506.0,22.0,21.2,18.46,18.45,12.6,13.0,2.16,2.15
B,375.38,375.38,391.44,391.44,396.22,396.22,506.0,506.0,396.9,396.9,356.67,356.72,0.32,6.68,91.29,91.14
LSTAT,6.95,6.95,11.36,11.36,16.96,16.96,506.0,506.0,37.97,34.02,12.65,12.64,1.73,2.88,7.14,7.08
MEDV,17.02,17.02,21.2,21.2,25.0,25.0,506.0,506.0,50.0,50.0,22.53,22.54,5.0,7.0,9.2,9.18
Nothing more, Nothing less :)
def highlight_cols(s):
# input: s is a pd.Series with an attribute name
# s.name --> ('25%', 'raw')
# ('25%', 'winsorized')
# ...
#
# 1) Take the parent level of s.name (first value of the tuple) E.g. 25%
# 2) Select the subset from df, given step 1
# --> this will give you the df: 25% - raw | 25% - winsorized back
# 3) check if the amount of unique values (for each row) > 1
# If so: return a red text
# if not: return an empty string
#
# Output: a list with the desired style for serie x
return ['background-color: red' if x else '' for x in df[s.name[0]].nunique(axis=1) > 1]
df.style.apply(highlight_cols)
You can do this comparison between columns using a groupby. Here's an example:
import pandas as pd
import io
s = """,25%,25%,50%,50%,75%,75%,count,count,max,max,mean,mean,min,min,std,std
,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized,raw,winsorized
CRIM,0.08,0.08,0.26,0.26,3.68,3.68,506.0,506.0,88.98,41.53,3.61,3.38,0.01,0.01,8.6,6.92
ZN,0.0,0.0,0.0,0.0,12.5,12.5,506.0,506.0,100.0,90.0,11.36,11.3,0.0,0.0,23.32,23.11
INDUS,5.19,5.19,9.69,9.69,18.1,18.1,506.0,506.0,27.74,25.65,11.14,11.12,0.46,1.25,6.86,6.81
CHAS,0.0,0.0,0.0,0.0,0.0,0.0,506.0,506.0,1.0,1.0,0.07,0.07,0.0,0.0,0.25,0.25
NOX,0.45,0.45,0.54,0.54,0.62,0.62,506.0,506.0,0.87,0.87,0.55,0.55,0.38,0.4,0.12,0.12
RM,5.89,5.89,6.21,6.21,6.62,6.62,506.0,506.0,8.78,8.34,6.28,6.29,3.56,4.52,0.7,0.68
AGE,45.02,45.02,77.5,77.5,94.07,94.07,506.0,506.0,100.0,100.0,68.57,68.58,2.9,6.6,28.15,28.13
DIS,2.1,2.1,3.21,3.21,5.19,5.19,506.0,506.0,12.13,9.22,3.8,3.78,1.13,1.2,2.11,2.05
RAD,4.0,4.0,5.0,5.0,24.0,24.0,506.0,506.0,24.0,24.0,9.55,9.55,1.0,1.0,8.71,8.71
TAX,279.0,279.0,330.0,330.0,666.0,666.0,506.0,506.0,711.0,666.0,408.24,407.79,187.0,188.0,168.54,167.79
PTRATIO,17.4,17.4,19.05,19.05,20.2,20.2,506.0,506.0,22.0,21.2,18.46,18.45,12.6,13.0,2.16,2.15
B,375.38,375.38,391.44,391.44,396.22,396.22,506.0,506.0,396.9,396.9,356.67,356.72,0.32,6.68,91.29,91.14
LSTAT,6.95,6.95,11.36,11.36,16.96,16.96,506.0,506.0,37.97,34.02,12.65,12.64,1.73,2.88,7.14,7.08
MEDV,17.02,17.02,21.2,21.2,25.0,25.0,506.0,506.0,50.0,50.0,22.53,22.54,5.0,7.0,9.2,9.18"""
df = pd.read_csv(io.StringIO(s), header=[0,1])
df = df.set_index(df.columns[0])
df.index.name = ''
def get_styles_inner(col):
first_level_name = col.columns[0][0]
# compare raw and windsorized
match = col[(first_level_name, 'raw')] == col[(first_level_name, 'winsorized')]
# color both the raw and windsorized red if they don't match
col[(first_level_name, 'raw')] = match
col[(first_level_name, 'winsorized')] = match
return col
def get_styles(df):
# Grouping on the first level of the index of the columns, pass each
# group to get_styles_inner.
match_df = df.groupby(level=0, axis=1).apply(get_styles_inner)
# Replace True with no style, and False with red
style_df = match_df.applymap(lambda x: None if x else 'color:red;')
return style_df
df.style.apply(get_styles, axis=None)
(The first 24 lines are just loading in your dataset. You can ignore them if you already have the dataset.)
Here's the output:

Adding new strings line by line from a file to a new one

I have a data output file in the format below from the script I run.
1. xxx %percentage1
2. yyy %percentage1
.
.
.
I am trying to take the percentages only, and append them to the same formatted file line by line (writing a new file once in the process).
1. xxx %percentage1 %percentage2
2. yyy %percentage1 %percentage2
The main idea is every time I run the code with a source data file I want it to add those percentages to the new file line by line.
1. xxx %percentage1 %percentage2 %percentage3 ...
2. yyy %percentage1 %percentage2 %percentage3 ...
This is what I could come up with:
import os
os.chdir("directory")
f = open("data1", "r")
n=3
a = f.readlines()
b = []
for i in range(n):
b.append(a[i].split(" ")[2])
file_lines = []
with open("data1", 'r') as f:
for t in range(n):
for x in f.readlines():
file_lines.append(''.join([x.strip(), b[t], '\n']))
print(b[t])
with open("data2", 'w') as f:
f.writelines(file_lines)
With this code I get the new file but the appending percentages are all from the first line, not different for each line. And I can only get one set of percentages added only and it is overwriting it rather than adding more down the lines.
I hope I explained it properly, if you can give some help I would be glad.
You can use a dict as a structure to load and write your data. This dict can then be pickled to store the data.
EDIT: added missing return statement
EDIT2: Fix return list of get_data
import pickle
import os
output = 'output'
dump = 'dump'
output_dict = {}
if os.path.exists(dump):
with open(dump, 'rb') as f:
output_dict = pickle.load(f)
def read_data(lines):
""" Builds a dict from a list of lines where the keys are
a tuple(w1, w2) and the values are w3 where w1, w2 and w3
are the 3 words composing each line.
"""
d = {}
for line in lines:
elts = line.split()
assert(len(elts)==3)
d[tuple(elts[:2])] = elts[2]
return d
def get_data(data):
""" Recover data from a dict as a list of strings.
The formatting for each element of the list is the following:
k[0] k[1] v
where k and v are the key/values of the data dict.
"""
lines = []
for k, v in data.items():
line = list(k)
line += [v, '\n']
lines.append(' '.join(line))
return lines
def update_data(output_d, new_d):
""" Update a data dict with new data
The values are appended if the key already exists.
Otherwise a new key/value pair is created.
"""
for k, v in new_d.items():
if k in output_d:
output_d[k] = ' '.join([output_d[k], v])
else:
output_d[k] = v
for data_file in ('data1', 'data2', 'data3'):
with open(data_file) as f:
d1 = read_data(f.readlines())
update_data(output_dict, d1)
print("Dumping data", output_dict)
with open(dump, 'wb') as f:
pickle.dump(output_dict, f)
print("Writing data")
with open(output, 'w') as f:
f.write('\n'.join(get_data(output_dict)))

Add filenames to multiple for loops

I have a list of file names, like this.
file_names = ['file1', 'file2']
Also, I have a list of key words I am trying to extract from some files. So, the list of key words (list_1, list_2) and the text string that come from file1 and file2 are below,
## list_1 keywords
list_1 = ['hi', 'hello']
## list_2 keywords
list_2 = ['I', 'am']
## Text strings from file_1 and file_2
big_list = ['hi I am so and so how are you', 'hello hope all goes well by the way I can help you']
My function to extract text,
def my_func(text_string, key_words):
sentences = re.findall(r"([^.]*\.)" ,text_string)
for sentence in sentences:
if all(word in sentence for word in key_words):
return sentence
Now, I am going through multiple lists with two different for loops (as shown below) and with the funciton. After end of each iteration of these multiple for loops, I want to save the file with the filenames from file_names list.
for a,b in zip(list_1,list_2):
for item in big_list:
sentence_1 = my_func(item, a.split(' '))
sentence_2 = my_func(item, b.split(' '))
## Here I would like to add the file name i.e (print(filename))
print(sentence_1)
print(sentence_2)
I need an output that looks like this,
file1 is:
None
file2 is:
None
You can ignore None in my output now, as my main focus is to iterate though filename list and add them to my output. I would appreciate any help to achieve this.
You can access the index in Python for loops and use this index to find the file to which the string corresponds. With this you can print out the current file.
Here is an example of how you can do it:
for a,b in zip(list_1,list_2):
# idx is the index here
for idx, item in enumerate(big_list):
sentence_1 = extract_text(item, a)
sentence_2 = extract_text(item, b)
prefix = file_names[idx] + " is: " # Use idx to get the file from the file list
if sentence_1 is not None:
print(prefix + sentence_1)
if sentence_2 is not None:
print(prefix + sentence_2)
Update:
If you want to print the results after the iteration you can save temporarily the results in a dictionary and then loop through it:
for a,b in zip(list_1,list_2):
# idx is the index here
resMap = {}
for idx, item in enumerate(big_list):
sentence_1 = extract_text(item, a)
sentence_2 = extract_text(item, b)
if sentence_1 is not None:
resMap[file_names[idx]] = sentence_1
if sentence_2 is not None:
resMap[file_names[idx]] = sentence_2
for k in resMap.keys():
prefix = k + " is: " # Use idx to get the file from the file list
print (prefix + resMap[k])

How to append a string at the beginning to every tuple within a list?

tuple_list = [('1','2'),('2','3'),('2','6')]
string = Point
Desired_List = [Point('1','2'),Point('2','3'),Point('2','6')]
I have tried the following code:
for x in tuple_list:
x.append("Point")
for x in tuple_list:
x + 'Point'
How to append a string at the beginning to every tuple within a list?
Updates for your info, I have 2 columns and hundreds of rows in csv file of x and y points:
x y
1 3
2 4
I want to get that as:
Points = [Point(1,3),Point(2,4),......]
If strings are all you need, this is the way to go :
desired_list = []
for x,y in tuple_list:
desired_list.append(f"Point({x},{y})")
Which produces the following output :
>>> print(desired_list)
['Point(1,2)', 'Point(2,3)', 'Point(2,6)']
As far as named tuples are concerned, you'd do it as follows :
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
tuple_list = [('1','2'),('2','3'),('2','6')]
desired_list = []
for x,y in tuple_list:
desired_list.append(Point(x, y))
And the corresponding result is :
>>> print(desired_list)
[Point(x='1', y='2'), Point(x='2', y='3'), Point(x='2', y='6')]

how to update contents of file in python

def update():
global mylist
i = j = 0
mylist[:]= []
key = input("enter student's tp")
myf = open("data.txt","r+")
ml = myf.readlines()
#print(ml[1])
for line in ml:
words = line.split()
mylist.append(words)
print(mylist)
l = len(mylist)
w = len(words)
print(w)
print(l)
for i in range(l):
for j in range(w):
print(mylist[i][j])
## if(key == mylist[i][j]):
## print("found at ",i,j)
## del mylist[i][j]
## mylist[i].insert((j+1), "xxx")
below is the error
print(mylist[i][j])
IndexError: list index out of range
I am trying to update contents in a file. I am saving the file in a list as lines and each line is then saved as another list of words. So "mylist" is a 2D list but it is giving me error with index
Your l variable is the length of the last line list. Others could be shorter.
A better idiom is to use a for loop to iterate over a list.
But there is an even better way.
It appears you want to replace a "tp" (whatever that is) with the string xxx everywhere. A quicker way to do that would be to use regular expressions.
import re
with open('data.txt') as myf:
myd = myf.read()
newd = re.sub(key, 'xxx', myd)
with open('newdata.txt', 'w') ad newf:
newf.write(newd)

Resources