data.frame slicing - string

I hope this question is not too simple for this board.
I have created a data.frame df:
CAS Name CID
89 13010-47-4 Lomustine 3950
90 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
91 130636-43-0 Nifekalant 268083
92 130929-57-6 Entacapone 5281081
and a vector vec
[1] 5282380 18471829 45923789 44308022 44266812 24883465 24867475 24867460
I would like to extract the rows of df which contains any number of vec. I tried to solve this problem by this code:
df$GC[(df$CID %in% vec)] = 1
df[df$GC==1,]
But the problem with this solution is, that I only get the rows, which contain only one number in the CID column. Rows which contain several values in CID like line 90 do not appear.
Is there an elegant solution for this problem?
Thanks in advance

Given your comment on EDi's answer (which I like) I thought I'd make a suggestion.
Squeezing comma separated values into a single column of a data frame is awkward and (in my experience) just leads to frustration. I often find it simpler to keep it in a separate data structure, a list:
dat <- read.table(text = " CAS Name CID
13010-47-4 Lomustine 3950
130209-82-4 Latanoprost 5311221,5282380,46705340,3890
130636-43-0 Nifekalant 268083
130929-57-6 Entacapone 5281081",sep = "",header = TRUE)
cid <- sapply(dat$CID,strsplit,",",USE.NAMES = FALSE)
In this form, things are often easier to work with:
ID <- c(5282380, 18471829, 45923789, 44308022, 44266812, 24883465, 24867475, 24867460, 3950)
dat[sapply(cid,function(x) {any(x %in% as.character(ID))}),]
CAS Name CID
1 13010-47-4 Lomustine 3950
2 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
You can always use rownames in dat and the names of the list to keep each item straight, if you're worried about orderings changing.
(Also note that my anonymous function is assuming that ID will be found eventually by R's scoping rules; you can alter the function to pass in ID explicitly if you like.)

One way is to use grep():
> txt <- " CAS Name CID
+ 13010-47-4 Lomustine 3950
+ 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
+ 130636-43-0 Nifekalant 268083
+ 130929-57-6 Entacapone 5281081
+ "
> con <- textConnection(txt)
> df <- read.table(con, header = TRUE)
> close(con)
> ID <- c(5282380, 18471829, 45923789, 44308022, 44266812, 24883465, 24867475, 24867460, 3950)
> grep(paste("\\b", ID, "\\b", sep="", collapse = "|"), dat$CID)
[1] 1 2

Related

running for loop until arbitrary index (python 3.x)

So I have these strings that I split by spaces (' ') and I just rolled them into a single list I called 'keyLabelRun'
so it looks like this:
keyLabelRun[0-12]:
0 OS=Dengue
1 virus
2 3
3 PE=4
4 SV=1
5 Split=0
6
7 OS=Bacillus
8 subtilis
9 XF-1
10 GN=opuBA
11 PE=4
12 SV=1
I only want the elements that include and are after "OS=", anything else, whether it be "SV=" or "PE=" etc. I want to skip over those elements until I get to the next "OS="
The number of elements to the next "OS=" is arbitrary so that's where I'm having the problem.
This is what I'm currently trying:
OSarr = []
for i in range(len(keyLabelrun)):
if keyLabelrun[i].count('OS='):
OSarr.append(keyLabelrun[i])
if keyLabelrun[i+1].count('=') != 1:
continue
But the elements where "OS=" is not included is what is tripping me up I think.
Also at the end I'm going to join them all back together in their own elements but I feel like I will be able to handle that after this.
In my attempt, I am trying to append all elements I'm looking for in order to an new list 'OSarr'
If anyone can lend a hand, it would be much appreciated.
Thank you.
These list of strings came from a dataset that is a text file in the form:
>tr|W0FSK4|W0FSK4_9FLAV Genome polyprotein (Fragment) OS=Dengue virus 3 PE=4 SV=1 Split=0
MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIPPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINKRKKTSLCLMMILPAALAFHLTSRDGEPRMIVGKNERGKSLLFKTASGINMCTLIAMDLGEMCDDTVTYKCPHITEVEPEDIDCWCNLTSTWVTYGTCNQAGEHRRDKRSVALAPHVGMGLDTRTQTWMSAEGAWRQVEKVETWALRHPGFTILALFLAHYIGTSLTQKVVIFILLMLVTPSMTMRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLCIEGKITNITTDSRCPTQGEATLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGLECSPRTGLDFNEMILLTMKNKAWMVHRQWFFDLPLPWTSGATTETPTWNRKELLVTFKNAHAKKQEVVVLGSQEGAMHTALTGATEIQNSGGTSIFAGHLKCRLKMDKLELKGMSYAMCTNTFVLKKEVSETQHGTILIKVEYKGEDVPCKIPFSTEDGQGKAHNGRLITANPVVTKKEEPVNIEAEPPFGESNIVIGIGDNALKINWYKKGSSIGKMFEATARGARRMAILGDTAWDFGSVGGVLNSLGKMVHQIFGSAYTALFSGVSWVMKIGIGVLLTWIGLNSKNTSMSFSCIAIGIITLYLGAVVQADMGCVINWKGKELKCGSGIFVTNEVHTWTEQYKFQADSPKRLATAIAGAWENGVCGIRSTTRMENLLWKQIANELNYILWENNIKLTVVVGDIIGVLEQGKRTLTPQPMELKYSWKTWGKAKIVTAETQNSSFIIDGPNTPECPSVSRAWNVWEVEDYGFGVFTTNIWLKLREVYTQLCDHRLMSAAVKDERAVHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTLWSNGVLESDMIIPKSLAGPISQHNHRPGYHTQTAGPWHLGKLELDFNYCEGTTVVITENCGTRGPSLRTTTVSGKLIHEWCCRSCTLPPLRYMGEDGCWYGMEIRPISEKEENMVKSLVSAGSGKVDNFTMGVLCLAILFEEVMRGKFGKKHMIAGVFFTFVLLLSGQITWRDMAHTLIMIGSNASDRMGMGVTYLALIATFKIQPFLALGFFLRKLTSRENLLLGVGLAMATTLQLPEDIEQMANGIALGLMALKLITQFETYQLWTALISLTCSNTIFTLTVAWRTATLILAGVSLLPVCQSSSMRKTDWLPMAVAAMGVPPLPLFIFGLKDTLKRRSWPLNEGVMAVGLVSILASSLLRNDVPMAGPLVAGGLLIACYVITGTSADLTVEKAADITWEEEAEQTGVSHNLMITVDDDGTMRIKDDETENILTVLLKTALLIVSGIFPYSIPATLLVWHTWQKQTQRSGVLWDVPSPPETQKAELEEGVYRIKQQGIFGKTQVGVGVQKEGVFHTMWHVTRGAVLTYNGKRLEPNWASVKKDLISYGGGWRLSAQWQKGEEVQVIAVEPGKNPKNFQTMPGTFQTTTGEIGAIALDFKPGTSGSPIINREGKVVGLYGNGVVTKNGGYVSGIAQTNAEPDGPTPELEEEMFKKRNLTIMDLHPGSGKTRKYLPAIVREAIKRRLRTLILAPTRVVAAEMEEALKGLPIRYQTTATKSEHTGREIVDLMCHATFTMRLLSPVRVPNYNLIIMDEAHFTDPASIAARGYISTRVGMGEAAAIFMTATPPGTADAFPQSNAPIQDEERDIPERSWNSGNEWITDFAGKTVWFVPSIKAGNDIANCLRKNGKKVIQLSRKTFDTEYQKTKLNDWDFVV
>tr|M4KW32|M4KW32_BACIU Choline ABC transporter (ATP-binding protein) OS=Bacillus subtilis XF-1 GN=opuBA PE=4 SV=1 Split=0
MLTLENVSKTYKGGKKAVNNVNLKIAKGEFICFIGPSGCGKTTTMKMINRLIEPSAGKIFIDGENIMDQDPVELRRKIGYVIQQIGLFPHMTIQQNISLVPKLLKWPEQQRKERARELLKLVDMGPEYVDRYPHELSGGQQQRIGVLRALAAEPPLILMDEPFGALDPITRDSLQEEFKKLQKTLHKTIVFVTHDMDEAIKLADRIVILKAGEIVQVGTPDDILRNPADEFVEEFIGKERLIQSSSPDVERVDQIMNTQPVTITADKTLSEAIQLMRQERVDSLLVVDDEHVLQGYVDVEIIDQCRKKANLIGEVLHEDIYTVLGGTLLRDTVRKILKRGVKYVPVVDEDRRLIGIVTRASLVDIVYDSLWGEEKQLAALS
>sp|Q8AWH3|SX17A_XENTR Transcription factor Sox-17-alpha OS=Xenopus tropicalis GN=sox17a PE=2 SV=1 Split=0
MSSPDGGYASDDQNQGKCSVPIMMTGLGQCQWAEPMNSLGEGKLKSDAGSANSRGKAEARIRRPMNAFMVWAKDERKRLAQQNPDLHNAELSKMLGKSWKALTLAEKRPFVEEAERLRVQHMQDHPNYKYRPRRRKQVKRMKRADTGFMHMAEPPESAVLGTDGRMCLESFSLGYHEQTYPHSQLPQGSHYREPQAMAPHYDGYSLPTPESSPLDLAEADPVFFTSPPQDECQMMPYSYNASYTHQQNSGASMLVRQMPQAEQMGQGSPVQGMMGCQSSPQMYYGQMYLPGSARHHQLPQAGQNSPPPEAQQMGRADHIQQVDMLAEVDRTEFEQYLSYVAKSDLGMHYHGQESVVPTADNGPISSVLSDASTAVYYCNYPSA
I got it! :D
OSarr = []
G = 0
for i in range(len(keyLabelrun)):
OSarr.append(keyLabelrun[G])
G += 1
if keyLabelrun[G].count('='):
while keyLabelrun[G].count('OS=') != 1:
G+=1
Maybe next time everyone, thank you!
Due to the syntax, you have to keep track of which part (OS, PE, etc) you're currently parsing. Here's a function to extract the species name from the FASTA header:
def extract_species(description):
species_parts = []
is_os = False
for word in description.split():
if word[:3] == 'OS=':
is_os = True
species_parts.append(word[3:])
elif '=' in word:
is_os = False
elif is_os:
species_parts.append(word)
return ' '.join(species_parts)
You can call it when processing your input file, e.g.:
from Bio import SeqIO
for record in SeqIO.parse('input.fa', 'fasta'):
species = extract_species(record.description)

What is the simplest way to complete a function on every row of a large table?

so I want to do a fisher exact test (one sided) on every row of a 3000+ row table with a format matching the below example
gene
sample_alt
sample_ref
population_alt
population_ref
One
4
556
770
37000
Two
5
555
771
36999
Three
6
554
772
36998
I would ideally like to make another column of the table equivalent to
[(4+556)!(4+770)!(770+37000)!(556+37000)!]/[4!(556!)770!(37000!)(4+556+770+37000)!]
for the first row of data, and so on and so forth for each row of the table.
I know how to do a fisher test in R for simple 2x2 tables, but I wouldn't know how I would apply the fisher.test() function to each row of a large table. I also can't use an excel formula because the numbers get so big with the factorials that they reach excel's digit limit and result in a #NUM error. What's the best way to simply complete this? Thanks in advance!
Beginning with a tab-delimited text file on desktop (table.txt) with the same format as shown in the stem question
if(!require(psych)){install.packages("psych")}
multiFisher = function(file="Desktop/table.txt", saveit=TRUE,
outfile="Desktop/table.csv", progress=T,
verbose=FALSE, digits=3, ... )
{
require(psych)
Data = read.table(file, skip=1, header=F,
col.names=c("Gene", "MD", "WTD", "MC", "WTC"), ...)
if(verbose){print(str(Data))}
Data$Fisher.p = NA
Data$phi = NA
Data$OR1 = format(0.123, nsmall=3)
Data$OR2 = NA
if(progress){cat("\n")}
for(i in 1:length(Data$Gene)){
Matrix = matrix(c(Data$WTC[i],Data$MC[i],Data$WTD[i],Data$MD[i]), nrow=2)
Fisher = fisher.test(Matrix, alternative = 'greater')
Data$Fisher.p[i] = signif(Fisher$p.value, digits=digits)
Data$phi[i] = phi(Matrix, digits=digits)
OR1 = (Data$WTC[i]*Data$MD[i])/(Data$MC[i]*Data$WTD[i])
OR2 = 1 / OR1
Data$OR1[i] = format(signif(OR1, digits=digits), nsmall=3)
Data$OR2[i] = signif(OR2, digits=digits)
if(progress) {cat(".")}
}
if(progress){cat("\n"); cat("\n")}
if(saveit){write.csv(Data, outfile)}
return(Data)
}
multiFisher()

Numerical integration of a numpy array in incremental time steps

I have two arrays. The first one is time in terms of Age (yrs) and the second one is a parameter that needs to be integrated with respect to time.
age = [5.00000e+08, 5.60322e+08, 6.27922e+08, 7.03678e+08, 7.88572e+08,
8.83709e+08, 9.90324e+08, 1.10980e+09, 1.24369e+09, 1.39374e+09,
1.56188e+09, 1.75032e+09, 1.96148e+09, 2.19813e+09, 2.46332e+09,
2.76050e+09, 3.09354e+09, 3.46676e+09, 3.88501e+09, 4.35371e+09,
4.87897e+09, 5.46759e+09, 6.12722e+09, 6.86644e+09, 7.69484e+09,
8.62318e+09, 9.66352e+09, 1.08294e+10, 1.21359e+10, 1.36000e+10]
sfr = [1.86120543e-02, 1.46680445e-02, 1.07275184e-02, 8.56960274e-03,
6.44041855e-03, 4.93194263e-03, 3.69203448e-05, 2.69813985e-04,
6.17644783e-04, 1.00780427e-02, 1.20645391e-02, 3.05009362e-02,
3.91535011e-02, 5.35479858e-02, 7.36489068e-02, 9.63931263e-02,
1.11108326e-01, 1.47781221e-01, 1.63057763e-01, 2.27429626e-01,
2.20941333e-01, 2.74413180e-01, 2.72010867e-01, 4.32215233e-01,
5.79654549e-01, 7.39362218e-01, 9.41168727e-01, 1.18868347e+00,
1.42839043e+00, 1.91326333e+00]
I want to perform integration of sfr array with respect to age array, but in steps.
For example, the first integration should contain only the first elements of both arrays, the second integration should contain the first 2 elements of both arrays, the third should have first 3 elements of both arrays and so on and so forth. And save the integration result for each step in a single output array.
The exact form of your desired result is not so clear. So, here are 2 posibilities:
age = [5.00000e+08, 5.60322e+08, 6.27922e+08, 7.03678e+08, 7.88572e+08,
8.83709e+08, 9.90324e+08, 1.10980e+09, 1.24369e+09, 1.39374e+09,
1.56188e+09, 1.75032e+09, 1.96148e+09, 2.19813e+09, 2.46332e+09,
2.76050e+09, 3.09354e+09, 3.46676e+09, 3.88501e+09, 4.35371e+09,
4.87897e+09, 5.46759e+09, 6.12722e+09, 6.86644e+09, 7.69484e+09,
8.62318e+09, 9.66352e+09, 1.08294e+10, 1.21359e+10, 1.36000e+10]
sfr = [1.86120543e-02, 1.46680445e-02, 1.07275184e-02, 8.56960274e-03,
6.44041855e-03, 4.93194263e-03, 3.69203448e-05, 2.69813985e-04,
6.17644783e-04, 1.00780427e-02, 1.20645391e-02, 3.05009362e-02,
3.91535011e-02, 5.35479858e-02, 7.36489068e-02, 9.63931263e-02,
1.11108326e-01, 1.47781221e-01, 1.63057763e-01, 2.27429626e-01,
2.20941333e-01, 2.74413180e-01, 2.72010867e-01, 4.32215233e-01,
5.79654549e-01, 7.39362218e-01, 9.41168727e-01, 1.18868347e+00,
1.42839043e+00, 1.91326333e+00]
integr_pairs = [[(a, s) for a, s in zip(age[:i], sfr[:i])] for i in range(1, len(age))]
print(integr_pairs)
# [[(500000000.0, 0.0186120543)], [(500000000.0, 0.0186120543), (560322000.0, 0.0146680445)], ....
integr_list = [[item for t in [(a, s) for a, s in zip(age[:i], sfr[:i])] for item in t ]for i in range(1, len(age))]
print(integr_list)
# [[500000000.0, 0.0186120543], [500000000.0, 0.0186120543, 560322000.0, 0.0146680445],

Why this code doesn't print the single index?

We have the eng_stress and eng_strain arrays taken from excel file
eng_stress = np.array(eng_stress)
eng_strain = np.array(strain_percent / 100)
eng_strain = eng_strain + 1
true_stress = np.multiply(eng_stress, eng_strain)
true_strain = np.log(eng_strain)
print(true_stress[10])
When I try to acces to a certain index, something like the following happens instead of single outcome.
[466.12834181 466.2044319 466.27916323 466.35480041 466.43043758
466.50562183 466.58125901 466.65689618 466.73208043 466.80771761
466.8838077 466.95853903 467.03508204 467.10981338 467.18545055
467.26108772 467.33627198 467.41145623 467.48709341 467.56273058
467.63882067 467.71355201 467.78918918 467.86482635 467.94001061
468.0161007 468.09083203 468.16692212 468.2425593 468.31774355
468.39292781 468.4690179 468.54374923 468.61983932 468.69502358
468.77020783 468.84629792 468.92148218 468.99666643 469.07275652
469.14794078 469.22357795 469.29966804 469.37439938 469.45048947
469.52522081 469.6013109 469.67649515 469.75167941 469.82731658
469.90295375 469.97813801 470.05377518 470.12895943 470.20459661
470.2806867 470.35541803 470.43196104 470.50669238 470.58232955
470.65796672 470.7336039 470.80833523 470.88442532 470.95960958
471.03524675 471.11088392 471.18606818 471.26170535 471.33688961
471.41252678 471.48771103 471.56380112 471.63898538 471.71507547
471.78980681 471.8658969 471.94108115 472.01626541 472.09190258
472.16753975 472.24272401 472.3188141 472.39354543 472.46918261
472.5452727 472.62000403 472.69654704 472.77127838 472.84736847
472.92209981 472.99773698 473.07337415 473.14901132 473.22419558
473.30028567 473.37501701 473.45065418 473.52629135 473.60102269
473.6775657 473.75229703 473.82884004 473.9040243 473.97920855
474.05439281 474.1304829 474.20521423 474.28130432 474.35648858
474.43167283 474.50776292 474.58294718 474.65813143 474.73376861
474.80940578 474.88459003 474.96113304 475.03586438 475.11195447
475.18668581 475.2627759 475.33796015 475.41314441 475.48878158
475.56441875 475.63960301 475.7156931 475.79087735 475.86606161
475.9421517 476.01688303 476.09342604 476.16815738 476.24379455
476.31943172 476.3950689 476.46980023 476.54589032 476.62107458
476.69716467 476.77234892 476.84753318 476.92317035 476.99835461
477.0744447 477.14962895 477.22526612 477.30045038 477.37654047
477.45127181 477.5273619 477.60209323 477.67773041 477.75336758
477.82855183 477.90464192 477.9802791 478.05501043 478.13064761
478.20628478 478.28146903 478.35801204 478.43274338 478.50883347
478.58356481 478.65920198 478.73483915 478.81002341 478.88566058
478.96175067 479.03648201 479.1125721 479.18775635 479.26294061
479.3390307 479.41376203 479.48985212 479.5650363......... 532]
Maybe eng_stress is a 2D array?
Try:
print(eng_stress.shape)
to find out the shape of the arrays you are working with :)
If your array has the shape (X,1) then it might be in the wrong direction and you could do a quick fix by changing your code to:
eng_stress = np.array(eng_stress).T[0]
eng_strain = np.array(strain_percent / 100)
eng_strain = eng_strain + 1
true_stress = np.multiply(eng_stress, eng_strain)
true_strain = np.log(eng_strain)
print(true_stress[10])
Your numpy arrays may be 2-dimensional. That's why it's printing an array rather than a value. To access a single value of column x, try print(true_stress[10][x]).
The other thing you can do is multiply two 1D numpy arrays. In that case, you'll get a single value.

parsing dates from strings

I have a list of strings in python like this
['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
I want to parse only the date and time (for example, 2016-08-05 15:10:00 )from these strings.
So far I used a for loop like the one below but it's very time consuming, is there a better way to do this?
for files in glob.glob("AM_B0_*.flac.h5"):
if files[11]=='_':
year=files[12:16]
month=files[17:19]
day= files[20:22]
hour=files[23:25]
minute=files[25:27]
second=files[27:29]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year),int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
else:
year=files[11:15]
month=files[16:18]
day= files[19:21]
hour=files[22:24]
minute=files[24:26]
second=files[26:28]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year), int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
Try this (based on the 2nd last '-', no need of if-else case):
filesall = ['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
def find_second_last(text, pattern):
return text.rfind(pattern, 0, text.rfind(pattern))
for files in filesall:
start = find_second_last(files,'-') - 4 # from yyyy- part
timepart = (files[start:start+17]).replace("T"," ")
#insert 2 ':'s
timepart = timepart[:13] + ':' + timepart[13:15] + ':' +timepart[15:]
# print(timepart)
tindex=pd.date_range(start= timepart, periods=60, freq='10S')
In Place of using file[11] as hard coded go for last or 2nd last index of _ then use your code then you don't have to write 2 times same code. Or use regex to parse the string.

Resources