Genomics Analysis after Blast - linux

This is how my Data look like, a results obtained after blast-p.
I have tried this command but it is not providing my desired outputs.
cat out_uniprot-proteome%3AUP000001415.fasta |grep -P -B5 'Identities = \d+/\d+\s\(([7-9]\d|100)%'
Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
(strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1
Length=788
Score E
Sequences producing significant alignments: (Bits) Value
sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=... 109 8e-24
tr|A8K1E1|A8K1E1_HUMAN cDNA FLJ75589, highly similar to Homo sa... 107 4e-23
sp|P20585|MSH3_HUMAN DNA mismatch repair protein Msh3 OS=Homo s... 107 4e-23
tr|B4DSB9|B4DSB9_HUMAN cDNA FLJ51069, highly similar to DNA mis... 102 1e-21
tr|B4DL39|B4DL39_HUMAN cDNA FLJ57316, highly similar to DNA mis... 102 1e-21
tr|A0A2R8YFH0|A0A2R8YFH0_HUMAN DNA mismatch repair protein OS=H... 101 3e-21
tr|A0A2R8Y6P0|A0A2R8Y6P0_HUMAN DNA mismatch repair protein OS=H... 101 3e-21
tr|B4DN49|B4DN49_HUMAN DNA mismatch repair protein OS=Homo sapi... 101 3e-21
tr|E9PHA6|E9PHA6_HUMAN DNA mismatch repair protein OS=Homo sapi... 101 3e-21
sp|P43246|MSH2_HUMAN DNA mismatch repair protein Msh2 OS=Homo s... 101 3e-21
tr|Q53GS1|Q53GS1_HUMAN DNA mismatch repair protein (Fragment) O... 101 3e-21
tr|A0A2R8YG02|A0A2R8YG02_HUMAN DNA mismatch repair protein OS=H... 101 3e-21
tr|Q53FK0|Q53FK0_HUMAN DNA mismatch repair protein (Fragment) O... 100 6e-21
tr|B4DZX3|B4DZX3_HUMAN cDNA FLJ54211, highly similar to MutS pr... 90.1 5e-18
tr|A0A0G2JJ70|A0A0G2JJ70_HUMAN MSH5-SAPCD1 readthrough (NMD can... 89.7 5e-18
tr|A2ABF0|A2ABF0_HUMAN cDNA FLJ39914 fis, clone SPLEN2018732, h... 89.7 5e-18
tr|Q9UFG2|Q9UFG2_HUMAN Uncharacterized protein DKFZp434C1615 (F... 87.0 6e-18
tr|H0YF11|H0YF11_HUMAN MSH5-SAPCD1 readthrough (NMD candidate) ... 87.0 6e-18
> sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=9606
GN=MSH4 PE=1 SV=2
Length=936
Score = 109 bits (273), Expect = 8e-24, Method: Compositional matrix adjust.
Identities = 71/228 (31%), Positives = 118/228 (52%), Gaps = 8/228 (4%)
> tr|Q0QEN7|Q0QEN7_HUMAN ATP synthase subunit beta (Fragment) OS=Homo
sapiens OX=9606 GN=ATP5B PE=2 SV=1
Length=445
Score = 590 bits (1522), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 300/448 (67%), Positives = 357/448 (80%), Gaps = 12/448 (3%)
--
Query 423 SYVPVAETVRGFKEILEGKHDNLPEEAF 450
VP+ ET++GF++IL G++D+LPE+AF
Sbjct 416 KLVPLKETIKGFQQILAGEYDHLPEQAF 443
> tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo
sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
Length=362
Score = 459 bits (1182), Expect = 1e-158, Method: Compositional matrix adjust.
Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)
--
Query 342 DPLASSSSALAPEIVGEEHYEVATEVQ 368
DPL S+S + P IVG EHY+VA VQ
Sbjct 336 DPLDSTSRIMDPNIVGSEHYDVARGVQ 362
> tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial
(Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
Length=270
Score = 281 bits (720), Expect = 1e-90, Method: Compositional matrix adjust.
Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)
--
Query 265 LGRMPSAVGYQPTLATEMGQLQERITSTKKGSITSIQAIYVPADDYTD 312
LGR+PSAVGYQPTLAT+MG +QERIT+TKKGSITS+QAIYVPADD TD
Sbjct 223 LGRIPSAVGYQPTLATDMGTMQERITTTKKGSITSVQAIYVPADDLTD 270
Output i want is:
Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
(strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1
> tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo
sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
Length=362
Score = 459 bits (1182), Expect = 1e-158, Method: Compositional matrix adjust.
Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)
> tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial
(Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
Length=270
Score = 281 bits (720), Expect = 1e-90, Method: Compositional matrix adjust.
Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)
I want the Query of the respective strains having Identities 70% or greater.

Related

Meaning of NER Training values using Spacy

Please explain the meaning of the columns when training Spacy NER model:
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 0.00 78.11 26.82 22.88 32.41 0.27
26 200 82.40 3935.97 94.44 94.44 94.44 0.94
59 400 50.37 2338.60 94.91 94.91 94.91 0.95
98 600 66.31 2646.82 92.13 92.13 92.13 0.92
146 800 85.11 3097.20 94.91 94.91 94.91 0.95
205 1000 92.20 3472.80 94.91 94.91 94.91 0.95
271 1200 124.10 3604.98 94.91 94.91 94.91 0.95
I know that ENTS_F ENTS_P and ENTS_R represent the F-score, precision, and recall respectively and the SCORE is the overall model score.
What is the formula for SCORE?
Where can I see the documentation about these columns?
What are the # and E columns stand for?
Please guide or send me to the relevant docs, I didn't find a proper documentation about the columns except here.
# refers to iterations (or batches), and E refers to epochs.
The score is calculated as a weighted average of other metrics, as designated in your config file. This is documented here.

How to calculate confidence intervals for crude survival rates?

Let's assume that we have a survfit object as follows.
fit = survfit(Surv(data$time_12m, data$status_12m) ~ data$group)
fit
Call: survfit(formula = Surv(data$time_12m, data$status_12m) ~ data$group)
n events median 0.95LCL 0.95UCL
data$group=HF 10000 3534 NA NA NA
data$group=IGT 70 20 NA NA NA
fit object does not show CI-s. How to calculate confidence intervals for the survival rates? Which R packages and code should be used?
The print result of survfit gives confidnce intervals by group for median survivla time. I'm guessing the NA's for the estimates of median times is occurring because your groups are not having enough events to actually get to a median survival. You should show the output of plot(fit) to see whether my guess is correct.
You might try to plot the KM curves, noting that the plot.survfit function does have a confidence interval option constructed around proportions:
plot(fit, conf.int=0.95, col=1:2)
Please read ?summary.survfit. It is the class of generic summary functions which are typically used by package authors to deliver the parameter estimates and confidence intervals. There you will see that it is not "rates" which are summarized by summary.survfit, but rather estimates of survival proportion. These proportions can either be medians (in which case the estimate is on the time scale) or they can be estimates at particular times (and in that instance the estimates are of proportions.)
If you actually do want rates then use a functions designed for that sort of model, perhaps using ?survreg. Compare what you get from using survreg versus survfit on the supplied dataset ovarian:
> reg.fit <- survreg( Surv(futime, fustat)~rx, data=ovarian)
> summary(reg.fit)
Call:
survreg(formula = Surv(futime, fustat) ~ rx, data = ovarian)
Value Std. Error z p
(Intercept) 6.265 0.778 8.05 8.3e-16
rx 0.559 0.529 1.06 0.29
Log(scale) -0.121 0.251 -0.48 0.63
Scale= 0.886
Weibull distribution
Loglik(model)= -97.4 Loglik(intercept only)= -98
Chisq= 1.18 on 1 degrees of freedom, p= 0.28
Number of Newton-Raphson Iterations: 5
n= 26
#-------------
> fit <- survfit( Surv(futime, fustat)~rx, data=ovarian)
> summary(fit)
Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
rx=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 13 1 0.923 0.0739 0.789 1.000
115 12 1 0.846 0.1001 0.671 1.000
156 11 1 0.769 0.1169 0.571 1.000
268 10 1 0.692 0.1280 0.482 0.995
329 9 1 0.615 0.1349 0.400 0.946
431 8 1 0.538 0.1383 0.326 0.891
638 5 1 0.431 0.1467 0.221 0.840
rx=2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
353 13 1 0.923 0.0739 0.789 1.000
365 12 1 0.846 0.1001 0.671 1.000
464 9 1 0.752 0.1256 0.542 1.000
475 8 1 0.658 0.1407 0.433 1.000
563 7 1 0.564 0.1488 0.336 0.946
Might have been easier if I had used "exponential" instead of "weibull" as the distribution type. Exponential fits have a single parameter that is estimated and are more easily back-transformed to give estimates of rates.
Note: I answered an earlier question about survfit, although the request was for survival times rather than for rates. Extract survival probabilities in Survfit by groups

Hi, I'm trying to write code that will extract information from invoices that are saved as text files

name = input("Enter file:")
if len(name) < 1 : name = "AWM2.txt"
handle = open(name)
for line in handle:
line = line.rstrip()
type = re.findall('^\d{1,2}?.\d{1,2}?.\d{1,2}\s*', line)
if len(type) > 0:
print(type)
#type = re.findall('.*\n.*\s([M].[W])', line)
lift = re.findall('^\d{1,2}.\d{1,2}.\d{1,2}\s*\d{1,5}\s*(\d{1,})\s.*\n.*[M].*[l]', line)
if len(lift) > 0:
print(line)
print("Lift", lift)
lifts = lifts + int(lift[0])
print("Total Lifts", lifts)
What I'm struggling with is trying to 'findall' data that is spread over two lines, i.e. the number i want to count depends on the following line contains the text 'MMW Commercial', but not the lines that contain 'MDR Commercial'. It's all going fine until my regex contains \n to look at the following line. Any ideas to help?
I'm looking to extract a number from a text file, the text looks like this below, the first line is bold with the number I'm trying to extract, and italicised with the piece of text that specifies the correct number to extract.
XXX GROUP
Supervalu 393
Ardee Shopping Centre
Ash Walk
Ardee
A92 W56E
Invoice No.: OUT-10618 Invoice Date: 29-02-20
Supervalu 393 (393), Ardee Shopping Centre, Ash Walk, Ardee, A92 W56E
05/02/20 67879 **3** SLBIN 1100 (Customers Own Bin) 127.95
FC 22.00 _MMW Commercial_ 66.00
PPKG 0.1500 MMW Commercial 413 kg 61.95
12/02/20 69770 5 SLBIN 1100 (Customers Own Bin) 110.00
FC 22.00 MMW Commercial 110.00
PPKG 0.1500 MMW Commercial 0 kg 0.00
19/02/20 71619 4 SLBIN 1100 (Customers Own Bin) 128.50
FC 22.00 MMW Commercial 88.00
PPKG 0.1500 MMW Commercial 270 kg 40.50
26/02/20 73458 4 SLBIN 1100 (Customers Own Bin) 134.23
FC 22.00 MMW Commercial 88.00
PPKG 0.1605 MMW Commercial 288 kg 46.23
Bin Services Sub Total: 500.68
03/02/20 67077 2 SLBIN 1100 (Customers Own Bin) 20.00
FC 10.00 MDR Commercial 20.00
10/02/20 69074 3 SLBIN 1100 (Customers Own Bin) 30.00
FC 10.00 MDR Commercial 30.00
17/02/20 70884 3 SLBIN 1100 (Customers Own Bin) 30.00
FC 10.00 MDR Commercial 30.00
24/02/20 72713 2 SLBIN 1100 (Customers Own Bin) 20.00
FC 10.00 MDR Commercial 20.00
Bin Services Sub Total: 100.00
11/02/20 69381 1 SLBIN 1 (Baled) 113.78
P/T 56.89 Packaging Cardboard Baled 2000 kg 113.78
26/02/20 73007 1 SLBIN 1 (Baled) 204.80
P/T 56.89 Packaging Cardboard Baled 3600 kg 204.80
you are iterating over the lines in your file
for line in handle:
just don't do that, read the whole file with handle.read() call re.findall on that string, findall returns a list so just iterate over the matches.
handle = open(name)
data = handle.read()
matches = re.findall('^\d{1,2}.\d{1,2}.\d{1,2}\s*\d{1,5}\s*(\d{1,})\s.*\n.*[M].*[l]', data)
for lift in matches:
print("Lift", lift)
lifts = lifts + int(lift)
for line in data.splitlines():
# do what ever else you need line wise, if you still need something line wise.
print("Total Lifts", lifts)

Decimal Point Normalization in Python

I am trying to apply normalization to my data and I have tried the Conventional scaling techniques using sklearn packages readily available for this kind of requirement. However, I am looking to implement something called Decimal scaling.
I read about it in this research paper and looks like a technique which can improve results of a neural network regression. As per my understanding, this is what I believe needs to be done -
Suppose the range of attribute X is −4856 to 28. The maximum absolute value of X is 4856.
To normalize by decimal scaling I will need to divide each value by 10000 (c = 4). In this case, −4856 becomes −0.4856 while 28 becomes 0.0028.
So for all values: new value = old value/ 10^c
How can I reproduce this as a function in Python so as to normalize all the features(column by column) in my data set?
Input:
A B C
30 90 75
56 168 140
28 84 70
369 1107 922.5
485 1455 1212.5
4856 14568 12140
40 120 100
56 168 140
45 135 112.5
78 234 195
899 2697 2247.5
Output:
A B C
0.003 0.0009 0.0075
0.0056 0.00168 0.014
0.0028 0.00084 0.007
0.0369 0.01107 0.09225
0.0485 0.01455 0.12125
0.4856 0.14568 1.214
0.004 0.0012 0.01
0.0056 0.00168 0.014
0.0045 0.00135 0.01125
0.0078 0.00234 0.0195
0.0899 0.02697 0.22475
Thank you guys for asking questions which led me to think about my problem more clearly and break it into steps. I have arrived to a solution. Here's how my solution looks like:
def Dec_scale(df):
for x in df:
p = df[x].max()
q = len(str(abs(p)))
df[x] = df[x]/10**q
I hope this solution looks agreeable!
def decimal_scaling (df):
df_abs = abs(df)
max_valus= df_abs.max()
log_num=[]
for i in range(max_valus.shape[0]):
log_num.append(int(math.log10(max_valus[i]))+1)
log_num = np.array(log_num)
log_num = [pow(10, number) for number in log_num]
X_full =df/log_num
return X_full

Incorrect Empirical Semivariogram Value

My gstat program for calculating empirical semivariogram on walker lake data is as follows
data = read.table("C:/Users/chandan/Desktop/walk470.csv",sep=",",header=TRUE);
attach(data);
coordinates(data)=~x+y;
walk.var1 <- variogram(v ~ x+y, data=data,width=5,cutoff=100);
The result is as follows
np dist gamma
1 105 3.836866 32312.63
2 459 8.097102 44486.82
3 1088 12.445035 60230.48
4 985 17.874264 76491.36
5 1579 22.227711 75103.67
6 1360 27.742246 83595.83
7 1747 32.291155 91248.20
8 1447 37.724524 97610.65
9 2233 42.356048 85857.03
10 1794 47.537644 93263.63
11 2180 52.295711 98282.98
12 2075 57.601882 91589.39
13 2848 62.314646 91668.70
14 2059 67.627847 95803.45
15 2961 72.310575 91975.76
16 2240 77.648900 95858.87
17 3067 82.379802 88123.56
18 2463 87.641359 87568.94
19 2746 92.334788 97991.56
20 2425 97.754121 93914.31
I have written a code of my own version of the same peoblem using classical sample variogram estimator. The number of points, dist are coming exactly as in the output. But the gamma value is not same. Why is that and what should I do to make it exactly same with gstat output?
Thanks in advance...

Resources