I would like to find consecutive numbers in column A and Column B in python, Column A should be ascending but Column B is descending. I am attaching an example file.
Input file
nucleotide
Pos_A
Pos_B
Connection_Pos20_Pos102
20
102
Connection_Pos19_Pos102
19
102
Connection_Pos20_Pos101
20
101
Connection_Pos18_Pos102
18
102
Connection_Pos19_Pos101
19
101
Connection_Pos20_Pos100
20
100
Connection_Pos17_Pos102
17
102
Connection_Pos18_Pos101
18
101
Connection_Pos19_Pos100
19
100
Connection_Pos20_Pos99
20
99
Connection_Pos16_Pos102
16
102
Connection_Pos17_Pos101
17
101
Connection_Pos18_Pos100
18
100
Connection_Pos19_Pos99
19
99
Connection_Pos20_Pos98
20
98
Connection_Pos15_Pos102
15
102
Connection_Pos16_Pos101
16
101
Connection_Pos17_Pos100
17
100
Connection_Pos18_Pos99
18
99
Connection_Pos19_Pos98
19
98
Connection_Pos20_Pos97
20
97
Connection_Pos14_Pos102
14
102
Connection_Pos15_Pos101
15
101
Connection_Pos16_Pos100
16
100
Output:
nucleotide
Pos_A
Pos_B
Consecutive ID
Consecutive Number (Size)
Connection_Pos20_Pos102
20
102
101
1
Connection_Pos19_Pos102
19
102
100
2
Connection_Pos20_Pos101
20
101
100
2
Connection_Pos18_Pos102
18
102
99
3
Connection_Pos19_Pos101
19
101
99
3
Connection_Pos20_Pos100
20
100
99
3
Connection_Pos17_Pos102
17
102
98
4
Connection_Pos18_Pos101
18
101
98
4
Connection_Pos19_Pos100
19
100
98
4
Connection_Pos20_Pos99
20
99
98
4
Connection_Pos16_Pos102
16
102
97
5
Connection_Pos17_Pos101
17
101
97
5
Connection_Pos18_Pos100
18
100
97
5
Connection_Pos19_Pos99
19
99
97
5
Connection_Pos20_Pos98
20
98
97
5
Connection_Pos15_Pos102
15
102
96
6
Connection_Pos16_Pos101
16
101
96
6
Connection_Pos17_Pos100
17
100
96
6
Connection_Pos18_Pos99
18
99
96
6
Connection_Pos19_Pos98
19
98
96
6
Connection_Pos20_Pos97
20
97
96
6
Connection_Pos14_Pos102
14
102
95
7
Connection_Pos15_Pos101
15
101
95
7
Connection_Pos16_Pos100
16
100
95
7
Connection_Pos17_Pos99
17
99
95
7
Connection_Pos18_Pos98
18
98
95
7
Connection_Pos19_Pos97
19
97
95
7
Connection_Pos20_Pos96
20
96
95
7
For Consecutive ID, if Pos_B's shifted difference != 1, then we want to subtract 1, so we mark those indexes as -1 with mul(-1) and cumsum them:
df['ID'] = df.Pos_B.shift().sub(df.Pos_B).ne(1).mul(-1).cumsum() + df.Pos_B[0]
For Consecutive Number, if Pos_A's shifted difference != -1, then we want to add 1, so we mark those indexes as 1 and cumsum again:
df['Number'] = df.Pos_A.shift().sub(df.Pos_A).ne(-1).mul(1).cumsum()
Result:
nucleotide Pos_A Pos_B ID Number
0 Connection_Pos20_Pos102 20 102 101 1
1 Connection_Pos19_Pos102 19 102 100 2
2 Connection_Pos20_Pos101 20 101 100 2
3 Connection_Pos18_Pos102 18 102 99 3
4 Connection_Pos19_Pos101 19 101 99 3
5 Connection_Pos20_Pos100 20 100 99 3
6 Connection_Pos17_Pos102 17 102 98 4
7 Connection_Pos18_Pos101 18 101 98 4
8 Connection_Pos19_Pos100 19 100 98 4
9 Connection_Pos20_Pos99 20 99 98 4
10 Connection_Pos16_Pos102 16 102 97 5
11 Connection_Pos17_Pos101 17 101 97 5
12 Connection_Pos18_Pos100 18 100 97 5
13 Connection_Pos19_Pos99 19 99 97 5
14 Connection_Pos20_Pos98 20 98 97 5
15 Connection_Pos15_Pos102 15 102 96 6
16 Connection_Pos16_Pos101 16 101 96 6
17 Connection_Pos17_Pos100 17 100 96 6
18 Connection_Pos18_Pos99 18 99 96 6
19 Connection_Pos19_Pos98 19 98 96 6
20 Connection_Pos20_Pos97 20 97 96 6
21 Connection_Pos14_Pos102 14 102 95 7
22 Connection_Pos15_Pos101 15 101 95 7
23 Connection_Pos16_Pos100 16 100 95 7
Do it one by one then groupby with ngroup
s1 = df.Pos_A.diff().le(0).cumsum()
s2 = df.Pos_B.diff().ge(0).cumsum()
df['out'] = df.groupby([s1,s2]).ngroup()+1
Out[452]:
0 1
1 2
2 2
3 3
4 3
5 3
6 4
7 4
8 4
9 4
10 5
11 5
12 5
13 5
14 5
15 6
16 6
17 6
18 6
19 6
20 6
21 7
22 7
23 7
24 7
25 7
26 7
27 7
dtype: int64
I got two files.
file 1:
4
14
18
45
53
60
64
102
106
158
162
file2:
28 1 2
54 1 2
90 1 1
103 1 1
155 1 17
191 1 1
235 1 1
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
529 57 3
542 53 1
560 58 6
562 164 25
568 164 5
I want to extract the value from file2 if the second column of file two matches the value in file 1.
So the expected output will be:
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
542 53 1
I saw many of the solution online is using python or Perl, however, I want to use linux command to do this, any idea?
This should do it?
awk 'FNR==NR{a[$0]++};FNR!=NR{if($2 in a){print}}' file1 file2
245 4 1
275 4 1
362 4 1
377 18 1
391 18 1
413 18 2
466 18 2
492 18 2
494 18 41
498 45 1
522 45 1
542 53 1
Explanation:
we hand awk both files (order is important in this case!).
as long as we read the first file (FNR==NR) we store each value in an array a[$1]++
when we reach the second file we just check if values from the second file's second column ($2) are in the array; if yes, we print them.
I want to extract all the strings between gi| and |. The position of the string is consistent in all the lines.
I am trying this one:
cat ERR594382_second_cat.test | sed -n '/gi\|/,/\|/p'
But, it is not working.
Here is a head of my file :
head ERR594382_second_cat.test
ERR594382.28316455_3_6_1 gi|914605561|ref|WP_050599988.1| 22 54 67 99 4.03e-15 77.0 100.000 33 0 0 225971;1306953 Bacteria Erythrobacter citreus;Erythrobacter citreus LAMA 915 ribonuclease D [Erythrobacter citreus]
ERR594382.28316455_65_2_3 gi|914605561|ref|WP_050599988.1| 13 46 11 44 2.15e-17 82.8 100.000 34 0 0 225971;1306953 Bacteria Erythrobacter citreus;Erythrobacter citreus LAMA 915 ribonuclease D [Erythrobacter citreus]
ERR594382.28316459_1_1_2 gi|1270336953|gb|PHR32068.1| 8 53 863 903 6.98e-08 56.6 63.043 46 12 1 2024840 Bacteria Methylophaga sp. phosphohydrolase [Methylophaga sp.]
ERR594382.28316464_2_2_3 gi|705244733|gb|AIW56710.1| 2 33 145 176 5.76e-12 67.8 93.750 32 2 0 340016 Viruses uncultured virus ribonucleotide reductase, partial [uncultured virus]
ERR594382.28316464_53_5_5 gi|1200458341|gb|OUV73944.1| 1 31 557 587 9.54e-11 64.3 80.645 31 6 0 1986721 Bacteria Flavobacteriales bacterium TMED123 hypothetical protein CBC83_04720 [Flavobacteriales bacterium TMED123]
ERR594382.28316465_3_3_2 gi|787065740|dbj|BAR36435.1| 1 46 204 249 5.55e-10 63.2 58.696 46 19 0 1407671 Viruses uncultured Mediterranean phage uvMED hypothetical protein [uncultured Mediterranean phage uvMED]
ERR594382.28316465_67_4_3 gi|787065740|dbj|BAR36435.1| 2 34 224 256 1.31e-07 55.1 66.667 33 11 0 1407671 Viruses uncultured Mediterranean phage uvMED hypothetical protein [uncultured Mediterranean phage uvMED]
ERR594382.28316466_18_6_3 gi|1200295886|gb|OUU17830.1| 1 33 92 124 1.73e-12 70.1 100.000 33 0 0 1986638 Bacteria Alphaproteobacteria bacterium TMED37 hypothetical protein CBB97_21775 [Candidatus Endolissoclinum sp. TMED37]
ERR594382.28316470_37_1_1 gi|787067413|dbj|BAR37857.1| 16 43 60 87 1.94e-09 58.9 96.429 28 1 0 1407671 Viruses uncultured Mediterranean phage uvMED terminase large subunit [uncultured Mediterranean phage uvMED]
ERR594382.28316474_2_5_1 gi|1219813777|gb|ASN63501.1| 1 33 62 94 3.55e-12 64.3 81.818 33 6 0 340016 Viruses uncultured
You could use grep or / pcregrep (in case of using macOS) with this:
pcregrep -o "gi\|\K.+?(?=\|)" file
or with:
grep -oP "gi\|\K.+?(?=\|)" file
The \K can be read as excluding everything to the left before it and return only the right part .+, and then the .+?(?=\|) match any characters until | is found.
The easiest way if only your delimiter is fixed could be with cut:
cut -f2 -d"|" file
I've been trying to figure out how to SUM the top 2 values of an array using SUMPRODUCT but I also want to add a criteria that will only sum the product if it matches a specific string. I thought I could combine SUMPRODUCT and SUMIF but I have been unsuccessful.
Position Age ADP Trend Value
QB 23 241 84.2 21
QB 35 185 -37.5 142
QB 27 300 25 19
QB 26 300 25 19
QB 32 300 25 19
RB 22 98 -2.2 1051
RB 24 69 0.3 1929
RB 24 238 6 25
RB 26 300 25 19
RB 26 300 25 19
WR 22 300 25 19
WR 24 300 25 19
WR 26 232 -17 36
WR 25 300 25 19
WR 28 300 25 19
WR 23 9 -4.2 8591
WR 23 178 21.4 161
WR 23 38 8.5 4679
WR 26 222 102.8 53
WR 23 300 25 19
WR 26 300 25 19
TE 26 117 -18.7 617
TE 36 193 -30.3 119
TE 26 199 -22.5 105
TE 24 300 25 19
What I want is to SUM the top two values under the Value column IF the Position = QB.
How can I accomplish this?
Cheers!
Use this array formula:
=SUM(LARGE(IF(A2:A25="QB",E2:E25,""),1),LARGE(IF(A2:A25="QB",E2:E25,""),2))
Press CTRL+SHIFT+ENTER to evaluate the formula as it is an array formula.
I figured out a way to backcast (ie. predicting the past) with a time series. Now I'm just struggling with the programming in R.
I would like to reverse the time series data so that I can forecast the past. How do I do this?
Say the original time series object looks like this:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 116 99 115 101 112 120 120 110 143 136 147 142
2009 117 114 133 134 139 147 147 131 125 143 136 129
I want it to look like this for the 'backcasting':
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 129 136 143 125 131 147 147 139 134 133 114 117
2009 142 147 136 143 110 120 120 112 101 115 99 116
Note, I didn't forget to change the years - I am basically mirroring/reversing the data and keeping the years, then going to forecast.
I hope this can be done in R? Or should I export and do it in Excel somehow?
Try this:
tt <- ts(1:24, start = 2008, freq = 12)
tt[] <- rev(tt)
ADDED. This also works and does not modify tt :
replace(tt, TRUE, rev(tt))
You can just coerce the matrix to a vector, reverse it, and make it a matrix again. Here's an example:
mat <- matrix(seq(24),nrow=2,byrow=TRUE)
> mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 1 2 3 4 5 6 7 8 9 10 11 12
[2,] 13 14 15 16 17 18 19 20 21 22 23 24
> matrix( rev(mat), nrow=nrow(mat) )
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 24 23 22 21 20 19 18 17 16 15 14 13
[2,] 12 11 10 9 8 7 6 5 4 3 2 1
I found this post of Hyndman under http://www.r-bloggers.com/backcasting-in-r/ and am basically pasting in his solution, which in my opinion provids a complete answer to you question.
library(forecast)
x <- WWWusage
h <- 20
f <- frequency(x)
# Reverse time
revx <- ts(rev(x), frequency=f)
# Forecast
fc <- forecast(auto.arima(revx), h)
plot(fc)
# Reverse time again
fc$mean <- ts(rev(fc$mean),end=tsp(x)[1] - 1/f, frequency=f)
fc$upper <- fc$upper[h:1,]
fc$lower <- fc$lower[h:1,]
fc$x <- x
# Plot result
plot(fc, xlim=c(tsp(x)[1]-h/f, tsp(x)[2]))