How to find all Common substrings which is 3 chars or longer - string

Are there an efficient algorithm to search and dump all common substrings (which length is 3 or longer) between 2 strings?
Example input:
Length : 0 5 10 15 20 25 30
String 1 : ABC-DEF-GHI-JKL-ABC-ABC-STU-MWX-Y
String 2 : ABC-JKL-MNO-ABC-DEF-PQR-DEF-ZWX-Y
Example output:
In string 1 2
---------------------------
ABC-DEF 0 12
ABC-DE 0 12
BC-DEF 1 13
:
-ABC- 15,19 11
-JKL- 11 3
-DEF- 3 15
-JKL 11 3
JKL- 12 4
-DEF 3 15,23
DEF- 4 16
WX-Y 29 29
ABC- 0,16,20 0,12
-ABC 15,19 11
DEF- 4 16,24
DEF 4 16,24
ABC 0,16,20 0,12
JKL 12 4
WX- 29 29
X-Y 30 30
-AB 15,19 11
BC- 1,17,21 1,13
-DE 3 15,23
EF- 5 17,25
-JK 11 3
KL- 13 5
:
In the example, "-D", "-M" is also a common substring but is not required, because it's length is only 2. (There might be some missing outputs in example because there are so many of them...)

You can find all common substrings using a data structure called a Generalized suffix tree
Libstree contains some example code for finding the longest common substring. That example code can be modified to obtain all common substrings.

Related

What do "!" and "." mean in BASIC?

Trying to translate BASIC code written in the 1990's to Python. I keep coming across two symbols, ! (exclamation mark) and . (period). I can't find any documentation online on what they do.
I have the code running but some of the outputs are not as expected - I am wondering if these might be the issue as I previously thought that the period may just be a typo for a multiplication.
Examples:
|
v
QWLOST = (((TW-TDAO)/(TWRT-TDAOR))^1.25)*((VISR/VIS)^0.25).(PW+PE)*DT
TFAVE = (TTO+TBO)/2!
^
|
In case anyone else in the future needs to know this.
! - defines a single
. - Was just a typo for * (multiplication)
I tried a few things in bwBasic (in Linux, in case that's relevant!).
bwBASIC: list
10: for i = 1 to 20
20: print i, ., . - i
30: next i
40: print ".="; .
This gave me:
bwBASIC: run
1 20 19
2 20 18
3 20 17
4 20 16
5 20 15
6 20 14
7 20 13
8 20 12
9 20 11
10 20 10
11 20 9
12 20 8
13 20 7
14 20 6
15 20 5
16 20 4
17 20 3
18 20 2
19 20 1
20 20 0
.= 20
Which would suggest that . (in bwBasic in any case) is the max number in a for loop.

Efficient Reading of Input File

Currently for a task, I am working with input files which give Matrix related test cases (Matrix Multiplication) i.e., example of an input file ->
N M
1 3 5 ... 6 (M columns)
....
5 4 2 ... 1 (N rows)
I was using simple read() to access them till now, but this is not efficient for large files of size > 10^2.
So I wanted to know is there some way to use processes to do this in parallel.
Also I was thinking of using multiple IO readers based on line, so then each process could read different segments of the file but couldn't find any helpful resources.
Thank you.
PS: Current code is using this:
io:fread(IoDev, "", "~d")
Did you consider to use re module? I did not make a performance test, but it may be efficient. In the following example I do not use the first "M N" line. So I did not put it in the matrix.txt file.
matrix file:
1 2 3 4 5 6 7 8 9
11 12 13 14 15 16 17 18 19
21 22 23 24 25 26 27 28 29
31 32 33 34 35 36 37 38 39
I made the conversion in the shell
1> {ok,B} = file:read_file("matrix.txt"). % read the complete file and store it in a binary
{ok,<<"1 2 3 4 5 6 7 8 9\r\n11 12 13 14 15 16 17 18 19\r\n21 22 23 24 25 26 27 28 29\r\n31 32 33 34 35 36 37 38 39">>}
2> {ok,ML} = re:compile("[\r\n]+"). % to split the complete binary in a list a binary, one for each line
{ok,{re_pattern,0,0,0,
<<69,82,67,80,105,0,0,0,0,0,0,0,1,8,0,0,255,255,255,255,
255,255,...>>}}
3> {ok,MN} = re:compile("[ ]+"). % to split the line into binaries one for each integer
{ok,{re_pattern,0,0,0,
<<69,82,67,80,73,0,0,0,0,0,0,0,17,0,0,0,255,255,255,255,
255,255,...>>}}
4> % a function to split a line and convert each chunk into integer
4> F = fun(Line) -> Nums = re:split(Line,MN), [binary_to_integer(N) || N <- Nums] end.
#Fun<erl_eval.7.126501267>
5> Lines = re:split(B,ML). % split the file into lines
[<<"1 2 3 4 5 6 7 8 9">>,<<"11 12 13 14 15 16 17 18 19">>,
<<"21 22 23 24 25 26 27 28 29">>,
<<"31 32 33 34 35 36 37 38 39">>]
6> lists:map(F,Lines). % map the function to each lines
[[1,2,3,4,5,6,7,8,9],
[11,12,13,14,15,16,17,18,19],
[21,22,23,24,25,26,27,28,29],
[31,32,33,34,35,36,37,38,39]]
7>
if you want to check the matrix size, you can replace the last line with:
[[NbRows,NbCols]|Matrix] = lists:map(F,Lines),
case (length(Matrix) == NbRows) andalso
lists:foldl(fun(X,Acc) -> Acc andalso (length(X) == NbCols) end,true,Matrix) of
true -> {ok,Matrix};
_ -> {error_size,Matrix}
end.
is there some way to use processes to do this in parallel.
Of course.
Also I was thinking of using multiple IO readers based on line, so
then each process could read different segments of the file but
couldn't find any helpful resources.
You don't seek to positions in a file by line, rather you seek to byte positions. While a file may look like a bunch of lines, a file is actually just one long sequence of characters. Therefore, you will need to figure out what byte positions you want to seek to in the file.
Check out file:position, file:pread.

How to generate 3 natural number that sum to 60 using awk

I am trying to write awk script that generate 3 natural numbers that sum to 60. I am trying with rand function but I`ve got problem with sum to 60
Here is one way:
awk -v n=60 'BEGIN{srand();a=int(rand()*n);b=int(rand()*(n-a));c=n-a-b;
print a,b,c}'
Idea is:
generate random number a :0=<a<60
generate random number b :0=<b<60-a
c=60-a-b
here, I set a variable n=60, to make it easy if you have other sum.
If we run this one-liner 10 times, we get output:
kent$ awk 'BEGIN{srand();for(i=1;i<=10;i++){a=int(rand()*60);b=int(rand()*(60-a));c=60-a-b;print a,b,c}}'
46 7 7
56 1 3
26 15 19
14 12 34
44 6 10
1 36 23
32 1 27
41 0 19
55 1 4
54 1 5

sort multiple column file

I have a file a.dat as following.
1 0.246102 21 1 0.0408359 0.00357267
2 0.234548 21 2 0.0401056 0.00264361
3 0.295771 21 3 0.0388905 0.00305116
4 0.190543 21 4 0.0371858 0.00427217
5 0.160047 21 5 0.0349674 0.00713894
I want to sort the file according to values in second column. i.e. output should look like
5 0.160047 21 5 0.0349674 0.00713894
4 0.190543 21 4 0.0371858 0.00427217
2 0.234548 21 2 0.0401056 0.00264361
1 0.246102 21 1 0.0408359 0.00357267
3 0.295771 21 3 0.0388905 0.00305116
How can do this with command line?. I read that sort command can be used for this purpose. But I could not figure out how to use sort command for this.
Use sort -k to indicate the column you want to use:
$ sort -k2 file
5 0.160047 21 5 0.0349674 0.00713894
4 0.190543 21 4 0.0371858 0.00427217
2 0.234548 21 2 0.0401056 0.00264361
1 0.246102 21 1 0.0408359 0.00357267
3 0.295771 21 3 0.0388905 0.00305116
This makes it in this case.
For future references, note (as indicated by 1_CR) that you can also indicate the range of columns to be used with sort -k2,2 (just use column 2) or sort -k2,5 (from 2 to 5), etc.
Note that you need to specify the start and end fields for sorting (2 and 2 in this case), and if you need numeric sorting, add n.
sort -k2,2n file.txt

Count Number of occurrence at each line

I have the following file
ENST001 ENST002 4 4 4 88 9 9
ENST004 3 3 3 99 8 8
ENST009 ENST010 ENST006 8 8 8 77 8 8
Basically I want to count how many times ENST* is repeated in each line so the expected results is
2
1
3
Any suggestion please ?
Try this (and see it in action here):
awk '{print gsub("ENST[0-9]+","")}' INPUTFILE

Resources