Get Last 5 Sequential Numbers from Perl Text and 3 Preceding Characters - string

How can someone get the last 5 sequential numbers from a Perl string and then additionally get the 3 characters that immediately proceed that sequence. For example, if the string is "This is just a bunch of ran 00000 Dom text. It has no 11111 meaning." Then I would want to get "11111" and then "no ".

Use a regular expression:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = 'This is just a bunch of ran 00000 Dom text. It has no 11111 meaning.';
my ($pre, $digits) = $string =~ /.*(...)([0-9]{5})/;
say "<$pre>\t<$digits>";
[0-9] matches any digit
{5} means the previous thing should match five times
() parentheses create a capture group
. matches any character (excpet newline)
* means the previous thing should match zero or more times, as much as possible. The .* therefore tries to match as much as possible from the string, to prevent matching the 00000.

Related

How to return only integers from a variable in Shell Script and discard letters and leading zeros?

In my shell script there is a parameter that comes from certain systems and it gives an answer similar to this one: PAR0000008.
And I need to send only the last number of this parameter to another variable, ie VAR=8.
I used the command VAR=$( echo ${PAR} | cut -c 10 ) and it worked perfectly.
The problem is when the PAR parameter returns with numbers from two decimal places like PAR0000012. I need to discard the leading zeros and send only the number 12 to the variable, but I don't know how to do the logic in the Shell to discard all the characters to the left of the number.
Edit Using grep To Handle 0 As Part Of Final Number
Since you are using POSIX shell, making use of a utility like sed or grep (or cut) makes sense. grep is quite a bit more flexible in parsing the string allowing a REGEX match to handle the job. Say your variable v=PAR0312012 and you want the result r=312012. You can use a command substitution (e.g. $(...)) to parse the value assigning the result to r, e.g.
v=PAR0312012
r=$(echo $v | grep -Eo '[1-9].*$')
echo $r
The grep expression is:
-Eo - use Extended REGEX and only return matching portion of string,
[1-9].*$ - from the first character in [1-9] return the remainder of the string.
This will work for PAR0000012 or PAR0312012 (with result 312012).
Result
For PAR0312012
312012
Another Solution Using expr
If your variable can have zeros as part of the final number portion, then you must find the index where the first [1-9] character occurs, and then assign the substring beginning at that index to your result variable.
POSIX shell provides expr which provides a set of string parsing tools that can to this. The needed commands are:
expr index string charlist
and
expr substr string start end
Where start and end are the beginning and ending indexes to extract from the string. end just has to be long enough to encompass the entire substring, so you can just use the total length of your string, e.g.
v=PAR0312012
ndx=$(expr index "$v" "123456789")
r=$(expr substr "$v" "$ndx" 10)
echo $r
Result
312012
This will handle 0 anywhere after the first [1-9].
(note: the old expr ... isn't the fastest way of handling this, but if you are only concerned with a few tens of thousands of values, it will work fine. A billion numbers and another method will likely be needed)
This can be done easily using Parameter Expension.
var='PAR0000008'
echo "${var##*0}"
//prints 8
echo "${var##*[^1-9]}"
//prints 8
var="${var##*0}"
echo "$var"
//prints 8
var='PAR0000012'
echo "${var##*0}"
//prints 12
echo "${var##*[^1-9]}"
//prints 12
var="${var##*[^1-9]}"
echo "$var"
//prints 12

Bash String Format Comparison with Wildcards

I am fairly new to bash scripting and was trying to echo only lines that match a specific formatting. I have this code so far:
LINE=1
while read -r CURRENT_LINE
do
if [[ $CURRENT_LINE == ??-?-??? ]]
then
echo "$LINE: $CURRENT_LINE"
fi
((LINE++))
done < "./new-1.txt"
The text file contains number sequences on each line that match the following format: "12-3-456", but also contains sequences that are in different formats as well, such as "123-89203-9420" or "123-456-7890". I can't quite understand why the if statement inside the while loop does not result to True on lines that match the formatting. I've tried using the * as well, but using it gives me incorrect results.
Here are the contents of the text file new-1.txt. I want the script to output "Line 1: 11-1-111", but it doesn't output anything.
11-1-111
222-22-2222
333-33-3333
444-444-4444
555-555-5555
In the regex parlance, the ? makes the character or selection optional, ie , a character/selection is allowed to occur at most one time but zero occurrences are also tolerated.
However, the == operation is not the regex matching operator. It is =~.
So changing your if clause to the below would do the job.
[[ $CURRENT_LINE =~ "^[0-9]{2}-[0-9]{1}-[0-9]{3}$" ]]
Here
The ^ specifies the beginning of regex and $ the end. So we have a tight coupling of the pattern to match
[0-9] denotes a range, here any number from zero to nine.
The {n} mandates that the preceding character/selection should match exactly n number of times
Note : You can also use a more verbose [[:digit:]] instead of [0-9]

Select sequences in a fasta file with more than 300 aa and "C" occurs at least 4 times

I have a fasta file which contains protein sequences. I'd like to select sequences with more than 300 amino acids and Cysteine (C) amino acid appears more than 4 times.
I've used this command to select sequences with more than 300 aa:
cat 72hDOWN-fasta.fasta | bioawk -c fastx 'length($seq) > 300{ print ">"$name; print $seq }'
Some sequence example:
>jgi|Triasp1|216614|CE216613_3477
MPSLYLTSALGLLSLLPAAQAGWNPNSKDNIVVYWGQDAGSIGQNRLSYYCENAPDVDVI
NISFLVGITDLNLNLANVGNNCTAFAQDPNLLDCPQVAADIVECQQTYGKTIMMSLFGST
YTESGFSSSSTAVSAAQEIWAMFGPVQSGNSTPRPFGNAVIDGFDFDLEDPIENNMEPFA
AELRSLTSAATSKKFYLSAAPQCVYPDASDESFLQGEVAFDWLNIQFYNNGCGTSYYPSG
YNYATWDNWAKTVSANPNTKLLVGTPASVHAVNFANYFPTNDQLAGAISSSKSYDSFAGV
MLWDMAQLFGNPGYLDLIVADLGGASTPPPPASTTLSTVTRSSTASTGPTSPPPSGGSVP
QWGQCGGQGYTGPTQCQSPYTCVVESQWWSSCQ*
I do not know bioawk but I assume it is identical to awk with some initial parsing and constant definitions.
I would proceed as follows. Assuming you want the find the strings with more then 4 times the letter C in and a length of more than 300, then you could do :
bioawk -c fastx '
(length($seq) > 300) && (gsub("C","C",$seq)>4) {
print ">"$name; print $seq
}' 72hDOWN-fasta.fasta
but this assumes that seq is the full character sequence.
The idea behind it is the following. The gsub command performs substitutions in strings and returns the total substitutions it did. Hence, if we substitute all characters "C" with "C" we actually did not change the string, but get the total amount of "C"'s in the string back.
From the POSIX standard IEEE Std 1003.1-2017:
gsub(ere, repl[, in]): Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like
the ed utility global substitute) in $0 or in the in argument,
when specified.
sub(ere, repl[, in ]): Substitute the string repl in place of the first instance of the extended regular expression ere in string in
and return the number of substitutions. An <ampersand> ( &
) appearing in the string repl shall be replaced by the string from in
that matches the ERE. An <ampersand> preceded with a
<backslash> shall be interpreted as the literal
<ampersand> character. An occurrence of two consecutive
<backslash> characters shall be interpreted as just a single
literal <backslash> character. Any other occurrence of a
<backslash> (for example, preceding any other character) shall
be treated as a literal <backslash> character. Note that if repl
is a string literal (the lexical token STRING; see Grammar), the
handling of the <ampersand> character occurs after any lexical
processing, including any lexical <backslash>-escape sequence
processing. If in is specified and it is not an lvalue (see
Expressions in awk), the behavior is undefined. If in is omitted, awk
shall use the current record ($0) in its place.
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.

How do I include new lines in a string in Perl?

I have a string that looks like this
Acanthocolla_cruciata,#8B5F65Acanthocyrta_haeckeli,#8B5F65Acanthometra_fusca,#8B5F65Acanthopeltis_japonica,#FFB5C5
I am trying to added in new lines so get in list format. Like this
Acanthocolla_cruciata,#8B5F65
Acanthocyrta_haeckeli,#8B5F65
Acanthometra_fusca,#8B5F65
Acanthopeltis_japonica,#FFB5C5
I have a perl script
use strict;
use warnings;
open my $new_tree_fh, '>', 'test_match.txt'
or die qq{Failed to open "update_color.txt" for output: $!\n};
open my $file, '<', $ARGV[0]
or die qq{Failed to open "$ARGV[0]" for input: $!\n};
while ( my $string = <$file> ) {
my $splitmessage = join ("\n", ($string =~ m/(.+)+\,+\#+\w{6}/gs));
print $new_tree_fh $splitmessage, "\n";
}
close $file;
close $new_tree_fh;
The pattern match works but it wont print the new line as I want to make the list. Can anyone please suggest anything.
I'd do:
my $str = 'Acanthocolla_cruciata,#8B5F65Acanthocyrta_haeckeli,#8B5F65Acanthometra_fusca,#8B5F65Acanthopeltis_japonica,#FFB5C5';
$str =~ s/(?<=,#\w{6})/\n/g;
say $str;
Output:
Acanthocolla_cruciata,#8B5F65
Acanthocyrta_haeckeli,#8B5F65
Acanthometra_fusca,#8B5F65
Acanthopeltis_japonica,#FFB5C5
OK, I think your problem here is that your regular expression doesn't match properly.
(.+)+
for example - probably doesn't do what you think it does. It's a greedy capture of 1 or more of "anything" which will grab your whole string.
Check it out on regex101.
Try:
#!/usr/bin/perl
use strict;
use warnings;
while ( my $string = <DATA> ) {
my $splitmessage = join( "\n", ( $string =~ m/(\w+,\#+\w{6})/g ) );
print $splitmessage, "\n";
}
__DATA__
Acanthocolla_cruciata,#8B5F65Acanthocyrta_haeckeli,#8B5F65Acanthometra_fusca,#8B5F65Acanthopeltis_japonica,#FFB5C5
Which will print:
Acanthocolla_cruciata,#8B5F65
Acanthocyrta_haeckeli,#8B5F65
Acanthometra_fusca,#8B5F65
Acanthopeltis_japonica,#FFB5C5
Rather than a quickfix solution, let's find the problem in your existing code and hence learn from it. Your problem is in the regular expression, so we'll dissect and fix it.
($string =~ m/(.+)+\,+\#+\w{6}/gs)
First, the two significant mistakes that lead to the bug:
At the beginning, you're doing a .+, followed by matching with , and # and so on. The problem is, .+ is greedy, which means it'll match upto the last , in the input, and not the first one. So when you run this, almost the entire line (except for the last plant's color) gets matched up by this single .+.
There are a few different ways you can fix this, but the easiest is to restrict what you're matching. Instead of saying .+ "match anything", make it [\w\s]+ at the beginning - which means match either "word characters" (which includes alphabets and digits) or space characters (since there is a space in the middle of the plant name).
($string =~ m/([\w\s]+)+\,+\#+\w{6}/gs)
That changes the output, but still not to the fully correct version because:
m/some regex/g returns a list of its matches as a list here, and what we want is for it to return the whole match including both plant name and color. But, when there are paranthesis inside the match anywhere, m/ returns only the part matched by the paranthesis (which is the plant name here), not the whole match. So, remove the paranthesis, and it becomes:
($string =~ m/[\w\s]++\,+\#+\w{6}/gs)
This works, but is quite clumsy and bug-prone, so here's some improvement suggestions:
Since your input has no newline characters, the /s at the end is unnecessary.
($string =~ m/[\w\s]++\,+\#+\w{6}/g)
, and # are not a special character in perl regular expressions, so they don't need a \ before them.
($string =~ m/[\w\s]++,+#+\w{6}/g)
+ is for when you know only that the character will be present, but don't know how many times it'll be there. Here, since we're only trying to match one , and one # characters, the + after them is unnecessary.
($string =~ m/[\w\s]++,#\w{6}/g)
The ++ after [\w\s] means something quite different from + (basically an even greedier match than usual), so let's make it a single +
($string =~ m/[\w\s]+,#\w{6}/g)
Optionally, you can change the last \w to match only the hexadecimal characters which will appear in the colour code:
($string =~ m/[\w\s]+,#[0-9A-F]{6}/g)
That's a pretty solid, working regular expression that does what you want.

Perl: Count number of times a word appears in text and print out surrounding words

I want to do two things:
1) count the number of times a given word appears in a text file
2) print out the context of that word
This is the code I am currently using:
my $word_delimiter = qr{
[^[:alnum:][:space:]]*
(?: [[:space:]]+ | -- | , | \. | \t | ^ )
[^[:alnum:]]*
}x;
my $word = "hello";
my $count = 0;
#
# here, a file's contents are loaded into $lines, code not shown
#
$lines =~ s/\R/ /g; # replace all line breaks with blanks (cannot just erase them, because this might connect words that should not be connected)
$lines =~ s/\s+/ /g; # replace all multiple whitespaces (incl. blanks, tabs, newlines) with single blanks
$lines = " ".$lines." "; # add a blank at beginning and end to ensure that first and last word can be found by regex pattern below
while ($lines =~ m/$word_delimiter$word$word_delimiter/g ) {
++$count;
# here, I would like to print the word with some context around it (i.e. a few words before and after it)
}
Three problems:
1) Is my $word_delimiter pattern catching all reasonable characters I can expect to separate words? Of course, I would not want to separate hyphenated words, etc. [Note: I am using UTF-8 throughout but only English and German text; and I understand what reasonably separates a word might be a matter of judgment]
2) When the file to be analzed contains text like "goodbye hello hello goodbye", the counter is incremented only once, because the regex only matches the first occurence of " hello ". After all, the second time it could find "hello", it is not preceeded by another whitespace. Any ideas on how to catch the second occurence, too? Should I maybe somehow reset pos()?
3) How to (reasonably efficiently) print out a few words before and after any matched word?
Thanks!
1. Is my $word_delimiter pattern catching all reasonable characters I can expect to separate words?
Word characters are denoted by the character class \w. It also matches digits and characters from non-roman scripts.
\W represents the negated sense (non-word characters).
\b represents a word boundary and has zero-length.
Using these already available character classes should suffice.
2. Any ideas on how to catch the second occurence, too?
Use zero-length word boundaries.
while ( $lines =~ /\b$word\b/g ) {
++$count;
}

Resources