Perl String parsing multiple patterns - string

I'm trying to parse text file which has multiple patters.
Goal is to have everything in between * * and only integer in between ^ ^ it should remove all special character or string if found.
data.txt
*ABC-13077* ^817266, 55555^
*BCD-13092* ^CL: 816933^
*CDE-13127* ^ ===> Change 767666 submitted^
output.txt
ABC-13077 817266 55555
BCD-13092 816933
CDE-13127 767666
my script
#!/usr/bin/perl
use strict;
use Cwd;
my $var;
open(FH,"changelists.txt")or die("can't open file:$!");
while($var=<FH>)
{
my #vareach=split(/[* \s\^]+/,$var);
for my $each(#vareach)
{
print "$each\n";
}
}

Replace the while loop with the following:
while (<FH>) {
s/\*(.*)\*/$1/;
s/\^(.*)\^/ join ' ', $1 =~ m([0-9]+)g /e;
print;
}
The first substitution removes the asterisks.
The second substitution takes the ^...^ part, and replaces it with the result of the code in the replacement part because of the /e modifier. The code matches all the integers, and as join forces list context on the match, it returns all the matches.

Related

How to edit a multi-line scalar and print the edits

I need to edit a multi-line scalar and print the results, however I am not able to do it neatly.
my $text = "$omething\n nothing\n Everything\n";
What I need to do is check each line, and if there's a capital letter or special charracter - print this line and remove it from the original scalar ($text).
In this example it would print two times, first time:
$omething
Second time:
Everything
And remove both of those strings from the $text scalar.
To include a dollar sign in a double quoted string, you need to escape it by a backslash.
You can remove the matching lines in a while loop:
#!/usr/bin/perl
use warnings;
use strict;
my $text = "\$omething\nnothing\nEverything\n";
while ($text =~ s/(.*[[:upper:]\$].*\n)//) {
print $1;
}
print "Remaining: $text";
A period never matches a newline (unless you specify the /s modifier).

Find and replace words using sed command not working

I have a a text file which is tab separated, the first column holds the word to be found and the second column holds the word to replace the found word. This text file contains English and Arabic pairs. Once the word is found and replaced it should not be changed again.
For example:
adam a +dam
a b
ال ال+
So for a given text file:
adam played with a ball ال
I expect:
a +dam played with b ball ال+
However, I get:
b +dbm plbyed with b bbll ال+
I am using the following sed command to find and replace:
sed -e 's/^/s%/' -e 's/\t/%/' -e 's/$/%g/' tab_sep_file.txt | sed -f - original_file.txt >replaced.txt
How can I fix this issue
The basic problem to your approach is that you don't want to replace matched text in a prior substitution with a later one - you don't want to change the a's in a +dam to b's. This makes sed a pretty poor choice - you can make a regular expression that matches all of the things you want to replace fairly easily, but picking which replacement to use is an issue.
A way using GNU awk:
gawk -F'\t' '
FNR == NR { subs[$1] = $2; next } # populate the array of substitutions
ENDFILE {
if (FILENAME == ARGV[1]) {
# Build a regular expression of things to substitute
subre = "\\<("
first=0
for (s in subs)
subre = sprintf("%s%s%s", subre, first++ ? "|" : "", s)
subre = sprintf("%s)\\>", subre)
}
}
{
# Do the substitution
nwords = patsplit($0, words, subre, between)
printf "%s", between[0]
for (n = 1; n <= nwords; n++)
printf "%s%s", subs[words[n]], between[n]
printf "\n"
}
' tab_sep_file.txt original_file.txt
which outputs
a +dam played with b ball
First it reads the TSV file and builds an array of words to be replaced and text to replace it with (subs). Then after reading that file, it builds a regular expression to match all possible words to be found - \<(a|adam)\> in this case. The \< and \> match only at the beginning and end, respectively, of words, so the a in ball won't match.
Then for the second file with the text you want to process, it uses patsplit() to split each line into an array of matched parts (words) and the bits between matches (between), and iterates over the length of the array, printing out the replacement text for each match. That way it avoids re-matching text that's already been replaced.
And a perl version that uses a similar approach (Taking advantage of perl's ability to evaluate the replacement text in a s/// substitution):
perl -e '
use strict;
use warnings;
# Set file/standard stream char encodings from locale
use open ":locale";
# Or for explicit UTF-8 text
# use open ":encoding(UTF-8)", ":std";
my %subs;
open my $words, "<", shift or die $!;
while (<$words>) {
chomp;
my ($word, $rep) = split "\t" ,$_, 2;
$subs{$word} = $rep;
}
my $subre = "\\b(?:" . join("|", map { quotemeta } keys %subs) . ")\\b";
while (<<>>) {
print s/$subre/$subs{$&}/egr;
}
' tab_sep_file.txt original_file.txt
(This one will escape regular expression metacharacters in the words to replace, making it more robust)

How to use Regex in Perl

I need some help , I have an output from a command and need to extract only the time i.e. "10:57:09" from the output.
The command is: tail -f /var/log/sms
command output:
Thu 2016/08/04 10:57:09 gammu-smsd[48014]: Read 0 messages
how could I do this in perl and put the result into variable
Thank you
Normally, we'd expect you to show some evidence of trying to solve the problem yourself before giving an answer.
You use the match operator (m/.../) to check if a string matches a regular expression. The m is often omitted so you'll see it written as /.../. By default, it matches against the variable $_ but you can change that by using the binding operator, =~. If a regex includes parentheses ((...)) then whatever is matched by that section of the regex is stored in $1 (and $2, $3, etc for subsequent sets of parentheses). Those "captured" values are also returned by the match operator when it is evaluated in list context.
It's always a good idea to check the return value from the match operator, as you'll almost certainly want to take different actions if the match was unsuccessful.
See perldoc perlop for more details of the match operator and perldoc perlre for more details of Perl's regex support.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
$_ = 'Thu 2016/08/04 10:57:09 gammu-smsd[48014]: Read 0 messages';
if (my ($time) = /(\d\d:\d\d:\d\d)/) {
say "Time is '$time'";
} else {
say 'No time found in string';
}
And to get the data from your external process...
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
open my $tail_fh, 'tail -f /var/log/sms |' or die $!;
while (<$tail_fh>) {
if (my ($time) = /(\d\d:\d\d:\d\d)/) {
say "Time is '$time'";
} else {
say 'No time found in string';
}
}
Perl code:
$txt = "Thu 2016/08/04 10:57:09 gammu-smsd[48014]: Read 0 messages";
$txt =~ /(\d{2}:\d{2}:\d{2})/;
print $1; # result of regex
print "\n"; # new line
And it prints:
10:57:09
The result goes to a variable called $1, due to the capturing parenthesis. Had there been more capturing parenthesis their captured text would have put int $2, $3 etc...
EDIT
To read the line from console, use in the above script:
$txt = <STDIN>;
Now, suppose the script is called myscript.pl, execute tail like so:
tail -f /var/log/sms | myscript.pl

PERL eplacing newline characters inside fish-brackets, while leaving other newline characters untouched?

I finally know how to use regular expressions to replace one substring with another every place where it occurs within a string. But what I need to do now is a bit more complicated than that.
A string I must transform will have many instances of the newline character ('\n'). If those newline character are enclosed within fish-tags (between '<' and '>') I need to replace it with a simple whitespace character (' ').
However, if a newline character occurs anywhere else in the string, I need to leave that newline character alone.
There will be several places in the string that are enclosed in fish-tags, and several places that aren't.
Is there a way to do this in PERL?
I honestly don't recommend doing this with regular expressions. Besides the fact that you should never parse html with a regular expression, it's also a pain to do negative matches with regular expressions and anyone reading the code will honestly have no idea what you just did. Doing it manually on the other hand is really easy to understand.
This code assumes well formed html that doesn't have tags starting inside the definition of other tags (otherwise you would have to track all the instances and increment/decrement a count appropriately) and it does not handle < or > inside quoted strings which isn't the most common thing. And if you're doing all that I really recommend you use a real html parser, there are many of them.
Obviously if you're not reading this from a filehandle, the loop would be going over an array of lines (or the output of splitting the whole text, though you would instead be appending ' ' or "\n" depending on the inside variable if you split since it would remove the newline)
use strict;
use warnings;
# Default to being outside a tag
my $inside = 0;
while(my $line = <DATA>) {
# Find the last < and > in the string
my ($open, $close) = map { rindex($line, $_) } qw(< >);
# Update our state accordingly.
if ($open > $close) {
$inside = 1;
} elsif ($open < $close) {
$inside = 0;
}
# If we're inside a tag change the newline (last character in the line) with a space. If you instead want to remove it you can use the built-in chomp.
if ($inside) {
# chomp($line);
substr($line, -1) = ' ';
}
print $line;
}
__DATA__
This is some text
and some more
<enclosed><a
b
c
> <d
e
f
>
<g h i
>
Given:
$ echo "$txt"
Line 1
Line 2
< fish tag line 1
and line 2 >
< line 3 >
< fish tag line 4
and line 5 >
You can do:
$ echo "$txt" | perl -0777 -lpe "s/(<[^\n>]*)\n+([^>]*>)/\1\2/g"
Line 1
Line 2
< fish tag line 1 and line 2 >
< line 3 >
< fish tag line 4 and line 5 >
I will echo that this only works in limited cases. Please do not get in the general habit of using a regex for HTML.
This solution uses zdim's data (thanks, zdim)
I prefer to use an executable replacement together with the non-destructive option of the tr/// operator
This solution finds all occurrences of strings enclosed in angle brackets <...> and alters all newlines within each one to single spaces
Note that it would be simple to allow for quoted substrings containing any characters by writing this instead
$data =~ s{ ( < (?: "[^"]+" | [^>] )+ > ) }{ $1 =~ tr/\n/ /r }gex;
use strict;
use warnings 'all';
use v5.14; # For /r option
my $data = do {
local $/;
<DATA>;
};
$data =~ s{ ( < [^<>]+ > ) }{ $1 =~ tr/\n/ /r }gex;
print $data;
__DATA__
start < inside tags> no new line
again <inside, with one nl
> out
more <inside, with two NLs
and more text
>
output
start < inside tags> no new line
again <inside, with one nl > out
more <inside, with two NLs and more text >
The (X)HTML/XML shouldn't be parsed with regex. But since no description of the problem is given here is a way to go at it. Hopefully it demonstrates how tricky and involved this can get.
You can match a newline itself. Together with details of how linefeeds may come in text
use warnings;
use strict;
my $text = do { # read all text into one string
local $/;
<DATA>;
};
1 while $text =~ s/< ([^>]*) \n ([^>]*) >/<$1 $2>/gx;
print $text;
__DATA__
start < inside tags> no new line
again <inside, with one nl
> out
more <inside, with two NLs
and more text
>
This prints
start < inside tags> no new line
again <inside, with one nl > out
more <inside, with two NLs and more text >
The negated character class [^>] matches anything other than >, optionally and any number of times with *, up to an \n. Then another such pattern follows \n, up to the closing >. The /x modifier allows spaces inside, for readability. We also need to consider two particular cases.
There may be multiple \n inside <...>, for which the while loop is a clean solution.
There may be multiple <...> with \n, which is what /g is for.
The 1 while ... idiom is another way to write while (...) { }, where the body of the loop is empty so everything happens in the condition, which is repeatedly evaluated until false. In our case the substitution keeps being done in the condition until there is no match, when the loop exits.
Thanks to ysth for bringing up these points and for the 1 while ... solution.
All of this necessary care for various details and edge cases (of which there may be more) hopefully convinces you that it is better to reach for an HTML parsing module suitable for the particular task. For this we'd need to know more about the problem.

Perl script to compress strings replacing repeated chars

I'm writing a perl script that, given a list of string as input, similar to this:
AADDDDKPP PrRRRR
wwwwwwwwwwYY SSSSS SSSSGGGGGGGGGGGGGG
ZZZZZFZZQZZZZZZZZZZZZZ
should substituite every char that is repeated more than 2 times consecutively with the number of repetitions followed by the char. So with the input showed above, the script should give in output these strings:
AA4DKPP Pr4R
10wYY 5S 4S14G
5ZFZZQ13Z
Here's the perl script I've written so far:
foreach my $line(#lines){
#letters=split("",$line);
#alreadyChecked=();
foreach my $letter(#letters){
$count=0;
if (grep {$letter} #alreadyChecked) {
next;
}
push(#alreadyChecked,$letter);
foreach my $index(#letters){
if($letter eq $index){
$count=$count+1;
} else {
#alreadyChecked=0;
last;
}
}
if($count>2){
#chops=split(/$letter+/,$line);
$line=$chops[0].$count.$letter.$chops[1];
}
}
}
I think there's more than one flaw in this code, but can't find where.
my $str ="
AADDDDKPP PrRRRR
wwwwwwwwwwYY SSSSS SSSSGGGGGGGGGGGGGG
ZZZZZFZZQZZZZZZZZZZZZZ
";
$str =~ s/((.)\2{2,})/ length($1) . $2 /ge;
print $str;
output
AA4DKPP Pr4R
10wYY 5S 4S14G
5ZFZZQ13Z
You can solve this with a regex substitution:
perl -plwe's/((.)\2\2+)/length($1) . $2/eg'
This will check for repetition of a character \2, captured by the inner parenthesis, and replace it with a number representing the length of the entire match $1 followed by the character itself $2. The script version could be something like:
use strict;
use warnings;
while (<>) {
s/((.)\2\2+)/length($1) . $2/eg;
print;
}

Resources