Patterns in Lua with space - string

How could I use string.gmatch(text, pattern) to do this:
text = "Hello.%23 Awesome7^.."
pattern = --what to put here?
for word in string.gmatch(text, pattern) do
print(word)
end
--Result
>test
Hello.%23
Awesome7^..
>
I have been using "%w+%p", but this results in:
>test
Hello.
%
23
Awesome7^
.
.
Which is not the desired result.
Note: I have not tested this exact string, it could vary... but still, does not create the desired result

From your example, every word contains no spaces, and are separated by spaces, so the simplest pattern is "%S+":
text = "Hello.%23 Awesome7^.."
pattern = "%S+"
for word in string.gmatch(text, pattern) do
print(word)
end
"%s" matches a space character, "%S" matches a non-space character.

Related

Remove spaces from a string but not new lines in lua

I used string.gsub(str, "%s+") to remove spaces from a string but not remove new lines, example:
str = "string with\nnew line"
string.gsub(str, "%s+")
print(str)
and I'm expecting the output to be like:
stringwith
newline
what pattern should I use to get that result.
It seems you want to match any whitespace matched with %s but exclude a newline char from the pattern.
You can use a reverse %S pattern (that matches any non-whitespace char) in a negated character set, [^...], and add a \n there:
local str = "string with\nnew line"
str = string.gsub(str, "[^%S\n]+", "")
print(str)
See an online Lua demo yielding
stringwith
newline
"%s" matches any whitespace character. if you want to match a space use " ". If you want to define a specific number of spaces either explicitly write them down " " or use string.rep(" ", 5)

perl: print remaining string only if there is no character before the matched value.

The following prints the entire content of the line after "B. "
perl -ne'print if /B[.] (.*)/s' $string > file
How can I match/print the line only if there is no other character before the "B. "? In other words, if there is a character before the "B. " ie. "TAB." skip the line / do not print.
The correct "B." is always on a new line, the only correct line to match appears as follows:
B. some text here
A regex with a leading carat indicates that the expression should match only if it is the first item on the line. The pattern /^B[.] (.*)/s should get you the result you're looking for.
Put ^ in front of the B. It means match the word starts with B. So your regex should be /^B\. (.*)/. Then no need you s flag in your pattern match.

Perl: Count number of times a word appears in text and print out surrounding words

I want to do two things:
1) count the number of times a given word appears in a text file
2) print out the context of that word
This is the code I am currently using:
my $word_delimiter = qr{
[^[:alnum:][:space:]]*
(?: [[:space:]]+ | -- | , | \. | \t | ^ )
[^[:alnum:]]*
}x;
my $word = "hello";
my $count = 0;
#
# here, a file's contents are loaded into $lines, code not shown
#
$lines =~ s/\R/ /g; # replace all line breaks with blanks (cannot just erase them, because this might connect words that should not be connected)
$lines =~ s/\s+/ /g; # replace all multiple whitespaces (incl. blanks, tabs, newlines) with single blanks
$lines = " ".$lines." "; # add a blank at beginning and end to ensure that first and last word can be found by regex pattern below
while ($lines =~ m/$word_delimiter$word$word_delimiter/g ) {
++$count;
# here, I would like to print the word with some context around it (i.e. a few words before and after it)
}
Three problems:
1) Is my $word_delimiter pattern catching all reasonable characters I can expect to separate words? Of course, I would not want to separate hyphenated words, etc. [Note: I am using UTF-8 throughout but only English and German text; and I understand what reasonably separates a word might be a matter of judgment]
2) When the file to be analzed contains text like "goodbye hello hello goodbye", the counter is incremented only once, because the regex only matches the first occurence of " hello ". After all, the second time it could find "hello", it is not preceeded by another whitespace. Any ideas on how to catch the second occurence, too? Should I maybe somehow reset pos()?
3) How to (reasonably efficiently) print out a few words before and after any matched word?
Thanks!
1. Is my $word_delimiter pattern catching all reasonable characters I can expect to separate words?
Word characters are denoted by the character class \w. It also matches digits and characters from non-roman scripts.
\W represents the negated sense (non-word characters).
\b represents a word boundary and has zero-length.
Using these already available character classes should suffice.
2. Any ideas on how to catch the second occurence, too?
Use zero-length word boundaries.
while ( $lines =~ /\b$word\b/g ) {
++$count;
}

Extract the required substring from another string -Perl

I want to extract a substring from a line in Perl. Let me explain giving an example:
fhjgfghjk3456mm 735373653736
icasd 666666666666
111111111111
In the above lines, I only want to extract the 12 digit number. I tried using split function:
my #cc = split(/[0-9]{12}/,$line);
print #cc;
But what it does is removes the matched part of the string and stores the residue in #cc. I want the part matching the pattern to be printed. How do I that?
You can do it with regular expressions:
#!/usr/bin/perl
my $string = 'fhjgfghjk3456mm 735373653736 icasd 666666666666 111111111111';
while ($string =~ m/\b(\d{12})\b/g) {
say $1;
}
Test the regex here: http://rubular.com/r/Puupx0zR9w
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/\b(\d+)\b/)->explain();
The regular expression:
(?-imsx:\b(\d+)\b)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The $1 built-in variable stores the last match from a regex. Also, if you perform a regex on a whole string, it will return the whole string. The best solution here is to put parentheses around your match then print $1.
my $strn = "fhjgfghjk3456mm 735373653736\nicasd\n666666666666 111111111111";
$strn =~ m/([0-9]{12})/;
print $1;
This makes our regex match JUST the twelve digit number and then we return that match with $1.
#!/bin/perl
my $var = 'fhjgfghjk3456mm 735373653736 icasd 666666666666 111111111111';
if($var =~ m/(\d{12})/) {
print "Twelve digits: $1.";
}
#!/usr/bin/env perl
undef $/;
$text = <DATA>;
#res = $text =~ /\b\d{12}\b/g;
print "#res\n";
__DATA__
fhjgfghjk3456mm 735373653736
icasd 666666666666
111111111111

Lua 'plain' string.gsub

I've hit s small block with string parsing. I have a string like:
footage/down/temp/cars_[100]_upper/cars_[100]_upper.exr
and I'm having difficulty using gsub to delete a portion of the string. Normally I would do this
lineA = footage/down/temp/cars_[100]_upper/cars_[100]_upper.exr
lineB = footage/down/temp/cars_[100]_upper/
newline = lineA:gsub(lineB, "")
which would normally give me 'cars_[100]_upper.exr'
The problem is that gsub doesn't like the [] or other special characters in the string and unlike string.find gsub doesn't have the option of using the 'plain' flag to cancel pattern searching.
I am not able to manually edit the lines to include escape characters for the special characters as I'm doing file a file comparison script.
Any help to get from lineA to newline using lineB would be most appreciated.
Taking from page 181 of Programming in Lua 2e:
The magic characters are:
( ) . % + - * ? [ ] ^ $
The character '%' works as an escape
for these magic characters.
So, we can just come up with a simple function to escape these magic characters, and apply it to your input string (lineB):
function literalize(str)
return str:gsub("[%(%)%.%%%+%-%*%?%[%]%^%$]", function(c) return "%" .. c end)
end
lineA = "footage/down/temp/cars_[100]_upper/cars_[100]_upper.exr"
lineB = literalize("footage/down/temp/cars_[100]_upper/")
newline = lineA:gsub(lineB, "")
print(newline)
Which of course prints: cars_[100]_upper.exr.
You may use another approach like:
local i1, i2 = lineA:find(lineB, nil, true)
local result = lineA:sub(i2 + 1)
You can also escape punctuation in a text string, str, using:
str:gsub ("%p", "%%%0")

Resources