Pattern Matching log files

Pattern Matching log files - linux

I am getting files like .log and _log in a folder ,i am able to pick .log files with /*.log$/ but unable to find files which are _log .
need a regex pattern which will take both type of files from a specified folder.

Your question is tagged both 'perl' and 'linux'. I'll assume here that you're talking about Perl style regular expressions, as it looks like that's what you are showing in your example snippet.
The *. sequence is a mistake.
Let's focus on what you want to match. You want to match any filename that ends in a dot followed by the literal characters 'log'. You also want to match any filename that ends in an underscore, followed by the literal characters 'log'. You really shouldn't concern yourself with the "anything at all" that can come before the final dot or underscore. So the regexp would probably be better written as this:
/[._]log$/
Notice we don't even bother with the dot-star. It isn't helpful in this situation.
If you want for your pattern to also match files where the literal characters 'log' may optionally be followed by an integer sequence (not mentioned in your question, but discussed in one of your followup comments), you could write it like this:
/[._]log\d*$/
Here the 'star' is helpful; it allows for zero or more digits sandwiched between the 'g' and the end of the string.

I totally agree (by upvoting) with DavidO's solution but it usually makes more sense, and increase readability, to use glob() to get a list of files from a particular directory
my $dir = "/path/here";
my #log_files = grep { /[\._]log\d*$/ } glob("$dir/*");
print join "\n", #log_files;
This will catch
foo.log
foo_log
foo.log1
foo_log22

Use the regexp /.*[._]log$/.
I'm surprised your first case worked -- /*.log$/ isn't legal regexp (since the * doesn't say what it is supposed to match zero-or-more of). Double-check your current results.

Related

Linux rename s/ - regex for wildcard single characte r

I have found a simple solution to my actual requirement, but I would still like to understand how to use the regex equivalent of the single character wildcard ? which we use for filtering ... in say ls
I would like to rename a group of files which differ by one character.
FROM
Impossible-S01E01-x264.mkv
Impossible-S01E02-x264.mkv
Impossible-S01E03-x264.mkv
Impossible-S01E04-x264.mkv
Impossible-S01E05-x264.mkv
TO
Impossible-S01E01.mkv
Impossible-S01E02.mkv
Impossible-S01E03.mkv
Impossible-S01E04.mkv
Impossible-S01E05.mkv
As I said above, my simple solution is:
rename s/-x264// *.mkv
That sorts out my needs - all good and well - but I really want to understand my first approach:
To list the files, I can use:
ls Impossible-S01E0?-x264.mkv
So what I was trying for the rename was:
rename s/Impossible-S01E0?-x264.mkv/Impossible-S01E0?.mkv/ *.mkv
I have read up here:
How do regular expressions differ from wildcards used to filter files
And here:
Why does my regular expression work in X but not in Y?
I see this:
. matches any character (or any character except a newline).
I just can't seem to wrap my head around how to use that - hoping someone will explain for my education.
{ edit: missed a backslash \ }

So, regular expressions aren't globs. If you wanted to keep the middle (e.g. catch the season/ep) and replace everything else, you'd need to use capture groups. e.g. s/^.*(S\d+E\d+).*\.(.*?)$/Foo-$1.$2/
This would extract an SxxExx and the file extension, throw everything else away, and compose a new filename.
In a bit more detail it:
Matches everything from the start until an SxxExx (where xx is actually any number of digits)
Captures the contents of SxxExx
Matches everything until the final literal .
Non-greedily matches everything after the ., which it captures.
For your specific case of removing a suffix, this is likely overkill, though.

How to rename a folder that contains smart quotes

I have a folder that was created automatically. The user unintentionally provided smart (curly) quotes as part of the name, and the process that sanitizes the inputs did not catch these. As a result, the folder name contains the smart quotes. For example:
this-is-my-folder’s-name-“Bob”
I'm now trying to rename/remove said folder on the command line, and none of the standard tricks for dealing with files/folders with special characters (enclosing in quotes, escaping the characters, trying to rename it by inode, etc.) are working. All result in:
mv: cannot move this-is-my-folder’s-name-“Bob” to this-is-my-folders-name-BOB: No such file or directory
Can anyone provide some advice as to how I can achieve this?

To get the name in a format you can copy-and-paste into your shell:
printf '%q\n' this*
...will print out the filename in a manner the shell will accept as valid input. This might look something like:
$'this-is-my-folder200\231s-name-200\234Bob200\235'
...which you can then use as an argument to mv:
mv $'this-is-my-folder200\231s-name-200\234Bob200\235' this-is-my-folders-name-BOB
Incidentally, if your operating system works the same way mine does (when running the test above), this would explain why using single-character globs such as ? for those characters didn't work: They're actually more than one byte long each!

You can use shell globbing token ? to match any single character, so matching the smart quotes using ? should do:
mv this-is-my-folder?s-name-?Bob? new_name
Here replacing the smart quotes with ? to match the file name.

There are several possibilities.
If an initial substring of the file name ending before the first quote is unique within the directory, then you can use filename completion to help you type an appropriate command. Type "mv" (without the quotes) and the unique initial substring, then press the TAB key to request filename completion. Bash will complete the filename with the correct characters, correctly escaped.
Use a graphical file browser. Then you can select the file to rename by clicking on it. (Details of how to proceed from there depend on the browser.) If you don't have a graphical terminal and can't get one, then you may be able to do the same with a text-mode browser such as Midnight Commander.
A simple glob built with the ? or * wildcard should be able to match the filename
Use a more complex glob to select the filename, and perhaps others with the same problem. Maybe something like *[^a-zA-Z0-9-]* would do. Use a pattern substitution to assign a new name. Something like this:
for f in *[^a-zA-Z0-9-]*; do
mv "$f" "${f//[^a-zA-Z0-9-]/}"
done
The substitution replaces all appearances of a characters that are not decimal digits, appercase or lowercase Latin letters, or hyphens with nothing (i.e. it strips them). Do take care before you use this, though, to make sure you're not going to make more changes than you intend to do.

replacing part of regex matches

I have several functions that start with get_ in my code:
get_num(...) , get_str(...)
I want to change them to get_*_struct(...).
Can I somehow match the get_* regex and then replace according to the pattern so that:
get_num(...) becomes get_num_struct(...),
get_str(...) becomes get_str_struct(...)
Can you also explain some logic behind it, because the theoretical regex aren't like the ones used in UNIX (or vi, are they different?) and I'm always struggling to figure them out.
This has to be done in the vi editor as this is main work tool.
Thanks!

To transform get_num(...) to get_num_struct(...), you need to capture the correct text in the input. And, you can't put the parentheses in the regular expression because you may need to match pointers to functions too, as in &get_distance, and uses in comments. However, and this depends partially on the fact that you are using vim and partially on how you need to keep the entire input together, I have checked that this works:
%s/get_\w\+/&_struct/g
On every line, find every expression starting with get_ and continuing with at least one letter, number, or underscore, and replace it with the entire matched string followed by _struct.
Darn it; I shouldn't answer these things on spec. Note that other regex engines might use \& instead of &. This depends on having magic set, which is default in vim.

For an alternate way to do it:
%s/get_\(\w*\)(/get_\1_struct(/g
What this does:
\w matches to any "word character"; \w* matches 0 or more word characters.
\(...\) tells vim to remember whatever matches .... So, \(w*\) means "match any number of word characters, and remember what you matched. You can then access it in the replacement with \1 (or \2 for the second, etc.)
So, the overall pattern get_\(\w*\)( looks for get_, followed by any number of word chars, followed by (.
The replacement then just does exactly what you want.
(Sorry if that was too verbose - not sure how comfortable you are with vim regex.)

Search files with multiple "dot" characters

In Linux how do I use find and regular expressions or a similar way without writing a script to search for files with multiple "dots" but IGNORE extension.
For e.g search through the following files will only return the second file. In this example ".ext" is the extension.
testing1234hellothisisafile.ext
testing.1234.hello.this.is.a.file.ext
The solution should work with one or more dots in the file name (ignoring the extension dot). This should also work for any files i.e. with any file extension
Thanks in advance

So if I understand correctly, you want to get the filenames with at least two additional dots in the name. This would do:
$ find -regex ".*\.+[^.]*\.+[^.]*\.+.*"
./testing.1234.hello.this.is.a.file.ext
./testing1234.hellothisisafile.ext
$ find -regex ".*\.+[^.]*\.+[^.]*\.+[^.]*\.+.*"
./testing.1234.hello.this.is.a.file.ext
The key dot detecting part is \.+ (at least one dot), coupled with the separating anything (but a dot, but the previous part covers it already; a safety measure against greedy matching) [^.]*. Together they make the core part of the regex - we don't care what is before or after, just that somewhere there are three dots. Three since also the one from the current dir matters — if you'll be searching from elsewhere, remove one \.+[^.]* group:
$ find delme/ -regex ".*\.+[^.]*\.+[^.]*\.+[^.]*\.+.*"
delme/testing.1234.hello.this.is.a.file.ext
$ find delme/ -regex ".*\.+[^.]*\.+[^.]*\.+.*"
delme/testing.1234.hello.this.is.a.file.ext
In this case the result is the same, since the name contains a lot of dots, but the second regex is the correct one.

Why doesn't Vims errorformat take regular expressions?

Vims errorformat (for parsing compile/build errors) uses an arcane format from c for parsing errors.
Trying to set up an errorformat for nant seems almost impossible, I've tried for many hours and can't get it. I also see from my searches that alot of people seem to be having the same problem. A regex to solve this would take minutesto write.
So why does vim still use this format? It's quite possible that the C parser is faster but that hardly seems relevant for something that happens once every few minutes at most. Is there a good reason or is it just an historical artifact?

It's not that Vim uses an arcane format from C. Rather it uses the ideas from scanf, which is a C function. This means that the string that matches the error message is made up of 3 parts:
whitespace
characters
conversion specifications
Whitespace is your tabs and spaces. Characters are the letters, numbers and other normal stuff. Conversion specifications are sequences that start with a '%' (percent) character. In scanf you would typically match an input string against %d or %f to convert to integers or floats. With Vim's error format, you are searching the input string (error message) for files, lines and other compiler specific information.
If you were using scanf to extract an integer from the string "99 bottles of beer", then you would use:
int i;
scanf("%d bottles of beer", &i); // i would be 99, string read from stdin
Now with Vim's error format it gets a bit trickier but it does try to match more complex patterns easily. Things like multiline error messages, file names, changing directory, etc, etc. One of the examples in the help for errorformat is useful:
1 Error 275
2 line 42
3 column 3
4 ' ' expected after '--'
The appropriate error format string has to look like this:
:set efm=%EError\ %n,%Cline\ %l,%Ccolumn\ %c,%Z%m
Here %E tells Vim that it is the start of a multi-line error message. %n is an error number. %C is the continuation of a multi-line message, with %l being the line number, and %c the column number. %Z marks the end of the multiline message and %m matches the error message that would be shown in the status line. You need to escape spaces with backslashes, which adds a bit of extra weirdness.
While it might initially seem easier with a regex, this mini-language is specifically designed to help with matching compiler errors. It has a lot of shortcuts in there. I mean you don't have to think about things like matching multiple lines, multiple digits, matching path names (just use %f).
Another thought: How would you map numbers to mean line numbers, or strings to mean files or error messages if you were to use just a normal regexp? By group position? That might work, but it wouldn't be very flexible. Another way would be named capture groups, but then this syntax looks a lot like a short hand for that anyway. You can actually use regexp wildcards such as .* - in this language it is written %.%#.
OK, so it is not perfect. But it's not impossible either and makes sense in its own way. Get stuck in, read the help and stop complaining! :-)

I would recommend writing a post-processing filter for your compiler, that uses regular expressions or whatever, and outputs messages in a simple format that is easy to write an errorformat for it. Why learn some new, baroque, single-purpose language unless you have to?

According to :help quickfix,
it is also possible to specify (nearly) any Vim supported regular
expression in format strings.
However, the documentation is confusing and I didn't put much time into verifying how well it works and how useful it is. You would still need to use the scanf-like codes to pull out file names, etc.

They are a pain to work with, but to be clear: you can use regular expressions (mostly).
From the docs:
Pattern matching
The scanf()-like "%*[]" notation is supported for backward-compatibility
with previous versions of Vim. However, it is also possible to specify
(nearly) any Vim supported regular expression in format strings.
Since meta characters of the regular expression language can be part of
ordinary matching strings or file names (and therefore internally have to
be escaped), meta symbols have to be written with leading '%':
%\ The single '\' character. Note that this has to be
escaped ("%\\") in ":set errorformat=" definitions.
%. The single '.' character.
%# The single '*'(!) character.
%^ The single '^' character. Note that this is not
useful, the pattern already matches start of line.
%$ The single '$' character. Note that this is not
useful, the pattern already matches end of line.
%[ The single '[' character for a [] character range.
%~ The single '~' character.
When using character classes in expressions (see |/\i| for an overview),
terms containing the "\+" quantifier can be written in the scanf() "%*"
notation. Example: "%\\d%\\+" ("\d\+", "any number") is equivalent to "%*\\d".
Important note: The \(...\) grouping of sub-matches can not be used in format
specifications because it is reserved for internal conversions.

lol try looking at the actual vim source code sometime. It's a nest of C code so old and obscure you'll think you're on an archaeological dig.
As for why vim uses the C parser, there are plenty of good reasons starting with that it's pretty universal. But the real reason is that sometime in the past 20 years someone wrote it to use the C parser and it works. No one changes what works.
If it doesn't work for you the vim community will tell you to write your own. Stupid open source bastards.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string