Linux: finding the position of the last '/' in a string only - linux

I have this string:
/sandbox/US_MARKETING/COMMON_DATA/BAU/FILES/2020/08/dnb_mi_081420.gz
Without knowing how many '/' there are in it, I want to be able to read just the file into a variable.
I want to be able to do a search where I start at the last '/' in the line and find the filename 'dnb_mi_081420.gz'.
I want to basically say "Find the last '/' in the string and then read the substring that comes after it to the end and store it.
So I know it's going to look like this:
filename=substr(<position of the last'/'>,<position of first character in last string>)
So how to find the index position of the last '/' is I guess what I'm looking for.
Does anyone know what that is?
Also I tried using basename and unfortunately I'm doing this through 'hdfs dfs' to get to a hadoop shell. So some of the non-standard Linux commands like basename aren't in that vocabulary. I'm basically going to have to store that whole string in a variable and do operations on that variable value.

In bash, parameter expansion can be used:
${parameter##word}
The word is expanded to produce a pattern and matched according to the rules described below (see Pattern Matching). If the pattern matches the beginning of the expanded value of parameter, then the result of the expansion is the expanded value of parameter with the shortest matching pattern (the ‘#’ case) or the longest matching pattern (the ‘##’ case) deleted
Example:
$ s="/sandbox/US_MARKETING/COMMON_DATA/BAU/FILES/2020/08/dnb_mi_081420.gz" && echo ${s##*/}
dnb_mi_081420.gz
$

You can use the -state subcommand which pulls information and stats about a file in a specified format. Since you simply want the file name the format would simply be "%n"
hdfs dfs -stat "%n" /path/to/file
This may be more expensive than a solution based on raw indices, but should not create a meaningful or noticeable hit to performance.

Related

Linux rename s/ - regex for wildcard single characte r

I have found a simple solution to my actual requirement, but I would still like to understand how to use the regex equivalent of the single character wildcard ? which we use for filtering ... in say ls
I would like to rename a group of files which differ by one character.
FROM
Impossible-S01E01-x264.mkv
Impossible-S01E02-x264.mkv
Impossible-S01E03-x264.mkv
Impossible-S01E04-x264.mkv
Impossible-S01E05-x264.mkv
TO
Impossible-S01E01.mkv
Impossible-S01E02.mkv
Impossible-S01E03.mkv
Impossible-S01E04.mkv
Impossible-S01E05.mkv
As I said above, my simple solution is:
rename s/-x264// *.mkv
That sorts out my needs - all good and well - but I really want to understand my first approach:
To list the files, I can use:
ls Impossible-S01E0?-x264.mkv
So what I was trying for the rename was:
rename s/Impossible-S01E0?-x264.mkv/Impossible-S01E0?.mkv/ *.mkv
I have read up here:
How do regular expressions differ from wildcards used to filter files
And here:
Why does my regular expression work in X but not in Y?
I see this:
. matches any character (or any character except a newline).
I just can't seem to wrap my head around how to use that - hoping someone will explain for my education.
{ edit: missed a backslash \ }
So, regular expressions aren't globs. If you wanted to keep the middle (e.g. catch the season/ep) and replace everything else, you'd need to use capture groups. e.g. s/^.*(S\d+E\d+).*\.(.*?)$/Foo-$1.$2/
This would extract an SxxExx and the file extension, throw everything else away, and compose a new filename.
In a bit more detail it:
Matches everything from the start until an SxxExx (where xx is actually any number of digits)
Captures the contents of SxxExx
Matches everything until the final literal .
Non-greedily matches everything after the ., which it captures.
For your specific case of removing a suffix, this is likely overkill, though.

Wildcard index followed by a number

I need to rename a lot of files with mmv. I know how to do that but I have a problem with wildcard indexes followed by numbers in the filename.
Basically I need to have an output filename which contains a wildcard followed by numbers.
mmv -n ``\*2\\.3_\*'' ``#11.6#2''
Here, as you can see, I'd like to have an output filename which contains the first wildcard followed by 1.6.
Unfortunately, this way I have #11.6 and the code is interpreted as if I want the 11th wildcard, which of course do not exist.
By reading the documentation you should have been able to find a solution.
Citation from man mmv, see https://ss64.com/bash/mmv.html
To strip any character (e.g. ’*’, ’?’, or ’#’) of its special meaning to mmv, as when the actual replacement name must contain the character ’#’, precede the special character with a ´\’ (and enclose the argument in quotes because of the shell). This also works to terminate a wildcard index when it has to be followed by a digit in the filename, e.g. "a#1\1".

What is wrong with this vim regular expression?

I have a list of files with extension .elf like this
file1.elf
file2.elf
file3.elf
I am trying to run them in shell with run command like run file1.elf >file1.log and get the result in a log file with file name with .log addition.
My list of file is very big. I am trying out a vim regular expression so it will match the file name eg file1 in file1.elf and use it to create name for the log file. I am trying out like this
s/\(\(\<\w\+\)\#<=\.elf\)/\1 >\2\.log/
Here i try to match a text which is proceeded by .elf and keep it in \1 , i expect the entrire file name to be in it and \2 i was hoping would just contain the file name minus extension. but this gives me
run file1 >file1.run i.e \1 dose not take the full file name, it has some how missed .elf extension. I can do \1\.elf to get proper result but i was wondering why the expression is not working as i expected?
You use \#<= in your match pattern. This is the positiv lookahead assertion. As per documentation (:help /\#<=1),
Matches with zero width if the preceding atom matches just before what follows
The important part is that it matches with zero width, this is what you are experiancing, the .elf (which follows) is matched but with zero widht, so that \1 does not contain the suffix .elf.
Instead, it would be easier to go with a
%s/\v(.*)\.elf$/run \1.elf > \1.log/
Here, I've used \v to turn on very magic (:help magic). With this turned on, you don't need al those backslashes when you use grouping parantheses.
Then there is (.*) to match and store the filename up until
\.elf$ which seems to be each files suffix.
In the substitution part, after the / I add the literal run followed by \1. \1 will be replaced by the stored filename (without .elf suffix).
The \#<= seems pointless and unneeded. Removing it gets you the desired behavior.

Pattern Matching log files

I am getting files like .log and _log in a folder ,i am able to pick .log files with /*.log$/ but unable to find files which are _log .
need a regex pattern which will take both type of files from a specified folder.
Your question is tagged both 'perl' and 'linux'. I'll assume here that you're talking about Perl style regular expressions, as it looks like that's what you are showing in your example snippet.
The *. sequence is a mistake.
Let's focus on what you want to match. You want to match any filename that ends in a dot followed by the literal characters 'log'. You also want to match any filename that ends in an underscore, followed by the literal characters 'log'. You really shouldn't concern yourself with the "anything at all" that can come before the final dot or underscore. So the regexp would probably be better written as this:
/[._]log$/
Notice we don't even bother with the dot-star. It isn't helpful in this situation.
If you want for your pattern to also match files where the literal characters 'log' may optionally be followed by an integer sequence (not mentioned in your question, but discussed in one of your followup comments), you could write it like this:
/[._]log\d*$/
Here the 'star' is helpful; it allows for zero or more digits sandwiched between the 'g' and the end of the string.
I totally agree (by upvoting) with DavidO's solution but it usually makes more sense, and increase readability, to use glob() to get a list of files from a particular directory
my $dir = "/path/here";
my #log_files = grep { /[\._]log\d*$/ } glob("$dir/*");
print join "\n", #log_files;
This will catch
foo.log
foo_log
foo.log1
foo_log22
Use the regexp /.*[._]log$/.
I'm surprised your first case worked -- /*.log$/ isn't legal regexp (since the * doesn't say what it is supposed to match zero-or-more of). Double-check your current results.

Find required files by pattern and the change the pattern on Linux

I need to find all *.xml files that matched by pattern on Linux. I need to have written the file name on the screen and then change the pattern in the file just was found.
For instance.
I can start the script with arguments for keyword and for value, i.e
script.sh keyword "another word"
Script should find all files with keyword and do the following changes in the files containing keyword.
<keyword></keyword> should be the same <keyword></keyword>
<keyword>some word</keyword> should be like this <keyword>some word, another word</keyword>
In other words if initially value in keyword node was empty, then I don't need to change it and if it contains some value then I need to extend it with the value I will specify.
What is best way to do this on Linux? Using find, grep, sed?
Performance is also important since the number of files are thousands.
Thank you.
It seems using a combination of find, grep and sed would do this and they are pretty fast since you'll be doing text processing so there might not be a need for xml processing but if you could you give an example or rephrase your question I might be able to provide more help.

Resources