ksh shell script to find first occurence of _ in string and remove everything until that

ksh shell script to find first occurence of _ in string and remove everything until that - linux

Im New To Shell Scripting.Using KSH Shell. Could you please help me in this.
My string is like errorfile101_ApplicationData_2_333.txt. I want to remove everything until the first occurence of _.
My output should be ApplicationData_2_333.txt

This is an easy one, assuming you can assign your string to a variable, i.e.
str="errorfile101_ApplicationData_2_333.txt"
echo ${str#*_}
output
ApplicationData_2_333.txt
The # operator in ${str#*_} means remove the following pattern from the left of the variable's value.
There is also ##, which removes the longest match from the left, which would give you
333.txt
There are also similar removal operators for working from the right side of the string, % and a longest match (from right) with %%.
All versions of ksh (and bash, and other shells) support these operators. (sorry if this is the wrong term).
Versions of ksh93 and greater (bash, zsh and probably others) also support a sed like pattern match/sub value like
echo ${str/*_/xx}
#----------|--|>replacement
#----------> pattern to match
output
xx333.txt
which means that / works like sed matching the longest possible string.
IHTH

You can use the cut command:
echo "errorfile101_ApplicationData_2_333.txt" | cut -d"_" -f2-

Related

Remove text between one string and 1st occurrence of another string

I have found several solutions to remove text between two strings but I guess my case is a little different.
I am trying to convert this:
/nz/kit.7.2.0.7/bin/adm/tools/hostaekresume
To this:
/nz/kit/bin/adm/tools/hostaekresume
Basically remove the version specific information from the filename.
The solutions I have found remove everything from the word kit to the last occurrence of /. I need something to remove from kit to the first occurrence.
The most common solution I have seen is:
sed -e 's/\(kit\).*\(\/\)/\1\2/'
Which produces:
/nz/kit/hostaekresume
How can I only remove up to the first /? I assume this can done with sed or awk, but open to suggestions.

$ sed 's|\(kit\)[^/]*|\1|' <<< '/nz/kit.7.2.0.7/bin/adm/tools/hostaekresume'
/nz/kit/bin/adm/tools/hostaekresume
This uses a different delimiter (| instead of /) so we don't have to escape the /. Then, for non-greedy matching, it uses [^/]*: any number of characters other than /, which matches everything between kit and the next /.
Alternatively, if you know that what you want to remove consists of dots and digits, and nothing else in the string contains them, you can use parameter expansion:
$ var='/nz/kit.7.2.0.7/bin/adm/tools/hostaekresume'
$ echo "${var//[[:digit:].]}"
/nz/kit/bin/adm/tools/hostaekresume
The syntax is ${parameter/pattern/string}, where pattern in the expanded parameter is replaced by string. If we use // instead of /, all occurrences instead of just the first are replaced.
In our case, parameter is var, the pattern is [[:digit:].] (digits or a dot – this is a glob pattern, not a regular expression, by the way), and we've skipped the /string part, which just removes the pattern (replaces it with nothing).

You need perl for non-greedy regex. sed doesn't do that yet.
Also, use | as a delimiter since / can cause confusion when you have it in your regex.
perl -pe 's|(kit).*?(/.*)|\1\2|'
The ? after the .* makes the pattern non-greedy and will match the first instance of /.
echo "/nz/kit.7.2.0.7/bin/adm/tools/hostaekresume" | perl -pe 's|(kit).*?(/.*)|\1\2|'
returns
/nz/kit/bin/adm/tools/hostaekresume

echo "/nz/kit.7.2.0.7/bin/adm/tools/hostaekresume" | awk '{sub(/.7.2.0.7/,"")}1'
/nz/kit/bin/adm/tools/hostaekresume

How can I use sed to get an xml value

How can I use sed to get the SOMETHING in <version.suffix>SOMETHING</version.suffix>?
I tried sed 's#.*>\(.*\)\<version\.suffix\>#\1#' ,but fails.

Try this one:
sed 's/<.*>\(.*\)<.*>/\1/'
It should be general enough to get every xml value.
If you need to eliminate the indentation add \s* at the beginning like this:
sed 's/\s*<.*>\(.*\)<.*>/\1/'
Alternatively if you only want version.suffix's value, you can make the command more specific like this:
sed 's/<version\.suffix>\(.*\)<.*>/\1/'

You could use the below sed command,
$ echo '<version.suffix>SOMETHING</version.suffix>' | sed 's#^<[^>]*>\(.*\)<\/[^>]*>$#\1#'
SOMETHING
^<[^>]*> Matches the first tag string <version.suffix>.
\(.*\)<\/[^>]*>$ Characters upto the next closing tag are captured. And the remaining closing tag was matched by this <\/[^>]*> regex.
Finally all the matched characters are replaced by the characters which are present inside the group index 1.
Your regex is correct but the only thing is, you forget to use / inside the closing tag.
$ echo '<version.suffix>SOMETHING</version.suffix>' | sed 's#.*>\(.*\)</version\.suffix>#\1#'
|<-Here
SOMETHING

Many ways possible, e.g:
with sed
echo '<version.suffix>SOMETHING</version.suffix>' | sed 's#<[^>]*>##g'
or grep
echo '<version.suffix>SOMETHING</version.suffix>' | grep -oP '<version.suffix>\KSOMETHING(?=</version.suffix>)'

Assuming the formatting of the question is accurate, when I run the example in the question as-is:
$ echo '<version.suffix>SOMETHING</version.suffix>' | sed 's#.*>\(.*\)\<version\.suffix\>#\1#'
I see the following output:
SOMETHING</>
In case my formatting skills fail me, this output ends with the trailing left angle bracket, a forward slash, and finally the right angle bracket.
So, why this "failure"? Well, on my system (Linux with GNU grep 2.14), grep(1) includes the following snippet:
The Backslash Character and Special Expressions
The symbols \< and \> respectively match the empty string at the beginning and end of a word.
Other answers suggest good alternatives to extract the value in XML tag syntax; use them.
I just wanted to point out why the RE in the original problem fails on current Linux systems: some symbols match no actual characters, but instead match empty boundaries in these apps that support posix-extended regular expressions. So, in this example, the brackets in the source are matched in unexpected ways:
the (.*)has matched SOMETHING</, to be printed by the \1 back-reference
the left-hand side of version.suffix is matched by \<
version.suffix is matched by version\.suffix
the right-hand side of version.suffix is matched by \>
the trailing > character remains in sed's pattern space and is printed.
TL;DR -"\X" does not mean "just match an X" for all X!

Perl line runs 30 times quicker with single quotes than with double quotes

We have a task to change some strings in binary files to lowercase (from mixed/upper/whatever). The relevant strings are references to the other files (it's in connection with an upgrade where we are also moving from Windows to linux as a server environment, so the case suddenly matters). We have written a script which uses a perl loop to do this. We have a directory containing around 300 files (total size of the directory is around 150M) so it's some data but not huge amounts.
The following perl code takes about 6 minutes to do the job:
for file_ref in `ls -1F $forms6_convert_dir/ | grep -v "/" | sed 's/\(.*\)\..*/\1/'`
do
(( updated++ ))
write_line "Converting case of string: $file_ref "
perl -i -pe "s{(?i)$file_ref}{$file_ref}g" $forms6_convert_dir/*
done
while the following perl code takes over 3 hours!
for file_ref in `ls -1F $forms6_convert_dir/ | grep -v "/" | sed 's/\(.*\)\..*/\1/'`
do
(( updated++ ))
write_line "Converting case of string: $file_ref "
perl -i -pe 's{(?i)$file_ref}{$file_ref}g' $forms6_convert_dir/*
done
Can anyone explain why? Is it that the $file_ref is getting left as the string $file_ref instead of substituted with the value in the single quotes version? in which case, what is it replacing in this version? What we want is to replace all occurances of any filename with itself but in lowercase. If we run strings on the files before and after and search for the filenames then both appeared to have made the same changes. However, if we run diff on the files produced by the two loops (diff firstloop/file1 secondloop/file1) then it reports that they differ.
This is running from within a bash script on linux.

The shell doesn't do variable substitution for single quoted strings. So, the second one is a different program.

As the other answers said, the shell doesn't substitute variables inside single quotes, so the second version is executing the literal Perl statement s{(?i)$file_ref}{$file_ref}g for every line in every file.
As you said in a comment, if $ is the end-of-line metacharacter, $file_ref could never match anything. $ matches before the newline at end-of-line, so the next character would have to be a newline. Therefore, Perl doesn't interpret $ as the metacharacter; it interprets it as the beginning of a variable interpolation.
In Perl, the variable $file_ref is undef, which is treated as the empty string when interpolated. So you're really executing s{(?i)}{}g, which says to replace the empty string with the empty string, and do that for all occurrences in a case-insensitive manner. Well, there's an empty string between every pair of characters, plus one at the beginning and end of each line. Perl is finding each one and replacing it with the empty string. This is a no-op, but it's an expensive one, hence the 3-hour run time.
You must be mistaken about both versions making the same changes. As I just explained, the single-quoted version is just an expensive no-op; it doesn't make any changes at all to the file contents (it just makes a fresh copy of each file). The files you ran it on must have already been converted to lower case.

With double quotes you are using the shell variable, with single quotes Perl is trying to use a variable of that name.
You might wish to consider writing the whole lot in either Perl or Bash to speed things up. Both languages can read files and do pattern matching. In Perl you can change to lower-case using the lc built-in function, and in Bash 4 you can use ${file,,}.

Extract file basename without path and extension in bash [duplicate]

This question already has answers here:
Extract filename and extension in Bash
(38 answers)
Closed 6 years ago.
Given file names like these:
/the/path/foo.txt
bar.txt
I hope to get:
foo
bar
Why this doesn't work?
#!/bin/bash
fullfile=$1
fname=$(basename $fullfile)
fbname=${fname%.*}
echo $fbname
What's the right way to do it?

You don't have to call the external basename command. Instead, you could use the following commands:
$ s=/the/path/foo.txt
$ echo "${s##*/}"
foo.txt
$ s=${s##*/}
$ echo "${s%.txt}"
foo
$ echo "${s%.*}"
foo
Note that this solution should work in all recent (post 2004) POSIX compliant shells, (e.g. bash, dash, ksh, etc.).
Source: Shell Command Language 2.6.2 Parameter Expansion
More on bash String Manipulations: http://tldp.org/LDP/LG/issue18/bash.html

The basename command has two different invocations; in one, you specify just the path, in which case it gives you the last component, while in the other you also give a suffix that it will remove. So, you can simplify your example code by using the second invocation of basename. Also, be careful to correctly quote things:
fbname=$(basename "$1" .txt)
echo "$fbname"

A combination of basename and cut works fine, even in case of double ending like .tar.gz:
fbname=$(basename "$fullfile" | cut -d. -f1)
Would be interesting if this solution needs less arithmetic power than Bash Parameter Expansion.

Here are oneliners:
$(basename "${s%.*}")
$(basename "${s}" ".${s##*.}")
I needed this, the same as asked by bongbang and w4etwetewtwet.

Pure bash, no basename, no variable juggling. Set a string and echo:
p=/the/path/foo.txt
echo "${p//+(*\/|.*)}"
Output:
foo
Note: the bash extglob option must be "on", (Ubuntu sets extglob "on" by default), if it's not, do:
shopt -s extglob
Walking through the ${p//+(*\/|.*)}:
${p -- start with $p.
// substitute every instance of the pattern that follows.
+( match one or more of the pattern list in parenthesis, (i.e. until item #7 below).
1st pattern: *\/ matches anything before a literal "/" char.
pattern separator | which in this instance acts like a logical OR.
2nd pattern: .* matches anything after a literal "." -- that is, in bash the "." is just a period char, and not a regex dot.
) end pattern list.
} end parameter expansion. With a string substitution, there's usually another / there, followed by a replacement string. But since there's no / there, the matched patterns are substituted with nothing; this deletes the matches.
Relevant man bash background:
pattern substitution:
${parameter/pattern/string}
Pattern substitution. The pattern is expanded to produce a pat
tern just as in pathname expansion. Parameter is expanded and
the longest match of pattern against its value is replaced with
string. If pattern begins with /, all matches of pattern are
replaced with string. Normally only the first match is
replaced. If pattern begins with #, it must match at the begin‐
ning of the expanded value of parameter. If pattern begins with
%, it must match at the end of the expanded value of parameter.
If string is null, matches of pattern are deleted and the / fol
lowing pattern may be omitted. If parameter is # or *, the sub
stitution operation is applied to each positional parameter in
turn, and the expansion is the resultant list. If parameter is
an array variable subscripted with # or *, the substitution
operation is applied to each member of the array in turn, and
the expansion is the resultant list.
extended pattern matching:
If the extglob shell option is enabled using the shopt builtin, several
extended pattern matching operators are recognized. In the following
description, a pattern-list is a list of one or more patterns separated
by a |. Composite patterns may be formed using one or more of the fol
lowing sub-patterns:
?(pattern-list)
Matches zero or one occurrence of the given patterns
*(pattern-list)
Matches zero or more occurrences of the given patterns
+(pattern-list)
Matches one or more occurrences of the given patterns
#(pattern-list)
Matches one of the given patterns
!(pattern-list)
Matches anything except one of the given patterns

Here is another (more complex) way of getting either the filename or extension, first use the rev command to invert the file path, cut from the first . and then invert the file path again, like this:
filename=`rev <<< "$1" | cut -d"." -f2- | rev`
fileext=`rev <<< "$1" | cut -d"." -f1 | rev`

If you want to play nice with Windows file paths (under Cygwin) you can also try this:
fname=${fullfile##*[/|\\]}
This will account for backslash separators when using BaSH on Windows.

Just an alternative that I came up with to extract an extension, using the posts in this thread with my own small knowledge base that was more familiar to me.
ext="$(rev <<< "$(cut -f "1" -d "." <<< "$(rev <<< "file.docx")")")"
Note: Please advise on my use of quotes; it worked for me but I might be missing something on their proper use (I probably use too many).

Use the basename command. Its manpage is here: http://unixhelp.ed.ac.uk/CGI/man-cgi?basename

Prepend to regex match

I got a variable in a bash script that I need to replace. The only constant in the line is that it will be ending in "_(*x)xxxp.mov". Where x's are numbers and can be of either 3 or 4 of length. For example, I know how to replace the value but only if it is a constant:
echo 'whiteout-tlr1_1080p.mov' | sed 's/_[0-9]*[0-9][0-9][0-9]p.mov/_h1080p.mov/g'
How can I carry over the regex match to replacement line?
Edit:
Ok I just learned that grep can print only the match would it better to to do something like this?
urltrail=$(echo $# | grep -o [0-9]*[0-9][0-9][0-9]p.mov)
newurl=$(sed 's/$urltrail/h$urltrail/g')
Hmm, tried the above but am getting a hang.

Back Reference
sed 's/_\([0-9]*[0-9][0-9][0-9]\)p.mov/_h\1p.mov/g'
The back-reference \n, where n is a single digit, matches the substring previously matched by the nth parenthesized subexpression of the regular expression.

You're not piping the old path into sed, so sed is hanging waiting for input.
newurl=$(echo $# |sed 's/$urltrail/h$urltrail/g')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ksh shell script to find first occurence of _ in string and remove everything until that - linux

Im New To Shell Scripting.Using KSH Shell. Could you please help me in this. My string is like errorfile101_ApplicationData_2_333.txt. I want to remove everything until the first occurence of _. My output should be ApplicationData_2_333.txt

You can use the cut command: echo "errorfile101_ApplicationData_2_333.txt" | cut -d"_" -f2-

Related

Remove text between one string and 1st occurrence of another string

How can I use sed to get an xml value

Perl line runs 30 times quicker with single quotes than with double quotes

Extract file basename without path and extension in bash [duplicate]

Prepend to regex match

Categories

Resources