pattern matching in Bash of Linux - linux

Does anybody know how to delete the pattern "#TechCrunch:" in the following str by sed in Linux?
str="0,RT #TechCrunch: The Tyranny Of Government And Our Duty Of Confidentiality As Bloggers."
So the desired output will be:
"0,RT The Tyranny Of Government And Our Duty Of Confidentiality As Bloggers."
I tried many ways but no one works, e.g:
echo $str | sed 's/#[a-zA-Z]*\ //'

Using sed (or any other external tool) for a single line that's already in a shell variable is hideously inefficient. Much easier to have the shell do the replacement itself.
#!/bin/bash
# ^- must be /bin/bash, not /bin/sh, for extglobs to be available
shopt -s extglob # put this somewhere early in your script to enable extended globs
str="0,RT #TechCrunch: The Tyranny Of Government And Our Duty Of Confidentiality As Bloggers."
echo "${str//#+([[:alpha:]]): /}"
This uses extglob syntax to provide more powerful pattern matches with built-in shell pattern matching; +(foo) is the extglob equivalent to the regex form (foo)+.

You were close - just missing the :.
perl -pe 's/#\w*:\s//i'
Or in sed:
sed -e 's/#[a-z]: //i'

: is not matched by [a-zA-Z]. Also, there's no need to backslash the space.
sed 's/#[a-zA-Z]*: //'

Related

Delete _ and - characters using sed

I am trying to convert 2015-06-03_18-05-30 to 20150603180530 using sed.
I have this:
$ var='2015-06-03_18-05-30'
$ echo $var | sed 's/\-\|\_//g'
$ echo $var | sed 's/-|_//g'
None of these are working. Why is the alternation not working?
As long as your script has a #!/bin/bash (or ksh, or zsh) shebang, don't use sed or tr: Your shell can do this built-in without the (comparatively large) overhead of launching any external tool:
var='2015-06-03_18-05-30'
echo "${var//[-_]/}"
That said, if you really want to use sed, the GNU extension -r enables ERE syntax:
$ sed -r -e 's/-|_//g' <<<'2015-06-03_18-05-30'
20150603180530
See http://www.regular-expressions.info/posix.html for a discussion of differences between BRE (default for sed) and ERE. That page notes, in discussing ERE extensions:
Alternation is supported through the usual vertical bar |.
If you want to work on POSIX platforms -- with /bin/sh rather than bash, and no GNU extensions -- then reformulate your regex to use a character class (and, to avoid platform-dependent compatibility issues with echo[1], use printf instead):
printf '%s\n' "$var" | sed 's/[-_]//g'
[1] - See the "APPLICATION USAGE" section of that link, in particular.
Something like this ought to do.
sed 's/[-_]//g'
This reads as:
s: Search
/[-_]/: for any single character matching - or _
//: replace it with nothing
g: and do that for every character in the line
Sed operates on every line by default, so this covers every instance in the file/string.
I know you asked for a solution using sed, but I offer an alternative in tr:
$ var='2015-06-03_18-05-30'
$ echo $var | tr -d '_-'
20150603180530
tr should be a little faster.
Explained:
tr stands for translate and it can be used to replace certain characters with another ones.
-d option stands for delete and it removes the specified characters instead of replacing them.
'_-' specifies the set of characters to be removed (can also be specified as '\-_' but you need to escape the - there because it's considered another option otherwise).
Easy:
sed 's/[-_]//g'
The character class [-_] matches of the characters from the set.
sed 's/[^[:digit:]]//g' YourFile
Could you tell me what failed on echo $var | sed 's/\-\|\_//g', it works here (even if escapping - and _ are not needed and assuming you use a GNU sed due to \| that only work in this enhanced version of sed)

Bash to transform string `3.11.0.17.16` into `3.11.0-17-generic`

I'm trying to transform this 3.11.0.17.16 into 3.11.0-17-generic using only bash and unix tools. The 16 in the original string can be anything. I feel like sed is the answer, but I'm not comfortable with its flavor of regex. How would you do this?
Version using awk instead of sed:
echo "3.11.0.17.16" | awk -F. '{printf "%s.%s.%s-%s-generic\n",$1,$2,$3,$4}'
echo "3.11.0.17.16" | sed 's/\.\([0-9][0-9]*\)\.[0-9][0-9]*$/-\1-generic/'
3.11.0-17-generic
This only accepts digits in the final component. If you want to accept arbitrary characters other than . there (you can't allow . or the match will become ambiguous) then write instead
echo "3.11.0.17.gr#wl1x" | sed 's/\.\([0-9][0-9]*\)\.[^.][^.]*$/-\1-generic/'
In a portable sed invocation you are limited to POSIX basic regular expressions, which most importantly means you cannot use +, ?, or |, and ( ) { } are ordinary characters unless \-escaped. Many sed implementations now accept an -E option that brings their regex syntax in line with egrep, but that is not a feature even of the very latest revision of POSIX so you cannot rely on it.
Substring removal using bash parameter expansion and extended globs
shopt -s extglob
version=3.11.0.17.16
version=${version%.+(!(.))}
printf "%s-%s-generic\n" ${version%.+(!(.))} ${version##*.}
3.11.0-17-generic
If you anchor the regex you are trying to match onto the last 3 sets of digits you would get
echo "3.11.0.17.16" | sed 's!\([0-9]*\)\.\([0-9]*\)\.\([0-9]*\)$!\1-\2-generic!'

Specify multiple possible patterns for a single command

Basically there a few lines which contain a common format, but different wording at the end. The command will work for all of them, but I want to match all possible pattern, thereby needing only 1 line in the script. As an example, I know how to make the script work like so:
/pattern1/ s/asdf/ghjk/g
/pattern2/ s/asdf/ghjk/g
/pattern3/ s/asdf/ghjk/g
Any ideas?
If your patterns are really as similar as in your example, you can use
sed -e '/pattern[1-3]/ s/asdf/ghjk/g'
If the patterns aren't so similar and your sed command supports extended regular expressions, you can use
sed -E -e '/(pattern1|pattern2|pattern3)/ s/asdf/ghjk/g'
# ^^ use extended regular expressions
# for GNU sed, use -r or escape (, |, and ) with \
If your sed command doesn't support extended regular expressions, you might have to turn to awk or perl:
perl -ple '/(pattern1|pattern2|pattern3)/ && s/asdf/ghjk/g'

Sed:Replace a series of dots with one underscore

I want to do some simple string replace in Bash with sed. I am Ubuntu 10.10.
Just see the following code, it is self-explanatory:
name="A%20Google.."
echo $name|sed 's/\%20/_/'|sed 's/\.+/_/'
I want to get A_Google_ but I get A_Google..
The sed 's/\.+/_/' part is obviously wrong.
BTW, sed 's/\%20/_/' and sed 's/%20/_/' both work. Which is better?
sed speaks POSIX basic regular expressions, which don't include + as a metacharacter. Portably, rewrite to use *:
sed 's/\.\.*/_/'
or if all you will ever care about is Linux, you can use various GNU-isms:
sed -r 's/\.\.*/_/' # turn on POSIX EREs (use -E instead of -r on OS X)
sed 's/\.\+/_/' # GNU regexes invert behavior when backslash added/removed
That last example answers your other question: a character which is literal when used as is may take on a special meaning when backslashed, and even though at the moment % doesn't have a special meaning when backslashed, future-proofing means not assuming that \% is safe.
Additional note: you don't need two separate sed commands in the pipeline there.
echo $name | sed -e 's/\%20/_/' -e 's/\.+/_/'
(Also, do you only need to do that once per line, or for all occurrences? You may want the /g modifier.)
The sed command doesn't understand + so you'll have to expand it by hand:
sed 's/\.\.*/_/'
Or tell sed that you want to use extended regexes:
sed -r 's/\.+/_/' # GNU
sed -E 's/\.+/_/' # OSX
Which switch, -r or -E, depends on your sed and it might not even support extended regexes so the portable solution is to use \.\.* in place of \.+. But, since you're on Linux, you should have GNU sed so sed -r should do the trick.

Using ? with sed

I just want to get the number of a file that may or may not be gzip'd. However, it appears that a regular expression in sed does not support a ?. Here's what I tried:
echo 'file_1.gz'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'
and nothing was returned. Then I added a ? to the string being analyzed:
echo 'file_1.gz?'|sed -n 's/.*_\(.*\)\(\.gz\)?/\1/p'
and got:
1
So, it looks like the ? used in most regex's is not supported in sed, right? Well then, I would just like sed to give a 1 for file_1 and file_1.gz. What's the best way to do that in a bash script if execution time is critical?
The equivalent to x? is \(x\|\).
However, many versions of sed support an option to enable "extended regular expressions" which includes ?. In GNU sed the flag is -r. Note that this also changes unescaped parens to do grouping. eg:
echo 'file_1.gz'|sed -n -r 's/.*_(.*)(\.gz)?/\1/p'
Actually, there's another bug in your regex which is that the greedy .* in the parens is going to swallow up the ".gz" if there is one. sed doesn't have a non-greedy equivalent to * as far as I know, but you can use | to work around this. | in sed (and many other regex implementations) will use the leftmost match that works, so you can do something like this:
echo 'file_1.gz'|sed -r 's/(.*_(.*)\.gz)|(.*_(.*))/\2\4/'
This tries to match with .gz, and only tries without it if that doesn't work. Only one of group 2 or 4 will actually exist (since they are on opposite sides of the same |) so we just concatenate them to get the value we want.
If you're looking for an answer to the specific example given in the question, or why it uses the ? incorrectly (regardless of syntax), see the answer by Laurence Gonsalves.
If you're looking instead for the answer to the general question of why ? doesn't exhibit its special meaning in sed as you might expect:
By default, sed uses the " POSIX basic regular expressions syntax", so the question mark must be escaped as \? to apply its special meaning, otherwise it matches a literal question mark. As an alternative, you can use the -r or --regexp-extended option to use the "extended regular expression syntax", which reverses the meaning of escaped and non-escaped special characters, including ?.
In the words of the GNU sed documentation (view by running 'info sed' on Linux):
The only difference between basic and extended regular expressions is in
the behavior of a few characters: '?', '+', parentheses, and braces
('{}'). While basic regular expressions require these to be escaped if
you want them to behave as special characters, when using extended
regular expressions you must escape them if you want them to match a
literal character.
and the option is explained:
-r
--regexp-extended
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that `egrep' accepts;
they can be clearer because they usually have less backslashes,
but are a GNU extension and hence scripts that use them are not
portable.
Update
Newer versions of GNU sed now say this:
-E
-r
--regexp-extended
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that 'egrep' accepts; they
can be clearer because they usually have fewer backslashes.
Historically this was a GNU extension, but the '-E' extension has
since been added to the POSIX standard
(http://austingroupbugs.net/view.php?id=528), so use '-E' for
portability. GNU sed has accepted '-E' as an undocumented option
for years, and *BSD seds have accepted '-E' for years as well, but
scripts that use '-E' might not port to other older systems.
So, if you need to preserve compatibility with ancient GNU sed, stick with -r. But if you prefer better cross-platform portability on more modern systems (e.g. Linux+Mac support), go with -E (but note that there are still some quirks and differences between GNU sed and BSD sed, so you'll have to make sure your scripts are portable in any case).
echo 'file_1.gz'|sed -n 's/.*_\(.*\)\?\(\.gz\)/\1/p'
Works. You have to put the return in the right spot, and you have to escape it.
A function that should return a number that follows the '_' in a filename, regardless of file extension:
realname () {
local n=${$1##*/}
local rn="${n%.*}"
sed 's/^.*\_//g' ${$rn:-$n}
}
You should use awk which is superior to sed when it comes to field grabbing/parsing:
$ awk -F'[._]' '{print $2}' <<<"file_1"
1
$ awk -F'[._]' '{print $2}' <<<"file_1.gz"
1
Alternatively you can just use Bash's parameter expansion like so:
var=file_1.gz;
temp=${var#*_};
file=${temp%.*}
echo $file
Note: works when var=file_1 as well
Part of the solution lies in escaping the question mark or using the -r option.
sed 's/.*_\([^.]*\)\(\.\?[^.]\+\)\?$/\1/'
or
sed -r 's/.*_([^.]*)(\.?[^.]+)?$/\1/'
will work for:
file_1.gz
file_12.txt
file_123
resulting in:
1
12
123
I just realized that could do something very easy:
echo 'file_1.gz'|sed -n 's/.*_\([0-9]*\).*/\1/p'
Notice the [0-9]* instead of a .*. #Laurence Gonsalves's answer made me realize the greediness of my previous post.

Resources