Add spaces after punctuation marks with sed

Add spaces after punctuation marks with sed - linux

I need to capitalize a txt file but I found some problems when I try to add a space after any punctuation mark with sed. For instance: "Hello,World" -> to "Hello, World"
I tried the following:
#!/bin/bash
if [ $# != 1 ]; then
echo "No parameter"
exit
fi
cp $1 $1.bak
ARCH1=/tmp/`basename $1`.$$
sed 's/[A-Z]*/\L&/g' $1 > $ARCH1
sed -i 's/^./\u&/' $ARCH1
sed 's/ */\ /g' $ARCH1 #Here I replace >= 2 spaces for 1
sed 's/, */, /g' $ARCH1
#These 2 lines don't work well
sed 's/. */. /g' $ARCH1
sed 's/; */; /g' $ARCH1
mv $ARCH1 $1
The script doesn't crash, but the output is not the one that I expect.

I believe the reason your script doesn't work is that you forgot to pass -i to sed in several calls, and also that you don't escape . in the regex, so that . matches any character.
I also believe that a simpler way to do what you're trying to do is
sed -i.bak 's/[A-Z]*/\L&/g; s/\([.,;]\) */\1 /' "$1"
-i.bak edits the file in-place and creates a backup with the .bak extension, and the script is simply
s/[A-Z]*/\L&/g # lower-case everything (I got that from your code)
s/\([.,;]\) */\1 / # replace spaces after period, comma or semicolon
Here
[.,;] is a character set matching period, comma or semicolon,
\(stuff\) captures stuff in a group for later use, and
\1 is a back reference referring to the first such capture.
Note that this is a very simple approach. If your text, for example, contains ellipses (...), it'll waltz right over that and make ... into . . ., and similar caveats apply for ?! and such.

Using GNU sed:
$ echo "foo;BAR,BaZ.qux" | sed -r 's/[[:punct:]]+/& /g; s/[[:alnum:]]+/\L\u&/g'
Foo; Bar, Baz. Qux
\L lower cases the whole word, then \u upper cases the first character.
See your regex(7) man page for regular expression documentation.

Related

Terminal SED regex fails with dash and slash

I try to convert filenames and remove special chars and whitespaces.
For some reasons my SED regex don't work if I declare dash and slashes not to be replaced.
Example:
echo "/path/to/file 20-456 (1).jpg" | sed -e 's/ /_/g' -e 's/[^0-9a-zA-Z\.\_\-\/]//g'
Output:
/path/to/file_20456_1.jpg
So the dash isn't in.
When I try this command:
echo "/path/to/file 20-456 (1).jpg" | sed -e 's/ /_/g' -e 's/[^0-9a-zA-Z\.\_\-]//g'
Output:
pathtofile_20-456_1.jpg
the dash is there but without the directory slashes I can't move the files.
I wonder why the replacment with dash didn't work anymore if I add \/ into regex pattern.
Any suggestions?

With your shown samples and attempts, please try following awk code.
echo "/path/to/file 20-456 (1).jpg" |
awk 'BEGIN{FS=OFS="/"} {gsub(/ /,"_",$NF);gsub(/-|\(|\)/,"",$NF)} 1'
Explanation: Simple explanation would be, by echo printing value /path/to/file 20-456 (1).jpg as a standard input to awk program. In awk program, setting FS and OFS to / in BEGIN section. Then in main program using gsub to globally substitute space with _ in last field($NF) and then globally substitute - OR ( OR ) with NULL in last field and then mentioning 1 will print that line.

You may get the result using string manipulation in Bash:
#!/bin/bash
path="/path/to/file 20-456 (1).jpg"
fldr="${path%/*}" # Get the folder
file="${path##*/}" # Get the file name
file="${file// /_}" # Replace spaces with underscores in filename
echo "$fldr/${file//[^[:alnum:]._-]/}" # Get the result
See the online demo yielding /path/to/file_20-456_1.jpg.
Quick notes:
${path%/*} - Removes the smallest chunk up to / from the end of the path
${path##*/} - Removes the largest text chunk from start of path to last / (including it)
${file// /_} replaces all spaces with _ in file
${file//[^[:alnum:]._-]/} removes all chars that are not alphanumeric, ., _ and - from file.

Why are there 2 forward slashes in sed command before g keyword?

For the below code:
sed "s/ //g" filename
Since, this is used to remove the spaces, why there are 2 forward slashes in front of 'g'. What can be the reason. Though it is working fine.

I suggest you read some tutorial about sed first.
Long story short, use this example sed "s/search_pattern/replace_string/g" filename:
s means search and replace
search_pattern is the pattern to be searched
replace_string is the string to be replaced
g means apply the action globally, which means keep search and replace for all match pattern
Thus, sed "s/ //g" filename means search all space in file and replace it to empty string

Each slash is a token, there's just nothing between them. For example if you wanted to replace spaces with underscores, you would put an underscore between the second and third slashes:
sed "s/ /_/g" filename
Example run:
$ echo "foo bar" | sed "s/ /_/g"
foo_bar
$ echo "foo bar" | sed "s/ //g"
foobar

Inverted exclamation and question mark in ISO-8859

I need to replace inverted exclamation and inverted question marks in subtitle files so they display correctly on my TV. The files work correctly in ISO-8859, but I can't remove the marks.
The first solution was to use the command 'sed':
sed s/\¿|¡//g "$FILE"
This works for files in UTF-8, but what would be the right solution for files in ISO-8859?
sed 's/\xBF//g', for example, doesn't work.

In this command, your \ is removed by bash before the argument is passed to sed:
sed s/\¿//g "$FILE"
That doesn't matter, because ¿ is not a bash metacharacter and it does not require quoting. However, if you write this:
sed s/\xBF//g "$FILE"
it won't do what you expect; bash will replace \x with x leaving sed with the command s/xBF//g, which is probably not what you wanted to do.
You must either write:
sed 's/\xBF//g'
or
sed s/\\xBF//g
The command posted will not work, though:
sed s/\¿|¡//g "$FILE"
| is a bash metacharacter, and it must therefore be quoted or escaped. Also, sed uses Basic Regular Expressions (BREs) by default, which means that you must write \| to express alternation. That means that you would have to type:
sed 's/¿\|¡//g' "$FILE"
or
sed s/¿\\\|¡//g "$FILE"

Bash script to remove 'x' amount of characters the end of multiple filenames in a directory?

I have a list of file names in a directory (/path/to/local). I would like to remove a certain number of characters from all of those filenames.
Example filenames:
iso1111_plane001_00321.moc1
iso1111_plane002_00321.moc1
iso2222_plane001_00123.moc1
In every filename I wish to remove the last 5 characters before the file extension.
For example:
iso1111_plane001_.moc1
iso1111_plane002_.moc1
iso2222_plane001_.moc1
I believe this can be done using sed, but I cannot determine the exact coding. Something like...
for filename in /path/to/local/*.moc1; do
mv $filname $(echo $filename | sed -e 's/.....^//');
done
...but that does not work. Sorry if I butchered the sed options, I do not have much experience with it.

mv $filname $(echo $filename | sed -e 's/.....\.moc1$//');
or
echo ${filename%%?????.moc1}.moc1
%% is a bash internal operator...

This sed command will work for all the examples you gave.
sed -e 's/\(.*\)_.*\.moc1/\1_.moc1/'
However, if you just want to specifically "remove 5 characters before the last extension in a filename" this command is what you want:
sed -e 's/\(.*\)[0-9a-zA-Z]\{5\}\.\([^.]*\)/\1.\2/'
You can implement this in your script like so:
for filename in /path/to/local/*.moc1; do
mv $filename "$(echo $filename | sed -e 's/\(.*\)[0-9a-zA-Z]\{5\}\.\([^.]*\)/\1.\2/')";
done
First Command Explanation
The first sed command works by grabbing all characters until the first underscore: \(.*\)_
Then it discards all characters until it finds .moc1: .*\.moc1
Then it replaces the text that it found with everything it grabbed at first inside the parenthesis: /\1
And finally adds the .moc1 extension back on the end and ends the regex: .moc1/
Second Command Explanation
The second sed command works by grabbing all characters at first: \(.*\)
And then it is forced to stop grabbing characters so it can discard five characters, or more specifically, five characters that lie in the ranges 0-9, a-z, and A-Z: [0-9a-zA-Z]\{5\}
Then comes the dot '.' character to mark the last extension : \.
And then it looks for all non-dot characters. This ensures that we are grabbing the last extension: \([^.]*\)
Finally, it replaces all that text with the first and second capture groups, separated by the . character, and ends the regex: /\1.\2/

This might work for you (GNU sed):
sed -r 's/(.*).{5}\./\1./' file

Environment variable substitution in sed

If I run these commands from a script:
#my.sh
PWD=bla
sed 's/xxx/'$PWD'/'
...
$ ./my.sh
xxx
bla
it is fine.
But, if I run:
#my.sh
sed 's/xxx/'$PWD'/'
...
$ ./my.sh
$ sed: -e expression #1, char 8: Unknown option to `s'
I read in tutorials that to substitute environment variables from shell you need to stop, and 'out quote' the $varname part so that it is not substituted directly, which is what I did, and which works only if the variable is defined immediately before.
How can I get sed to recognize a $var as an environment variable as it is defined in the shell?

Your two examples look identical, which makes problems hard to diagnose. Potential problems:
You may need double quotes, as in sed 's/xxx/'"$PWD"'/'
$PWD may contain a slash, in which case you need to find a character not contained in $PWD to use as a delimiter.
To nail both issues at once, perhaps
sed 's#xxx#'"$PWD"'#'

In addition to Norman Ramsey's answer, I'd like to add that you can double-quote the entire string (which may make the statement more readable and less error prone).
So if you want to search for 'foo' and replace it with the content of $BAR, you can enclose the sed command in double-quotes.
sed 's/foo/$BAR/g'
sed "s/foo/$BAR/g"
In the first, $BAR will not expand correctly while in the second $BAR will expand correctly.

Another easy alternative:
Since $PWD will usually contain a slash /, use | instead of / for the sed statement:
sed -e "s|xxx|$PWD|"

You can use other characters besides "/" in substitution:
sed "s#$1#$2#g" -i FILE

一. bad way: change delimiter
sed 's/xxx/'"$PWD"'/'
sed 's:xxx:'"$PWD"':'
sed 's#xxx#'"$PWD"'#'
maybe those not the final answer,
you can not known what character will occur in $PWD, / : OR #.
if delimiter char in $PWD, they will break the expression
the good way is replace(escape) the special character in $PWD.
二. good way: escape delimiter
for example:
try to replace URL as $url (has : / in content)
x.com:80/aa/bb/aa.js
in string $tmp
URL
A. use / as delimiter
escape / as \/ in var (before use in sed expression)
## step 1: try escape
echo ${url//\//\\/}
x.com:80\/aa\/bb\/aa.js #escape fine
echo ${url//\//\/}
x.com:80/aa/bb/aa.js #escape not success
echo "${url//\//\/}"
x.com:80\/aa\/bb\/aa.js #escape fine, notice `"`
## step 2: do sed
echo $tmp | sed "s/URL/${url//\//\\/}/"
URL
echo $tmp | sed "s/URL/${url//\//\/}/"
URL
OR
B. use : as delimiter (more readable than /)
escape : as \: in var (before use in sed expression)
## step 1: try escape
echo ${url//:/\:}
x.com:80/aa/bb/aa.js #escape not success
echo "${url//:/\:}"
x.com\:80/aa/bb/aa.js #escape fine, notice `"`
## step 2: do sed
echo $tmp | sed "s:URL:${url//:/\:}:g"
x.com:80/aa/bb/aa.js

With your question edit, I see your problem. Let's say the current directory is /home/yourname ... in this case, your command below:
sed 's/xxx/'$PWD'/'
will be expanded to
sed `s/xxx//home/yourname//
which is not valid. You need to put a \ character in front of each / in your $PWD if you want to do this.

Actually, the simplest thing (in GNU sed, at least) is to use a different separator for the sed substitution (s) command. So, instead of s/pattern/'$mypath'/ being expanded to s/pattern//my/path/, which will of course confuse the s command, use s!pattern!'$mypath'!, which will be expanded to s!pattern!/my/path!. I’ve used the bang (!) character (or use anything you like) which avoids the usual, but-by-no-means-your-only-choice forward slash as the separator.

Dealing with VARIABLES within sed
[root#gislab00207 ldom]# echo domainname: None > /tmp/1.txt
[root#gislab00207 ldom]# cat /tmp/1.txt
domainname: None
[root#gislab00207 ldom]# echo ${DOMAIN_NAME}
dcsw-79-98vm.us.oracle.com
[root#gislab00207 ldom]# cat /tmp/1.txt | sed -e 's/domainname: None/domainname: ${DOMAIN_NAME}/g'
--- Below is the result -- very funny.
domainname: ${DOMAIN_NAME}
--- You need to single quote your variable like this ...
[root#gislab00207 ldom]# cat /tmp/1.txt | sed -e 's/domainname: None/domainname: '${DOMAIN_NAME}'/g'
--- The right result is below
domainname: dcsw-79-98vm.us.oracle.com

VAR=8675309
echo "abcde:jhdfj$jhbsfiy/.hghi$jh:12345:dgve::" |\
sed 's/:[0-9]*:/:'$VAR':/1'
where VAR contains what you want to replace the field with

I had similar problem, I had a list and I have to build a SQL script based on template (that contained #INPUT# as element to replace):
for i in LIST
do
awk "sub(/\#INPUT\#/,\"${i}\");" template.sql >> output
done

If your replacement string may contain other sed control characters, then a two-step substitution (first escaping the replacement string) may be what you want:
PWD='/a\1&b$_' # these are problematic for sed
PWD_ESC=$(printf '%s\n' "$PWD" | sed -e 's/[\/&]/\\&/g')
echo 'xxx' | sed "s/xxx/$PWD_ESC/" # now this works as expected

for me to replace some text against the value of an environment variable in a file with sed works only with quota as the following:
sed -i 's/original_value/'"$MY_ENVIRNONMENT_VARIABLE"'/g' myfile.txt
BUT when the value of MY_ENVIRONMENT_VARIABLE contains a URL (ie https://andreas.gr) then the above was not working.
THEN use different delimiter:
sed -i "s|original_value|$MY_ENVIRNONMENT_VARIABLE|g" myfile.txt

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Add spaces after punctuation marks with sed - linux

Using GNU sed: $ echo "foo;BAR,BaZ.qux" | sed -r 's/[[:punct:]]+/& /g; s/[[:alnum:]]+/\L\u&/g' Foo; Bar, Baz. Qux \L lower cases the whole word, then \u upper cases the first character. See your regex(7) man page for regular expression documentation.

Related

Terminal SED regex fails with dash and slash

Why are there 2 forward slashes in sed command before g keyword?

Inverted exclamation and question mark in ISO-8859

Bash script to remove 'x' amount of characters the end of multiple filenames in a directory?

Environment variable substitution in sed

Categories

Resources