removing prepositions from a text file in linux

removing prepositions from a text file in linux - linux

What I want to do is that i want to remove all prepositions in a text file in CentOS. Things like 'on of to the in at ....'. Here is my script:
!/bin/bash
list='i me my myself we our ours ourselves you your yours yourself ..... '
cat Hamlet.txt | for item in $list
do
sed 's/$item//g'
done > newHam.txt
but at the end when i open newHam.txt nothing changes! It's the same as Ham.txt. I don't know whether this is a good approach or not. Any suggestion? Any approach??

Assuming your sed understands \< and \> for word boundaries,
sed 's/\<\(i\|me\|my\|myself|\we|\our|\ours|\ourselves|\you|\your|\yours|\yourself\)\> \?//g' Hamlet.txt >newHam.txt
You want to make sure you include word boundaries; your original attempt would replace e.g. i everywhere n the nput.
If you already have the words in a string, you can interpolate it in Bash with
sed "s/\\<\\(${list// /\\|}\\)\\> \\?//g" Hamlet.txt >newHam.txt
but the ${variable//pattern/substitution} parameter expansion is not portable to e.g. /bin/sh. Notice also how double quotes instead of single are necessary for the shell to be allowed to perform variable substitutions within the script, and how all literal backslashes need to be escaped with another backslash within double quotes.
Unfortunately, many details of sed are poorly standardized. Ironically, switching to a tool which isn't standard at all might be the most portable solution.
perl -pe 'BEGIN {
#list = qw(i me my myself we our ours ourselves you your yours yourself .....);
$re = join("|", #list); }
s/\b($re)\b ?//go' Hamlet.txt >newHam.txt
If you want this as a standalone script,
#!/usr/bin/perl
BEGIN {
#list = qw(i me my myself we our ours ourselves you your yours yourself .....);
$re = join("|", #list);
}
while (<>) {
s/\b($re)\b ?//go;
print
}
These words are pronouns, not prepositions.
Finally, take care to fix the shebang of your script; the first line of the script needs to start with exactly the two characters #! because that's what makes it a shebang. You'll also want to avoid the useless cat in the future.

Related

Find line starts with and replace in linux using sed [duplicate]

This question already has answers here:
Replace whole line when match found with sed
(4 answers)
Closed 4 years ago.
How do I find line starts with and replace complete line?
File output:
xyz
abc
/dev/linux-test1/
Code:
output=/dev/sda/windows
sed 's/^/dev/linux*/$output/g' file.txt
I am getting below Error:
sed: -e expression #1, char 9: unknown option to `s'
File Output expected after replacement:
xyz
abc
/dev/sda/windows

Let's take this in small steps.
First we try changing "dev" to "other":
sed 's/dev/other/' file.txt
/other/linux-test1/
(Omitting the other lines.) So far, so good. Now "/dev/" => "/other/":
sed 's//dev///other//' file.txt
sed: 1: "s//dev///other//": bad flag in substitute command: '/'
Ah, it's confused, we're using '/' as both a command delimiter and literal text. So we use a different delimiter, like '|':
sed 's|/dev/|/other/|' file.txt
/other/linux-test1/
Good. Now we try to replace the whole line:
sed 's|^/dev/linux*|/other/|' file.txt
/other/-test1/
It didn't replace the whole line... Ah, in sed, '*' means the previous character repeated any number of times. So we precede it with '.', which means any character:
sed 's|^/dev/linux.*|/other/|' file.txt
/other/
Now to introduce the variable:
sed 's|^/dev/linux.*|$output|' file.txt
$output
The shell didn't expand the variable, because of the single quotes. We change to double quotes:
sed "s|^/dev/linux.*|$output|" file.txt
/dev/sda/windows

This might work for you (GNU sed):
output="/dev/sda/windows"; sed -i '\#/dev/linux.*/#c'"$output" file
Set the shell variable and change the line addressed by /dev/linux.*/ to it.
N.B. The shell variable needs to interpolated hence the ; i.e. the variable may be set on a line on its own. Also the the delimiter for the sed address must be changed so as not to interfere with the address, hence \#...#, and finally the shell variable should be enclosed in double quotes to allow full interpolation.

I'd recommend not doing it this way. Here's why.
Sed is not a programming language. It's a stream editor with some constructs that look and behave like a language, but it offers very little in the way of arbitrary string manipulation, format control, etc.
Sed only takes data from a file or stdin (also a file). Embedding strings within your sed script is asking for errors -- constructs like s/re/$output/ are destined to fail at some point, almost regardless of what workarounds you build into your sed script. The best solutions for making sed commands like this work is to do your input sanitization OUTSIDE of sed.
Which brings me to ... this may be the wrong tool for this job, or might be only one component of the toolset for the job.
The error you're getting is obviously because the sed command you're using is horribly busted. The substitute command is:
s/pattern/replacement/flags
but the command you're running is:
s/^/dev/linux*/$output/g
The pattern you're searching for is ^, the null at the beginning of the line. Your replacement pattern is dev, then you have a bunch of text that might be interpreted as flags. This plainly doesn't work, when your search string contains the same character that you're using as a delimiter to the options for the substitute command.
In regular expressions and in sed, you can escape things. You while you might get some traction with s/^\/dev\/linux.*/$output/, you'd still run into difficulty if $output contained slashes. If you're feeding this script to sed from bash, you could use ${output//\//\\\/}, but you can't handle those escapes within sed itself. Sed has no variables.
In a proper programming language, you'd have better separation of variable content and the commands used for the substitution.
output="/dev/sda/windows"
awk -v output="$output" '$1~/\/dev\/linux/ { $0=output } 1' file.txt
Note that I've used $1 here because in your question, your input lines (and output) appear to have a space at the beginning of each line. Awk automatically trims leading and trailing space when assigning field (positional) variables.
Or you could even do this in pure bash, using no external tools:
output="/dev/sda/windows"
while read -r line; do
[[ "$line" =~ ^/dev/linux ]] && line="$output"
printf '%s\n' "$line"
done < file.txt
This one isn't resilient in the face of leading whitespace. Salt to taste.
So .. yes, you can do this with sed. But the way commands get put together in sed makes something like this risky, and despite the available workarounds like switching your substitution command delimiter to another character, you'd almost certainly be better off using other tools.

Select lines between two patterns using variables inside SED command

I'm new to shell scripting. My requirement is to retrieve lines between two pattern, its working fine if I run it from the terminal without using variables inside sed cmd. But the problem arises when I put all those below cmd in a file and tried to execute it.
#!/bin/sh
word="ajp-qdcls2228.us.qdx.com%2F156.30.35.204-8009-34"
upto="2017-01-03 23:00"
fileC=`cat test.log`
output=`echo $fileC | sed -e "n/\$word/$upto/p"`
printf '%s\n' "$output"
If I use the below cmd in the terminal it works fine
sed -n '/ajp-qdcls2228.us.qdx.com%2F156.30.35.204-8009-34/,/2017-01-03 23:00/ p' test.log
Please suggest a workaround.

If we put aside for a moment the fact you shouldn't cat a file to a variable and then echo it for sed filtering, the reason why your command is not working is because you're not quoting the file content variable, fileC when echoing. This will munge together multiple whitespace characters and turn them into a single space. So, you're losing newlines from the file, as well as multiple spaces, tabs, etc.
To fix it, you can write:
fileC=$(cat test.log)
output=$(echo "$fileC" | sed -n "/$word/,/$upto/p")
Note the double-quotes around fileC (and a fixed sed expression, similar to your second example). Without the quotes (try echo $fileC), your fileC is expanded (with the default IFS) into a series of words, each being one argument to echo, and echo will just print those words separated with a single space. Additionally, if the file contains some of the globbing characters (like *), those patterns are also expanded. This is a common bash pitfall.
Much better would be to write it like this:
output=$(sed -n "/$word/,/$upto/p" test.log)
And if your patterns include some of the sed metacharacters, you should really escape them before using with sed, like this:
escape() {
sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$1";
}
output=$(sed -n "/$(escape "$word")/,/$(escape "$upto")/ p" test.log)

The correct approach will be something like:
word="ajp-qdcls2228.us.qdx.com%2F156.30.35.204-8009-34"
upto="2017-01-03 23:00"
awk -v beg="$word" -v end="$upto" '$0==beg{f=1} f{print; if ($0==end) exit}' file
but until we see your sample input and output we can't know for sure what it is you need to match on (full lines, partial lines, all text on one line, etc.) or what you want to print (include delimiters, exclude one, exclude both, etc.).

"Substitution replacement not terminated" with variable

Found this error in other questions, but I can't see how the solutions relate to this.
Assume a file test containing:
one
twoX
three
I can correct twoX with:
perl -0777 -i -pe 's/twoX/two/igm' test
I can make a function to do this:
replace_str(){ perl -0777 -i -pe 's/'$2'/'$3'/igm' $1; }
replace_str test twoX two
But this fails when the replacement contains a space (possibly other chars):
replace_str test two 'two frogs'
Substitution replacement not terminated at -e line 1.
The perl line works with the space. Why not when called in a function? I've tried with other quotes and e.g. $(echo two frogs) (with and without quotes).

It's because you end the string you pass as argument to Perl for your variable expansions. That makes the regex become two arguments.
Instead just put the whole regex, including variables, inside double-quotes and the shell should expand the variables properly.
So use "s/$2/$3/igm" instead.

Translate Part of a Line

I have a bunch of files that I am moving from one wiki (Markdown based) to another (Creole based). I have written a couple of sed scripts to things such as convert link formats and header formats. But the new wiki allows a directory structure and I would rather use that than the pseudo-directory structure I have now. I have already renamed the files, but I need to convert all of the links from _ delimited to / delimited.
Basic info:
Creole link: [[url]] [[url|name]]
I only want to convert the links that do not contain a . or a /.
I would really appreciate if you explained what the command you give means so that I can learn from it.
Sample
this is a line with a [[Link_to_something]] and [[Something_else|something else]]
this site is cool [[http://example.com/this_page]]
to
this is a line with a [[Link/to/something]] and [[Something/else|something else]]
this site is cool [[http://example.com/this_page]]
What I have tried
y/// only works on the whole line.
s//\u\2 only supports case translations.

I think I'd use Perl. It can be done as a one-liner, thus:
perl -pe 's{\[\[([^/.|]+)(|[^]]+)?\]\]}{$x=$1;$y=$2;$x=~s%_%/%g;"[[$x$y]]"}gex;' <<'EOF'
this is a line with a [[Link_to_something]] and [[Something_else|something else]]
this site is cool [[http://example.com/this_page]]
EOF
The output from that is:
this is a line with a [[Link/to/something]] and [[Something/else|something else]]
this site is cool [[http://example.com/this_page]]
Whether that's good style etc is entirely open to debate.
I'll explain this version of the code, which is isomorphic with the code above:
perl -e 'use strict; use warnings;
while (my $line = <>)
{
$line =~ s{ \[\[ ([^/.|]+) (|[^]]+)? \]\] }
{ my($x, $y) = ($1, $2); $x =~ s%_%/%g; "[[$x$y]]" }gex;
print $line;
} '
The while loop is basically what the -p provides in the first version. I've explicitly named the input variable as $line instead of using the implicit $_ as in the first version. I also had to declare $x and $y because of the use strict; use warnings;.
The substitute command takes the form s{pattern}{replace} because there are slashes in the regexes themselves. The x modifier allows (non-significant) spaces in the two parts, which makes it easier to lay out. The g modifier repeats the substitution as often as the pattern matches. The e modifier says 'treat the right-hand part of the substitution as an expression'.
The matching pattern looks for a pair of open square brackets, then remembers a sequence of characters other than /, . or |, optionally followed by a | and a sequence of characters other than ], finishing at a pair of close square brackets. The two captures are $1 and $2.
The replacement expression saves the values of $1 and $2 in variables $x and $y. It then applies a simpler substitution to $x, changing underscores into slashes. Then the result value is the string of [[$x$y]]. You can't modify $1 or $2 directly in the replacement expression. And the inner s%_%/%g; clobbers $1 and $2, which is why I needed $x and $y.
There might be another way to do it - this is Perl, so TMTOWTDI: there's more than one way to do it. But this does at least work.

This might work for you:
awk -vORS='' -vRS='[[][[][^].]*[]][]]' '{gsub(/_/,"/",RT);print $0 RT}' file
this is a line with a [[Link/to/something]] and [[Something/else|something else]]
this site is cool [[http://example.com/this_page]]
Set the output record separator to null
Set the record separator to [[...]] (where the ... does not contain a ..
Replace all _'s in what is placed in the record separator variable RT with /'s
Print the concatenated record and record separator. i.e. $0 RT
This is a sed solution:
sed 's/\[\[[^].]*]]/\a\n&\a\n/g' file |
sed '/^\[\[[^]]*\]\]\a/y/_/\//;H;$!d;g;s/\a\n//g;s/.//'
this is a line with a [[Link/to/something]] and [[Something/else|something else]]
this site is cool [[http://example.com/this_page]]
Surround [[...]]'s by \a\n's N.B. \a is chosen as an unlikely character to appear in the file.
Translate '_'s to /'s in lines beginning with [[
Remove all occurrences of \a\n's
If you have GNU sed, this will do:
sed '/\[\[[^].]*]]/{s||'\''$(sed "y/_/\\//" <<<"&")'\''|g;s/.*/echo '\''&'\''/}' file
this is a line with a [[Link/to/something]] and [[Something/else|something else]]
this site is cool [[http://example.com/this_page]]

You can use python to simplify the regex:
$ python3 -c '
> import re
> import sys
> for line in sys.stdin:
> print(re.sub(r"\[\[(?!http).*?\]\]", lambda m:m.group(0).replace("_", "/"), line), end="")
> ' <input.txt
this is a line with a [[Link/to/something]] and [[Something/else|something else]]
this site is cool [[http://example.com/this_page]]
Note: $ and > at the beginning of lines are command prompt.
You can also do it in vim visually:
/\[\[\(http\)\#!.\{-}\]\]
:%s##\=substitute(submatch(0), '_', '/', '')#g

Perl line runs 30 times quicker with single quotes than with double quotes

We have a task to change some strings in binary files to lowercase (from mixed/upper/whatever). The relevant strings are references to the other files (it's in connection with an upgrade where we are also moving from Windows to linux as a server environment, so the case suddenly matters). We have written a script which uses a perl loop to do this. We have a directory containing around 300 files (total size of the directory is around 150M) so it's some data but not huge amounts.
The following perl code takes about 6 minutes to do the job:
for file_ref in `ls -1F $forms6_convert_dir/ | grep -v "/" | sed 's/\(.*\)\..*/\1/'`
do
(( updated++ ))
write_line "Converting case of string: $file_ref "
perl -i -pe "s{(?i)$file_ref}{$file_ref}g" $forms6_convert_dir/*
done
while the following perl code takes over 3 hours!
for file_ref in `ls -1F $forms6_convert_dir/ | grep -v "/" | sed 's/\(.*\)\..*/\1/'`
do
(( updated++ ))
write_line "Converting case of string: $file_ref "
perl -i -pe 's{(?i)$file_ref}{$file_ref}g' $forms6_convert_dir/*
done
Can anyone explain why? Is it that the $file_ref is getting left as the string $file_ref instead of substituted with the value in the single quotes version? in which case, what is it replacing in this version? What we want is to replace all occurances of any filename with itself but in lowercase. If we run strings on the files before and after and search for the filenames then both appeared to have made the same changes. However, if we run diff on the files produced by the two loops (diff firstloop/file1 secondloop/file1) then it reports that they differ.
This is running from within a bash script on linux.

The shell doesn't do variable substitution for single quoted strings. So, the second one is a different program.

As the other answers said, the shell doesn't substitute variables inside single quotes, so the second version is executing the literal Perl statement s{(?i)$file_ref}{$file_ref}g for every line in every file.
As you said in a comment, if $ is the end-of-line metacharacter, $file_ref could never match anything. $ matches before the newline at end-of-line, so the next character would have to be a newline. Therefore, Perl doesn't interpret $ as the metacharacter; it interprets it as the beginning of a variable interpolation.
In Perl, the variable $file_ref is undef, which is treated as the empty string when interpolated. So you're really executing s{(?i)}{}g, which says to replace the empty string with the empty string, and do that for all occurrences in a case-insensitive manner. Well, there's an empty string between every pair of characters, plus one at the beginning and end of each line. Perl is finding each one and replacing it with the empty string. This is a no-op, but it's an expensive one, hence the 3-hour run time.
You must be mistaken about both versions making the same changes. As I just explained, the single-quoted version is just an expensive no-op; it doesn't make any changes at all to the file contents (it just makes a fresh copy of each file). The files you ran it on must have already been converted to lower case.

With double quotes you are using the shell variable, with single quotes Perl is trying to use a variable of that name.
You might wish to consider writing the whole lot in either Perl or Bash to speed things up. Both languages can read files and do pattern matching. In Perl you can change to lower-case using the lc built-in function, and in Bash 4 you can use ${file,,}.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string