Strange output from sed - linux

I have some html files and want to extract only lines with containing these tags:
head
p
I used sed to extract these parts of the files, as follows:
grep "<head>" myfile.html | sed -e 's%\(head\)\(.*\)\(/head\)%title\2\/title%'
grep "<p>" myfile.html | sed -e 's%\(<p>\)\(.*\)\(</p\)\(>\)%\2\\%'
Everything is Ok, but I get "\" character at the end of each line. How I can overcome this problem?

In this command, you're telling it to add a backslash by including the double backslash:
sed -e 's%\(<p>\)\(.*\)\(</p\)\(>\)%\2\\%'
Try removing the backslashes:
sed -e 's%\(<p>\)\(.*\)\(</p\)\(>\)%\2%'
Also, you don't need grep:
sed -ne '/<p>/{s%\(<p>\)\(.*\)\(</p\)\(>\)%\2%;p}'

Don't use \ at the end of the replacement string:
grep "<p>" myfile.html | sed -e 's%\(<p>\)\(.*\)\(</p\)\(>\)%\2%'

Related

Linux merging files

I'm doing a linux online course but im stuck with a question, you can find the question below.
You will get three files called a.bf, b.bf and c.bf. Merge the contents of these three files and write it to a new file called abc.bf. Respect the order: abc.bf must contain the contents of a.bf first, followed by those of b.bf, followed by those of c.bf.
Example
Suppose the given files have the following contents:
a.bf contains +++.
b.bf contains [][][][].
c.bf contains <><><>.
The file abc.bf should then have
+++[][][][]<><><>
as its content.
I know how to merge the 3 files but when i use cat my output is:
+++
[][][]
<><><>
When i use paste my output is "+++ 'a lot of spaces' [][][][] 'a lot of spaces' <><><>"
My output that i need is +++[][][][]<><><>, i dont want the spaces between the content. Can someone help me?
What you want to do is delete the newline characters.
With tr:
cat {a,b,c}.bf | tr --delete '\n' > abc.bf
With echo & sed:
echo $(cat {a,b,c}.bf) | sed -E 's/ //g' > abc.bf
With xargs & sed:
<{a,b,c}.bf xargs | sed -E 's/ //g' > abc.bf
Note that sed is only used to remove the spaces.
With cat & sed:
cat {a,b,c}.bf | sed -z 's/\n//g'
echo -n "$(cat a.bf)$(cat b.bf)$(cat c.bf)" > abc.bf
echo -n will not output trailing newlines

How do I replace single quotes with another character in sed?

I have a flat file where I have multiple occurrences of strings that contains single quote, e.g. hari's and leader's.
I want to replace all occurrences of the single quote with space, i.e.
all occurences of hari's to hari s
all occurences of leader's to leader s
I tried
sed -e 's/"'"/ /g' myfile.txt
and
sed -e 's/"'"/" "/g' myfile.txt
but they are not giving me the expected result.
Try to keep sed commands simple as much as possible.
Otherwise you'll get confused of what you'd written reading it later.
#!/bin/bash
sed "s/'/ /g" myfile.txt
This will do what you want to
echo "hari's"| sed 's/\x27/ /g'
It will replace single quotes present anywhere in your file/text. Even if they are used for quoting they will be replaced with spaces. In that case(remove the quotes within a word not at word boundary) you can use the following:
echo "hari's"| sed -re 's/(\<.+)\x27(.+\>)/\1 \2/g'
HTH
Just go leave the single quote and put an escaped single quote:
sed 's/'\''/ /g' input
also possible with a variable:
quote=\'
sed "s/$quote/ /g" input
Here is based on my own experience.
Please notice on how I use special char ' vs " after sed
This won't do (no output)
2521 #> echo 1'2'3'4'5 | sed 's/'/ /g'
>
>
>
but This would do
2520 #> echo 1'2'3'4'5 | sed "s/'/ /g"
12345
The -i should replace it in the file
sed -i 's/“/"/g' filename.txt
if you want backups you can do
sed -i.bak 's/“/"/g' filename.txt
I had to replace "0x" string with "32'h" and resolved with:
sed 's/ 0x/ 32\x27h/'

How to do str.strip() for every line in a text file? Unix

I could do the following in python to clean and strip unwanted whitespaces, but can it be done just through the terminal by other means like sed , grep or something?
outfile = open('textstripped.txt','w+','utf8')
for i in open('textfile.txt','r','utf8'):
print>>outfile, i.strip()
Using perl on the command line:
perl -lpe 's/^\s+//; s/\s+$//' file.txt > stripped.txt
This solution is based on sed man page:
sed 'y/\t/ /;s/^ *//;s/ *$//' input > output
http://www.gnu.org/software/sed/manual/sed.html#Centering-lines
Description:
y\t/ / replaces tabs with spaces
s/^ *// removes leading spaces
s/ *$// removes trailing spaces
$ cat input.txt | sed 's/^[ \t]*//;s/[ \t]*$//' > output.txt
This gets rid of the leading and trailing white spaces..
EDIT: sed -e "s/^[ \t]+//; s/[ \t]+$//" -i .bk input.txt
This does in place file editing, and saves backup to input.txt.bk (and saves a process as some suggested)
sed -E "s/(^[ \t]+|[ \t]+$)//" < input > output
Or if you have a GNU-compliant version of SED:
sed -E "s/^\s+|\s+$//g" < in > out
If you have a Mac, I recommed getting homebrew and installing gnu-sed.
Then, alias sed=gsed.

Text formating - sed, awk, shell

I need some assistance trying to build up a variable using a list of exclusions in a file.
So I have a exclude file I am using for rsync that looks like this:
*.log
*.out
*.csv
logs
shared
tracing
jdk*
8.6_Code
rpsupport
dbarchive
inarchive
comms
PR116PICL
**/lost+found*/
dlxwhsr*
regression
tmp
working
investigation
Investigation
dcsserver_weblogic_
dcswebrdtEAR_weblogic_
I need to build up a string to be used as a variable to feed into egrep -v, so that I can use the same exclusion list for rsync as I do when egrep -v from a find -ls.
So I have created this so far to remove all "*" and "/" - and then when it sees certain special characters it escapes them:
cat exclude-list.supt | while read line
do
echo $line | sed 's/\*//g' | sed 's/\///g' | 's/\([.-+_]\)/\\\1/g'
What I need the ouput too look like is this and then export that as a variable:
SEXCLUDE_supt="\.log|\.out|\.csv|logs|shared|PR116PICL|tracing|lost\+found|jdk|8\.6\_Code|rpsupport|dbarchive|inarchive|comms|dlxwhsr|regression|tmp|working|investigation|Investigation|dcsserver\_weblogic\_|dcswebrdtEAR\_weblogic\_"
Can anyone help?
A few issues with the following:
cat exclude-list.supt | while read line
do
echo $line | sed 's/\*//g' | sed 's/\///g' | 's/\([.-+_]\)/\\\1/g'
Sed reads files line by line so cat | while read line;do echo $line | sed is completely redundant also sed can do multiple substitutions by either passing them as a comma separated list or using the -e option so piping to sed three times is two too many. A problem with '[.-+_]' is the - is between . and + so it's interpreted as a range .-+ when using - inside a character class put it at the end beginning or end to lose this meaning like [._+-].
A much better way:
$ sed -e 's/[*/]//g' -e 's/\([._+-]\)/\\\1/g' file
\.log
\.out
\.csv
logs
shared
tracing
jdk
8\.6\_Code
rpsupport
dbarchive
inarchive
comms
PR116PICL
lost\+found
dlxwhsr
regression
tmp
working
investigation
Investigation
dcsserver\_weblogic\_
dcswebrdtEAR\_weblogic\_
Now we can pipe through tr '\n' '|' to replace the newlines with pipes for the alternation ready for egrep:
$ sed -e 's/[*/]//g' -e 's/\([._+-]\)/\\\1/g' file | tr "\n" "|"
\.log|\.out|\.csv|logs|shared|tracing|jdk|8\.6\_Code|rpsupport|dbarchive|...
$ EXCLUDE=$(sed -e 's/[*/]//g' -e 's/\([._+-]\)/\\\1/g' file | tr "\n" "|")
$ echo $EXCLUDE
\.log|\.out|\.csv|logs|shared|tracing|jdk|8\.6\_Code|rpsupport|dbarchive|...
Note: If your file ends with a newline character you will want to remove the final trailing |, try sed 's/\(.*\)|/\1/'.
This might work for you (GNU sed):
SEXCLUDE_supt=$(sed '1h;1!H;$!d;g;s/[*\/]//g;s/\([.-+_]\)/\\\1/g;s/\n/|/g' file)
This should work but I guess there are better solutions. First store everything in a bash array:
SEXCLUDE_supt=$( sed -e 's/\*//g' -e 's/\///g' -e 's/\([.-+_]\)/\\\1/g' exclude-list.supt)
and then process it again to substitute white space:
SEXCLUDE_supt=$(echo $SEXCLUDE_supt |sed 's/\s/|/g')

How to filter data out of tabulated stdout stream in Bash?

Here's what output looks like, basically:
? RESTRequestParamObj.cpp
? plugins/dupfields2/_DupFields.cpp
? plugins/dupfields2/_DupFields.h
I need to get the filenames from second column and pass them to rm. There's AWK script that goes like awk '{print $2}' but I was wondering if there's another solution.
If you have spaces between the ? and the filename then:
cut -c9-
If they're tabs then:
cut -f2
Placed your output in file
$> cat ./text
? RESTRequestParamObj.cpp
? plugins/dupfields2/_DupFields.cpp
? plugins/dupfields2/_DupFields.h
Edit it with sed
$> cat ./text | sed -r -e 's/(\?[\ \t]*)(.*)/\2/g'
RESTRequestParamObj.cpp
plugins/dupfields2/_DupFields.cpp
plugins/dupfields2/_DupFields.h
Sed in here is matching 2 parts of line -
? with tabs or spaces
Other characters until the end f the line
And then it changes whole line only with second part.
This might work for you:
echo "? RESTRequestParamObj.cpp" | sed -e 's/^\S\+/rm /' | sh
or using GNU sed
echo "? RESTRequestParamObj.cpp"| sed -r 's/^\S+/rm /e'
bash only solution, assuming your output comes from stdin:
while read line; do echo ${line##* }; done
use cut/perl instead
cut -f2 -t'\t'|xargs rm -rf
<your output>|perl -ne '#cols = split /\t/; print $cols[1]'|xargs rm -rf

Resources