Parsing HTML table in Bash using sed - linux

In bash I am trying to parse following file:
Input:
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">
Wanted output:
12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples~withstuff"
11/07/2011 67100270 "https://stuff.com/findstones"
I got to the point that I have:
# less input.txt | sed -e "s/><tr><td//" -e "s/\///" -e "s/a>//" -e "s/<\/td><\/tr>//g" -e "s/<\/td><td>//g" -e "s/>$//g" -e "s/<a class=\"btn-down\" download href=//g"
<stuff.txt (15.18 KB)12/01/2015Large things158520312"https://resource.com/stones"
<flowers.pdf (83.03 MB)23/03/2011Large flowers872448000"https://resource.com/flosers with stuff"
<apples.pdf (281.16 MB)21/04/2012Large things like apples299009564"https://resource.com/apples"
<stones.pdf (634.99 MB)11/07/2011Large stones from mountains67100270"https://stuff.com/findstones"
Is there a easier way to parse it? I feel that it can be done much simpler and I am not even in the middle of parsing.

Could you please try following and let us know if this helps you.
awk -F"[><]" '{sub(/.*=/,"",$28);print $15,$23,$28}' Input_file

I'm sure the best way to solve your problem is to use an HTML parser. Solution for shown sample of file:
sed -r 's/.*(..\/..\/....).*>([0-9]*)<\/.*href=([^>]*)>/\1 \2 \3/I' input.txt

Personally, I'd use perl, but that's not what you asked, so...
A pedantic stepwise approach, so that you can edit bits of the logic when needed.
Assuming the input is a file named x:
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">
Try this:
sed -E '
s/>$//;
s/href=/>/;
s/(<[^>]+>)+/~/g;
s/~[^~]+~//;
s/~[^~]+~/ /;
s/~/ /;
' x
Output:
12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples"
11/07/2011 67100270 "https://stuff.com/findstones"
Explained:
sed -E '
This uses extended regexes, and opens a script of sed code so that I can list each pattern individually. Each will be executed in order on each line, so it's not super efficient, but it's "readable" as regex code goes, and reasonably maintainable once you understand it, and so easy to edit when something needs tweaking.
s/>$//;
Strip the closing > off the end, to preserve the URL before squashing out all the other tags.
s/href=/>/;
use the href= as a hook to insert the > back so we can squash out all the tags in one pass.
s/(<[^>]+>)+/~/g;
Convert ALL the strings of tags and everything still in them to a simple delimiter each.
s/~[^~]+~//;
Eliminate the leading and second delimiter and the first unneeded field between them.
s/~[^~]+~/ /;
Eliminate the third and fourth delimiters and the unneeded third field between them, replacing them with the space you wanted in the output.
Those two are very similar, and could certainly be combined with minimal shenannigans, but I left them nigh-redundant for easier explication.
s/~/ /;
Convert the remaining delimiter to the other space you wanted between the remaining fields.
' x
Close the script and give it the filename to read.
Obviously, this leaves a LOT of room for improvement, and is in many ways stylistically repulsive, but hopefully it is a simple explanation of tricks you can hack into a maintainably useful solution to your issue.
Good luck.

Related

How do I extract specific strings from multiple files and write them to .txt in bash?

I have a lot of files in the folder filesToCheck, some examples given below. I need the output result.txt as also shown below. I can use linux bash with any commands that do not require extra installations.
The try-out below (with help from stackoverflow) has two problems when I execute it. It only looks for one instance of CAKE_FROSTING despite the global flag g and the file result.txt remains empty despite > result.txt.
sed -Enz 's/.*CAKE_FROSTING\(\n?"([^"]*).*/\1\n/gp' filesToCheck/* > result.txt
What do I need to change?
file1.cpp
something
CAKE_FROSTING("is.simply.the.best", "[no][matter][what]") { DO(something(0) == 1); }
file2.h
something else
CAKE_FROSTING(
"is.kinda.neat",
"[i][agree]") something else
something more
file3.cpp
random_text CAKE_FROSTING("Can be nice") "more random text"
CAKE_CREAM("totally.sucks", "[trust][me]")
random_text CAKE_FROSTING("Can be very nice") "indeed"
desiredResult.txt
is.simply.the.best
is.kinda.neat
Can be nice
Can be very nice
currentResult command line output:
is.simply.the.best
is.kinda.neat
Can be nice
Assuming the string CAKE_FROSTING occurs once per line, you can try this sed
$ sed -En ':a;N;s/.*CAKE_FROSTING\(\n?"([^"]*).*/\1/p;ba' filesToCheck/*
is.simply.the.best
is.kinda.neat
Can be nice
Can be very nice

sed not replacing a full sentence

ssh root#$IP sed -i -e 's/listen\t80\default_server;/test/' /etc/nginx/conf.d/default.conf is there something I am not doing correctly?
I am doing to learn how to use sed - but I think the greatest route for making a general configuration across multiple server is to upload the conf file? Any input would be appreciated, thanks!
It appears that you are missing a tab:
listen\t80\tdefault_server
If it was me, I'd replace the tab pattern with general whitespace pattern to allow a little flexibility:
listen\s\+80\s\+default_server
or
listen[[:space:]]\+80[[:space:]]\+default_server

Trying to 'grep' links from downloaded html pages in bash shell environment without cut, sed, tr commands (only e/grep)

In Linux shell, I am trying to return links to JPG files from the downloaded HTML script file. So far I only got to this point:
grep 'http://[:print:]*.jpg' 'www_page.html'
I don't want to use auxiliary commands like 'tr', 'cut', 'sed' etc...'lynx' is okay!
Using grep alone without massaging the file is doable but not recommended as many have pointed out in the comments.
If you can loosen up your requirements a bit then you can use html tidy to massage the downloaded HTML file so that each html entities are on a single line so that the regular expression can be simpler like you wanted, something like this:
$ tidy file.html|grep -o 'http://[[:print:]]*.jpg'
Note the use of "-o" option to grep to print only the matching part of the input

Linux rename function not being used correctly

I'm trying to use the rename command in a Terminal in Ubuntu to append a string to the beginning of some avi file names as follows.
rename -n 's/(\w)\.avi$/String_to_add__$1\.avi/' *.avi
So I expect the following:
String_to_add_MyMovie.avi
Problem is that when I run the command it appends the string to the end of the file name, so I end up with the following:
MyMovie_String_to_add_.avi
I'm not sure if I have the perlexpr syntax wrong or something else. Any insight is appreciated.
UPDATE:
Thanks for the suggestions, I tried the suggestions from alno and plundra and made the following modification:
rename -n 's/(\w+)\.avi$/String_to_add__$1\.avi/' *.avi
But now the file gets the string inserted in the middle of the name as follows:
My_String_to_add_Movie
My apologies though, I neglected to mention that the titles are preceded by 3 numeric values, so the file name nomenclature is {3 numbers}-My_Movie.avi so for example 001-My_Movie.avi. But I didn't think this would make a difference since I'm assuming \w+ matches alphanumeric characters, might the '-' be the issue?
Haven't tried Christian's approach yet, I want to be able to use the rename command, or at least understand why it's not working before I try a different approach.
I don't think rename -n is standard. You could do this:
for i in *.avi; do mv $i String_to_add_$i; done
You're only matching a single character with \w, you want \w+, so the complete line would be:
rename -n 's/(\w+)\.avi$/String_to_add__$1\.avi/' *.avi
Correct version:
rename -n 's/(\w+)\.avi$/String_to_add__$1\.avi/' *.avi
You simply forgot + after \w, so it tried to match only one character.

How can I replace a specific line by line number in a text file?

I have a 2GB text file on my linux box that I'm trying to import into my database.
The problem I'm having is that the script that is processing this rdf file is choking on one line:
mismatched tag at line 25462599, column 2, byte 1455502679:
<link r:resource="http://www.epuron.de/"/>
<link r:resource="http://www.oekoworld.com/"/>
</Topic>
=^
I want to replace the </Topic> with </Line>. I can't do a search/replace on all lines but I do have the line number so I'm hoping theres some easy way to just replace that one line with the new text.
Any ideas/suggestions?
sed -i yourfile.xml -e '25462599s!</Topic>!</Line>!'
sed -i '25462599 s|</Topic>|</Line>|' nameoffile.txt
The tool for editing text files in Unix, is called ed (as opposed to sed, which as the name implies is a stream editor).
ed was once intended as an interactive editor, but it can also easily scripted. The way ed works, is that all commands take an address parameter. The way to address a specific line is just the line number, and the way to change the addressed line(s) is the s command, which takes the same regexp that sed would. So, to change the 42nd line, you would write something like 42s/old/new/.
Here's the entire command:
FILENAME=/path/to/whereever
LINENUMBER=25462599
ed -- "${FILENAME}" <<-HERE
${LINENUMBER}s!</Topic>!</Line>!
w
q
HERE
The advantage of this is that ed is standardized, while the -i flag to sed is a proprietary GNU extension that is not available on a lot of systems.
Use "head" to get the first 25462598 lines and use "tail" to get the remaining lines (starting at 25462601). Though... for a 2GB file this will likely take a while.
Also are you sure the problem is just with that line and not somewhere previous (ie. the error looks like an XML parse error which might mean the actual problem is someplace else).
My shell script:
#!/bin/bash
awk -v line=$1 -v new_content="$2" '{
if (NR == line) {
print new_content;
} else {
print $0;
}
}' $3
Arguments:
first: line number you want change
second: text you want instead original line contents
third: file name
This script prints output to stdout then you need to redirect. Example:
./script.sh 5 "New fifth line text!" file.txt
You can improve it, for example, by taking care that all your arguments has expected values.

Resources