How can I extract the text of multiple URLs with lynx/w3m in Linux - linux

I have made a list of 50 odd URLs in one text file (one URL per each line). Now, for each URL I want to extract the text of the web site and save it down. This sounds like a job for a shell script in Linux.
At the moment I am putting things together:
with say sed -n 1p listofurls.txt I could read the first line in my URL file, listofurls.txt
with lynx -dump www.firsturl... I can use the output for piping through various commands to tidy and clean it up. Done, that works.
Before automating, I am struggling to piping in the URLs into lynx: say
sed -n 1p listofurls.txt | lynx -dump -stdin
does not work.
How can I do that say for one URL, and more importantly for each URL I have in listofurls.txt?

You can write script like this
vi script.sh
#content of script.sh#
while read line
do
name=$line
wget $name
echo "Downloaded content from - $name"
done < $1
#end#
chmod 777 script.sh
./script.sh listofurls.txt

To pipe one URL to lynx, you can can use xargs:
sed -n 1p listofurls.txt | xargs lynx -dump
To download all URLs from the file (parse them by lynx and just print out), you can do:
while read url; do lynx - -dump $url; done < listofurls.txt

Related

Using Bash to cURL a website and grep for keywords

I'm trying to write a script that will do a few things in the following order:
cURL websites from a list of urls contained within a "url_list.txt" (new-line delineated) file.
For each website in the list, I want to grep that website looking for keywords contained within a "keywords.txt" (new-line delineated) file.
I want to finish by printing to the terminal in the following format (or something similar):
$URL (that contained match) : $keyword (that made the match)
It needs to be able to run in Ubuntu (GNU grep, etc.)
It does not need to be cURL and grep; as long as the functionality is there.
So far I've got:
#!/bin/bash
keywords=$(cat ./keywords.txt)
urllist=$(cat ./url_list.txt)
for url in $urllist; do
content="$(curl -L -s "$url" | grep -iF "$keywords" /dev/null)"
echo "$content"
done
But for some reason, no matter what I try to tweak or change, it keeps failing to one degree or another.
How can I go about accomplishing this task?
Thanks
Here's how I would do it:
#!/bin/bash
keywords="$(<./keywords.txt)"
while IFS= read -r url; do
curl -L -s "$url" | grep -ioF "$keywords" |
while IFS= read -r keyword; do
echo "$url: $keyword"
done
done < ./url_list.txt
What did I change:
I used $(<./keywords.txt) to read the keywords.txt. This does not rely on an external program (cat in your original script).
I changed the for loop that loops over the url list, into a while loop. This guarentees that we use Θ(1) memory (i.e. we don't have to load the entire url list in memory).
I remove /dev/null from grep. greping from /dev/null alone is meaningless, since it will find nothing there. Instead, I invoke grep with no arguments so that it filters its stdin (which happens to be the output of curl in this case).
I added the -o flag for grep so that it outputs only the matched keyword.
I removed the subshell where you were capturing the output of curl. Instead I run the command directly and feed its output to a while loop. This is necessary because we might get more than keyword match per url.

Script to open latest text file from a directory

I need a shell script to open latest text file from a given directory. it will be then copied to another directory. How can i achieve it?
I need a logic which will search and give the latest file from a directory (name of the text file can be anything (not fixed), so i need to find out latest text file)
Here you can do something like this
#!/bin/sh
SOURCE_DIR=/home/juned/Downloads
DEST_DIR=/tmp/
LAST_MODIFIED_FILE=`ls -t ${SOURCE_DIR}| head -1`
echo $LAST_MODIFIED_FILE
#Open file
vim $SOURCE_DIR/$LAST_MODIFIED_FILE
#Copy file
cp $SOURCE_DIR/$LAST_MODIFIED_FILE $DEST_DIR
echo "File copied successfully"
You can specify any application name in which you want to open that file like gedit, kate etc. Here I've used vim.
xdg-open - opens a file or URL in the user's preferred application
Not an expert in bash but you can try this logic:
First, grab the latest file using ls -t -t sorts by time head -1 gets the first file
F=`ls -t * | head -1`
Then open the file using and editor:
xdg-open $F
gedit $F
...
As suggested by # AJefferiss you can directly do :
xdg-open $(ls -t * | head -1)
gedit $(ls -t * | head -1)
For editing the latest modified / created,
vim $(ls -t | head -1)
For editing the latest in alphanumerical order,
vim $(ls -1 | tail -1)
In one line (if are you sure that there are only files):
vim `ls -t .|head -1`
it will be opened in vim (or use other txt editor)
if there are directories you should write script with loop and test every file (if it's not a dir):
if [ -f $FILE ];
or you can also use find, or use pipe for get latest file:
ls -lt .|sed -n 2p|grep -v '^d'
The existing answers are helpful, but fall short when it comes to dealing with filenames with embedded spaces or other shell metacharacters.[1]
# Get the most recently modified *.txt file.
# (On *assignment*, names with spaces, ... are not a concern.)
f=$(ls -t *.txt | head -n 1)
# *Use* the variable enclosed in *double-quotes* to ensure that it is passed
# to the target command unmodified.
xdg-open "$f" # could also use "$(ls -t *.txt | head -n 1)" directly
Additionally, some answer user all-uppercase shell variable names, which should be avoided so as to avoid conflicts with environment variables.
[1] Due to use of ls, filenames with embedded newlines won't be handled correctly, but that's rarely a real-world concern.

Unix lynx in shell script to input data into website and grep result

Pretty new to unix shell scripting here, and I have some other examples to look at but still trying from almost scratch. I'm trying to track deliveries for our company, and I have a script I want to run that will input the tracking number into the website, and then grep the result to a file (delivered/not delivered). I can use the lynx command to get to the website at the command line and see the results, but in the script it just returns the webpage, and doesn't enter the tracking number.
Here's the code I have tried that works up to this point:
#$1 = 1034548607
FNAME=`date +%y%m%d%H%M%S`
echo requiredmcpartno=$1 | lynx -accept_all_cookies -nolist -dump -post_data http://apps.yrcregional.com/shipmentStatus/track.do 2>&1 | tee $FNAME >/home/jschroff/log.lg
DLV=`grep "PRO" $FNAME | cut --delimiter=: --fields=2 | awk '{print $DLV}'`
echo $1 $DLV > log.txt
rm $FNAME
I'm trying to get the results for the tracking number(PRO number as they call it) 1034548607.
Try doing this with curl :
trackNumber=1234
curl -A Mozilla/5.0 -b cookies -c cookies -kLd "proNumber=$trackNumber" http://apps.yrcregional.com/shipmentStatus/track.do
But verify the TOS to know if you are authorized to scrape this web site.
If you want to parse the output, give us a sample HTML output.

Saving multiple URLs at once using Linux Centos

So i have a list of about 1000 urls in a txt file, one per line, I wish to save the contents of every page to a file, how can i automate this?"
You can use wget with the -i option to let it download a list of URLs. Assuming your URLs are stored in a file called urls.txt:
wget -i urls.txt
The problem here might be that the filenames can be the same for multiple websites (e.g. index.html), so that wget will append a number which makes it hard/impossible to connect a file to the original URL just by looking at the filename.
The solution to that would be to use a loop like this:
while read -r line
do
wget "$line" -O <...>
done < urls.txt
You can specify a custom filename with the -O option.
Or you can "build" the file name from the url you are processing.
while read -r line
do
fname=$(echo "$line" | sed -e 's~http[s]*://~~g' -e 's~[^A-Za-z0-9]~-~g')
fname=${fname}.html
wget "$line" -O "$fname"
done < urls.txt

less-style markdown viewer for UNIX systems

I have a Markdown string in JavaScript, and I'd like to display it (with bolding, etc) in a less (or, I suppose, more)-style viewer for the command line.
For example, with a string
"hello\n" +
"_____\n" +
"*world*!"
I would like to have output pop up with scrollable content that looks like
hello
world
Is this possible, and if so how?
Pandoc can convert Markdown to groff man pages.
This (thanks to nenopera's comment):
pandoc -s -f markdown -t man foo.md | man -l -
should do the trick. The -s option tells it to generate proper headers and footers.
There may be other markdown-to-*roff converters out there; Pandoc just happens to be the first one I found.
Another alternative is the markdown command (apt-get install markdown on Debian systems), which converts Markdown to HTML. For example:
markdown README.md | lynx -stdin
(assuming you have the lynx terminal-based web browser).
Or (thanks to Danny's suggestion) you can do something like this:
markdown README.md > README.html && xdg-open README.html
where xdg-open (on some systems) opens the specified file or URL in the preferred application. This will probably open README.html in your preferred GUI web browser (which isn't exactly "less-style", but it might be useful).
I tried to write this in a comment above, but I couldn't format my code block correctly. To write a 'less filter', try, for example, saving the following as ~/.lessfilter:
#!/bin/sh
case "$1" in
*.md)
extension-handler "$1"
pandoc -s -f markdown -t man "$1"|groff -T utf8 -man -
;;
*)
# We don't handle this format.
exit 1
esac
# No further processing by lesspipe necessary
exit 0
Then, you can type less FILENAME.md and it will be formatted like a manpage.
If you are into colors then maybe this is worth checking as well:
terminal_markdown_viewer
It can be used straightforward also from within other programs, or python modules.
And it has a lot of styles, like over 200 for markdown and code which can be combined.
Disclaimer
It is pretty alpha there may be still bugs
I'm the author of it, maybe some people like it ;-)
A totally different alternative is mad. It is a shell script I've just discovered. It's very easy to install and it does render markdown in a console pretty well.
I wrote a couple functions based on Keith's answer:
mdt() {
markdown "$*" | lynx -stdin
}
mdb() {
local TMPFILE=$(mktemp)
markdown "$*" > $TMPFILE && ( xdg-open $TMPFILE > /dev/null 2>&1 & )
}
If you're using zsh, just place those two functions in ~/.zshrc and then call them from your terminal like
mdt README.md
mdb README.md
"t" is for "terminal", "b" is for browser.
Using OSX I prefer to use this command
brew install pandoc
pandoc -s -f markdown -t man README.md | groff -T utf8 -man | less
Convert markupm, format document with groff, and pipe into less
credit: http://blog.metamatt.com/blog/2013/01/09/previewing-markdown-files-from-the-terminal/
This is an alias that encapsulates a function:
alias mdless='_mdless() { if [ -n "$1" ] ; then if [ -f "$1" ] ; then cat <(echo ".TH $1 7 `date --iso-8601` Dr.Beco Markdown") <(pandoc -t man $1) | groff -K utf8 -t -T utf8 -man 2>/dev/null | less ; fi ; fi ;}; _mdless '
Explanation
alias mdless='...' : creates an alias for mdless
_mdless() {...}; : creates a temporary function to be called afterwards
_mdless : at the end, call it (the function above)
Inside the function:
if [ -n "$1" ] ; then : if the first argument is not null then...
if [ -f "$1" ] ; then : also, if the file exists and is regular then...
cat arg1 arg2 | groff ... : cat sends this two arguments concatenated to groff; the arguments being:
arg1: <(echo ".TH $1 7date --iso-8601Dr.Beco Markdown") : something that starts the file and groff will understand as the header and footer notes. This substitutes the empty header from -s key on pandoc.
arg2: <(pandoc -t man $1) : the file itself, filtered by pandoc, outputing the man style of file $1
| groff -K utf8 -t -T utf8 -man 2>/dev/null : piping the resulting concatenated file to groff:
-K utf8 so groff understands the input file code
-t so it displays correctly tables in the file
-T utf8 so it output in the correct format
-man so it uses the MACRO package to outputs the file in man format
2>/dev/null to ignore errors (after all, its a raw file being transformed in man by hand, we don't care the errors as long as we can see the file in a not-so-much-ugly format).
| less : finally, shows the file paginating it with less (I've tried to avoid this pipe by using groffer instead of groff, but groffer is not as robust as less and some files hangs it or do not show at all. So, let it go through one more pipe, what the heck!
Add it to your ~/.bash_aliases (or alike)
I personally use this script:
#!/bin/bash
id=$(uuidgen | cut -c -8)
markdown $1 > /tmp/md-$id
google-chrome --app=file:///tmp/md-$id
It renders the markdown into HTML, puts it into a file in /tmp/md-... and opens that in a kiosk chrome session with no URI bar etc.. You just pass the md file as an argument or pipe it into stdin. Requires markdown and Google Chrome. Chromium should also work but you need to replace the last line with
chromium-browser --app=file:///tmp/md-$id
If you wanna get fancy about it, you can use some css to make it look nice, I edited the script and made it use Bootstrap3 (overkill) from a CDN.
#!/bin/bash
id=$(uuidgen | cut -c -8)
markdown $1 > /tmp/md-$id
sed -i "1i <html><head><style>body{padding:24px;}</style><link rel=\"stylesheet\" type=\"text/css\" href=\"http://maxcdn.bootstrapcdn.com/bootstrap/3.2.0/css/bootstrap.min.css\"></head><body>" /tmp/md-$id
echo "</body>" >> /tmp/md-$id
google-chrome --app=file:///tmp/md-$id > /dev/null 2>&1 &
I'll post my unix page answer here, too:
An IMHO heavily underestimated command line markdown viewer is the markdown-cli.
Installation
npm install markdown-cli --global
Usage
markdown-cli <file>
Features
Probably not noticed much, because it misses any documentation...
But as far as I could figure out by some example markdown files, some things that convinced me:
handles ill formatted files much better (similarly to atom, github, etc.; eg. when blank lines are missing before lists)
more stable with formatting in headers or lists (bold text in lists breaks sublists in some other viewers)
proper table formatting
syntax highlightning
resolves footnote links to show the link instead of the footnote number (not everyone might want this)
Screenshot
Drawbacks
I have realized the following issues
code blocks are flattened (all leading spaces disappear)
two blank lines appear before lists

Resources