Trying to 'grep' links from downloaded html pages in bash shell environment without cut, sed, tr commands (only e/grep)

Trying to 'grep' links from downloaded html pages in bash shell environment without cut, sed, tr commands (only e/grep) - linux

In Linux shell, I am trying to return links to JPG files from the downloaded HTML script file. So far I only got to this point:
grep 'http://[:print:]*.jpg' 'www_page.html'
I don't want to use auxiliary commands like 'tr', 'cut', 'sed' etc...'lynx' is okay!

Using grep alone without massaging the file is doable but not recommended as many have pointed out in the comments.
If you can loosen up your requirements a bit then you can use html tidy to massage the downloaded HTML file so that each html entities are on a single line so that the regular expression can be simpler like you wanted, something like this:
$ tidy file.html|grep -o 'http://[[:print:]]*.jpg'
Note the use of "-o" option to grep to print only the matching part of the input

Related

"Cat" into multiple files using brace expansion

I am quite new to bash and trying to type some text into multiple files with a single command using brace expansion.
I tried: cat > file_{1..100} to write into 100 files some text that I will type in the terminal. I get the following error:
bash: file_{1..100}: ambiguous redirect
I also tried: cat > "file_{1..100}" but that creates a singe file named: file_{1..100}.
I tried: cat > `file_{1..100}` but that gives the error:
file_1: command not found
How can I achieve this using brace expansion? Maybe there are other ways using other utilities and/or pipelines. But I want to know if that is possible using only simple brace expansion or not.

You can't do this with cat alone. It only writes its output to its standard output, and that single file descriptor can only be associated with a single file.
You can however do it with tee file_{1..100}.
You may wish to consider using tee file_{01..100} instead, so that the filenames are zero-padded to all have the same width: file_001, file_002, ... This has the advantage that lexicographic order will agree with numerical order, and so ls, *, etc, will process them in numerical order. Without this, you have the situation that file_2 comes after file_10 in lexicographic order.

target could be only a pipe, not a multiple files.
If you want redirect output to multiple files, use tee
cat | tee file_{1..100}
Don't forget to check man tee, for example if you want to append to the files, you should add -a option (tee -a file_{1..100})

This types the string or text into file{1..4}
echo "hello you just knew me by kruz" > file{1..4}
Use to remove them
rm file*

Is it possible to display a file's contents and delete that file in the same command?

I'm trying to display the output of an AWS lambda that is being captured in a temporary text file, and I want to remove that file as I display its contents. Right now I'm doing:
... && cat output.json && rm output.json
Is there a clever way to combine those last two commands into one command? My goal is to make the full combined command string as short as possible.

For cases where
it is possible to control the name of the temporary text file.
If file is not used by other code
Possible to pass "/dev/stdout" as the.name of the output
Regarding portability: see stack exchange how portable ... /dev/stdout
POSIX 7 says they are extensions.
Base Definitions,
Section 2.1.1 Requirements:
The system may provide non-standard extensions. These are features not required by POSIX.1-2008 and may include, but are not limited to:
[...]
• Additional character special files with special properties (for example,  /dev/stdin, /dev/stdout,  and  /dev/stderr)
Using the mandatory supported /dev/tty will force output into “current” terminal, making it impossible to pipe the output of the whole command into different program (or log file), or to use the program when there is no connected terminals (cron job, or other automation tools)

No, you cannot easily remove the lines of a file while displaying them. It would be highly inefficient as it would require removing characters from the beginning of a file each time you read a line. Current filesystems are pretty good at truncating lines at the end of a file, but not at the beginning.
A simple but extremely slow method would look like this:
while [ -s output.json ]
do
head -1 output.json
sed -i 1d output.json
done
While this algorithm is plain and simple, you should know that each time you remove the first line with sed -i 1d it will copy the whole content of the file but the first line into a temporary file, resulting in approximately 0.5*n² lines written in total (where n is the number of lines in your file).
In theory you could avoid this by do something like that:
while [ -s output.json ]
do
line=$(head -1 output.json)
printf -- '%s\n' "$line"
fallocate -c -o 0 -l $((${#len}+1)) output.json
done
But this does not account for variable newline characters (namely DOS-formatted newlines) and fallocate does not always work on xfs, among other issues.
Since you are trying to consume a file alongside its creation without leaving a trace of its existence on disk, you are essentially asking for a pipe functionality. In my opinion you should look into how your output.json file is produced and hopefully you can pipe it to a script of your own.

Parsing HTML table in Bash using sed

In bash I am trying to parse following file:
Input:
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">
Wanted output:
12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples~withstuff"
11/07/2011 67100270 "https://stuff.com/findstones"
I got to the point that I have:
# less input.txt | sed -e "s/><tr><td//" -e "s/\///" -e "s/a>//" -e "s/<\/td><\/tr>//g" -e "s/<\/td><td>//g" -e "s/>$//g" -e "s/<a class=\"btn-down\" download href=//g"
<stuff.txt (15.18 KB)12/01/2015Large things158520312"https://resource.com/stones"
<flowers.pdf (83.03 MB)23/03/2011Large flowers872448000"https://resource.com/flosers with stuff"
<apples.pdf (281.16 MB)21/04/2012Large things like apples299009564"https://resource.com/apples"
<stones.pdf (634.99 MB)11/07/2011Large stones from mountains67100270"https://stuff.com/findstones"
Is there a easier way to parse it? I feel that it can be done much simpler and I am not even in the middle of parsing.

Could you please try following and let us know if this helps you.
awk -F"[><]" '{sub(/.*=/,"",$28);print $15,$23,$28}' Input_file

I'm sure the best way to solve your problem is to use an HTML parser. Solution for shown sample of file:
sed -r 's/.*(..\/..\/....).*>([0-9]*)<\/.*href=([^>]*)>/\1 \2 \3/I' input.txt

Personally, I'd use perl, but that's not what you asked, so...
A pedantic stepwise approach, so that you can edit bits of the logic when needed.
Assuming the input is a file named x:
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>stuff.txt (15.18 KB)</td><td>12/01/2015</td><td>Large things</td><td>158520312</td><td><a class="btn-down" download href="https://resource.com/stones">
</a></td></tr><tr><td>flowers.pdf (83.03 MB)</td><td>23/03/2011</td><td>Large flowers</td><td>872448000</td><td><a class="btn-down" download href="https://resource.com/flosers with stuff">
</a></td></tr><tr><td>apples.pdf (281.16 MB)</td><td>21/04/2012</td><td>Large things like apples</td><td>299009564</td><td><a class="btn-down" download href="https://resource.com/apples">
</a></td></tr><tr><td>stones.pdf (634.99 MB)</td><td>11/07/2011</td><td>Large stones from mountains</td><td>67100270</td><td><a class="btn-down" download href="https://stuff.com/findstones">
Try this:
sed -E '
s/>$//;
s/href=/>/;
s/(<[^>]+>)+/~/g;
s/~[^~]+~//;
s/~[^~]+~/ /;
s/~/ /;
' x
Output:
12/01/2015 158520312 "https://resource.com/stones"
23/03/2011 872448000 "https://resource.com/flosers with stuff"
21/04/2012 299009564 "https://resource.com/apples"
11/07/2011 67100270 "https://stuff.com/findstones"
Explained:
sed -E '
This uses extended regexes, and opens a script of sed code so that I can list each pattern individually. Each will be executed in order on each line, so it's not super efficient, but it's "readable" as regex code goes, and reasonably maintainable once you understand it, and so easy to edit when something needs tweaking.
s/>$//;
Strip the closing > off the end, to preserve the URL before squashing out all the other tags.
s/href=/>/;
use the href= as a hook to insert the > back so we can squash out all the tags in one pass.
s/(<[^>]+>)+/~/g;
Convert ALL the strings of tags and everything still in them to a simple delimiter each.
s/~[^~]+~//;
Eliminate the leading and second delimiter and the first unneeded field between them.
s/~[^~]+~/ /;
Eliminate the third and fourth delimiters and the unneeded third field between them, replacing them with the space you wanted in the output.
Those two are very similar, and could certainly be combined with minimal shenannigans, but I left them nigh-redundant for easier explication.
s/~/ /;
Convert the remaining delimiter to the other space you wanted between the remaining fields.
' x
Close the script and give it the filename to read.
Obviously, this leaves a LOT of room for improvement, and is in many ways stylistically repulsive, but hopefully it is a simple explanation of tricks you can hack into a maintainably useful solution to your issue.
Good luck.

Obtaining file names from directory in Bash

I am trying to create a zsh script to test my project. The teacher supplied us with some input files and expected output files. I need to diff the output files from myExecutable with the expected output files.
Question: Does $iF contain a string in the following code or some kind of bash reference to the file?
#!/bin/bash
inputFiles=~/project/tests/input/*
outputFiles=~/project/tests/output
for iF in $inputFiles
do
./myExecutable $iF > $outputFiles/$iF.out
done
Note:
Any tips in fulfilling my objectives would be nice. I am new to shell scripting and I am using the following websites to quickly write the script (since I have to focus on the project development and not wasting time on extra stuff):
Grammar for bash language
Begginer guide for bash

As your code is, $iF contains full path of file as a string.
N.B: Don't use for iF in $inputFiles
use for iF in ~/project/tests/input/* instead. Otherwise your code will fail if path contains spaces or newlines.

If you need to diff the files you can do another for loop on your output files. Grab just the file name with the basename command and then put that all together in a diff and output to a ".diff" file using the ">" operator to redirect standard out.
Then diff each one with the expected file, something like:
expectedOutput=~/<some path here>
diffFiles=~/<some path>
for oF in ~/project/tests/output/* ; do
file=`basename ${oF}`
diff $oF "${expectedOutput}/${file}" > "${diffFiles}/${file}.diff"
done

Why can't i detect this file?

I have this file in a directory say test.php whose contents are below
< ? php $XZKsyG=’as’;
I want to pick up the file test.php with a search based on its content. So from the directory containing it I do:
grep 'php \$[a-zA-Z]*=.as.;'
However I get no result...what am I doing wrong?
Thanks

It works for me:
$ cat file
< ? php $XZKsyG=’as’;
$ grep 'php \$[a-zA-Z]*=.as.;' file
< ? php $XZKsyG=’as’;
Are you sure the contents of the file are exactly what you showed us?
Try cat -A file or od -c file to see whether the file really looks the way you think it does.
(Note that you don't need to escape the $ character; it's only a metacharacter at the end of a line. But escaping it should be ok.)
EDIT :
The characters around the as in your file are not ASCII apostrophes; they're Unicode RIGHT SINGLE QUOTATION MARK characters (0x2019). If the file is stored in UTF-8, each of them is represented as a 3-byte sequence. The grep command works for me because my locale settings "en_US.UTF-8" are such that a UTF-8 character is matched by . in a regexp, even if it has a multi-byte representation. I suspect your locale is such that it would be matched by ....
Probably the simplest solution is to edit the file to use ASCII apostrophes.
You might also want to play around with your locale settings. Try the grep command with $LANG set to "en_US.UTF-8".
What's the output of the locale command?

That works fine for me, though you may want to look into those "funny" single quotes you have around as:
pax$ cat testfile
< ? php $XZKsyG='as';
pax$ grep 'php \$[a-zA-Z]*=.as.;' testfile
< ? php $XZKsyG='as';
Failing that, there's some things you can look at. Some of these may sound silly but I'm really just checking all bases.
Are you sure the file contains only what you think it does? Executing od -xcb file will give you a hex dump of it for better checking.
Are you sure you're accessing the right file, in the right directory?
Have you done something silly like aliasing grep to be something else?
That's if you're looking for a file containing that string. If instead you're looking for a file named like that, you can use something like:
ls -1 | grep 'php \$[a-zA-Z]*=.as.;'
The ls -1 command gives you one file per line, and piping that through grep will filter out those not matching the pattern.
I suppose I should mention that I'm not really a big fan of file names with spaces in them, but I'm violently opposed to file names made up of PHP scripts :-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Trying to 'grep' links from downloaded html pages in bash shell environment without cut, sed, tr commands (only e/grep) - linux

In Linux shell, I am trying to return links to JPG files from the downloaded HTML script file. So far I only got to this point: grep 'http://[:print:]*.jpg' 'www_page.html' I don't want to use auxiliary commands like 'tr', 'cut', 'sed' etc...'lynx' is okay!

Related

"Cat" into multiple files using brace expansion

Is it possible to display a file's contents and delete that file in the same command?

Parsing HTML table in Bash using sed

Obtaining file names from directory in Bash

Why can't i detect this file?

Categories

Resources