Bash, wget remove comma from output filename - linux

I am reading a file with URL's by line by line then I pass URL to the wget:
FILE=/home/img-url.txt
while read line; do
url=$line
wget -N -P /home/img/ $url
done < $FILE
This works, but some file contains comma in the filename. How I can save the file without the comma?
Example:
http://xy.com/0005.jpg -> saved as 0005.jpg
http://xy.com/0022,22.jpg -> save as 002222.jpg not as 0022,22
I hope you find my question interesting.
UPDATE:
We have some nice solution, but is there any solution to the time stamping error?
WARNING: timestamping does nothing in combination with -O. See the manual
for details.

This should work:
url="$line"
filename="${url##*/}"
filename="${filename//,/}"
wget -P /home/img/ "$url" -O "$filename"
Using -N and -O both will throw a warning message. wget manual says:
-N (for timestamp-checking) is not supported in
combination with -O: since file is always newly created, it will always
have a very new timestamp.
So, when you use -O option, it actually creates a new file with new timestamping and thus the -N option becomes dummy (it can't do what it is for). If you want to preserve the timestapming, then a workaround might be this:
url="$line"
wget -N -P /home/img/ "$url"
file="${url##*/}"
newfile="${filename//,/}"
[[ $file != $newfile ]] && cp -p /home/img/"$file" /home/img/"$newfile" && rm /home/img/"$file"

In the body of the loop you need to generate the filename from the URL without commas and, without the leading part of the URL, and tell wget to save under other name.
url=$line
file=`echo $url | sed -e 's|^.*/||' -e 's/,//g'`
wget -N -P /home/image/dema-ktlg/ -O $file $url

In the meantime I wrote this:
url=$line
$file=`echo ${url##*/} | sed 's/,//'`
wget -N -P /home/image/dema-ktlg/ -O $file $url
Seems to work fine, is there any trivial problem with my code?

Related

Multiple process curl command for urls to output to individual files

I am attempting to curl multiple urls in a bash command. Eventually I will be curling a large number of Urls so I am using xargs to use multiple processes to speed up the process.
My file consists of x number of URLs:
https://someurl.com
https://someotherurl.com
My issue comes when attempting to output the results to separate files named after the URLs I curl.
The bash command I have is:
xargs -P 5 -n 1 -I% curl -k -L % -0 % < urls.txt
When I run this I get 'Failed to create file https://someotherurl.com'
You cannot create a file with / in the filename. You could do it this way:
#!/bin/bash
while IFS= read -r line
do
echo "LINE: $line"
if [[ "$line" != "" ]]
then
filename="${line#https://}"
echo "FILENAME: $filename"
# YOUR CURL COMMAND HERE, USING $filename
fi
done < url.txt
it ignores empty lines
variable substitution is used to remove the https:// part of each URL
this will allow you to create the file
Note: if your URLs containt sub-directories, they must be removed as well.
Ex: you want to do https://www.exemple.com/some/sub/dir
The script I suggested here would try to create a file named "www.exemple.com/some/sub/dir". In this case, you could replace the / with _ using tr.
The script would become:
#!/bin/bash
while IFS= read -r line
do
echo "LINE: $line"
if [[ "$line" != "" ]]
then
filename=$(echo "$line" | tr '/' '_')
filename2=${filename#https:__}
echo "FILENAME: $filename2"
# YOUR CURL COMMAND HERE, USING $filename2
fi
done < url.txt
Because your question is ambiguous, I would assume:
You have a file urls.txt that contains URLs separated by LF.
You want to download all URLs by curl and use each URL as its filename.
Unfortunately, that's not possible because URL contains invalid characters like slash /. Alternatively, for this case, I would suggest you use Bsse64 safe mode to decode URL before saving to file based on RFC 3548.
After applying this requirement, your script would become like:
seq 100 | xargs -I# echo 'https://example.com?#' > urls.txt
xargs -P0 -L1 sh -c 'curl -SskL0 -o $(printf %s "$1" | uuencode -m /dev/stdout | sed "1d;\$d" | tr +/ -_) "$1"' sh < urls.txt

BASH: grep doesn't work in shell script but echo shows correct command and it works on command line

I need to write a script that checks some >20k files for some >2k search text and it needs to be flexible, so I came up with this script:
#!/bin/bash
# This script checks all files in a given directory against a list of criteria
shopt -s expand_aliases
source ~/.bashrc
TIMESTAMP=$(date "+%Y-%m-%d-%T")
ROOT_DIR=/data
PROJECT_NAME=$1
FILE_DIR=$ROOT_DIR/projects/$1/$2
RESULT_DIR=$ROOT_DIR/projects/$1/check_result
SEARCHTEXT_FILE=$ROOT_DIR/scripts/$3
OIFS="$IFS"
IFS=$'\n'
files=$(find $FILE_DIR -type f -name '*.json')
for file in $files; do
while read line; do
grep -H -o $line "$file" >> $RESULT_DIR/check_result_$TIMESTAMP.log
done < $SEARCHTEXT_FILE
done
IFS="$OIFS"
This script only produces the empty $RESULT_DIR/check_result_$TIMESTAMP.log log file with correct name.
Because the file names sometimes contain spaces I added the IFS... statements and I enclosed $file in " quotes (copied from another post).
The content of the $SEARCHTEXT_FILE is for example:
'Tel alt........'
'City ..........'
If I place an echo before the grep like this
echo grep -H -o $line "$file"
then output I get is
grep -H -o 'Tel alt........' /data/projects/DNAR/input/report-157538.json
and I can execute this line as is and get the correct result.
I tried to put various combinations of " or ' or ` or () or {} around any part of this grep command but nothing changed.
Somewhere I did read about alias and the alias set for grep is
alias grep='grep --color=auto'
After many hours of searching on the internet I couldn't find any post that helped me as most of them are covering issues around wrong quotes or inline bash issues.
What are I missing here?
The simple and obvious workaround is to remove all that complexity and simply use the features of the commands you are running anyway.
find "$FILE_DIR" -type f -name '*.json' \
-exec grep -H -o -f "$SEARCHTEXT_FILE" {} + > "$RESULT_DIR/check_result_$TIMESTAMP.log"
Notice also the quoting fixes; see When to wrap quotes around a shell variable; to avoid mishaps, you should switch to lower case for your private variables (see Correct Bash and shell script variable capitalization).
shopt -s expand_aliases
and source ~/.bashrc merely look superfluous, but could contribute to whatever problem you are trying to troubleshoot; they should basically never be part of a script you plan to use in production.

How to use "curl -Is link | head -n 1" with many links in a file

i want to use this curl -Is link | head -n 1 to check if a link is alive ,but this is for one link ,anyone help me to check with a large number link in a file ,i has tried curl -Is Mylink.txt | head -n 1 ,but it does not achieve my goal .Thanks for any help .
Assuming your lines are well formatted (as curl likely needs them to be anyway) #codeforester's should work for you. As a matter of habit I try to avoid using cat and for loops to parse files though so I'd do it a little differently:
while read -r line; do
printf 'URL %s: %s\n' "$line" "$(curl -Is "$line" | head -n 1)"
done < Mylink.txt
This sounds like a job for xargs! Depending on what your file looks like that is. I'm assuming there is one link per line. Why are you using the -Is? Especially the -s for silent mode if you want to see errors? You may actually want the -s -S options together (link)
xargs < links.txt -I % sh -c 'curl -Is "$1" | head -n1' _ %
edited to reflect corrections in the comments, thanks!
for link in $(cat Mylink.txt)
do
echo $link: $(curl -Is $link | head -n 1)
done

Bash script to download graphic files from website

I'm trying to write bash script in Linux (Debian), that will be used for downloading graphic files from website given by user during start-up. I'm not sure if my code is correct but first problem is when i try to run my script with website e.g. http://www.bbc.com/ an error shows: http://www.bbc.com/ : invalid identifier. I even tried a simple website that has only a few JPG files. My next problem is to find out how to download files from .txt file where the images Internet adresses are included.
#!/bin/bash
# $1 - URL $2 - new catalog name
read $1 $2
url=$1
fold=$2
mkdir -p $fold
if [$# -ne 3];
then
echo "Wrong command"
exit -1
fi
curl $url | grep -o -e "<img src=\".*\"+>" > img_list.txt |wc -l img_list.txt | lin=${% *}
baseurl=$(echo $url | grep -o "https?://[a-z.]*"")
curl -s $url | egrep -o "<img src\=[^>]*>" | sed 's/<img src=\"\([^"]*\).*/\1/.*/\1/g' > url_list.txt
sed -i "s|^/|$baseurl/|" url_list.txt
cd $fold;
what can I do next?
For download every image from the webpage I would to use:
mech-dump --absolute --images http://example.com | xargs -n1 curl -O
but this need to be installed the mech-dump command from the WWW::Mechanize package.
Using the list file
while read -r url folder
do
mkdir -p "$folder" || exit 1
(cd "$folder" && mech-dump --absolute --images "$url" | xargs -n1 curl -O)
done < list.txt
(assuming than no url nor folder containing a space).
an error shows: http://www.bbc.com/ : invalid identifier
Your use of read is wrong; change
read $1 $2
url=$1
fold=$2
to
read url fold
or decide to specify the arguments on the command line and omit only read $1 $2.
Also, each operand in [ ] must be separated from the brackets; change
if [$# -ne 3];
to
if [ -z "$fold" ]

Wget Output in Recursive mode

I am using wget -r to download 3 .zip files from a specified webpage. Here is what I have so far:
wget -r -nd -l1 -A.zip http://www.website.com/example
Right now, the zip files all begin with abc_*.zip where * seems to be a random. I want to have the first downloaded file to be called xyz_1.zip, the second to be xyz_2.zip, and the third to be xyz_3.zip.
Is this possible with wget?
Many thanks!
I don't think it's possible with wget alone. After downloading you could use some simple shell scripting to rename the files, like:
i=1; for f in abc_*.zip; do mv "$f" "xyz_$i.zip"; i=$(($i+1)); done
Try to get a listing first and then download each file separately.
let n=1
wget -nv -l1 -r --spider http://www.website.com/example 2>&1 | \
egrep -io 'http://.*\.zip'| \
while read url; do
wget -nd -nv -O $(echo $url|sed 's%^.*/\(.*\)_.*$%\1%')_$n.zip "$url"
let n++
done
I don't think there is a way you can do it within a single wget command.
wget does have a -O option which you can use to tell it which file to output to, but it won't work in your case because multiple files will get concatenated together.
You will have to write a script which renames the files from abc_*.zip to xyz_*.zip after wget has completed.
Alternatively, invoke wget for one zip file at a time and use the -O option.

Resources