Download batch files from a website using linux - linux

i want to downloads some files (nearly about 1000-2000 zip files) from a website.
i can sit around and add each file one after another. please give me a program or script or whatever method so i can automate the download.
The website i am talking about has download link as
sitename.com/sometetx/date/12345/folder/12345_zip.zip
date can be taken care of. the main concern is that number 12345 before and after the folder, they both change simultaneously. e.g.
sitename.com/sometetx/date/23456/folder/23456_zip.zip
sitename.com/sometetx/date/54321/folder/54321_zip.zip
i tried using curl
sitename.com/sometetx/date/[12345-54321]/folder/[12345-54321]_zip.zip
but it makes to much of combination of downloads i.e. keeps left 12345 as it is and scan through 12345 to 54321 the increment left 12345 +1 then repeats scan from [12345-54321].
also tried bash wget
here i have one variable at two places, when using loop the right 12345 with a " _" is ignored by the program.
PLease help me, i dont know much about linux or programing, thanks

In order to get your loop variable next to _ to not be ignored by the shell, put it in the quotes, like this:
$ for ((i=10000; i < 99999; i++)); do \
wget sitename.com/sometetx/date/$i/folder/"$i"_zip.zip; done

Related

How to code for iterating through multiple files in linux?

I have code which I am trying to update from another example. The aim is to run plink using files of: each chromosome, snp ids, and a file containing only 1 ID which is an individual's ID. Running these files in plink ultimately makes a vcf file per individual for a given chromosome.
I have 22 chromosome files, 1 snp file (which is always the same), and 500 individual files. For each individual I am aiming to make a vcf for each chromosome, so I have 22*500 (11000) vcf files as output.
With doing this at the moment I have tried a bash script with this:
ID=$SGE_TASK_ID
indiv=$SGE_TASK_ID
plink --bed chr${ID}.bed --bim chr${ID}.bim --fam chr${ID}.fam --extract snps.txt
--recode vcf-iid --out output${indiv}chr${ID}vcf --keep-fam individual${indiv}.txt
This runs, however it only runs through 1 individual, giving me 22 chromosome vcf files for that one person, and stops there. How do I make this run for all 500 people, would it be with a for loop? Looking through other questions I haven't been able to find one that matches my question and is in linux, any help would appreciated.
${indiv} would just be a number, so the text file that runs looks like individual1.txt and increases through the 500 individuals (individual1.txt, individual2.txt, individual3.txt)
Assuming that ${indiv} contains no spaces,
for indiv in $(<individuals.data); do
plink [...] individual${indiv}.txt
done
The file individuals.data would name the individuals, separated by spaces or newlines.
If unsure what the Bash shell's $(<...) operator does, try this:
for A in $(<individuals.data); do
echo "[$A]"
done
Note that, as #Kaz has observed, if wish your script to work also in shells other than Bash, then you might write $(cat ...) rather than $(<...)

How can I run two bash scripts simultaneously and without repetition of the same action?

I'm trying to write a script that automatically runs a data analysis program. The data analysis takes a file, analyzes it, and puts all the outputs into a folder. The program can be run on two terminals simultaneously (each analyzing a different subject file).
I wrote a script that can do all the inputs automatically. However, I can only get my script to run one automatically. If I run my script simultaneously it will analyze the same subject twice (useless)
Currently, my script looks like:
for name in `ls [file_directory]`
do
[Data analysis commands]
done
If you run this on two terminals, it will start from the top of the directory containing all the data files. This is a problem, so I tried to do checks for duplicates but they weren't very effective.
I tried a name comparison with the if command (didn't work because all the output files except one were of a unique name, so it would check the first outfput folder at the top of the directory and say the name was different even though an output folder further down had the same name). It looked something like..
for name in `ls <file_directory>`
do
for output in `ls <output directory>`
do
If [ name==output ]
then
echo "This file has already been analyzed."
else
<Data analyis commands>
fi
done
done
I thought this was the right method but apparently not. I would need to check all the names before some decision was made (rather one by one which that does)
Then I tried moving completed data files with the mv command (didn't work because "name" in the for statement stored all the file names so it went down the list regardless of what was in the folder at present). I remember reading something about how shell scripts do not do things in "real time" so it makes sense that this didn't work.
My thought was looking for some sort of modification to that if statement so it does all the name checks before I make a decision (how?)
Also are there any other commands I could possibly be missing that I could possibly try?
One pattern I use often is to use split command.
ls <file_directory> > file_list
split -d -l 10 file_list file_list_part
This will create files like file_list_part00 to file_list_partnn
You can then feed these file names to you script.
for file_part in `ls file_list_part*`
do
for file_name in `cat file_part | tr '\n' ' '`
do
data_analysis_command file_name
done
done
Never use "ls" in a "for" (http://mywiki.wooledge.org/ParsingLs)
I think you should use a fifo (see mkfifo)
As a follow-on from the comments, you can install GNU Parallel with homebrew:
brew install parallel
Then your command becomes:
parallel analyse ::: *.dat
and it will process all your files in parallel using as many CPU cores as you have in your Mac. You can also add in:
parallel --dry-run analyse ::: *.dat
to get it to show you the commands it would run without actually running anything.
You can also add in --eta (Estimated Time of Arrival) for an estimate of when the jobs will be done, and -j 8 if you want to run, say 8, jobs at a time. Of course, if you specifically want the 2 jobs at a time you asked for, use -j 2.
You can also have GNU Parallel simply distribute jobs and data to any other machines you may have available via ssh access.

Interactive quiz in Bash (Multiple Q's)

I'm teaching an introductory Linux course and have abandoned the paper-based multiple-choice quizzes and have created interactive quizzes in Bash. My quiz script is functional, but kind of quick-and-dirty, and now I'm in the improvement phase and looking for suggestions.
First off, I'm not looking to automate the grading, which certainly simplifies things.
Currently, I have a different script file for each quiz, and the questions are hard-coded. That's obviously terrible, so I created a .txt file holding the questions, delimited by lines with "question 01" etc. I can loop through and use sed -n "/^quest.*$i\$/,/^quest.*$(($i+1))\$/p", but this prints the delimiter lines. I can pipe through sed "/^q/d" or head -n-1|tail -n+2 to get rid of them, but is there a better way?
Second issue: For questions where the answer is an actual command, I'm printing a [user]$ prompt, but for short-answer, I'm using a >. In my text file, for each question, the last line is the prompt to use. Initially, I was thinking I could store the question in a variable and |tail -1 it to get the prompt, but duh, when you store it it strips newlines. I want the cursor to immediately follow the prompt, so I either need to pass it to read -p or strip the final newline from the output. (Or create some marker in the file to differentiate between the $ and > prompt.) One thought I had was to store each question in a separate file and just cat it to display it, making sure there was no newline at the end. That might be kind of a pain to maintain, but it would solve both problems. Thoughts?
Now to how I'm actually running the quiz. This is a Fedora 20 box, and I tried copying bash and setuid-ing it to me so that it would be able to read the quiz script that the students couldn't normally read, but I couldn't get that to work. After some trial and error, I ended up copying touch and setuid-ing it to me, then using that to create their answer file in a "submit" directory with an ACL so new files have o=w so they can write to their answer file (in the quiz with >> echo) but not read it back or access the directory. The only major loophole I see with this is that they can delete their file by name and start the quiz over with no record of having done so. Since I'm not doing any automatic grading, I'm not terribly concerned with the students being able to read the script file, although if I'm storing the questions separately, I suppose I could make a copy of cat and setuid it to read in files that they can't access.
Also, I realize that Bash is not the best choice for this, and learning the required simple input/output for Python or something better would not take much effort. Perhaps that's my next step.
1) You could use
sed -n "/^quest.*$i\$/,/^quest.*$(($i+1))\$/ { //!p }"
Here // repeats the last attempted pattern, which is the opening pattern in the first line of the range and the closing pattern for the rest.
...by the way, if you really want to do this with sed, you better be damn sure that i is a number, or you'll run into code injection problems.
2) You can store multiline command output in a variable without problems. You just have to make sure you quote the variable everafter to avoid shell expansion on it. For example,
QUESTION=$(sed -n "/^quest.*$i\$/,/^quest.*$(($i+1))\$/ { //!p }" questions.txt)
echo -n "$QUESTION" # <-- the double quotes are important here.
The -n option to echo tells echo to not append a newline at the end, which should take care of your prompt problem.
3) Yes, well, hackery breeds more hackery. If you want to lock this down, the first order of business would be to not give students a shell on the test machine. You could put your script behind inetd and have the students fill it out with telnet or something, I suppose, but...really, why bash? If it were me, I'd knock something together with a web server and one of the several gazillion php web quiz frameworks. Although I also have to wonder why it's a problem if students can see the questions and the answers they gave. It's not like all students use the same account and can see each other's answers, is it? (is it?) Don't store an answer key on the same machine and you shouldn't have a problem.

Download multiple files, with different final names

OK, what I need is fairly simple.
I want to download LOTS of different files (from a specific server), via cURL and would want to save each one of them as a specific new filename, on disk.
Is there an existing way (parameter, or whatever) to achieve that? How would you go about it?
(If there was an option to input all URL-filename pairs in a text file, one per line, and get cURL to process it, would be ideal)
E.g.
http://www.somedomain.com/some-image-1.png --> new-image-1.png
http://www.somedomain.com/another-image.png --> new-image-2.png
...
OK, just figured a smart way to do it myself.
1) Create a text file with pairs of URL (what to download) and Filename (how to save it to disk), separated by comma (,), one per line. And save it as input.txt.
2) Use the following simple BASH script :
while read line; do
IFS=',' read -ra PART <<< "$line";
curl $PART[0] -o $PART[1];
done < input.txt
*Haven't thoroughly tested it yet, but I think it should work.

splitting text files based column wise

So I have an invoice that I need to make a report out of. It is on average to be about 250 pages long. So I'm trying to create a script that would extract the specific value of the invoice and make a report. Here's my problem:
the invoice is in pdf format with it spanning two column. In Linux command, I want to use 'pdftotext' Linux command to convert into multiple text files (with each txt file representing each pdf page). How do I do that
I recognize that 'pdftotext' command splits it left part of the page and right part of the page by having 21 spaces in between. How do I the right side of the data(identified after reading at least 21 spaces in a row) to the end of the file
Since the file is large and I only last few page of the files, how do I delete all those text files in a script (not manually) until I read a keyword (let's just say the keyword = Start Invoice)?
I know this is a lot of questions, but I'm confused in what Linux command can do. Can you guys guide me to the right direction? Thanks
PS: I'm using CentOS 5.2
What about:
pdftotext YOUR.pdf | sed 's/^\([^ ]\+\) \{21\}.*/\1/' > OUTPUT
pdftotext YOUR.pdf | sed 's/.* \{21\}\(.*\)/\1/' >> OUTPUT
But you should check out pdftotext's -raw and -layout options too. And there are more ways to do it...

Resources