How to code for iterating through multiple files in linux? - linux

I have code which I am trying to update from another example. The aim is to run plink using files of: each chromosome, snp ids, and a file containing only 1 ID which is an individual's ID. Running these files in plink ultimately makes a vcf file per individual for a given chromosome.
I have 22 chromosome files, 1 snp file (which is always the same), and 500 individual files. For each individual I am aiming to make a vcf for each chromosome, so I have 22*500 (11000) vcf files as output.
With doing this at the moment I have tried a bash script with this:
ID=$SGE_TASK_ID
indiv=$SGE_TASK_ID
plink --bed chr${ID}.bed --bim chr${ID}.bim --fam chr${ID}.fam --extract snps.txt
--recode vcf-iid --out output${indiv}chr${ID}vcf --keep-fam individual${indiv}.txt
This runs, however it only runs through 1 individual, giving me 22 chromosome vcf files for that one person, and stops there. How do I make this run for all 500 people, would it be with a for loop? Looking through other questions I haven't been able to find one that matches my question and is in linux, any help would appreciated.
${indiv} would just be a number, so the text file that runs looks like individual1.txt and increases through the 500 individuals (individual1.txt, individual2.txt, individual3.txt)

Assuming that ${indiv} contains no spaces,
for indiv in $(<individuals.data); do
plink [...] individual${indiv}.txt
done
The file individuals.data would name the individuals, separated by spaces or newlines.
If unsure what the Bash shell's $(<...) operator does, try this:
for A in $(<individuals.data); do
echo "[$A]"
done
Note that, as #Kaz has observed, if wish your script to work also in shells other than Bash, then you might write $(cat ...) rather than $(<...)

Related

How can I run two bash scripts simultaneously and without repetition of the same action?

I'm trying to write a script that automatically runs a data analysis program. The data analysis takes a file, analyzes it, and puts all the outputs into a folder. The program can be run on two terminals simultaneously (each analyzing a different subject file).
I wrote a script that can do all the inputs automatically. However, I can only get my script to run one automatically. If I run my script simultaneously it will analyze the same subject twice (useless)
Currently, my script looks like:
for name in `ls [file_directory]`
do
[Data analysis commands]
done
If you run this on two terminals, it will start from the top of the directory containing all the data files. This is a problem, so I tried to do checks for duplicates but they weren't very effective.
I tried a name comparison with the if command (didn't work because all the output files except one were of a unique name, so it would check the first outfput folder at the top of the directory and say the name was different even though an output folder further down had the same name). It looked something like..
for name in `ls <file_directory>`
do
for output in `ls <output directory>`
do
If [ name==output ]
then
echo "This file has already been analyzed."
else
<Data analyis commands>
fi
done
done
I thought this was the right method but apparently not. I would need to check all the names before some decision was made (rather one by one which that does)
Then I tried moving completed data files with the mv command (didn't work because "name" in the for statement stored all the file names so it went down the list regardless of what was in the folder at present). I remember reading something about how shell scripts do not do things in "real time" so it makes sense that this didn't work.
My thought was looking for some sort of modification to that if statement so it does all the name checks before I make a decision (how?)
Also are there any other commands I could possibly be missing that I could possibly try?
One pattern I use often is to use split command.
ls <file_directory> > file_list
split -d -l 10 file_list file_list_part
This will create files like file_list_part00 to file_list_partnn
You can then feed these file names to you script.
for file_part in `ls file_list_part*`
do
for file_name in `cat file_part | tr '\n' ' '`
do
data_analysis_command file_name
done
done
Never use "ls" in a "for" (http://mywiki.wooledge.org/ParsingLs)
I think you should use a fifo (see mkfifo)
As a follow-on from the comments, you can install GNU Parallel with homebrew:
brew install parallel
Then your command becomes:
parallel analyse ::: *.dat
and it will process all your files in parallel using as many CPU cores as you have in your Mac. You can also add in:
parallel --dry-run analyse ::: *.dat
to get it to show you the commands it would run without actually running anything.
You can also add in --eta (Estimated Time of Arrival) for an estimate of when the jobs will be done, and -j 8 if you want to run, say 8, jobs at a time. Of course, if you specifically want the 2 jobs at a time you asked for, use -j 2.
You can also have GNU Parallel simply distribute jobs and data to any other machines you may have available via ssh access.

grab 2 numbers from file name then insert into command

I'm a bit new to programming in general and I'm not sure how to go about accomplish this task in my bash script.
A quick background: when importing my music library (formerly organized by iTunes) to Banshee, all of the files were duplicated to fit Banshee's number style (ex: 02. instead of 02 ) on top of that, iTunes apparently did not save the ID3 tags to the files, so many of them are blank. So now I've got a few thousand tags to fix and duplicate files to get rid of.
To automate the process, I started learning to write bash scripts. I came up with a script (which you can see here) that does four things: removes unnecessary iTunes files, takes input from user about ID3 Tag information and stores it in variables, clears any present tag info from all files, writes new tags with info taken from user, using a program called eyeD3.
Now, here's where I run into my problem. This script is basically blindly writing info to all mp3 files in the dir. This is fine for tags that all the files have in common - like artist, album, total tracks, year, etc. But I can't tag each individual track number with this method. So I'm still editing the track# tags one at a time, manually. And that's something I really don't want to do 2,000+ times.
The files names all look like this:
01. song1.mp3
02. song2.mp3
03. song3.mp3
The command to write a track number to a tag looks like this:
$ eyeD3 -n 1 "01. song1.mpg"
So... I'm not sure how to go about automating this. I need to grab the first two digits of each file name, store them somewhere, then recall each one into a separate eyeD3 command.
You can loop over the files using globbing, and use substring expansion to capture the first two characters of the filename:
for f in *mp3; do
eyeD3 -n ${f:0:2} "$f"
done

awk/sed/grep command to compare the contents of three files

Hi I am trying to automate some data entry, and I am using a tcp server/client to send filenames around for other server to go into a repository and pull these files. as part of testing this I am running the program with logging the filenames that are supposed to be sent, what was received, and if it got received I am sending a reply back with the filename.
so I have three text files with file names inside of them.
SupposedToSend.txt
Recieved.txt
GotReplyFor.txt
I know that awk could do what I am trying to do but I am not sure how to set it up, I need to compare the three files for elements that does not exists in any of the other files, so if one entry is missing from any file i need to know which one and from which file.
I can write a program for this which will take much longer to write and to run since these files are getting 5 elements/minute dumped into them
paste -d '\n' SupposedToSend.txt Recieved.txt GotReplyFor.txt | uniq -c | grep -v '^ 3'
It's tolerable if you have no errors, deeply suboptimal otherwise. Or if the data in the different files is out of sequence... (In which case you might need to sort them somehow.)
Or you could just run diff3 to compare 3 files...

Concatenate PDFs while preserving rank in list

I am currently struggling to concatenate my various PDF files into one file in an automated way while at the same time preserving the order the files are provided in.
The main problem is, that I include a rank for each file (they are visualizations of list items), ranging currently from 1 to 100. If I run
pdftk *.pdf cat output all.pdf
the combined PDF pages will not be ordered from 1 to 100 accordingly. My PDFs are named in a similar way to the following example, please note that "rank_XXX" determines obviously their rank in the list. However, the fact that in the terminal 10 and 100 are listed prior to 2 messes up my sorting. I was thinking that ls -v could somehow be useful for pipe the filenames into pdftk or a similar tool, but I could not get it working.
rank_1_XYZ_123123A.pdf
rank_1_XYZ_123123B.pdf
rank_2_XYZ_123141A.pdf
rank_2_XYZ_123141B.pdf
rank_3_ABC_394124A.pdf
rank_3_ABC_394124B.pdf
...
rank_10_XYZ_129123A.pdf
rank_10_XYZ_129123B.pdf
...
rank_100_ZZZ_929123A.pdf
rank_100_ZZZ_929123B.pdf
I managed to get at least partially what I want by using
pdftk rank_[1-9]*.pdf cat output all.pdf
Nevertheless, this somehow does not work for numbers larger than 9.
Any help is greatly appreciated.
ls -v seems to do the job:
pdftk `ls -v` cat output all.pdf

Download batch files from a website using linux

i want to downloads some files (nearly about 1000-2000 zip files) from a website.
i can sit around and add each file one after another. please give me a program or script or whatever method so i can automate the download.
The website i am talking about has download link as
sitename.com/sometetx/date/12345/folder/12345_zip.zip
date can be taken care of. the main concern is that number 12345 before and after the folder, they both change simultaneously. e.g.
sitename.com/sometetx/date/23456/folder/23456_zip.zip
sitename.com/sometetx/date/54321/folder/54321_zip.zip
i tried using curl
sitename.com/sometetx/date/[12345-54321]/folder/[12345-54321]_zip.zip
but it makes to much of combination of downloads i.e. keeps left 12345 as it is and scan through 12345 to 54321 the increment left 12345 +1 then repeats scan from [12345-54321].
also tried bash wget
here i have one variable at two places, when using loop the right 12345 with a " _" is ignored by the program.
PLease help me, i dont know much about linux or programing, thanks
In order to get your loop variable next to _ to not be ignored by the shell, put it in the quotes, like this:
$ for ((i=10000; i < 99999; i++)); do \
wget sitename.com/sometetx/date/$i/folder/"$i"_zip.zip; done

Resources