Output format of bcftools view - zip

I am using Bcftools to extract a single sample VCF from a GVCF file.
bcftools view -f -Oz -s Sample_name -o output_sample.vcf.gz input_file.vcf.gz
Unfortunately, it seems that the format of the output is not Bgzip compressed, despite the use of the -Oz flag to do so.
bcftools index output_sample.vcf.gz
the file is not BGZF compressed
Would anyone have an idea of why it is the case?

You can also try
bgzip -c outputfile.vcf > outputfile.vcf.gz
and do not forget to index your file
bcftools index outputfile.vcf.gz

Related

Using wget to download images and saving with specified filename

I'm using wget mac terminal to download images from a file where each image url is it's own line, and that works perfectly with this command:
cut -f1 -d, images.txt | while read url; do wget ${url} -O $(basename ${url}); done
However I want to specify the output filename it's saved as instead of using the basename. The file name is specified in the next column, separated by either space or comma and I can't quite figure out how to tell wget to use the 2nd column as the name it should as the -O name.
I'm sure it's a simple change to my above command but after reading dozens of different posts on here and other sites I can't figure it out. Any help would be appreciated.
If you use whitespace as the seperator it's very easy:
cat images.txt | while read url name; do wget ${url} -O ${name}; done
Explanation: instead of reading just one variable per line (${url}) as in your example, you read two (${url} and ${name}). The second one is your local filename. I assumed your images.txt file looks something like this:
http://cwsmgmt.corsair.com/newscripts/landing-pages/wallpaper/v3/Wallpaper-v3-2560x1440.jpg test.jpg

How can I run a command on all files in a directory and mv to a different the ones that get an output that contains 'Cannot read TIFF header'?

I'd like to remove all bad tiffs from out of a very large directory. The commandline tool "tiffinfo" makes it easy to identify them:
tiffinfo -D *
This will have an ouput like this:
00074000/74986.TIF: Cannot read TIFF header.
if the tiff file is corrupt. If this happens I'd like to take the file and move it to a different dirrectory: bad_images. I tried using awk on this, but it hasn't worked so far...
Thanks!
Assuming the "Cannot read TIFF header" error comes on standard error, and assuming tiffinfo outputs other data on standard out which you don't want, then:
cd /path/to/tiffs
for file in `tiffinfo -D * 2>&1 >/dev/null | cut -f1 -d:`
do
echo mv $file /path/to/bad_images
done
Remove the echo to actually move the files, once satisfied that the script will work as expected.

Open multiple webpages, use wget and merge the output

I have a text file containing a bunch of webpages:
http://rest.kegg.jp/link/pathway/7603
http://rest.kegg.jp/link/pathway/5620
…
My aim is to download all info on these pages to a single text file.
The following works perfectly but it gives me 3000+ text files, how could I simple merge all the output files during the loop.
while read i; do wget $i; done < urls.txt
Thanks a lot
Use -O file option, which appends the output to the logfile specified.
while read i; do wget -O outputFile $i; done < urls.txt
The outputFile will contain the contents as well
Also you can skip the while loop by specifying the input file using -i file
wget -O outpuFile -i url.txt

Combine files in one

Currently I am in this directory-
/data/real/test
When I do ls -lt at the command prompt. I get like below something-
REALTIME_235000.dat.gz
REALTIME_234800.dat.gz
REALTIME_234600.dat.gz
REALTIME_234400.dat.gz
REALTIME_234200.dat.gz
How can I consolidate the above five dat.gz files into one dat.gz file in Unix without any data loss. I am new to Unix and I am not sure on this. Can anyone help me on this?
Update:-
I am not sure which is the best way whether I should unzip each of the five file then combine into one? Or
combine all those five dat.gz into one dat.gz?
If it's OK to concatenate files content in random order, then following command will do the trick:
zcat REALTIME*.dat.gz | gzip > out.dat.gz
Update
This should solve order problem:
zcat $(ls -t REALTIME*.dat.gz) | gzip > out.dat.gz
What do you want to happen when you gunzip the result? If you want the five files to reappear, then you need to use something other than the gzip (.gz) format. You would need to either use tar (.tar.gz) or zip (.zip).
If you want the result of the gunzip to be the concatenation of the gunzip of the original files, then you can simply cat (not zcat or gzcat) the files together. gunzip will then decompress them to a single file.
cat [files in whatever order you like] > combined.gz
Then:
gunzip combined.gz
will produce an output that is the concatenation of the gunzip of the original files.
The suggestion to decompress them all and then recompress them as one stream is completely unnecessary.

combine multiple pdfs in linux using script?

I want to save/download pdfs from X website and then combined all those pdfs into one, so that it is easy for me to see all of them at once.
What I did,
get pdfs from website
wget -r -l1 -A.pdf --no-parent http://linktoX
combine pdfs into one
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=Combined_date +%F.pdf -dBATCH file1.pdf file2.pdf file3.pdf
My question/problem is, I thought of automating whole this in one script, so that I dont have to do this everyday. Here new pdfs are added daily in X.
So, how can I do step 2 above, without giving full list of all the pdfs, i tried doing file*.pdf in step2; but it combined all pdfs in random order.
Next problem is, total number of file*.pdf is not same everyday, sometimes 5 pdfs sometimes 10...but nice thing is it is named in order file1.pdf file2.pdf ...
So, I need some help to complete above step 2, such that all pdfs are combined in order and I dont have to give name of each pdf explicitly
Thanks.
UPDATE:
This solved the problem
pdftk `ls -rt kanti*.pdf` cat output Kanti.pdf
I did ls -rt as file1.pdf was downloaded first, and then file2.pdf and so on...just doing ls -t put file20.pdf in the start and file1.pdf in last...
I've also used pdftk in the past with good results.
For listing the files in numeric order, you can instruct sort to ignore the first $n - 1 characters of the filename by doing this:
ls | sort -n -k 1.$n
So if you had file*.pdf:
$ ls | sort -n -k 1.5
file1.pdf
file2.pdf
file3.pdf
file4.pdf
file10.pdf
file11.pdf
file20.pdf
file21.pdf
I have used pdftk before for such concatenations as pdftk happens to be readily available to Debian / Ubuntu.
You could do something like:
GSCOMMAND="gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=Combined_date +%F.pdf -dBATCH"
FILES=`ls file*.pdf | sort -n -k 1.5`
$GSCOMMAND $FILES
This is assuming the files are named "file.pdf". See also the post by alberge.
It will do strange things to files with spaces in their name, so you'll need to add escaping if you need to be able to handle names with spaces.
I'm really curious what other people will come up with, as this seems to me quite a quick and dirty solution, but getting better thanks to the answers of other people:)
EDIT
Used the numerical sort command for FILES as suggested by alberge.

Resources