How to filter VCF file with a list CHR or contig IDs? - vcf-variant-call-format

I need to subset/filter a SNP vcf file by a long list of non-sequential contig IDs, which appear in the CHR column. My VCF file contains 13,971 contigs currently, and I want to retain a specific set of 7,748 contigs and everything associated with those contigs (all variants and genotype information etc.).
My contig list looks like:
dDocent_Contig_1
dDocent_Contig_100
dDocent_Contig_10000 etc.
I am considering the following script:
vcftools --vcf TotalRawSNPs.vcf --chr dDocent_Contig_1 --chr dDocent_Contig_100 (etc...) --recode --recode-INFO-all --out FinalRawSNPs
where I list every contig ID individually with a --chr flag before. For this --chr flag, I cannot feed it a text file of contig IDs to keep, which would be ideal. If I list all contigs individually, it'll create a massive script in the command line.
I’ve seen options for filtering by a list of individuals, but not any clear option for filtering by CHR/contig IDs only. Is there a more efficient way to filter my vcf file by CHR/contig?

Related

Combining phrases from list of words Python3

doing my best to grab information out of a lot of pdf files. Have them in a dictionary format where the key is a given date and the values are a list of occupations.
looks like this when proper:
'12/29/2014': [['COUNSELING',
'NURSING',
'NURSING',
'NURSING',
'NURSING',
'NURSING']]
However, occasionally there are occupations with several words which cannot be reliably understood in single word-form, such as this:
'11/03/2014': [['DENTISTRY',
'OSTEOPATHIC',
'MEDICINE',
'SURGERY',
'SOCIAL',
'SPEECH-LANGUAGE',
'PATHOLOGY']]
Notice that "osteopathic medicine & surgery" and "speech-language pathology" are the full text for two of these entries. This gets hairier when we also have examples of just "osteopathic medicine" or even "medicine."
So my question is this - How should I go about testing combinations of these words to see if they match more complex occupational titles? I can use the same order of the words, as I have maintained that from the source.
Thanks!

Linux rename batch files according to a list

I am looking to rename a bunch of files according to the names found in a separate list. Here is the situation:
Files:
file_0001.txt
file_0102.txt
file_ab42.txt
I want to change the names of these files according to a list of corresponding names that looks like :
0001 abc.01
0102 abc.02
ab42 def.01
I want to replace, for each file, the part of the name present in the first column of my list by the part in the second column:
file_0001.txt -> file_abc.01.txt
file_0102.txt -> file_abc.02.txt
file_ab42.txt -> file_def.01.txt
I looked into several mv, rename and such commands, but I only found ways to rename batch files according to a single pattern in the file name, not matching the changes with a list.
Does anyone has a example of script that I could use to do that ?
while read a b; do mv file_$a.txt $b;done < listfile

Performing a sort using k1,1 only

Assume you have an unsorted file with the following content:
identifier,count=Number
identifier, extra information
identifier, extra information
...
I want to sort this file so that for each id, write the line with the count first and then the lines with extra info. I can only use the sort unix command with option -k1,1 but am allowed to slightly change the lines to get this sort.
As an example, take
a,Count=1
a,giulio
aa,Count=44
aa,tango
aa,information
ee,Count=2
bb,que
f,Count=3
b,Count=23
bax,game
f,ee
c,Count=3
c,roma
b,italy
bax,Count=332
a,atlanta
bb,Count=78
c,Count=3
The output should be
a,Count=1
a,atlanta
a,giulio
aa,Count=44
aa,information
aa,tango
b,Count=23
b,italy
bax,Count=332
bax,game
bb,Count=78
bb,que
c,Count=3
c,roma
ee,Count=2
f,Count=3
f,ee
but I get:
aa,Count=44
aa,information
aa,tango
a,atlanta
a,Count=1
a,giulio
bax,Count=332
bax,game
bb,Count=78
bb,que
b,Count=23
b,italy
c,Count=3
c,Count=3
c,roma
ee,Count=2
f,Count=3
f,ee
I tried adding spaces at the end of the identifier and/or at the beginning of the count field and other characters, but none of these approaches work.
Any pointer on how to perform this sorting?
EDIT:
if you consider for example the products with id starting with a, one of them has info 'atlanta' and appears before Count (but I wand Count to appear before any information). In addition, bb should be after b in alphabetical order for the ids. To make my question clearer: How can I get the IDs sorted by alphabetical order and such that for a given ID, the line with Count appears before the others. And how to do this using sort -k1,1 (This is a group project I am working on and I am not free to change the sorting command) and maybe changing the content (I tried for example adding a '~' to all the infos so that Count is before)
you need to tell sort, that comma is used as field separator
sort -t, -k1,1
For ASCII sorting make sure LC_ALL=C and LANG and LANGUAGE are unset

Trying to output the page counts of a large number of PDF's to a log file

I have about 1,550 .pdf files that I want to find page counts for.
I used the command lS -Q | grep \.pdf > ../lslog.log to output all the file names with the extension .pdf to be output into a .log file with double quotes around them. I then opened the lslog.log file in gedit and replaced all the " (double quotes) with ' (apostrophe) so that I can use the files that contain parentheses in the final command.
When I use the command exiftool -"*Count*" (which outputs any exifdata of the selected file that contains the word "count") on a single file, for example, exiftool -"*Count*" 'examplePDF(withparantheses).pdf' I get something like, "Page Count: 512" or whatever the page count happens to be.
However, when I use it on multiple files, for example: exiftool -"*Count*" 'examplePDF(withparantheses).pdf' 'anotherExamplePDF.pdf' I get
File not found: examplePDF(withparantheses).pdf,
======== anotherExamplePDF.pdf
Page Count : 362
1 image files read
1 files could not be read
So basically, I'm able to read the last file, but not the first one. This pattern continues as I add more files. It's able to find the file itself and page count of the last file, but not the other files.
Do I need to input multiple files differently? I'm using a comma right now to separate files, but even without the comma I get the same result.
Does exiftool take multiple files?
I don't know exactly why you're getting the behaviour that you're getting, but it looks like to me like everything you're doing can be collapsed into one line:
exiftool -"*Count*" *.pdf
My output from a bunch of PDFs I had around look like this
======== 86A103EW00.pdf
Page Count : 494
======== DSET3.5_Reportable_Items_Linux.pdf
Page Count : 70
======== DSView 4 v4.1.0.36.pdf
Page Count : 7
======== DSView-Release-Notes-v4.1.0.77 (1).pdf
Page Count : 7
======== DSView-Release-Notes-v4.1.0.77.pdf
Page Count : 7

Matching text files from a list of system numbers

I have ~ 60K bibliographic records, which can be identified by system number. These records also hold full-text (individudal text files named by system number).
I have lists of system numbers in bunches of 5K and I need to find a way to copy only the text files from each 5K list.
All text files are stored in a directory (/fulltext) and are named something along these lines:
014776324.txt.
The 5k lists are plain text stored in separated directories (e.g. /5k_list_1, 5k_list_2, ...), where each system number matches to a .txt file.
For example: bibliographic record 014776324 matches to 014776324.txt.
I am struggling to find a way to copy into the 5k_list_* folders only the corresponding text files.
Any idea?
Thanks indeed,
Let's assume we invoke the following script this way:
./the-script.sh fulltext 5k_list_1 5k_list_2 [...]
Or more succinctly:
./the-script.sh fulltext 5k_list_*
Then try using this (totally untested) script:
#!/usr/bin/env bash
set -eu # enable error checking
src_dir=$1 # first argument is where to copy files from
shift 1
for list_dir; do # implicitly consumes remaining args
while read bibliographic record sys_num rest; do
cp "$src_dir/$sys_num.txt" "$list_dir/"
done < "$list_dir/list.txt"
done

Resources