Matching text files from a list of system numbers - linux

I have ~ 60K bibliographic records, which can be identified by system number. These records also hold full-text (individudal text files named by system number).
I have lists of system numbers in bunches of 5K and I need to find a way to copy only the text files from each 5K list.
All text files are stored in a directory (/fulltext) and are named something along these lines:
014776324.txt.
The 5k lists are plain text stored in separated directories (e.g. /5k_list_1, 5k_list_2, ...), where each system number matches to a .txt file.
For example: bibliographic record 014776324 matches to 014776324.txt.
I am struggling to find a way to copy into the 5k_list_* folders only the corresponding text files.
Any idea?
Thanks indeed,

Let's assume we invoke the following script this way:
./the-script.sh fulltext 5k_list_1 5k_list_2 [...]
Or more succinctly:
./the-script.sh fulltext 5k_list_*
Then try using this (totally untested) script:
#!/usr/bin/env bash
set -eu # enable error checking
src_dir=$1 # first argument is where to copy files from
shift 1
for list_dir; do # implicitly consumes remaining args
while read bibliographic record sys_num rest; do
cp "$src_dir/$sys_num.txt" "$list_dir/"
done < "$list_dir/list.txt"
done

Related

How to filter VCF file with a list CHR or contig IDs?

I need to subset/filter a SNP vcf file by a long list of non-sequential contig IDs, which appear in the CHR column. My VCF file contains 13,971 contigs currently, and I want to retain a specific set of 7,748 contigs and everything associated with those contigs (all variants and genotype information etc.).
My contig list looks like:
dDocent_Contig_1
dDocent_Contig_100
dDocent_Contig_10000 etc.
I am considering the following script:
vcftools --vcf TotalRawSNPs.vcf --chr dDocent_Contig_1 --chr dDocent_Contig_100 (etc...) --recode --recode-INFO-all --out FinalRawSNPs
where I list every contig ID individually with a --chr flag before. For this --chr flag, I cannot feed it a text file of contig IDs to keep, which would be ideal. If I list all contigs individually, it'll create a massive script in the command line.
I’ve seen options for filtering by a list of individuals, but not any clear option for filtering by CHR/contig IDs only. Is there a more efficient way to filter my vcf file by CHR/contig?

Split a file into variable quantities and directories based on csv using linux

I have a spreadsheet with a list of directories, and an associated variable number of 'accounts' needed to be assigned to each directory (could easily be converted to csv) ie:
directory # of accounts needed
/usr/src/Mon-Carlton/ 110
/usr/src/Mon-CoalMtn/ 50
/usr/src/Mon-Cumming/ 90
etc...
I also have a 'master_account_list.csv' file that contains the full list of all possible accounts available to be distributed to each area ie:
account_1,password,type
account_2,password,type
account_3,password,type
etc...
I would like to be able to script the splitting of the master_account_list.csv into a separate accounts.csv file for each unique directory with the listed # of accounts needed.
The master_file gets updated with fresh accounts often and there is a need to redistribute again to all the directories. (The resulting accounts.csv file has the same formatting as the master_account_list.)
What is the best way to accomplish then in Linux?
Edit: When the script is complete, it would be ideal if the remainder of unassigned accounts from the master_account_list.csv became the new master_account_list.csv.
Assuming you've converted the accounts spreasheet to a comma(!) separated csv file, without headers(!), you can use the following awk program:
split_accounts.awk:
# True as long as we are reading the first file which
# is the converted spreadsheet
NR==FNR {
# Store the directories and counts in arrays
dir[NR]=$1
cnt[NR]=$2
next
}
# This block runs for every line of master_account_list.csv
{
# Get the directory and count from the arrays we've
# created above. 'i' will get initialized automatically
# with 0 on it's first usage
d=dir[i+1]
c=cnt[i+1]
# append the current line to the output file
f=d"account_list.csv"
print >> f
# 'n' holds the number of accounts placed into the current
# directory. Check if it has the reached the desired count
if(++n==c) {
# Step to the next folder / count
i++
# Reset the number of accounts placed
n=0
# Close the previous output file
close(f)
}
}
Call it like this:
awk -F, -f split_accounts.awk accounts.csv master_account_list.csv
Note: 0 is not allowed for the count in the current implementation.

Linux rename batch files according to a list

I am looking to rename a bunch of files according to the names found in a separate list. Here is the situation:
Files:
file_0001.txt
file_0102.txt
file_ab42.txt
I want to change the names of these files according to a list of corresponding names that looks like :
0001 abc.01
0102 abc.02
ab42 def.01
I want to replace, for each file, the part of the name present in the first column of my list by the part in the second column:
file_0001.txt -> file_abc.01.txt
file_0102.txt -> file_abc.02.txt
file_ab42.txt -> file_def.01.txt
I looked into several mv, rename and such commands, but I only found ways to rename batch files according to a single pattern in the file name, not matching the changes with a list.
Does anyone has a example of script that I could use to do that ?
while read a b; do mv file_$a.txt $b;done < listfile

How to call a large list of paired files to be executed by a program in BASH?

I have a large directory of files (100+) that I'd like to pass through a program via the terminal.
The files are paired and all follow a naming scheme like such:
TS-8_S53_L001_R1_001.fastq
TS-8_S53_L001_R2_001.fastq
RS-9_S54_L001_R1_001.fastq
RS-9_S54_L001_R2_001.fastq
And the program execution looks like:
Seqprogram -i1 Blah_R1_001.fastq -i2 Blah_R2_001.fastq -o Blah_paired.fastq
All of these files are in one directory.
I'd like to able to run the program on all of the files, using the files paired together in the proper sequence (R1 files are passed through i1, the R1 and R2 files have the same base name) and the output file (-o) is saved under the base name with some identifier attached ("_paired", etc).
I've envisioned on how I'd do this over Python; however, I am trying to get better with BASH.
I'm familiar with how one might call multiple files into a single command; i.e., uncompressing all .gz files in a particular directory
gunzip "*.gz"
But this command has two inputs, and the inputs must be ordered, so the wildcard scheme isn't sufficient.
Thanks
Use a wildcard to get one file of the pair, and then use parameter substitution to get the other corresponding filenames.
for i1 in *_R1_001.fastq; do
i2=${i1/R1_001/R2_001}
paired=${i1/R1_001/paired}
Seqprogram -i1 "$i1" -i2 "$i2" -o "$paired"
done
The easiest way to do this is to match a single one of the three filenames patterned, and to modify it to get the other two.
That is to say:
for r1file in *_R1_*.fastq; do
r2file=${r1file/_R1_/_R2_}
pairfile=${r1file%_R1_*}_paired.fastq
Seqprogram -i1 "$r1file" -i2 "$r2file" -o "$pairfile"
done

Filename manipulation in cygwin

I am running cygwin on Windows 7. I am using a signal processing tool and basically performing alignments. I had about 1200 input files. Each file is of the format given below.
input_file_ format = "AC_XXXXXX.abc"
The first step required building some kind of indexes for all the input files, this was done with the tool's build-index command and now each file had 6 indexes associated with it. Therefore now I have about 1200*6 = 7200 index files. The indexes are of the form given below.
indexes_format = "AC_XXXXXX.abc.1",
"AC_XXXXXX.abc.2",
"AC_XXXXXX.abc.3",
"AC_XXXXXX.abc.4",
"AC_XXXXXX.abc.rev.1",
"AC_XXXXXX.abc.rev.1"
Now, I need to use these indexes to perform the alignment. All the 6 indexes of each file are called together and the final operation is done as follows.
signal-processing-tool ..\path-to-indexes\AC_XXXXXX.abc ..\Query file
Where AC_XXXXXX.abc is the index associated with that particular index file. All 6 index files are called with **AC_XXXXXX.abc*.
My problem is that I need to use only the first 14 characters of the index file names for the final operation.
When I use the code below, the alignment is not executed.
for file in indexes/*; do ./tool $file|cut -b1-14 Project/query_file; done
I'd appreciate help with this!
First of all, keep in mind that $file will always start with "indexes/", so trimming first 14 characters would always include that folder name in the beginning.
To use first 14 characters in a variable, use ${file:0:14}, where 0 is the starting string index, and 14 is the length of the desired substring.
Alternatively, if you want to use cut, you need to run it in a subshell: for file in indexes/*; do ./tool $(echo $file|cut -c 1-14) Project/query_file; done I changed the arg for cut to -c for characters instead of bytes

Resources