How to extract names of compound present in sub files? - linux

I have a list of 15000 compound names (file name: uniq-compounds) which contains names of 15000 folder. the folder have sub files i.e. out.pdbqt which contains names of compound in 3rd Row. (Name = 1-tert-butyl-5-oxo-N-[2-(3-pyridinyl)ethyl]-3-pyrrolidinecarboxamide). I want to extract all those 15000 names by providing uniq-compound file (it contain folder names e.g ligand_*) out of 50,000 folder.
directory and subfiles
sidra---50,000folder (ligand_00001 - ligand50,000)--each contains subfiles (out.pdbqt)--that conatins names.(mention below)
another file (uniq-compound) contains 15000 folder names (that compound names i want).
out.pdbqt
MODEL 1
REMARK VINA RESULT: -6.0 0.000 0.000
REMARK Name = 1-tert-butyl-5-oxo-N-[2-(3-pyridinyl)ethyl]-3-pyrrolidinecarboxamide
REMARK 8 active torsions:
REMARK status: ('A' for Active; 'I' for Inactive)
REMARK 1 A between atoms: N_1 and C_7

Assuming, uniq-compound.txt contains the folder names and each folder contains an out.pdbqt. Also, the compound name appears in the 3rd row of the file out.pdbqt. If that is the case below script will work:
#!/bin/bash
while IFS= read -r line; do
awk 'FNR == 3 {print $4}' $line/out.pdbqt
done < uniq-compound.txt
Loop will iterate through the uniq-compound.txt one by one, for each line in the file (i.e folder), it uses awk to display the 4th column in the 3rd line of the file out.pdbqt inside that folder.

Related

Edit values in one column in 4,000,000 row CSV file

I have a CSV file I am trying to edit to add a numeric ID-type column in with unique integers from 1 - approx 4,000,000. Some of the fields already have an ID value, so I was hoping I could just sort those and then fill in starting on the largest value + 1. However, I cannot open this file to edit in Excel because of its size (I can only see the max of 1,048,000 or whatever rows). Is there an easy way to do this? I am not familiar with coding, so I was hoping there was a way to do it manually that is similar to Excel's fill series feature.
Thanks!
-also - I know there are threads on how to edit a large CSV file, but I was hoping for help with how to edit this specific feature. Thanks!
-I want to basically sort the rows based on idnumber and then add unique IDs to rows without that ID value
Screenshot of file
one way, using Notepad++, and a plugin named SQL:
Load the CSV in Notepad++
SELECT a+1,b,c FROM data
Hit 'start'
When starting with a file like this:
a,b,c
1,2,3
4,5,6
7,8,9
The results after look like this:
SQL Plugin 1.0.1025
Query : select a+1,b,c from data
Sourcefile : abc.csv
Delimiter : ,
Number of hits: 3
===================================================================================
Query result:
2,2,3
5,5,6
8,8,9
Or, in words, the first column is incremented by 1.
2nd solution, using gawk, downloaded from https://www.klabaster.com/freeware.htm#mawk:
D:\TEMP>type abc.csv
a,b,c
1,2,3
4,5,6
7,8,9
D:\TEMP>gawk "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print $1+1,$2,$3 }" abc.csv
a,b,c
2,2,3
5,5,6
8,8,9
(g)awk id a tool which reads a file line by line. The line is then accessible via $0, and the parts from the line via $1,$2,$3,... using a separator.
This separator is set in my example (FS=OFS=\",\";) in the BEGIN section which is only done once per input file. Do not get confused by the \". This is because the script is between double quotes, and a variable (like OFS) is set using double quotes too, so it needs to be escaped like \".
The getline; print $0, do take care of the first line in a CSV which typically hold column names.
Then, for every line, this piece of code print $1+1,$2,$3 will increment the first column, and print the second and third column.
To extend this second example:
gawk "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print ($1<5?$1+1:$1),$2,$3 }" abc.csv
The ($1<5?$1+1:$1) will check if value of $1is less then 5 ($1<5), if true, it will return $1+1, and else $1. Or, in words, it will only add 1 if the current value is less than 5.
With your data you end up with something like this (untested!):
gawk "BEGIN{ FS=OFS=\",\"; getline; a=42; print $0 }{ if($4+0==0){ a++ }; print ($4<=0?$a:$1),$2,$3 }" input.csv
a=42 to set the initial value for the column values which needs to be update (you need to change this to the correct value )
The if($4+0==0){ a++ } will increment the value of a when the fourth column equals 0 (The $4+0 is done to convert empty values like "" to a numeric value 0).

randomly choose the files and add it to the data present in another folder

i have two folders(DATA1 and DATA2) and inside it there are 3 subfolders(folder1,folder2 and folder3) as shown below :
DATA1
folder1/*.txt contain 5 files
folder2/*.txt contain 4 files
folder3/*.txt contain 10 files
DATA2
folder1/*.txt contain 8 files
folder2/*.txt contain 9 files
folder3/*.txt contain 10 files
as depicted above, there are various number of files in each subfolders with different names and each file contain two columns data as shown below:
1 -2.4654174805e+01
2 -2.3655626297e+01
3 -2.2654634476e+01
4 -2.1654865265e+01
5 -2.0653873444e+01
6 -1.9654104233e+01
7 -1.8654333115e+01
8 -1.7653341293e+01
9 -1.6654792786e+01
10 -1.5655022621e+01
I just want add data folder wise by choosing the second columns of files randomly
I mean any random data(only second column) from DATA2/folder1/*.txt will be added to DATA1/folder1/*.txt(only second column), similarly DATA2/folder2/*.txt will be added to DATA1/folder2/*.txt and so on.
most importantly, i don't need to disturb the first column value of any folders only manipulations with second column.And finally i want to save the data.
can anybody suggest solution for the same
My directory and data structure is attached here
https://i.fluffy.cc/2RPrcMxVQ0RXsSW1lzf6vfQ30jgJD8qp.html
i want to add folder wise data(from DATA2 to DATA1). First of all enter the DATA2/folder1 and randomly chose any file and select its(file) second column(as it consists of two column). Then add the selected second column to the second column of any file present inside DATA1/folder1 and save it to the OUTPUT folder
Since there's no code to start from this won't be a ready-to-use answer but rather a few building blocks that may come in handy.
I'll show how to find all files, select a random file, select a random column and extract the value from that column. Duplicating and adapting this for selecting a random file and column to add the value to is left as an exercise to the reader.
#!/bin/bash
IFS=
# a function to generate a random number
prng() {
# You could use $RANDOM instead but it gives a narrower range.
echo $(( $(od -An -N4 -t u4 < /dev/urandom) % $1 ))
}
# Find files
readarray -t files < <(find DATA2/folder* -mindepth 1 -maxdepth 1 -name '*.txt')
# Debug print-out of the files array
declare -p files
echo Found ${#files[#]} files
# List files one-by-one
for file in "${files[#]}"
do
echo "$file"
done
# Select a random file
fileno=$(prng ${#files[#]})
echo "Selecting file number $fileno"
filename=${files[$fileno]}
echo "which is $filename"
lines=$(wc -l < "$filename")
echo "and it has $lines lines"
# Add 1 since awk numbers its lines from 1 and up
rndline=$(( $(prng $lines) + 1 ))
echo "selecting value in column 2 on line $rndline"
value=$(awk -v rndline=$rndline '{ if(NR==rndline) print $2 }' "$filename")
echo "which is $value"
# now pick a random file and line in the other folder using the same technique

Iterate over two files in linux || column comparison

we have two files File1 and File 2
File 1 columns
Name Age
abc 12
bcd 14
File2 Columns
Age
12
14
I was to Iterate over the second column of File1 and First column of File2 in single loop and Then check if they are same.
Note:- note number of Rows in both the files are same and I am using .sh shell
First make a temporary file from file1 that should be the same as file2.
The field name might have spaces, so remove everything until the last space.
When you have done this you can compare the files.
sed 's/.* //' file1 > file1.tmp
diff file1.tmp file2

Linux - putting lines that contain a string at a specific column in a new file

I want to pull all rows from a text file in linux which contain a specific number (in this case 9913) in a specific column (column 4). This is a tab-delimited file, so I am calling this a column, though I am not sure it is.
In some cases, there is only one number in column 4, but in other lines there are multiple numbers in this column (ex. 9913; 4444; 5555). I would like to get any rows for which the number 9913 appears in the 4th column (whether or not it is the only number or in a list). How do I put all lines which contain the number 9913 in column 4 and put them in their own file?
Here is an example of what I have tried:
cat file.txt | grep 9913 > newFile.txt
result is a mixture of the following:
CDR1as CDR1as ENST00000003100.8 9913 AAA-GGCAGCAAGGGACUAAAA (files that I want)
CDR1as CDR1as ENST00000399139.1 9606 GUCCCCA................(file ex. I don't want)
I do not get any results when calling a specific column. Shown by the helper below, the code is not recognizing the columns I think, and I get blank files when using awk.
awk '$4 == "9913"' file.txt > newfile.txt
will give me no transfer of data to a new file.
Thanks
This is one way of doing it
awk '$4 == "9913" {print $0}' file.txt > newfile.txt
or just
awk '$4 == "9913"' file.txt > newfile.txt

Subset delimited file by subset of one column

I have a text file (call it infile.txt) where the columns have headers and are delimited by semicolons. A subset of it is reproduced below:
SCHCD; SCHNAME
13110208001; GOVT MIDSCHOOL
10110208002; GOVT HIGHSCHOOL
21110208101; MATRIC
21110208102; UPPER SECONDARY
13110208201; SECONDARY
I want a subset of the file where the first two characters of "SCHCD" is "13". So my subset (call it outfile.txt) should look like:
SCHCD; SCHNAME
13110208001; GOVT MIDSCHOOL
13110208201; SECONDARY
With awk:
awk ' NR == 1 || /^13/ ' infile.txt > outfile.txt

Resources