Sorting a file and putting them in different files - linux

I am trying to sort a file that has different genomic regions, and each region has a letter&number combination to itself.
I want to sort the whole file in terms of each genomic location (columns1,2,3),and if these 3 are the same,
and extract it into a new separate file.
My input is:
1.txt
chr1 10 20 . . 00000 ACTGBACA
chr1 10 20 . + 11111 AACCCCHQ
chr1 18 40 . . 0 AA12KCCHQ
chr7 22 23 . . 21 KLJMWQKD
chr7 22 23 . . 8 XJKFIRHFBF24
chrX 199 201 . . KK AVJI24
What I am expecting is:
chr1.10-20.txt
chr1 10 20 ACTGBACA
chr1 10 20 AACCCCHQ
chr1.18-40.txt
chr1 18 40 AA12KCCHQ
chr7.22-23.txt
chr7 22 23 KLJMWQKD
chr7 22 23 XJKFIRHFBF24
chrX.199-201.txt
chrX 199 201 AVJI24
I was experimenting splitting a document with awk, but it is not what I want to do.
awk -F, '{print > $1$2$3".txt"}' 1.txt
It gives me the file names with all the rows, and inside the files, it is again the whole row, even though I need only column 1,2,3 and 7.
>ls
1.txt
chr1 10 20 . + 11111 AACCCCHQ.txt
chr7 22 23 . . 21 KLJMWQKD.txt
chrX 199 201 . . KK AVJI24.txt
chr1 10 20 . . 00000 ACTGBACA.txt
chr1 18 40 . . 0 AA12KCCHQ.txt
chr7 22 23 . . 8 XJKFIRHFBF24.txt
>cat chr1\ \ \ \ 10\ \ 20\ .\ +\ 11111\ AACCCCHQ.txt
chr1 10 20 . + 11111 AACCCCHQ
I would appreciate if you can show me how to fix the file names and its content.

Take a look at this:
#!/bin/sh
INPUT="$1"
while read -r LINE; do
GEN_LOC="$(echo "$LINE" | tr -s ' ' '.' | cut -d '.' -f 1,2,3)"
echo "$LINE" | tr -s ' ' | cut -d ' ' -f 1,2,3,6,7 >> "${GEN_LOC}.txt"
done < "$INPUT"
This script will take an input file in the format you posted and read it in line-by-line. For each line, it will replace the extra whitespace with dots for the filename and cut it down to fields 1, 2, and 3 (storing it in the $GEN_LOC variable). Then, it will append the whole $LINE to a file named ${GEN_LOC}.txt. If there are multiple lines that end up outputting to the same filename, that's fine - the line will just append. This does not take into account previous runs, so if you run this twice, it will continually append to the existing files. Hope this helps!

Related

print the count of files for each sub folder iterativly

I have the following folder structure:
A/B/C/D/E/00
A/B/C/D/E/01
.
.
A/B/C/D/E/23
Similarly,
M/N/O/P/Q/00
M/N/O/P/Q/01
.
.
M/N/O/P/Q/23
Now, each folder from 00 to 23 has many files inside, which I would like to count.
If I run this simple command:
ls /A/B/C/D/E/00 | wc -l
I can get the count of files in each of these sub directories. I want to automate this or get it iteratively.
Also, the final output I am looking at is a file that should look like this:
C E RESULT OF ls /A/B/C/D/E/00 | wc -l RESULT OF ls /A/B/C/D/E/01 | wc -l
M Q RESULT OF ls /M/N/O/P/Q/00 | wc -l RESULT OF ls /M/N/O/P/Q/01 | wc -l
So, the output should look like this finally
C E 23 23 4 6 7 4 76 98 57 2 67 9 12 34 67 0 2 3 78 98 12 3 57 213
M Q 12 10 2 34 32 1 35 65 87 8 32 2 65 87 98 0 4 12 1 35 34 76 9 67
Please note, the values after the alphabets are the values of file counts of the 24 folders 00, 01 through 23.
Using the eval approach: I can hardcode and get the exact results. But, I wanted it in a way that would show me the data for the previous day. So this is what I did:
d=`date --date ="1 days ago" +%Y%m%d`
month= `date +%Y%m`
eval echo YZ $d '"$(ls "/A/B/YZ/$month/$d/"'{20150800..20150823})'| wc -l)"'
This works perfectly because in the given location there are files inside child directories 20150800,20150801..20150823. However when I try to generalize this like below, it shows no such file or directory:
eval echo YZ $d '"$(ls "/A/B/YZ/$month/$d/"'{"$d"00.."$d"23})'| wc -l)"'
Is there something I am missing in the above line?
A very safe way of counting files:
find . -mindepth 1 -exec printf x \; | wc -c
To not count recursively add -maxdepth 1 before -exec.
Some other notes:
eval is evil. Don't use it. There is only one place I've ever seen where it's appropriate, and that's when using getopt.
You should not parse the output of ls.
Use $() for command substitutions.

Print the count of files in a specific format iteratively in shell script

I have the following folder structure:
A/B/C/D/E/00
A/B/C/D/E/01
.
.
A/B/C/D/E/23
Similarly,
M/N/O/P/Q/00
M/N/O/P/Q/01
.
.
M/N/O/P/Q/23
Now, each folder from 00 to 23 has many files inside, which I would like to count.
If I run this simple command:
ls /A/B/C/D/E/00 | wc -l
I can get the count of files in each of these sub directories. I want to automate this or get it iteratively. Can anyone suggest a way?
Also, the final output I am looking at is a file that should look like this:
C E RESULT OF ls /A/B/C/D/E/00 | wc -l RESULT OF ls /A/B/C/D/E/01 | wc -l
M Q RESULT OF ls /M/N/O/P/Q/00 | wc -l RESULT OF ls /M/N/O/P/Q/01 | wc -l
So, the output should look like this finally
C E 23 23 4 6 7 4 76 98 57 2 67 9 12 34 67 0 2 3 78 98 12 3 57 213
M Q 12 10 2 34 32 1 35 65 87 8 32 2 65 87 98 0 4 12 1 35 34 76 9 67
Please note, the values after the alphabets are the values of file counts of the 24 folders 00, 01 through 23.
Using the eval approach: I can hardcode and get the exact results. But, I wanted it in a way that would show me the data for the previous day. So this is what I did:
d=`date --date ="1 days ago" +%Y%m%d`
month= `date +%Y%m`
eval echo YZ $d '"$(ls "/A/B/YZ/$month/$d/"'{20150800..20150823})'| wc -l)"'
This works perfectly because in the given location there are files inside child directories 20150800,20150801..20150823. However when I try to generalize this like below, it gives me the total count of the folder instead of the count of each sub folder:
eval echo YZ $d '"$(ls "/A/B/YZ/$month/$d/"'{"$d"00.."$d"23})'| wc -l)"'
Something like this (not tested):
for d in [A-Z]/[A-Z]/[A-Z]/[A-Z]/[A-Z]/[0-9][0-9]
do
[[ -d $d ]] && echo $d : $(ls $d|wc -l)
done
Note that this gives an inccorect line count if one of the file names contains a newline character.

Faster solution to compare files in bash

file1:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
chr1 16857 17055 NR_024540_4_r_WASH7P_198
and file2:
NR_024540 11
I need find match file2 in file1 and print whole file1 + second column of file2
So ouptut is:
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
My solution is very slow in bash:
#!/bin/bash
while read line; do
c=$(echo $line | awk '{print $1}')
d=$(echo $line | awk '{print $2}')
grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}' >> output
done < file2
I am prefer FASTER any bash or awk solution. Output can be modified, but need keep all the informations (order of column can be different).
EDIT:
Right now it looks like fastest solution according #chepner:
#!/bin/bash
while read -r c d; do
grep $c file1 | awk -v line="$d" -v OFS="\t" '{print $1,$2,$3,$4"_"line}'
done < file2 > output
In a single Awk command,
awk 'FNR==NR{map[$1]=$2; next}{ for (i in map) if($0 ~ i){$(NF+1)=map[i]; print; next}}' file2 file1
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
A more readable version in a multi-liner
FNR==NR {
# map the values from 'file2' into the hash-map 'map'
map[$1]=$2
next
}
# On 'file1' do
{
# Iterate through the array map
for (i in map){
# If there is a direct regex match on the line with the
# element from the hash-map, print it and append the
# hash-mapped value at last
if($0 ~ i){
$(NF+1)=map[i]
print
next
}
}
}
Another solution using join and sed, Under the assumption that file1 and file2 are sorted
join <(sed -r 's/[^ _]+_[^_]+/& &/' file1) file2 -1 4 -2 1 -o "1.1 1.2 1.3 1.5 2.2" > output
If the output order doesn't matter, to use awk
awk 'FNR==NR{d[$1]=$2; next}
{split($4,v,"_"); key=v[1]"_"v[2]; if(key in d) print $0, d[key]}
' file2 file1
you get,
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
try this -
cat file2
NR_024540 11
NR_024541 12
cat file11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14361 14829 NR_024542_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
chr1 16857 17055 NR_024540_4_r_WASH7P_198
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468
chr1 14969 15038 NR_024540_1_r_WASH7P_69
chr1 15795 15947 NR_024540_2_r_WASH7P_152
chr1 16606 16765 NR_024540_3_r_WASH7P_15
awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
chr1 16857 17055 NR_024540_4_r_WASH7P_198 11
chr1 14361 14829 NR_024540_0_r_DDX11L1,WASH7P_468 11
chr1 14969 15038 NR_024540_1_r_WASH7P_69 11
chr1 15795 15947 NR_024540_2_r_WASH7P_152 11
chr1 16606 16765 NR_024540_3_r_WASH7P_15 11
Performance - (Tested for 55000 records)
time awk 'NR==FNR{a[$1]=$2;next} substr($4,1,9) in a {print $0,a[substr($4,1,9)]}' file2 file1 > output1
real 0m0.16s
user 0m0.14s
sys 0m0.01s
You are starting a lot of external programs unnecessarily. Let read split the incoming line from file2 for you instead of calling awk twice. There is also no need to run grep; awk can do the filtering itself.
while read -r c d; do
awk -v field="$c" -v line="$d" -v OFS='\t' '$0 ~ field {print $1,$2,$3,$4"_"line}' file1
done < file2 > output
If the searched string is always the same length (length("NR_024540")==9):
awk 'NR==FNR{a[$1]=$2;next} (i=substr($4,1,9)) && (i in a){print $0, a[i]}' file2 file1
Explained:
NR==FNR { # process file2
a[$1]=$2 # hash record using $1 as the key
next # skip to next record
}
(i=substr($4,1,9)) && (i in a) { # read the first 9 bytes of $4 to i and search in a
print $0, a[i] # output if found
}
awk -F '[[:blank:]_]+' '
FNR==NR { a[$2]=$3 ;next }
{ if ( $5 in a ) $0 = $0 " " a[$5] }
7
' file2 file1
Comment:
use _ as extra field separator so file names are easier to compare in both file (using only the number part).
7 is for fun, it's just a non 0 value -> print the line
i don't change the field (NF+1, ...) so we keep the original format adding just the referenced number
a smaller oneliner code (optimized for code size) (assuming non empty lines in file1 that are mandatory). if separator are only space, you can remplace [:blank:] by a space char
awk -F '[[:blank:]_]+' 'NF==3{a[$2]=$3;next}$0=$0" "a[$5]' file2 file1
No awk or sed needed. This assumes file2 is only one line:
n="`cut -f 2 file2`" ; while read x ; do echo "$x $n" ; done < file1

Process multiple files and append them in linux/unix

I have over 100 files with at least 5-8 columns (tab-separated) in each file. I need to extract first three columns from each file and add fourth column with some predefined text and append them.
Let's say I have 3 files: file001.txt, file002.txt, file003.txt.
file001.txt:
chr1 1 2 15
chr2 3 4 17
file002.txt:
chr1 1 2 15
chr2 3 4 17
file003.txt:
chr1 1 2 15
chr2 3 4 17
combined_file.txt:
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
For simplicity I kept file contents same.
My script is as follows:
#!/bin/bash
for i in {1..3}; do
j=$(printf '%03d' $i)
awk 'BEGIN { OFS="\t"}; {print $1,$2,$3}' file${j}.txt | awk -v k="$j" 'BEGIN {print $0"\t$k”}' | cat >> combined_file.txt
done
But the script is giving the following errors:
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
Can some one help me to figure it out?
You don't need two different awk scripts. And you don't use $ to refer to variables in awk, that's used to refer to input fields (i.e. $k means access the field whose number is in the variable k).
for i in {1..3}; do
j=$(printf '%03d' $i)
awk -v k="$j" -v OFS='\t' '{print $1, $2, $3, k}' file$j.txt
done > combined_file.txt
As pointed out in the comments your problem is youre trying to use odd characters as if they were double quotes. Once you fix that though, you don't need a loop or any of that other complexity all you need is:
$ awk 'BEGIN{FS=OFS="\t"} {$NF="f"ARGIND} 1' file*
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
The above used GNU awk for ARGIND.

How to extract one column from multiple files, and paste those columns into one file?

I want to extract the 5th column from multiple files, named in a numerical order, and paste those columns in sequence, side by side, into one output file.
The file names look like:
sample_problem1_part1.txt
sample_problem1_part2.txt
sample_problem2_part1.txt
sample_problem2_part2.txt
sample_problem3_part1.txt
sample_problem3_part2.txt
......
Each problem file (1,2,3...) has two parts (part1, part2). Each file has the same number of lines.
The content looks like:
sample_problem1_part1.txt
1 1 20 20 1
1 7 21 21 2
3 1 22 22 3
1 5 23 23 4
6 1 24 24 5
2 9 25 25 6
1 0 26 26 7
sample_problem1_part2.txt
1 1 88 88 8
1 1 89 89 9
2 1 90 90 10
1 3 91 91 11
1 1 92 92 12
7 1 93 93 13
1 5 94 94 14
sample_problem2_part1.txt
1 4 330 30 a
3 4 331 31 b
1 4 332 32 c
2 4 333 33 d
1 4 334 34 e
1 4 335 35 f
9 4 336 36 g
The output should look like: (in a sequence of problem1_part1, problem1_part2, problem2_part1, problem2_part2, problem3_part1, problem3_part2,etc.,)
1 8 a ...
2 9 b ...
3 10 c ...
4 11 d ...
5 12 e ...
6 13 f ...
7 14 g ...
I was using:
paste sample_problem1_part1.txt sample_problem1_part2.txt > \
sample_problem1_partall.txt
paste sample_problem2_part1.txt sample_problem2_part2.txt > \
sample_problem2_partall.txt
paste sample_problem3_part1.txt sample_problem3_part2.txt > \
sample_problem3_partall.txt
And then:
for i in `find . -name "sample_problem*_partall.txt"`
do
l=`echo $i | sed 's/sample/extracted_col_/'`
`awk '{print $5, $10}' $i > $l`
done
And:
paste extracted_col_problem1_partall.txt \
extracted_col_problem2_partall.txt \
extracted_col_problem3_partall.txt > \
extracted_col_problemall_partall.txt
It works fine with a few files, but it's a crazy method when the number of files is large (over 4000).
Could anyone help me with simpler solutions that are capable of dealing with multiple files, please?
Thanks!
Here's one way using awk and a sorted glob of files:
awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *)
Results:
1 8 a
2 9 b
3 10 c
4 11 d
5 12 e
6 13 f
7 14 g
Explanation:
For each line of input of each input file:
Add the files line number to an array with a value of column 5.
(a[FNR] ? a[FNR] FS : "") is a ternary operation, which is set up to build up the arrays value as a record. It simply asks if the files line number is already in the array. If so, add the arrays value followed by the default file separator before adding the fifth column. Else, if the line number is not in the array, don't prepend anything, just let it equal the fifth column.
At the end of the script:
Use a C-style loop to iterate through the array, printing each of the arrays values.
For only ~4000 files, you should be able to do:
find . -name sample_problem*_part*.txt | xargs paste
If find is giving names in the wrong order, pipe it to sort:
find . -name sample_problem*_part*.txt | sort ... | xargs paste
# print filenames in sorted order
find -name sample\*.txt | sort |
# extract 5-th column from each file and print it on a single line
xargs -n1 -I{} sh -c '{ cut -s -d " " -f 5 $0 | tr "\n" " "; echo; }' {} |
# transpose
python transpose.py ?
where transpose.py:
#!/usr/bin/env python
"""Write lines from stdin as columns to stdout."""
import sys
from itertools import izip_longest
missing_value = sys.argv[1] if len(sys.argv) > 1 else '-'
for row in izip_longest(*[column.split() for column in sys.stdin],
fillvalue=missing_value):
print " ".join(row)
Output
1 8 a
2 9 b
3 10 c
4 11 d
5 ? e
6 ? f
? ? g
Assuming the first and second files have less lines than the third one (missing values are replaced by '?').
Try this one. My script assumes that every file has the same number of lines.
# get number of lines
lines=$(wc -l sample_problem1_part1.txt | cut -d' ' -f1)
for ((i=1; i<=$lines; i++)); do
for file in sample_problem*; do
# get line number $i and delete everything except the last column
# and then print it
# echo -n means that no newline is appended
echo -n $(sed -n ${i}'s%.*\ %%p' $file)" "
done
echo
done
This works. For 4800 files, each 7 lines long it took 2 minutes 57.865 seconds on a AMD Athlon(tm) X2 Dual Core Processor BE-2400.
PS: The time for my script increases linearly with the number of lines. It would take very long time to merge files with 1000 lines. You should consider learning awk and use the script from steve. I tested it: For 4800 files, each with 1000 lines it took only 65 seconds!
You can pass awk output to paste and redirect it to a new file as follows:
paste <(awk '{print $3}' file1) <(awk '{print $3}' file2) <(awk '{print $3}' file3) > file.txt

Resources