I have a huge .txt file (15 GB) and having almost 30 million lines.
I want to put its lines to different files based on the 4th column. And the unique number of 4th column is around 2 million.
file1.txt
1 10 ABC KK-LK
1 33 23 KK-LK
2 34 32 CK-LK,LK
11 332 2 JK#
11 23 2 JK2
Right now, I can separate these lines to different files in the same folder as following:
awk '{ print $0 >> $4"_sep.txt" }' file1.txt
And it results in 4 different files as:
KK-LK_sep.txt
1 10 ABC KK-LK
1 33 23 KK-LK
and
CK-LK,LK_sep.txt
2 34 32 CK-LK,LK
and
JK#_sep.txt
11 332 2 JK#
and finally,
JK2_sep.txt
11 23 2 JK2
What I want is, to not to put 2 million files in one folder, to separate them into 20 different folders. I can make folders as folder1,2,3....:
mkdir folder{1..20}
With the answers below, I suppose something like broken below code would work:
#!/bin/env bash
shopt -s nullglob
numfiles=(*)
numfiles=${#numfiles[#]}
numdirs=(*/)
numdirs=${#numdirs[#]}
(( numfiles -= numdirs ))
echo $numfiles
var1=$numfiles
awk -v V1=var1 '{
if(V1 <= 100000)
{
awk '{ print $0 >> $4"_sep.txt" }' file1.txt
}
else if(V1 => 100000)
{
cd ../folder(cnt+1)
awk '{ print $0 >> $4"_sep.txt" }' file1.txt
}
}'
But then, how can I make this a loop and stop adding up to the folder1 once it has 100.000 files in it, and start adding files to folder2 and so on?
Maybe this is what you want (untested since your question doesn't include an example we can test against):
awk '
!($4 in key2out) {
if ( (++numKeys % 100000) == 1 ) {
dir = "dir" ++numDirs
system("mkdir -p " dir)
}
key2out[$4] = dir "/" $4 "_sep.txt"
}
{ print > key2out[$4] }
' file1.txt
That relies on GNU awk for managing the number of open files internally. With other awks you'd need to change that last line to { print >> key2out[$4]; close(key2out[$4]) } or otherwise handle how many concurrently open files you have to avoid getting a "too many open files" error, e.g. if your $4 values are usually grouped together then more efficiently than opening and closing the output file on every single write, you could just do it when the $4 value changes:
awk '
$4 != prevKey { close(key2out[prevKey]) }
!($4 in key2out) {
if ( (++numKeys % 100000) == 1 ) {
dir = "dir" ++numDirs
system("mkdir -p " dir)
}
key2out[$4] = dir "/" $4 "_sep.txt"
}
{ print >> key2out[$4]; prevKey=$4 }
' file1.txt
something like this?
count the unique keys and increment bucket after threshold.
count += !keys[$4]++;
bucket=count/100000;
ibucket=int(bucket);
ibucket=ibucket==bucket?ibucket:ibucket+1;
folder="folder"ibucket
Related
I wrote a function in AWk that will print the difference of two matrix
like f1 contains this matrix:
12 35 68 99
2 6
1
and f2 contains :
10 25 100
2 5 4
2
It will print in a file called tmp
2 10 -32 99
0 1 4
The code is :
function matrix_difference(file1,file2) {
printf "" > "tmp"
for (o=1;o<=NR;o++){
for (x=1;x<=NF;x++){
d=A[file1,o,x]
p=A[file2,o,x]
sum=d-p
printf sum " ">> "tmp"
}
print "" >> "tmp"
}
close("tmp")
}
I tried to write in AWK a code which gets a number of files that contains a matrix and prints : "The difference is (name of files with - between each name) is : " it will print the difference in all the files . IF there are four files . f1 f2 f3 f4 it prints f1-f2-f3-f4
I tried to write the code but It doesn't work only on two files , I tried to do a loop but it doesn't work . It only works on two files only when I write matrix_difference(ARGV[1],ARGV[2]).
#!/usr/bin/awk -f
{
for (i=1;i<=NF;i++)
A[FILENAME,FNR,i]=$i
}
{
if ( FNR ==1)
print "The matrix " FILENAME " is :"
print $0
}
END{
matrix_difference(ARGV[1],ARGV[2])
for ( m=1;m<=NR;m++){
getline x <"tmp"
print x
}
print " The matrix difference A-B-C-D is:"
}
function matrix_difference(file1,file2) {
printf "" > "tmp"
for (o=1;o<=NR;o++){
for (x=1;x<=NF;x++){
d=A[file1,o,x]
p=A[file2,o,x]
sum=d-p
printf sum " ">> "tmp"
}
print "" >> "tmp"
}
close("tmp")
}
also when I write the difference of Matrix then the files name don't know how to do that.
here is a sample code that will work with multiple files, after the first one, everything else will be subtracted. Assumes all matrices have matching dimensions.
$ awk 'NR==FNR {for(i=1;i<=NF;i++) a[NR,i]=$i; next}
{for(i=1;i<=NF;i++) a[FNR,i]-=$i}
END {for(i=1;i<=FNR;i++)
for(j=1;j<=NF;j++)
printf "%d%s",a[i,j],(j==NF?ORS:OFS)}' file1 file2
For the input files
==> file1 <==
1 2
3 4
==> file2 <==
1 0
0 1
script returns
0 2
3 3
try with more files
$ awk '...' file1 file2 file2
-1 2
3 2
you can convert to a function, not sure it helps!?
$ awk 'function sum(sign) {for(i=1;i<=NF;i++) a[FNR,i]+=sign*$i}
NR==FNR {sum(+1); next}
{sum(-1)}
END {for(i=1;i<=FNR;i++)
for(j=1;j<=NF;j++)
printf "%d%s",a[i,j],(j==NF?ORS:OFS)}'
I wrote a script in AWK called exc7
./exc7 file1 file2
In every file there is a matrix
file1 :
2 6 7
10 5 4
3 8 4
file2:
-60 10
10 -60
The code that I wrote is :
#!/usr/bin/awk -f
{
for (i=1;i<=NF;i++)
A[NR,i]=$i
}
END{
for (i=1;i<=NR;i++){
sum += A[i,1]
}
for (i=1;i<=NF;i++)
sum2 += A[1,i]
for (i=0;i<=NF;i++)
sum3 += A[NR,i]
for (i=0;i<=NR;i++)
sum4 += A[i,NF]
print sum,sum2,sum3,sum4
if (sum==sum2 && sum==sum3 && sum==sum4)
print "yes"
}
It should check for every file if the sum of the first column and the last and the first line and the last is the same. It will print the four sum and say yes if they are equal.Then it should print the largest sum of all number in all the files.
when I try it on one file it is right like when I try it on file1 it prints:
15 15 15 15
yes
but when I try it on two or more files like file1 file 2 the output is :
-35 8 -50 -31
you should use FNR instead of NR and with gawk you can use ENDFILE instead of END. However, this should work with any awk
awk 'function sumline(last,rn) {n=split(last,lr);
for(i=1;i<=n;i++) rn+=lr[i];
return rn}
function printresult(c1,r1,rn,cn) {print c1,r1,rn,cn;
print (r1==rn && c1==cn && r1==c1)?"yes":"no"}
FNR==1{if(last)
{rn=sumline(last);
printresult(c1,r1,rn,cn)}
rn=cn=c1=0;
r1=sumline($0)}
{c1+=$1;cn+=$NF;last=$0}
END {rn=sumline(last);
printresult(c1,r1,rn,cn)}' file1 file2
15 15 15 15
yes
-50 -50 -50 -50
yes
essentially, instead of checking end of file, you can check start of the file and print out the previous file's results. Need to treat first file differently. You still need the END block to handle the last file.
UPDATE
Based on the questions you asked, I think it's better for you to keep your script as is and change the way you call it.
for file in file1 file2;
do echo "$file"; ./exc7 "$file";
done
you'll be calling the script once for each file, so all the complications will go away.
I've spent the day trying to figure this out but didn't succeed. I have two files like this:
File1:
chr id pos
14 ABC-00 123
13 AFC-00 345
5 AFG-99 988
File2:
index id chr
1 ABC-00 14
2 AFC-00 11
3 AFG-99 7
I wanna check if the value for chr in File 1 != from chr in File 2 for the same ID, if this is true I want to print some columns from both files to have an output like this one below.
Expected output file:
ID OLD_chr(File1) NEW_chr(File2)
AFC-00 13 11
AFG-99 5 7
.....
Total number of position changes: 2
I got one caveat though. In File 1 I have to substitute some values in the $1 column before comparing the files. Like this:
30 and 32 >> X
31 >> Y
33 >> MT
Because in File 2 that's how those values are coded. And then compare the two files. How in the hell can I achieve this?
I've tried to recode File 1:
awk '{
if($1=30 || $1=32) gsub(/30|32/,"X",$1);
if($1=31) gsub(/31/,"Y",$1);
if($1=33) gsub(/33/,"MT",$1);
print $0
}' File 1 > File 1 Recoded
And I was trying to match the columns and print the output with:
awk 'NR==FNR{a[$1]=$1;next} (a[$1] !=$3){print $2, a[$1], $3 }' File 1 File 2 > output file
$ cat tst.awk
BEGIN {
map[30] = map[32] = "X"
map[31] = "Y"
map[33] = "MT"
print "ID", "Old_chr("ARGV[1]")", "NEW_chr("ARGV[2]")"
}
NR==FNR {
a[$2] = ($1 in map ? map[$1] : $1)
next
}
a[$2] != $3 {
print $2, a[$2], $3
cnt++
}
END {
print "Total number of position changes: " cnt+0
}
.
$ awk -f tst.awk file1 file2
ID Old_chr(file1) NEW_chr(file2)
AFC-00 13 11
AFG-99 5 7
Total number of position changes: 2
Like this:
awk '
BEGIN{ # executed at the BEGINning
print "ID OLD_chr("ARGV[1]") NEW_chr("ARGV[2]")"
}
FNR==NR{ # this code block for File1
if ($1 == 30 || $1 == 32) $1 = "X"
if ($1 == 31) $1 = "Y"
if ($1 == 33) $1 = "MT"
a[$2]=$1
next
}
{ # this for File2
if (a[$2] != $3) {
print $2, a[$2], $3
count++
}
}
END{ # executed at the END
print "Total number of position changes: " count+0
}
' File1 File2
ID OLD_chr(File1) NEW_chr(File2)
AFC-00 13 11
AFG-99 5 7
Total number of position changes: 2
Using bash script (Ubuntu 16.04), I'm trying to compare 2 lists of ranges: does any number in any of the ranges in file1 coincide with any number in any of the ranges in file2? If so, print the row in the second file. Here I have each range as 2 tab-delimited columns (in file1, row 1 represents the range 1-4, i.e. 1, 2, 3, 4). The real files are quite big.
file1:
1 4
5 7
8 11
12 15
file2:
3 4
8 13
20 24
Desired output:
3 4
8 13
My best attempt has been:
awk 'NR=FNR { x[$1] = $1+0; y[$2] = $2+0; next};
{for (i in x) {if (x[i] > $1+0); then
{for (i in y) {if (y[i] <$2+0); then
{print $1, $2}}}}}' file1 file2 > output.txt
This returns an empty file.
I'm thinking that the script will need to involve range comparisons using if-then conditions and iterate through each line in both files. I've found examples of each concept, but can't figure out how to combine them.
Any help appreciated!
It depends on how big your files are, of course. If they are not big enough to exhaust the memory, you can try this 100% bash solution:
declare -a min=() # array of lower bounds of ranges
declare -a max=() # array of upper bounds of ranges
# read ranges in second file, store then in arrays min and max
while read a b; do
min+=( "$a" );
max+=( "$b" );
done < file2
# read ranges in first file
while read a b; do
# loop over indexes of min (and max) array
for i in "${!min[#]}"; do
if (( max[i] >= a && min[i] <= b )); then # if ranges overlap
echo "${min[i]} ${max[i]}" # print range
unset min[i] max[i] # performance optimization
fi
done
done < file1
This is just a starting point. There are many possible performance / memory footprint improvements. But they strongly depend on the sizes of your files and on the distributions of your ranges.
EDIT 1: improved the range overlap test.
EDIT 2: reused the excellent optimization proposed by RomanPerekhrest (unset already printed ranges from file2). The performance should be better when the probability that ranges overlap is high.
EDIT 3: performance comparison with the awk version proposed by RomanPerekhrest (after fixing the initial small bugs): awk is between 10 and 20 times faster than bash on this problem. If performance is important and you hesitate between awk and bash, prefer:
awk 'NR == FNR { a[FNR] = $1; b[FNR] = $2; next; }
{ for (i in a)
if ($1 <= b[i] && a[i] <= $2) {
print a[i], b[i]; delete a[i]; delete b[i];
}
}' file2 file1
awk solution:
awk 'NR==FNR{ a[$1]=$2; next }
{ for(i in a)
if (($1>=i+0 && $1<=a[i]) || ($2<=a[i] && $2>=i+0)) {
print i,a[i]; delete a[i];
}
}' file2 file1
The output:
3 4
8 13
awk 'FNR == 1 && NR == 1 { file=1 } FNR == 1 && NR != 1 { file=2 } file ==1 { for (q=1;q<=NF;q++) { nums[$q]=$0} } file == 2 { for ( p=1;p<=NF;p++) { for (i in nums) { if (i == $p) { print $0 } } } }' file1 file2
Break down:
FNR == 1 && NR == 1 {
file=1
}
FNR == 1 && NR != 1 {
file=2
}
file == 1 {
for (q=1;q<=NF;q++) {
nums[$q]=$0
}
}
file == 2 {
for ( p=1;p<=NF;p++) {
for (i in nums) {
if (i == $p) {
print $0
}
}
}
}
Basically we set file = 1 when we are processing the first file and file = 2 when we are processing the second file. When we are in the first file, read the line into an array keyed on each field of the line. When we are in the second file, process the array (nums) and check if there is an entry for each field on the line. If there is, print it.
For GNU awk as I'm controlling the for scanning order for optimizing time:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_num_desc"
}
NR==FNR { # hash file1 to a
if(($2 in a==0) || $1<a[$2]) # avoid collisions
a[$2]=$1
next
}
{
for(i in a) { # in desc order
# print "DEBUG: For:",$0 ":", a[i], i # remove # for debug
if(i+0>$1) { # next after
if($1<=i+0 && a[i]<=$2) {
print
next
}
}
else
next
}
}
Test data:
$ cat file1
0 3 # testing for completely overlapping ranges
1 4
5 7
8 11
12 15
$ cat file2
1 2 # testing for completely overlapping ranges
3 4
8 13
20 24
Output:
$ awk -f program.awk file1 file2
1 2
3 4
8 13
and
$ awk -f program.awk file2 file1
0 3
1 4
8 11
12 15
If Perl solution is preferred, then below one-liner would work
/tmp> cat marla1.txt
1 4
5 7
8 11
12 15
/tmp> cat marla2.txt
3 4
8 13
20 24
/tmp> perl -lane ' BEGIN { %kv=map{split(/\s+/)} qx(cat marla2.txt) } { foreach(keys %kv) { if($F[0]==$_ or $F[1]==$kv{$_}) { print "$_ $kv{$_}" }} } ' marla1.txt
3 4
8 13
/tmp>
If the ranges are ordered according to their lower bounds, we can use this to make the algorithms more efficient. The idea is to alternately proceed through the ranges in file1 and file2. More precisely, when we have a certain range R in file2, we take further and further ranges in file1 until we know whether these overlap with R. Once we know this, we switch to the next range in file2.
#!/bin/bash
exec 3< "$1" # file whose ranges are checked for overlap with those ...
exec 4< "$2" # ... from this file, and if so, are written to stdout
l4=-1 # lower bound of current range from file 2
u4=-1 # upper bound
# initialized with -1 so the first range is read on the first iteration
echo "Ranges in $1 that intersect any ranges in $2:"
while read l3 u3; do # read next range from file 1
if (( u4 >= l3 )); then
(( l4 <= u3 )) && echo "$l3 $u3"
else # the upper bound from file 2 is below the lower bound from file 1, so ...
while read l4 u4; do # ... we read further ranges from file 2 until ...
if (( u4 >= l3 )); then # ... their upper bound is high enough
(( l4 <= u3 )) && echo "$l3 $u3"
break
fi
done <&4
fi
done <&3
The script can be called with ./script.sh file2 file1
I have two data files. One is having 1600 rows and the other one is having 2 million rows(tab delimited files). I need to vlookup between these two files. Please see below example for the expected output and kindly let me know if it's possible. I've tried using awk, but couldn't get the expected result.
File 1(small file)
BC1 10 100
BC2 20 200
BC3 30 300
File 2(large file)
BC1 XYZ
BC2 ABC
BC3 DEF
Expected Output:
BC1 10 100 XYZ
BC2 20 200 ABC
BC3 30 300 DEF
I also tried the join command. It is taking forever to complete. Please help me find a solution. Thanks
Commands for your output:
awk '{print $1}' *file | sort | uniq -d > out.txt
for i in $(cat out.txt)
do
grep "$i" large_file >> temp.txt
done
sort -g -t 1 temp.txt > out1.txt
sort -g -t 1 out.txt > out2.txt
paste out1.txt out2.txt | awk '{print $1 $2 $3 $5}'
Commands for Vlookup
Store 1st and 2nd column in file1 file2 respectively
cat file1 file2 | sort | uniq -d ### for records which are present in both files
cat file1 file2 | sort | uniq -u ### for records which are unique and not present in bulk file
This awk script will scan line by line each file, and will try to match the number in the BC column. Once matched, it will print all the columns.
If one of the files does not contain one of the numbers, it will be skipped in both files and will search for the next one. It will loop until one of the files ends.
The script also accepts any number of columns per file and any number of files, as long as the first column is BC and a number.
This awk script assumes that the files are ordered from minor to major number in the BC column (like in your example). Otherwise it will not work.
To execute the script, run this command:
awk -f vlookup.awk smallfile bigfile
The vlookup.awk file will have this content:
BEGIN {files=1;lines=0;maxlines=0;filelines[1]=0;
#Number of columns for SoD, PRN, reference file
col_bc=1;
#Initialize variables
bc_now=0;
new_bc=0;
end_of_process=0;
aux="";
text_result="";
}
{
if(FILENAME!=ARGV[1])exit;
no_bc=0;
new_bc=0;
#Save number of columns
NFields[1]=NF;
#Copy reference file data
for(j=0;j<=NF;j++)
{
file[1,j]=$j;
}
#Read lines from file
for(i=2;i<ARGC;i++)
{
ret=getline < ARGV[i];
if(ret==0) exit; #END OF FILE reached
#Copy columns to file variable
for(j=0;j<=NF;j++)
{
file[i,j]=$j;
}
#Save number of columns
NFields[i]=NF;
}
#Check that all files are in the same number
for(i=1;i<ARGC;i++)
{
bc[i]=file[i,col_bc];
bc[i]=sub("BC","",file[i,col_bc]);
if(bc[i]>bc_now) {bc_now=bc[i];new_bc=1;}
}
#One or more files have a new number
if (new_bc==1)
{
for(i=1;i<ARGC;i++)
{
while(bc_now!=file[i,col_bc])
{
#Read next line from file
if(i==1) ret=getline; #File 1 is the reference file
else ret=getline < ARGV[i];
if(ret==0) exit; #END OF FILE reached
#Copy columns to file variable
for(j=0;j<=NF;j++)
{
file[i,j]=$j;
}
#Save number of columns
NFields[i]=NF;
#Check if in current file data has gone to next number
if(file[i,col_bc]>bc_now)
{
no_bc=1;
break;
}
#No more data lines to compare, end of comparison
if(FILENAME!=ARGV[1])
{
exit;
}
}
#If the number is not in a file, the process to realign must be restarted to the next number available (Exit for loop)
if (no_bc==1) {break;}
}
#If the number is not in a file, the process to realign must be restarted to the next number available (Continue while loop)
if (no_bc==1) {next;}
}
#Number is aligned
for(i=1;i<ARGC;i++)
{
for(j=2;j<=NFields[i];j++) {
#Join colums in text_result variable
aux=sprintf("%s %s",text_result,file[i,j]);
text_result=sprintf("%s",aux);
}
}
printf("BC%d%s\n",bc_now,text_result)
#Reset text variables
aux="";
text_result="";
}
I also tried the join command. It is taking forever to complete.
Please help me find a solution.
It's improbable that you'll find a solution (scripted or not) that's faster than the compiled join command. If you can't wait for join to complete, you need more powerful hardware.