Is there an efficient way to separate lines into different file, awk in this case? - string

I am trying to separate a file into two different files based on whether the line contains certain string. If a line contain "ITS", this line and the line right after it will be write to file ITS.txt; if a line contains "V34" then this line and the line right after it will be write to file "V34.txt".
My awk code is
awk '/ITS/{print>"ITX.txt";getline;print>"ITX.txt";}; /V34/{print>"V34.txt";getline;print>"V34.txt";}' seqs.fna
It works well. But I am wondering whether there is an efficient way to do so?
seqs.fna (9-10G)
>16S.V34.S7.5_1
ACGGGAGGCAGCAGTAGGGAATCTTCC
>PCR.ITS.S8.14_2
CATTTAGAGGAAGTAAAAGTCGTAACA
>PCR.ITS.S7.11_3
CATTTAGAGGAAGTACAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTTTTGAAGGCTACAC
>16S.V34.S8.6_4
ACGGGCGGCAGCAGTAGGGAAT
>16S.V34.S8.13_5
ACGGGCGGCAGCAGTAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAACTCTGT
>16S.V34.S7.14_6
ACGGGGGGCAGCAGTAGGGAATCTTCCACAATGGGTGCAAACCTGATGGAGCAATGCCG
>16S.V34.S8.4_7
ACGGGAGGCAGCAGTAGGGAATCTTCCACAAT
>16S.V34.S8.14_8
CGTAGAGATGTGGAGGAACACCAGTGGCGAAG
>16S.V34.S8.8_9
CTGGGATAACACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTTGTAGTC
>16S.V34.S7.3_10
GGTCTGTAATTGACGCTGAGGTTCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCGGGTAGTC

getline has a few very specific uses and this would not be one of them. See http://awk.freeshell.org/AllAboutGetline. If you rewrote your script without getline you'd solve the problem yourself but given the input file you posted, this is all you need:
awk -F'.' '/^>/{out=$2".txt"} {print > out}' seqs.fna
To learn how to use awk correctly, read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Related

use awk to left outer join two csv file based on multiple columns while keeping order of the first file observations

I have two csv files.
File 1
ID,Name,Gender,Salary,DOB
11,Jim,M,200,90
12,David,M,100,89
12,David,M,300,89
13,Lucy,F,150,86
14,Lily,F,200,85
13,Lucy,F,100,86
File 2
DOB,Name,Children
90,Jim,2
88,Michael,4
88,Lily,1
85,Lily,0
What I want to do is to left outer join File 2 into File 1 based on DOB and Name while keeping the order of File 1 observations.
So the output is expected to be
ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,
I learned that we need to sort data if we use join command. So I was wondering whether I could use awk to do my work. But I am new with awk. Is there anyone can help me? By the way, if the data is very big, can I drop print command in awk but simply use > *.csv to save into a new csv file? It's because I found solutions to some related questions in this website often used {print ...}. Thank you.
awk to the rescue!
$ awk -F, 'NR==FNR{a[$1,$2]=$3; next} {print $0 FS a[$NF,$2]}' file2 file1
ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,
join will require sorted input and you need embellishments to recover initial ordering. You can redirect the output to a file by adding > outputfile.csv

Join txt files into a single file and then split them back up again

I have hundreds of txt files, which are all in a single directory. I would like to be able to do the following:
Join all files in a single txt file. This command will insert a symbol when joining (such as §) together with the file name.
[I then do some work on the combined file, which consists of making changes. Some of these changes involve using a priority software which works better with one big file than lots of little files].
Use a second command to go through the joined file and split it back into separate files, using the file name that was next to the symbol to name each split file.
Example:
Before joining:
File 1: "Towns.txt"
Béthlem
Cabul
Corinthia
ruined lands
eshcol
Gabbatha
old town
File 2: "Fruits and Nuts.txt"
Apples
Pomegranates
Sycamore
After Joining, but before I make changes
(Single file)
§Towns.txt
Béthlem
Cabul
Corinthia
ruined lands
eshcol
Gabbatha
old town
$Fruits and Nuts.txt
Apples
Pomegranates
Sycamore
After Joining and I make changes
(These changes are made manually in the single file)
§Towns.txt
Bethlehem
Cabul
Corinth
Ruined lands
Eshcol
Gabbatha
The Old Town
$Fruits and Nuts.txt
Apples
Pomegranates
Sycamore
After Splitting:
File 1: "Towns.txt"
Bethlehem
Cabul
Corinth
Ruined lands
Eshcol
Gabbatha
The Old Town
File 2: "Fruits and Nuts.txt"
Apples
Pomegranates
Sycamore
Steps I have tried
Combining files
I reworked the answer in this thread, to make an awk command that can join the files together with the file name prefixed with the § symbol.
awk '(FNR==1){print "§" FILENAME }1' * > ^0join.txt;
This seems to work well.
Splitting files
This thread provides a solution for splitting files. I have reworked to my needs to produce this:
awk -v RS='§' '{ outfile = "output_file_" NR; print > outfile}' ^0join.txt
The only problem is that the output files have the name "outfile1", "outfile2" etc.
They also keep the file name at the top of each file, which I do not want.
Also, sometimes when I use this command, it will just put everything in a single file called "outfile" and not split them up.
I also found this thread which had another solution, that I reworked:
awk '{print $0 "file" NR}' RS='§' ^0join.txt
However, this didn’t seem to do anything.
Notes
The § can be any other symbol.
I am using Mac OS 10.14.6, so I would like something that would work in the terminal of Mac OS.
Could you please try following.
For joining command:
awk 'FNR==1{print "§" FILENAME}; 1' Towns.txt "Fruits and Nuts.txt" > Output_file
For splitting files:
awk '/^§/{close(file);sub(/^§/,"");file=$0;next} {print > (file)}' Output_file
NOTE: As per OP's comments, in case .txt files needs to be passed to command then we could put /complete/path/to/txt_files/*.txt/ after awk code 1st one and one could remove individual file names from there(not tested it but should work)

extracting specific lines containing text and numbers from log file using awk

I haven't used my linux skills in a while and I'm struggling with extracting certain lines out of a csv log file.
The file is structured as:
code,client_id,local_timestamp,operation_code,error_code,etc
I want to extract only those lines of the file with a specific code and a positive client_id greater than 0.
for example if I have the lines:
message_received,1,134,20,0,xxx<br>
message_ack,0,135,10,1,xxx<br>
message_received,0,140,20,1,xxx<br>
message_sent,1,150,30,0,xxx
I only want to extract those lines having code message_received and positive client_id > 0, resulting in just the first line:
message_received,1,134,20,0,xxx
I want to use awk somewhat like:
awk '/message_received,[[:digit:]]>0'/ my log.csv which I know isn't quite correct.. but how do I achieve this in a one liner?
This is probably what you want:
awk -F, '($1=="message_received") && ($2>0)' mylog.csv
If not, edit your question to clarify.

Comparing two huge files in Unix

Requirement is to compare two huge Unix files and writing the difference in third file based on a unique key (first field) after searching few options got the below command:
awk 'FNR==NR{a[$0];next}!($0 in a)' hosts.csv masterlist.csv>results.csv
Though this gives the differences, if for a field one file contains NULL (as a word) and other empty/space for null values how to ignore this in the command and compare other fields?
Also would like to make a generic script or utility with such options, don't need the code but just a suggestion would be helpful.
You can try this fix in your awk:
awk 'FNR==NR{if ($0 !~ /NULL| *|^$/){a[$0]}next}!($0 in a)' hosts.csv masterlist.csv>results.csv
As #fedorqui suggest in comments, here's another alternative:
awk 'FNR==NR{if ($0 !~ /NULL/ && NF){a[$0]}next}!($0 in a)' hosts.csv masterlist.csv>results.csv
try to compare them using binary. if you compress the file into a binary (serialization), you can then compare them quite rapidly. if there is a difference you can then go through the file and compare them using similar methods to git... check their source code. hope this helps

Split ordered file in Linux

I have a large delimited file (with pipe '|' as the delimiter) which I have managed to sort (using linux sort) according to first (numeric), second (numeric) and fourth column (string ordering since it is a timestamp value). The file is like this:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I was wondering if there is an easy way to split this file to multiple text files with an awk, sed, grep or perl one liner whenever the first column or the second column value changes. The final result for the example file should be 3 text files like that:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I could do that in Java of course, but I think it would be kind of overkill, if it can be done with a script. Also, is this possible that the filenames created use those two columns values, something like 77_141.txt for the first file, 77_171.txt for the second file and 78_115.txt for the third one?
awk is very handy for this kind of problems. This can be an approach:
awk -F"|" '{print >> $1"_"$2".txt"}' file
Explanation
-F"|" sets field separator as |.
{print > something} prints the lines into the file something.
$1"_"$2".txt" instead of something, set the output file as $1"_"$2, being $1 the first field based on the | separator. That is, 77, 78... And same for $2, being 141, 171...

Resources