How to split a file into two files based on a pattern? - linux

In a file in Linux I have the following
123_test
234_test
abc_rest
cde_rest
and so on
Now I want to get two files in Linux. One which contains only records like below
123_test
234_test
and 2nd file like below
abc_rest
cde_rest
I want to split files based on what comes after _ like _test or _rest
Edited:
123_test
234_test
abc_rest
cde_rest
456_test
fgh_rest
How can I achieve that in Linux?
Can we use split function for this?

You can use this single awk command for splitting:
awk '{ print > (/_test$/ ? "file1" : "file2") }' file
This awk command will copy all lines that start with digits to file1 and remaining lines to file2.

Related

Split flat file and add delimiter in Linux

I would like how to improve a code that I have.
My shell script reads a flat file, and split it in two files based on first char of each line, header and detail. For header the first char is 1 and for detail is 2. Splitted files does not include the firts char.
Header is delimited by "|", and detail is fixed-width, so, I add the delimiter to it alter.
What I want is to do this in one single awk, to avoid creating a tmp file.
For splitting file I use and awk command, and for adding delimiter another awk command.
This is what I have now:
Input=Input.txt
Header=Header.txt
DetailTmp=DetailTmp.txt
Detail=Detail.txt
#First I split in two files and remove first char
awk -v vFileHeader="$Header" -v vFileDetail="$DetailTmp" '/^1/ {f=vFileHeader} /^2/ {f=vFileDetail} {sub(/^./,""); print > f}' $Input
#Then, I add the delimiter to detail
awk '{OFS="|"};{print substr($1,1,10),substr($1,11,5),substr($1,16,2),substr($1,18,14),substr($1,32,4),substr($1,36,18),substr($1,54,1)}' $DetailTmp > $Detail
Any suggestion?
Input.txt file
120190301|0170117174|FRANK|DURAND|USA
2017011717400052082911070900000000000000000000091430200
120190301|0170117204|ERICK|SMITH|USA
2017011720400052082911070900000000000000000000056311910
Header.txt splitted
20190301|0170117174|FRANK|DURAND|USA
20190301|0170117204|ERICK|SMITH|USA
DetailTmp.txt splitted
017011717400052082911070900000000000000000000091430200
017011720400052082911070900000000000000000000056311910
017011727100052052911070900000000000000000000008250000
017011718200052082911070900000000000000000000008102500
017011726300052052911070900000000000000000000008250000
Detail.txt desired
0170117174|00052|08|29110709000000|0000|000000000009143020|0
0170117204|00052|08|29110709000000|0000|000000000005631191|0
0170117271|00052|05|29110709000000|0000|000000000000825000|0
0170117182|00052|08|29110709000000|0000|000000000000810250|0
0170117263|00052|05|29110709000000|0000|000000000000825000|0
just combine the scripts
$ awk -v OFS='|' '/^1/{print substr($0,2) > "header"}
/^2/{print substr($0,2,10),substr($0,11,5),... > "detail"}' file
however, you may be better off, using FIELDWIDTHS on the detail file on the second pass.

How to Grep the complete sequences containing a specific motif in a fasta file?

How to Grep the complete sequences containing a specific motif in a fasta file or txt file with one linux command and write them into another file? Also, I want to include the lines beginning with a ">" before these target sequences.
Example:I have a fasta file of 10000 sequences.
$cat file.fa
>name1
AEDIA
>name2
ALKME
>name3
AAIII
I want to grep sequences containing KME, so I should get:
>name2
ALKME
Attached is the current way I am using based on the answers I got. Maybe others may find it helpful. Thanks to Pierre Lindenbaum, Philipp Bayer, cpad0112 and batMan.
Preprocessing the fasta file first and get each sequence into one line (which is very important)
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa > file1.fa
Get rid of the first empty line
tail -n +2 file1.fa > file2.fa
Extract the target sequences containing the substring including their names and save it into another file
LC_ALL=C grep -B 1 KME file2.fa > result.txt
Note: Take KME as the target substring as an example
if you have multiline fasta files. First linearize with awk, and use another awk to filter the sequence containing the motif. using grep would be dangerous a sequence name contains a short motif.
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{if(index($2,"KME")!=0) printf("%s\n%s\n",$1,$2);}'
grep -B1 KME file > output_file
-B1 : prints 1 line before the match as well

Linux CSV - Add a colum from a CSV file to another CSV File

I'm struggling to create a CSV file from two other ones
Here's what I need
File I want (lot of others lines)
"AB";"A";"B";"C";"D";"E"
Files I have:
File 1:
"A";"B";"C";"D";"E"
File 2:
"AB";"C";"D";"E"
How can I simply add "AB" from File to the 1st position of 1st one, adding one ";" ?
Thanks for your help
You can use awk as below. This assumes that you have only ; character as the field separator. And it is not used anywhere else in the CSV file.
$ awk -F\; '{print $2}' file.csv

Generate record of files which have been removed by grep as a secondary function of primary command

I asked a question here to remove unwanted lines which contained strings which matched a particular pattern:
Remove lines containg string followed by x number of numbers
anubhava provided a good line of code which met my needs perfectly. This code removes any line which contains the string vol followed by a space and three or more consecutive numbers:
grep -Ev '\bvol([[:blank:]]+[[:digit:]]+){2}' file > newfile
The command will be used on a fairly large csv file and be initiated by crontab. For this reason, I would like to keep a record of the lines this command is removing, just so I can go back to check the correct data that is being removed- I guess it will be some sort of log containing the name sof the lines that did not make the final cut. How can I add this functionality?
Drop grep and use awk instead:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print >> "deleted"; next} 1' file
The above uses GNU awk for word delimiters (\<) and will append every deleted line to a file named "deleted". Consider adding a timestamp too:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print systime(), $0 >> "deleted"; next} 1' file

find and replace line strings contained in one file to a second file in shell script

I'm trying to find a solution to the following problem:
I have two files: i.e. file1 and file2.
In file1 there some lines with some key words and I want to find these lines in file2 by using the key words. Once find the key words in file2 I would like to update this line with the content of the same line in file1. This operation should be done for every line contained in file1.
Just an example of what I have in mind, but I don't know exactly how to transform in shell script command.
file1:
key1=new_value1
key2=new_value2
key3=new_value3
etc....
file2:
key1=value1
key2=value2
key3=value3
key4=value4
key5=value5
key6=value6
etc....
Result:
key1=new_value1
key2=new_value2
key3=new_value3
key4=value4
key5=value5
key6=value6
etc....
I don't know how can I use 'sed' or something else in shell script to accomplish this task.
Any help is welcomed.
Thank you
awk would be my first choice
awk -F= -v OFS== '
NR==FNR {new[$1]=$2; next}
$1 in new {$2=new[$1]}
{print}
' file1 file2

Resources