How to split file based on first character in Linux shell - linux

I do have a fixedwidth flatfile with the header and detail data. Both of them can be recognized by the first character: 1 for header and 2 for detail.
I want to genrate 2 different files from my fixedwidth file , each file having it's own record set, but without type record written.
File Header.txt having only type 1 records.
File Detail.txt having only type 2 records.
Please Let me know how we can achieve this.
Example flatfile:
120190301,025712,FRANK,DURAND,USA
20257120023.12
20257120000.21
20257120191.45
120190301,025737,ERICK,SMITH,USA
20257370000.29
20257370326.41
120190301,025632,JOSEPH,SILVA,USA
20256320019.57
20256320029.12
20256320129.04
Desired Outputs:
Header.txt
20190301,025712,FRANK,DURAND,USA
20190301,025737,ERICK,SMITH,USA
20190301,025632,JOSEPH,SILVA,USA
Detail.txt
0257120023.12
0257120000.21
0257120191.45
0257370000.29
0257370326.41
0256320019.57
0256320029.12
0256320129.04

This first one is gawk-specific and works because in gawk "If the value [of FS] is the null string (""), then each character in the record becomes a separate field."
$ awk 'BEGIN {FS=""; f[1]="header.txt"; f[2]="detail.txt"}
{i=$1; sub(/^./,""); print > f[i]}' file
$ cat header.txt
20190301,025712,FRANK,DURAND,USA
20190301,025737,ERICK,SMITH,USA
20190301,025632,JOSEPH,SILVA,USA
$ cat detail.txt
0257120023.12
0257120000.21
0257120191.45
0257370000.29
0257370326.41
0256320019.57
0256320029.12
One that should work with any awk:
$ awk '/^1/ {f="header.txt"}
/^2/ {f="detail.txt"}
{sub(/^./,""); print > f}' file

awk '{if(/^1/){ sub(/^./,""); print > "Header.txt" }else{sub(/^./,""); print>"Detail.txt"}}' flatfile
If the first character of a line matches 1, strip the first character and write the line to Header.txt, otherwise strip the first character and write the line to Detail.txt.
Outputs:
cat Header.txt
20190301,025712,FRANK,DURAND,USA
20190301,025737,ERICK,SMITH,USA
20190301,025632,JOSEPH,SILVA,USA
And number two:
cat Detail.txt
0257120023.12
0257120000.21
0257120191.45
0257370000.29
0257370326.41
0256320019.57
0256320029.12
0256320129.04

To reduce IO, it redirect multi files to use "tee" command.
$ tee <All.txt >/dev/null \
>(sed -n '/^1/s/^1//p' >Header.txt) \
>(sed -n '/^2/s/^2//p' >Detail.txt)
$ cat Header.txt
20190301,025712,FRANK,DURAND,USA
20190301,025737,ERICK,SMITH,USA
20190301,025632,JOSEPH,SILVA,USA
$ cat Detail.txt
0257120023.12
0257120000.21
0257120191.45
0257370000.29
0257370326.41
0256320019.57
0256320029.12
0256320129.04

Related

extracting text from one file and creating new file with that text using linux/bash

I have a sequence file that has a repeated pattern that looks like this:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
and so on.
I want to extract the text between and including each >g## and create a new file titled protein_g##.faa
In the above example it would create a file called "protein_g34.faa" and it would be:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
I was trying to use sed but I am not very experienced using it. My guess was something like this:
$ sed -n '/^>g*/s///p; y/ /\n/' file > "g##"
but I can clearly tell that that is wrong... maybe the right thing is using awk?
Thanks!
Yeah, I would use awk for that. I don't think sed can write to more than one different output stream.
Here's how I would write that:
< input.txt awk '/^\$>/{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} {print $0 > fname}'
Breaking it down into details:
< input.txt This part reads in the input file.
awk Runs awk.
/^\$>/ On lines which start with the literal string $>, run the piece of code in brackets.
(If previous step matched) {fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} Take the first field in the previous line. Remove the first two characters of that field. Surround that with protein_ .faa. Save it as the variable fname. Print a message about switching files.
This next block has no condition before it. Implicitly, that means that it matches every line.
{print $0 > fname} Take the entire line, and send it to the filename held by fname. If no file is selected, this will cause an error.
Hope that helps!
If awk is an option:
awk '/\|/ {split($1,a,">"); fname="protein_"a[2]".faa"} {print $0 >> fname}' src.dat
awk is better than sed for this problem. You can implement it in sed with
sed -rz 's/(\$>)(g[^ ]*)([^\n]*\n[^\n]*)\n/echo '\''\1\2\3'\'' > protein_\2.faa/ge' file
This solution is nice for showing some sed tricks:
-z for parsing fragments that span several lines
(..) for remembering strings
\$ matching a literal $
[^\n]* matching until end of line
'\'' for a single quote
End single quoted string, escape single quote and start new single quoted string
\2 for recalling the second remembered string
Write a bash command in the replacement string
e execute result of replacement
awk procedure
awk allows records to be extracted between empty (or white space only) lines by setting the record separator to an empty string RS=""
Thus the records intended for each file can be got automatically.
The id to be used in the filename can be extracted from field 1 $1 by splitting the (default white-space-separated) field at the ">" mark, and using element 2 of the split array (named id in this example).
The file is written from awk before closing the file to prevent errors is you have many lines to process.
The awk procedure
The example data was saved in a file named all.seq and the following procedure used to process it:
awk 'BEGIN{RS="";} {split($1,id,">"); fn="protein_"id[2]".faa"; print $0 > fn; close(fn)}' all.seq
tests results
(terminal listings/outputs)
$ ls
all.seq protein_g104.faa protein_g115.faa protein_g34.faa
$ cat protein_g104.faa
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$ cat protein_g115.faa
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
$ cat protein_g34.faa
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
Tested using GNU Awk 5.1.0

How to parse a specific column or data without losing content from other columns/rows after parsing?

I have the following output to grep the value in this case "225". This value is actually a variable $pd so it could change depending on users input" It could be integer numbers or an alphanumeric character case-insensitive exact match. Example if value of variable is "225" then a "0225" or "11225" its not a valid output from the file Im reading it.
Input File:
10.20.223.10|2000-H1|1/1/2|DeviceX_4021|LG
10.20.223.10|2000-H1|1/1/3|Undiscoverable|Unkwn
10.20.225.10|2000-H1|1/1/5|DeviceZ_2050|LG
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
10.20.223.10|2000-H1|1/1/8|DeviceY_01225_|Kenmore
10.20.225.10|2000-H1|1/1/8|DeviceY_2250_|Kenmore
Desired Output File:
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
If user input is "lg"; then it should output the line without not ignoring it because the input file has "lg" in uppercase. (This part is already fixed on the script).
Desired Output:
10.20.223.10|2000-H1|1/1/2|DeviceX_4021|LG
10.20.225.10|2000-H1|1/1/5|DeviceZ_2050|LG
$ awk -F'|' -v n='225' '$4 ~ n' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
or if you don't want a partial match (e.g. against 1225) then one way is:
$ awk -F'|' -v n='225' '$4 ~ ("(^|[^0-9])" n "([^0-9]|$)")' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
or:
$ awk -F'|' -v n='225' '$4 ~ ("(^|_)" n "(_|$)")' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
There are other possibilities too. The right solution depends on the requirements you haven't told us about and will pass or fail when using input other then you've shown us yet.
awk
awk -F"|" -v var="[A-Za-z].225_" '$4 ~ var{print}'
sed
sed -n '/[A-Za-z].225./p'
grep
grep '[A-Za-z].225.'
Output
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
Using sed:
sed -n '/^\([^|]*\|\)\{3\}[^|]*225/p' < input
Explanation:
the -n option disables automatic output at the end of each sed cycle
the pattern matches arbitrary contents of the first three (\{3\}) columns of data via the \(parenthesized\) pattern [^|]*\| -- any number of non-delmiter characters followed by the column delimiter
it matches additional input at the beginning of the fourth column, but not spanning columns, with a similar subexpression: [^|]*
then comes the literal text you want to match
the p command after the pattern causes the line to be printed to sed's output in the event that it matches the pattern
There's almost certainly an awk solution too, but in Perl it's this:
$ perl -aF'\|' -ne '$F[3] =~ 225 and print' < input
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
-a: Autosplit the input into array #F
-F'\|: Set the autosplit delimiter to |
-n: Run code for each line in the input file
-e: Here's the code to run
$F[3]: The 4th element of the autosplit array #F
=~: Regex match
and print: Print the input line if the regex matches
Update: You can get the string you're interested in from a command line parameter by assigning it in a BEGIN block.
$ perl -aF'\|' -ne 'BEGIN { $x = shift } $F[3] =~ $x and print' 225 < input

Extract lines containing two patterns

I have a file which contains several lines as follows:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
I want to extract all lines containing both and including the header lines.
I have tried using grep, but it only extracts the sequence lines but not the header lines.
grep <pattern_1> | grep <pattern_2> input.fasta > output.fasta
How to extract lines containing both the patterns and the headers in Linux? The patterns can be present anywhere in the lines. Not limited to start or end of the lines.
Expected output:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
$ grep -A 1 header[12] file
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
man grep:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
grep -B 1 pattern_[12]could work also, but you have several pattern_1s in the sample data so... not this time.
You can easily do that with awk like this:
awk '/^>/{h=$0;next}
/<pattern_1>/&&/<pattern_2>/{print h;print}' input.fasta > output.fasta
And here is a sed solution which yields the desired output as well:
sed -n '/^>/{N;/<pattern_1>/{/<pattern_2>/p}}' input.fasta > output.fasta
If it is likely that multiline records exist, you can use this:
awk -v pat1='<pattern_1>' -v pat2='<pattern_2>' '
/^>/ {r=$0;p=0;next}
!p {r=r ORS $0;if(chk()){print r;p=1};next}
p
function chk( tmp){
tmp=gensub(/\n/,"","g",r)
return (tmp~pat1&&tmp~pat2)
}' input.fasta > output.fasta
You might be interested in BioAwk, it is an adapted version of awk which is tuned to process fasta files
bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
'($seq ~ seq1) && ($seq ~ seq2) { print ">"$name; print $seq }' file.fasta
If you want seq1 at the beginning and seq2 at the end, you can change it into:
bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
'($seq ~ "^"seq1) && ($seq ~ seq2"$") { print ">"$name; print $seq }' file.fasta
This is really practical for processing fasta files, as often the sequence is spread over multiple lines. The above code handles this very easily as the variable $seq contains the full sequence.
If you do not want to install BioAwk, you can use the following method to process your FASTA file. It will allow multi-line sequences and does the following:
read a single record at a time (this assumes no > in the header, except the first character)
extract the header from the record and store it in name (not really needed)
merge the full sequence in a single string of characters, removing all newlines and spaces. This ensures that searching for pattern1 or pattern2 will not fail if the pattern is split over multiple lines.
if a match is found, print the record.
The following awk does the requested:
awk -v seq1="pattern1" -v seq2="pattern2" \
'BEGIN{RS=">"; ORS=""; FS="\n"}
{ seq="";for(i=2;i<=NF;++i) seq=seq""$i; gsub(/[^a-zA-Z0-9]/,"",seq) }
(seq ~ seq1 && seq ~ seq2){print ">" $0}' file.fasta
If the record header contains other > characters which are not at the beginning of the line, you have to take a slightly different approach (unless you use GNU awk)
awk -v seq1="pattern1" -v seq2="pattern2" \
'/^>/ && (seq ~ seq1 && seq ~ seq2) {
print name
for(i=0;i<n;i++) print aseq[i]
}
/^>/ { seq=""; delete aseq; n=0; name=$0; next }
{ aseq[n++] = $0; seq=seq""$0; sub(/[^a-zA-Z0-9]*$/,"",seq) }
END { if (seq ~ seq1 && seq ~ seq2) {
print name
for(i=0;i<n;i++) print aseq[i]
}
}' file.fasta
note: we make use of sub here in case unexpected characters are introduced in the fasta file (eg. spaces/tabs or CR (\r))
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
If you want grep to print lines around the match, use the -B flag for lines before, the -A for lines after, and -C for both before and after the match.
In your case, grep -B 1 seems like it would do the job.
If your input file is exactly as described in your post then you can use:
grep -B1 '^<pattern_1>.*<pattern_2>$' input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
Where -B1 will display on top of the matching lines the line before it. The regex used is based on the hypothesis that your 2 patterns are located in the exact order at the beginning and at the end of the line. If this is not the case: use '.*<pattern_1>.*<pattern_2>.*'. Last but not least, if the order of the 2 patterns are not always respected then you can use: '^.*<pattern_1>.*<pattern_2>.*$\|^.*<pattern_2>.*<pattern_1>.*$'
On the following input file:
cat input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header2b
<pattern_2>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_1>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
output:
grep -B1 '^.*<pattern_1>.*<pattern_2>.*$\|^.*<pattern_2>.*<pattern_1>.*$' input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header2b
<pattern_2>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_1>

Line Breaks in Unix File [duplicate]

This question already has an answer here:
Fixing broken csv files using awk
(1 answer)
Closed 6 years ago.
I have a file with records delimited by |. There are a few line breaks where a part of the first line moves into the second line. If I calculate the number of | in a particular line, it should be consistent throughout. How do I identify which line has a line break as such and append two lines into one so as the number of '|' in each line is consistent throughout?
The file is something like below:
DeptID|EmpFName|EmpLName|Salary
Engg|Sam|Le
wis|1000
Engg|Smith|Davis|2000
HR|Denis
|Lillie|1500
HR|Danny|Borr
inson|3000
IT|David|Letterman|2000
IT|John|Newman|3000
whereas I want to calculate the number of '|' in each line.
In this case, each line should have 3 '|' each, but due to line breaks, that is not the case,
My final desired output is
DeptID|EmpFName|EmpLName|Salary
Engg|Sam|Lewis|1000
Engg|Smith|Davis|2000
HR|Denis|Lillie|1500
HR|Danny|Borrinson|3000
IT|David|Letterman|2000
IT|John|Newman|3000
One in awk:
$ cat foo.awk
BEGIN { FS=OFS="|" } # set separators
NR==1 { nf=NF } # expect the field count to be correct on header record
NF<nf { # if NF less than on header record
while (NF<nf) { # and while NF < less than on header record
b=$0 # buffer too short record
getline # read next record
$0 = b $0 # catenate buffer and fresh record
}
} 1 # output
Run it:
$ awk -f foo.awk foo
DeptID|EmpFName|EmpLName|Salary
Engg|Sam|Lewis|1000
Engg|Smith|Davis|2000
HR|Denis|Lillie|1500
HR|Danny|Borrinson|3000
IT|David|Letterman|2000
IT|John|Newman|3000
No checks if record grows too long.
Given that at max the split is across two lines as stated by OP in question, sed can be used for an easy solution
$ cat ip.txt
DeptID|EmpFName|EmpLName|Salary
Engg|Sam|Le
wis|1000
Engg|Smith|Davis|2000
HR|Denis
|Lillie|1500
HR|Danny|Borr
inson|3000
IT|David|Letterman|2000
IT|John|Newman|3000
$ sed '/.*|.*|.*|/! {N; s/\n//}' ip.txt
DeptID|EmpFName|EmpLName|Salary
Engg|Sam|Lewis|1000
Engg|Smith|Davis|2000
HR|Denis|Lillie|1500
HR|Danny|Borrinson|3000
IT|David|Letterman|2000
IT|John|Newman|3000
/.*|.*|.*|/! if line doesn't contain three |
{N; s/\n//} get next line and remove first \n
Use grouping and quantifier to specify a number instead
sed '/\(.*|\)\{3\}/! {N; s/\n//}' ip.txt
with extended regex, -E or -r
sed -E '/(.*\|){3}/! {N; s/\n//}' ip.txt

how to replace specific record on a line containg a string with a number from another file using inplace editing sed in linux

I have an input file like following.
R sfst 1000.0000
$ new time step for mass scaled calculation
R dt2ms -4.000E-7
$ friction value for blank
R mue 0.120000
$ blankholder force
R bhf 2.0000E+5
$ simulation time
R endtime 0.150000
i want to change the value on the line containing 'mue'
with following I can read it but cant change it.
awk ' /mue/ { print $3 } ' input.txt
The value is to be taken from another file fric.txt.
fric.txt contains only numbers, one on each line .
fric.txt has data like
0.1234
0.234
0.0234
.
.
Blockquote
It should be noted that ONLY the FIRST instance need to be replaced and the format i.e. white spacing be kept cosntant.
Blockquote
Can anybody guide me doing this using sed or awk?
Try this command:
$ awk '/mue/ && !seen {getline $3 <"fric.txt"; seen=1} 1' input.txt
This might work for you:
sed '/\<mue\>/!d;=;s/.* \([^ ]\+\).*/\1/;R fric.txt' input.txt |
sed 'N;N;s|\n|s/|;s|\n|/|;s|$|/|;q' >temp.sed
sed -i -f temp.sed input.txt
You can do it with a sed in the sed (assuming you like to take line 1 from fric.txt):
sed -ir 's/(.*mue[ \t]+)[0-9.]+(.*)/\1'$(sed -n '1{p;q}' fric.txt)'\2/' input.txt

Resources