Splitting comma set files - linux

I have a file with patterns like below
12345343|559|-2,0,-200000,-20|20161108000000|FL|62,859,1439,1956|0,0,21300,0|S
7778880|123|500,100|20161108000000|AL|21,135|3|S
I'm looking for a way to separate into multiple records mapping 3rd and 6th set of values
Required output:
12345343|559|-2|20161108000000|FL|62|0,0,21300,0|S
12345343|559|0|20161108000000|FL|859|0,0,21300,0|S
12345343|559|-200000|20161108000000|FL|1439|0,0,21300,0|S
12345343|559|-20|20161108000000|FL|1956|0,0,21300,0|S
7778880|123|500|20161108000000|AL|21|3|S
7778880|123|100|20161108000000|AL|135|3|S

This might work for you (GNU sed):
sed -r 's/^(.*\|.*\|)([^,]*),([^|]*)(\|.*\|.*\|)([^,]*),([^|]*)(.*)/\1\2\4\5\7\n\1\3\4\6\7/;P;D' file
Iteratively split the current line into pieces, building two lines separated by a newline. The first line contains the head of the 3rd and 6th fields, the second line contain the tails of the 3rd and 6th lines. Print then delete the first of the lines and then repeat till the lists in the 3rd and 6th fields are consumed.

You can use this awk
awk -F'|' -vOFS='|' '{a="";b=split($3,c,",");split($6,d,",");for(e=1;e<=b;e++){if(a)a=a RS;$3=c[e];$6=d[e];a=a$0};print a}' infile
Explanation :
awk -F'|' -vOFS='|' ' # fields are separate by | for input and output
{
a=""
b=split($3,c,",") # split field 3 in array c
# b is the number of elements
split($6,d,",") # split field 6 in array d
for(e=1;e<=b;e++) # for each element of array c and d
{
if(a) # if a is defined, append RS (\n) at the end
a=a RS
$3=c[e]
$6=d[e] # substitute fields 3 and 6 with the value of array c and d
a=a$0 # append the complete line to a
}
print a # at the end of the loop print a
}
' infile

Related

Remove n lines after nth occurrence of a pattern and paste content of a file after nth occurrence in unix

I have a file that contains two repeated strings, which are my markers to recognize. The marker is "(CO)VARIANCES" and each of theme following 3 different lines. I want to remove the 3 lines after the first string and replace it with the content of another file. Structure of file is like:
cat file1.txt
Something1
Something2
Something3
(CO)VARIANCES
44.572 0.28723E-01 0.0000
0.28723E-01 0.64501E-03 0.0000
0.0000 0.0000 0.0000
Something4
Something5
Something6
(CO)VARIANCES
34.891 0.38642E-01 1.7538
0.38642E-01 0.17122E-02 0.54735E-02
1.7538 0.54735E-02 0.23285
I want to remove three lines after (CO)VARIANCES and replace it with another file that contains something. The command
sed -e '/(CO)VARIANCES/{n;N;N;d}' file1.txt
is removing 3 lines after both markers and I don't know how to indicate the number of occurrence in this command. And I don't know how to conditionally paste the second content of the second file after those markers. Does somebody have an idea about that?
To perform the replacement on the Nth occurrence, consider the following script (N is the n-th occurrence)
awk -v N=3 '
# Delete if needed
DEL > 0 { DEL-- ; next } ;
# Find matching line
$1 == "(CO)VARIANCES" && (--N) == 0 {
print
DEL = 3 ;
# Include second file here.
while ( getline v < "file2.txt" ) print v ;
next
}
# Print all other lines
{ print }
'
This might work for you (GNU sed):
sed -En '/\(CO\)VARIANCES/{n;:a;s/[^\n]*/&/3;tb;N;ba;:b;p}' file2 |
sed -Ee '/\(CO\)VARIANCES/{n;:a;R /dev/stdin' -e 's/[^\n]*/&/3;tb;N;ba;:b;d}' file1
The solution consists of two sed invocations piped together.
The first sed invocation prepares a file consisting of the lines from file2 following the marker (in the this case 3).
The second sed invocation reads in the required number of lines from the first invocation and deletes the same number of lines from file1 following the marker.

sed - Delete lines only if they contain multiple instances of a string

I have a text file that contains numerous lines that have partially duplicated strings. I would like to remove lines where a string match occurs twice, such that I am left only with lines with a single match (or no match at all).
An example output:
g1: sample1_out|g2039.t1.faa sample1_out|g334.t1.faa sample1_out|g5678.t1.faa sample2_out|g361.t1.faa sample3_out|g1380.t1.faa sample4_out|g597.t1.faa
g2: sample1_out|g2134.t1.faa sample2_out|g1940.t1.faa sample2_out|g45.t1.faa sample4_out|g1246.t1.faa sample3_out|g2594.t1.faa
g3: sample1_out|g2198.t1.faa sample5_out|g1035.t1.faa sample3_out|g1504.t1.faa sample5_out|g441.t1.faa
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
In this case I would like to remove lines 1, 2, and 3 because sample1 is repeated multiple times on line 1, sample 2 is twice on line 2, and sample 5 is repeated twice on line 3. Line 4 would pass because it contains only one instance of each sample.
I am okay repeating this operation multiple times using different 'match' strings (e.g. sample1_out , sample2_out etc in the example above).
Here is one in GNU awk:
$ awk -F"[| ]" '{ # pipe or space is the field reparator
delete a # delete previous hash
for(i=2;i<=NF;i+=2) # iterate every other field, ie right side of space
if($i in a) # if it has been seen already
next # skit this record
else # well, else
a[$i] # hash this entry
print # output if you make it this far
}' file
Output:
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
The following sed command will accomplish what you want.
sed -ne '/.* \(.*\)|.*\1.*/!p' file.txt
grep: grep -vE '(sample[0-9]).*\1' file
Inspiring from Glenn's answer: use -i with sed to directly do changes in the file.
sed -r '/(sample[0-9]).*\1/d' txt_file

Merge values for same key

Is that possible to use awk to values of same key into one row?
For instance
a,100
b,200
a,131
a,102
b,203
b,301
Can I convert them to a file like this:
a,100,131,102
b,200,203,301
You can use awk like this:
awk -F, '{a[$1] = a[$1] FS $2} END{for (i in a) print i a[i]}' file
a,100,131,102
b,200,203,301
We use -F, to use comma as delimiter and use array a to keep aggregated value.
Reference: Effective AWK Programming
If Perl is an option,
perl -F, -lane '$a{$F[0]} = "$a{$F[0]},$F[1]"; END{for $k (sort keys %a){print "$k$a{$k}"}}' file
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
-F autosplit modifier, in this case splits on ,
#F is the array of words in each line, indexed starting with $F[0]
$F[0] is the first element in #F (the key)
$F[1] is the second element in #F (the value)
%a is a hash which stores a string containing all matches of each key
tl;dr
If you presort the input, it is possible to use sed to join the lines, e.g.:
sort foo | sed -nE ':a; $p; N; s/^([^,]+)([^\n]+)\n\1/\1\2/; ta; P; s/.+\n//; ba'
A bit more explanation
The above one-liner can be saved into a script file. See below for a commented version.
parse.sed
# A goto label
:a
# Always print when on the last line
$p
# Read one more line into pattern space and join the
# two lines if the key fields are identical
N
s/^([^,]+)([^\n]+)\n\1/\1\2/
# Jump to label 'a' and redo the above commands if the
# substitution command was successful
ta
# Assuming sorted input, we have now collected all the
# fields for this key, print it and move on to the next
# key
P
s/.+\n//
ba
The logic here is as follows:
Assume sorted input.
Look at two consecutive lines. If their key fields match, remove the key from the second line and append the value to the first line.
Repeat 2. until key matching fails.
Print the collected values and reset to collect values for the next key.
Run it like this:
sort foo | sed -nEf parse.sed
Output:
a,100,102,131
b,200,203,301
With datamash
$ datamash -st, -g1 collapse 2 <ip.txt
a,100,131,102
b,200,203,301
From manual:
-s, --sort
sort the input before grouping; this removes the need to manually pipe the input through 'sort'
-t, --field-separator=X
use X instead of TAB as field delimiter
-g, --group=X[,Y,Z]
group via fields X,[Y,Z]
collapse
comma-separated list of all input values

changing a character of columns of a large file

I have to replace a single nth character of each (alternate) row of a large file with correponding column of another file. For eg I am changing every 5th character.
file1:
>chr1:101842566-101842576
CCTCAACTCA
>chr1:101937281-101937291
GAATTGGATA
>chr1:101964276-101964286
AAAAAATAGG
>chr1:101972950-101972960
ggctctcatg
>chr1:101999969-101999979
CATCATGACG
file2:
G
A
T
A
C
output:
>chr1:101842566-101842576
CCTCGACTCA
>chr1:101937281-101937291
GAATAGGATA
>chr1:101964276-101964286
AAAATATAGG
>chr1:101972950-101972960
ggctAtcatg
>chr1:101999969-101999979
CATCCTGACG
The number of characters in each (alternate) row can be large. And number of rows are large too. How this can be done efficiently?
Here is one way with awk:
awk 'NR==FNR{a[NR]=$1;next}!/^>/{$1=substr($1,1,n-1) a[++i] substr($1,n+1)}1' n=5 f2 f1
Explanation:
We iterate over second file and store it in an array indexed at line number.
Once the second file is loaded in memory, we move to the second file.
We look for lines not starting with >.
When found we substitute the value from our array. We do this by using substr function.
The variable n defined allows you to modify the nth character
For the lines that donot have > we print them as is using 1 which is default for
printing.
This solution assumes the format of the file is as shown above. That is, the first file will always start with > followed by the line you want to make changes on. Substitution from the second file will be made in the order it is seen.
Demo:
Every 5th character:
$ awk 'NR==FNR{a[NR]=$1;next}!/^>/{$1=substr($1,1,n-1) a[++i] substr($1,n+1)}1' n=5 f2 f1
>chr1:101842566-101842576
CCTCGACTCA
>chr1:101937281-101937291
GAATAGGATA
>chr1:101964276-101964286
AAAATATAGG
>chr1:101972950-101972960
ggctAtcatg
>chr1:101999969-101999979
CATCCTGACG
Every 3rd character:
$ awk 'NR==FNR{a[NR]=$1;next}!/^>/{$1=substr($1,1,n-1) a[++i] substr($1,n+1)}1' n=3 f2 f1
>chr1:101842566-101842576
CCGCAACTCA
>chr1:101937281-101937291
GAATTGGATA
>chr1:101964276-101964286
AATAAATAGG
>chr1:101972950-101972960
ggAtctcatg
>chr1:101999969-101999979
CACCATGACG
This is how I would use perl. First read all of file2 into an array, and then iterate over that array reading two and two lines from file1, printing the first line unmodified and then change the 5th character on the second line:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
#use Data::Printer;
# Read all of file2
my $lines;
open(FILE, $ARGV[1]);
{
local $/;
$lines = <FILE>;
}
close(FILE);
my #new_chars = split(/\n/, $lines);
# Read and process file1
open(FILE, $ARGV[0]);
foreach my $new_char (#new_chars) {
# >chr1:101842566-101842576
my $line = <FILE>;
print $line;
# CCTCAACTCA
$line = <FILE>;
$line =~ s/^(....)./$1$new_char/; # Replace 5th character
print $line;
}
close(FILE);
You could replace a column in the file inplace using mmap in Python:
#!/usr/bin/env python3
"""Replace inplace a column of a large file.
Usage:
$ ./replace-inplace file1 file2 5
"""
import sys
from mmap import ACCESS_WRITE, mmap
def main():
ncolumn = int(sys.argv[3]) - 1 # 1st column is 1
with open(sys.argv[1], 'r+b') as file1:
with mmap(file1.fileno(), 0, access=ACCESS_WRITE) as mm:
with open(sys.argv[2], 'rb') as file2:
while True:
mm.readline() # ignore every other line
pos = mm.tell() # remember current position
if not mm.readline(): # EOF
break
replacement = file2.readline().strip()[0]
mm[pos + ncolumn] = replacement # replace the column
main()
It assumes that you are replacing a byte with a byte i.e., no content is moved in the file.
This might work for you (GNU sed, paste and cat):
cat file1 | paste -d\\n\\t\\n - file2 - | sed -r 's/^(.)\t(.{4})./\2\1/' >file3
Embed the data from file2 into file1 and then re-arrange it.

Scalable way of deleting all lines from a file where the line starts with one of many values

Given an input file of variable values (example):
A
B
D
What is a script to remove all lines from another file which start with one of the above values? For example, the file contents:
A
B
C
D
Would end up being:
C
The input file is of the order of 100,000 variable values. The file to be mangled is of the order of several million lines.
awk '
NR==FNR { # IF this is the first file in the arg list THEN
list[$0] # store the contents of the current record as an index or array "list"
next # skip the rest of the script and so move on to the next input record
} # ENDIF
{ # This MUST be the second file in the arg list
for (i in list) # FOR each index "i" in array "list" DO
if (index($0,i) == 1) # IF "i" starts at the 1st char on the current record THEN
next # move on to the next input record
}
1 # Specify a true condition and so invoke the default action of printing the current record.
' file1 file2
An alternative approach to building up an array and then doing a string comparison on each element would be to build up a Regular Expression, e.g.:
...
list = list "|" $0
...
and then doing an RE comparison:
...
if ($0 ~ list)
next
...
but I'm not sure that'd be any faster than the loop and you'd then have to worry about RE metacharacters appearing in file1.
If all of your values in file1 are truly single characters, though, then this approach of creating a character list to use in an RE comparison might work well for you:
awk 'NR==FNR{list = list $0; next} $0 !~ "^[" list "]"' file1 file2
You can also achieve this using egrep:
egrep -vf <(sed 's/^/^/' file1) file2
Lets see it in action:
$ cat file1
A
B
$ cat file2
Asomething
B1324
C23sd
D2356A
Atext
CtestA
EtestB
Bsomething
$ egrep -vf <(sed 's/^/^/' file1) file2
C23sd
D2356A
CtestA
EtestB
This would remove lines that start with one of the values in file1.
You can use comm to display the lines that are not common to both files, like this:
comm -3 file1 file2
Will print:
C
Notice that for this for this to work, both files have to be sorted, if they aren't sorted you can bypass that using
comm -3 <(sort file1) <(sort file2)

Resources