Split file with multiple delimited entries in some columns into separate lines

Split file with multiple delimited entries in some columns into separate lines - linux

I have a very large file with the following basic format, with a number of additional fields:
posA,id1,id2,posB,id3,name,(n additional fields)
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25;ENST76;ENST35,ENSP91;ENSP77;ENSP78,515;544;544,ENSG765,Gene2
3,ENST25;ENST76;ENST35,ENSP91;ENSP77;ENSP78,515;544;544,ENSG765,Gene2
4,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
5,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
6,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
Line one (posA=1) has a single entry for each column, and does not need to be modified. For lines with a variable number of multiple entries for some columns, for the third line (posA=2), the first entry for "id1" (ENST25) is paired with the first entry for "id2" (ENSP91) and the first entry for "posB" (515), and so on, but the columns with a single entry (eg, "posA", "id3", "name") apply to all of the paired entries in columns 2-4. Some fields in addition to columns 2-4 also rarely contain multiple entries.
I want to split the columns with multiple entries into separate lines, while retaining the data from the other columns, like so:
posA,id1,id2,posB,id3,name,(n additional fields)
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25,ENSP91,515,ENSG765,Gene2
2,ENST76,ENSP77,544,ENSG765,Gene2
2,ENST35,ENSP78,544,ENSG765,Gene2
3,ENST25,ENSP91,515,ENSG765,Gene2
3,ENST76,ENSP77,544,ENSG765,Gene2
3,ENST35,ENSP78,544,ENSG765,Gene2
4,ENST54,ENSP83,1864,ENSG48,Gene3
4,ENST93,ENSP36,722,ENSG48,Gene3
...
What is the best approach for this problem?
Thanks!

Taking your example as an example that at most there will be two-compound attributes, then using simple parameter expansion with substring removal, you can accomplish what you intend fairly easily, e.g.
#!/bin/bash
while IFS=, read -r p a1 a2 a3; do
[[ $a1 =~ ';' ]] && {
printf "%s,%s,%s,%s\n" "$p" "${a1%;*}" "${a2%;*}" "$a3"
printf "%s,%s,%s,%s\n" "$p" "${a1#*;}" "${a2#*;}" "$a3"
} || printf "%s,%s,%s,%s\n" "$p" "$a1" "$a2" "$a3"
done < "$1"
Where [[ $a1 =~ ';' ]] checks for a ';' in $a1 and if found then picks off the first attribute in $a1 and $a2 with ${a1%;*} and ${a2%;*}. Then for the second attribute in each, ${a1#*;} and ${a2#*;} are used.
If no ';' is contained in $a1, the attributes are printed unchanged. IFS=, insures the parameters are word-split on ','.
(note: you should add validation that the filename is valid, etc. to your final script. You can also use echo if you like)
Example Use/Output
$ splitattrib.sh file
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c,e,+
2,d,f,+

the best is break it to three part.
You have 3 line patterns. One has 6 columns. Another has 12, and the last is 9.
6 columns => 1 line
12 columns => 3 lines
9 columns => 2 line
Your 6 columns should not be modified. So reminds 12, and 9. That you can separate them in the if, else if and else. Like:
if( column == 6 ){...}
else if( column == 12 ){...}
else {...}
And here is a Perl one-liner solution:
perl -a -F",|;" -lne '$s=scalar #F;if($s==6){print join ",",#F}elsif($s==12){print join",",#F[0,1,4,7,-2,-1];print join",",#F[0,1,5,8,-2,-1];print join",",#F[0,1,6,9,-2,-1];}else{print join",",#F[0,1,3,5,-2,-1];print join",",#F[0,1,4,6,-2,-1]} ' file
and for you input, the output is:
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25,ENSP91,515,ENSG765,Gene2
2,ENST25,ENSP77,544,ENSG765,Gene2
2,ENST25,ENSP78,544,ENSG765,Gene2
3,ENST25,ENSP91,515,ENSG765,Gene2
3,ENST25,ENSP77,544,ENSG765,Gene2
3,ENST25,ENSP78,544,ENSG765,Gene2
4,ENST54,ENSP83,1864,ENSG48,Gene3
4,ENST54,ENSP36,722,ENSG48,Gene3
5,ENST54,ENSP83,1864,ENSG48,Gene3
5,ENST54,ENSP36,722,ENSG48,Gene3
6,ENST54,ENSP83,1864,ENSG48,Gene3
6,ENST54,ENSP36,722,ENSG48,Gene3

Assume your multiple entries are separated with semicolon ;, here is the awk version to do.
BEGIN {
FS="[,]"
}
{
if ($0 ~ /^[0-9].*/) {
end_split_field = 0
for (f=2;f<=NF;f++) {
if ($f ~ /.*;.*/) {
end_split_field=f
}
}
if (end_split_field == 0) {
print $0
} else {
for (f=2;f<=end_split_field;f++) {
n = split($f, a, ";") #split and return the number
for (i=1;i<=n;i++) {
b[f, i] = a[i]
}
}
for (i=1;i<=n;i++) {
printf $1","
for (j=2;j<=end_split_field;j++) {
printf b[j, i]","
}
for (k=end_split_field;k<NF;k++) {
printf $k","
}
printf $NF"\n"
}
}
} else {
print $0
}
}
Save the content above as input.awk, example input and output
$ cat input
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c;d,e;f,+
3,g;h;i,j;k;l,-
We can get the split output
$ awk -f input.awk input
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c,e,+
2,d,f,+
3,g,j,-
3,h,k,-
3,i,l,-

Related

How to print output in table format in shell script

I am new to shell scripting.. I want to disribute all the data of a file in a table format and redirect the output into another file.
I have below input file File.txt
Fruit_label:1 Fruit_name:Apple
Color:Red
Type: S
No.of seeds:10
Color of seeds :brown
Fruit_label:2 fruit_name:Banana
Color:Yellow
Type:NS
I want it to looks like this
Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds
1 | apple | red | S | 10 | brown
2 | banana| yellow | NS
I want to read all the data line by line from text file and make the headerlike fruit_label,fruit_name,color,type, no.of seeds, color of seeds and then print all the assigned value in rows.All the above data is different for different fruits for ex. banana dont have seeds so want to keep its row value as blank ..
Can anyone help me here.

Another approach, is a "Decorate & Process" approach. What is "Decorate & Process"? To Decorate is to take the text you have and decorate it with another separator to make field-splitting easier -- like in your case your fields can contain included whitespace along with the ':' separator between the field-names and values. With your inconsistent whitespace around ':' -- that makes it a nightmare to process ... simply.
So instead of worrying about what the separator is, think about "What should the fields be?" and then add a new separator (Decorate) between the fields and then Process with awk.
Here sed is used to Decorate your input with '|' as separators (a second call eliminates the '|' after the last field) and then a simpler awk process is used to split() the fields on ':' to obtain the field-name and field-value where the field-value is simply printed and the field-names are stored in an array. When a duplicate field-name is found -- it is uses as seen variable to designate the change between records, e.g.
sed -E 's/([^:]+:[[:blank:]]*[^[:blank:]]+)[[:blank:]]*/\1|/g' file |
sed 's/|$//' |
awk '
BEGIN { FS = "|" }
{
for (i=1; i<=NF; i++) {
if (split ($i, parts, /[[:blank:]]*:[[:blank:]]*/)) {
if (! n || parts[1] in fldnames) {
printf "%s %s", n ? "\n" : "", parts[2]
delete fldnames
n = 1
}
else
printf " | %s", parts[2]
fldnames[parts[1]]++
}
}
}
END { print "" }
'
Example Output
With your input in file you would have:
1 | Apple | Red | S | 10 | brown
2 | Banana | Yellow | NS
You will also see a "Decorate-Sort-Undecorate" used to sort data on a new non-existent columns of values by "Decorating" your data with a new last field, sorting on that field, and then "Undecorating" to remove the additional field when sorting is done. This allow sorting by data that may be the sum (or combination) of any two columns, etc...

Here is my solution. It is a new year gift, usually you have to demonstrate what you have tried so far and we help you, not do it for you.
Disclaimer some guru will probably come up with a simpler awk version, but this works.
File script.awk
# Remove space prefix
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
# Remove space suffix
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
# Remove both suffix and prefix spaces
function trim(s) { return rtrim(ltrim(s)); }
# Initialise or reset a fruit array
function array_init() {
for (i = 0; i <= 6; ++i) {
fruit[i] = ""
}
}
# Print the content of the fruit
function array_print() {
# To keep track if something was printed. Yes, print a carriage return.
# To avoid printing a carriage return on an empty array.
printedsomething = 0
for (i = 0; i <= 6; =+i) {
# Do no print if the content is empty
if (fruit[i] != "") {
printedsomething = 1
if (i == 1) {
# The first field must be further split, to remove "Fruit_name"
# Split on the space
split(fruit[i], temparr, / /)
printf "%s", trim(temparr[1])
}
else {
printf " | %s", trim(fruit[i])
}
}
}
if ( printedsomething == 1 ) {
print ""
}
}
BEGIN {
FS = ":"
print "Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds"
array_init()
}
/Fruit_label/ {
array_print()
array_init()
fruit[1] = $2
fruit[2] = $3
}
/Color:/ {
fruit[3] = $2
}
/Type/ {
fruit[4] = $2
}
/No.of seeds/ {
fruit[5] = $2
}
/Color of seeds/ {
fruit[6] = $2
}
END { array_print() }
To execute, call awk -f script.awk File.txt
awk processes a file line per line. So the idea is to store fruit information into an array.
Every time the line "Fruit_label:....." is found, print the current fruit and start a new one.
Since each line is read in sequence, you tell awk what to do with each line, based on a pattern.
The patterns are what are enclosed between // characters at the beginning of each section of code.
Difficulty: since the first line contains 2 information on every fruit, and I cut the lines on : char, the Fruit_label will include "Fruit_name".
I.e.: the first line is cut like this: $1 = Fruit_label, $2 = 1 Fruit_name, $3 = Apple
This is why the array_print() function is so complicated.
Trim functions are there to remove spaces.
Like for the Apple, Type: S when split on the : will result in S
If it meets your requirements, please see https://stackoverflow.com/help/someone-answers to accept it.

Replace each nth occurrence of 'foo' and 'bar' on two distincts columns by numerically respective nth line of a supplied file in respective columns

I have a source.txt file like below containing two columns of data. The format of the columns of source.txt include [ ] (square bracket) as shown in my source.txt:
[hot] [water]
[16] [boots and, juice]
and I have another target.txt file and contain empty lines plus full stops at the end of each line:
the weather is today (foo) but we still have (bar).
= (
the next bus leaves at (foo) pm, we can't forget to take the (bar).
I want to do replace foo of each nth line of target.txt with the "respective contents" of the first column of source.txt, and also replace bar of each nth line of target.txt with the "respective contents" of the second column of source. txt.
i tried to search other sources and understand how i would do it, at first i already have a command that i use to replace "replace each nth occurrence of 'foo' by numerically respective nth line of a supplied file" but i couldn't adapt it:
awk 'NR==FNR {a[NR]=$0; next} /foo/{gsub("foo", a[++i])} 1' source.txt target.txt > output.txt;
I remember seeing a way to use gsub with containing two columns of data but I don't remember what exactly the difference was.
EDIT POST: sometimes read with some symbols between them = and ( and ) within the target.txt text. I added this symbol as some answers will not work if these symbols are in the target.txt file
Note: the number of target.txt lines and therefore the number of occurrences of bar and foo in this file can vary, I just showed a sample. But the number of occurrences of both foo and bar in each row is 1 respectively.

With your shown samples, please try following answer. Written and tested in GNU awk.
awk -F'\\[|\\] \\[|\\]' '
FNR==NR{
foo[FNR]=$2
bar[FNR]=$3
next
}
NF{
gsub(/\<foo\>/,foo[++count])
gsub(/\<bar\>/,bar[count])
}
1
' source.txt FS=" " target.txt
Explanation: Adding detailed explanation for above.
awk -F'\\[|\\] \\[|\\]' ' ##Setting field separator as [ OR ] [ OR ] here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when source.txt will be read.
foo[FNR]=$2 ##Creating foo array with index of FNR and value of 2nd field here.
bar[FNR]=$3 ##Creating bar array with index of FNR and value of 3rd field here.
next ##next will skip all further statements from here.
}
NF{ ##If line is NOT empty then do following.
gsub(/\<foo\>/,foo[++count]) ##Globally substituting foo with array foo value, whose index is count.
gsub(/\<bar\>/,bar[count]) ##Globally substituting bar with array of bar with index of count.
}
1 ##printing line here.
' source.txt FS=" " target.txt ##Mentioning Input_files names here.
EDIT: Adding following solution also which will handle n number of occurrences of [...] in source and matching them at target file also. Since this is a working solution for OP(confirmed in comments) adding this in here. Also fair warning this will fail when source.txt contains a &.
awk '
FNR==NR{
while(match($0,/\[[^]]*\]/)){
arr[++count]=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
}
next
}
{
line=$0
while(match(line,/\(?[[:space:]]*(\<foo\>|\<bar\>)[[:space:]]*\)?/)){
val=substr(line,RSTART,RLENGTH)
sub(val,arr[++count1])
line=substr(line,RSTART+RLENGTH)
}
}
1
' source.txt target.txt

Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN {
FS="[][]"
tags["foo"]
tags["bar"]
}
NR==FNR {
map["foo",NR] = $2
map["bar",NR] = $4
next
}
{
found = 0
head = ""
while ( match($0,/\([^)]+)/) ) {
tag = substr($0,RSTART+1,RLENGTH-2)
if ( tag in tags ) {
if ( !found++ ) {
lineNr++
}
val = map[tag,lineNr]
}
else {
val = substr($0,RSTART,RLENGTH)
}
head = head substr($0,1,RSTART-1) val
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}
$ awk -f tst.awk source.txt target.txt
the weather is today hot but we still have water.
= (
the next bus leaves at 16 pm, we can't forget to take the boots and, juice.

awk '
NR==FNR { # build lookup
# delete gumph
gsub(/(^[[:space:]]*\[)|(\][[:space:]]*$)/, "")
# split
split($0, a, /\][[:space:]]+\[/)
# store
foo[FNR] = a[1]
bar[FNR] = a[2]
next
}
!/[^[:space:]]/ { next } # ignore blank lines
{ # do replacements
VFNR++ # FNR - (ignored lines)
# can use sub if foo/bar only appear once
gsub(/\<foo\>/, foo[VFNR])
gsub(/\<bar\>/, bar[VFNR])
print
}
' source.txt target.txt
Note: \< and \> are not in POSIX but are accepted by some versions of awk (eg. gawk). I'm not sure if POSIX awk regex has "word boundary".

Filtering CSV file based on string name

I'm trying to get specific columns of a csv file (that Header contains "SOF" in case). Is a large file and i need to copy this columns to another csv file using Shell.
I've tried something like this:
#!/bin/bash
awk ' {
i=1
j=1
while ( NR==1 )
if ( "$i" ~ /SOF/ )
then
array[j] = $i
$j += 1
fi
$i += 1
for ( k in array )
print array[k]
}' fil1.csv > result.csv
In this case i've tried to save the column numbers that contains "SOF" in the header in an array. After that copy the columns using this numbers.

Preliminary note: contrary to what one may infer from the code included in the OP, the values in the CSV are delimited with a semicolon.
Here is a solution with two separate commands:
the first parses the first line of your CSV file and identifies which fields must be exported. I use awk for this.
the second only prints the fields. I use cut for this (simpler syntax and quicker than awk, especially if your file is large)
The idea is that the first command yields a list of field numbers, separated with ",", suited to be passed as parameter to cut:
# Command #1: identify fields
fields=$(awk -F";" '
{
for (i = 1; i <= NF; i++)
if ($i ~ /SOF/) {
fields = fields sep i
sep = ","
}
print fields
exit
}' fil1.csv
)
# Command #2: export fields
{ [ -n "$fields" ] && cut -d";" -f "$fields" fil1.csv; } > result.csv

try something like this...
$ awk 'BEGIN {FS=OFS=","}
NR==1 {for(i=1;i<=NF;i++) if($i~/SOF/) {col=i; break}}
{print $col}' file
there is no handling if the sought out header doesn't exist so should print the whole line.

This link might be helpful for you :
One of the useful commands you probably need is "cut"
cut -d , -f 2 input.csv
Here number 2 is the column number you want to cut from your csv file.

try this one out :
awk '{for(i=1;i<=NF;i++)a[i]=a[i]" "$i}END{for (i in a ){ print a[i] } }' filename | grep SOF | awk '{for(i=1;i<=NF;i++)a[i]=a[i]" "$i}END{for (i in a ){ print a[i] } }'

Linux - Search text in a file and join in another file

I have two text files:
File-1:
PRKCZ
TNFRSF14
PRDM16
MTHFR
File-2(contains two tab delimited columns):
atherosclerosis GRAB1|PRKCZ|TTN
cardiomyopathy,hypercholesterolemia PRKCZ|MTHFR
Pulmonary arterial hypertension,arrhythmia PRDM16|APOE|GATA4
Now, for each name in File-1, print also the corresponding diseases names from File-2 where it matches. So the output would be:
PRKCZ atherosclerosis,cardiomyopathy,hypercholesterolemia
PRDM16 Pulmonary arterial hypertension,arrhythmia
MTHFR cardiomyopathy,hypercholesterolemia
I have tried the code:
$ awk '{k=$1}
NR==FNR{if(NR>1)a[k]=","b"="$1";else{a[k]="";b=$1}next}
k in a{print $0a[k]}' File1 File2
but I obtained no desired output. Can anybody correct/help please.

You can do this with the following awk script:
script.awk
BEGIN { FS="[\t]" }
NR==FNR { split($2, tmp, "|")
for( ind in tmp ) {
name = tmp[ ind ]
if (name in disease) { disease[ name ] = disease[ name ] "," $1 }
else { disease[ name ] = $1 }
}
next
}
{ if( $1 in disease) print $1, disease[ $1 ] }
Use it like this awk -f script.awk File-2 File-1 (note first File-2).
Explanation:
the BEGIN block sets up tab as separator.
the NR == FNR block is executed for the first argument (File-2): it reads the diseases with the names, splits the names and then appends the disease to a dictionary under each of the names
the last block is executed only (due to the next in the previous block) for the second argument (File-1): it outputs the diseases that are stored under the name (taken from $1)

Remove lines which are substrings of other lines

How can I delete lines which are substrings of other lines in a file while keeping the longer strings which include them?
I have a file that contain peptide sequences as strings - one sequence string per line. I want to keep the strings which contain all the sequences and remove all lines which are substrings of other lines in the file.
Input:
GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG
Expected Output:
GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN
The output should keep only longest strings and remove all lines which are substrings of the longest string. So, in the input above, lines 1,2,4 and 5 are substrings of line 3 so output retained line 3. Similarily for the strings on lines 6,8,9 and 10 all of which are substrings of line 7, thus line 7 is retained and written to output.

Maybe:
input=./input_file
while read -r str
do
[[ $(grep -c "$str" "$input") == 1 ]] && echo $str
done < "$input"
produces:
GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN
it is slow - but simple..

This should do what you want:
$ cat tst.awk
{ arr[$0]; strs=strs $0 RS }
END {
for (str in arr) {
if ( split(strs,tmp,str) == 2 ) {
print str
}
}
}
$ awk -f tst.awk file
IVPVNYARTTCRRTGGIRFTITGHDYFDN
GSAAQQYWTPANATFYGGSDASGT
It loops through every string in arr and then uses that as the separator value for split() - if the string occurs once then the full file contents will be split in half and so split() would return 2 but if the string is a subset of some other string then the file contents would be split into multiple segments and so split would return some number higher than 2.
If a string can appear multiple times in the input and you want it printed multiple times in the output (see the question in the comment from #G.Cito below) then you'd modify the above to:
!cnt[$0]++ { strs=strs $0 RS }
END {
for (str in cnt) {
if ( split(strs,tmp,str) == 2 ) {
for (i=1;i<=cnt[str];i++) {
print str
}
}
}
}

As a perl "one liner" (this should work for cutting and pasting into a terminal):
perl -E 'chomp(#r=<>);
for $i (0..$#r){
map { $uniq{$_}++ if ( index( $r[$i], $_ ) != -1 ) } #r;
}
for (sort keys %uniq){ say if ( $uniq{$_} == 1 ); }' peptide_seq.txt
We read and chomp the file (peptide_seq.txt) from STDIN (<>) and save it in #r which will be an array in which each element is a string from each line in the file.
Next we iterate through the array and map the elements of #r to a hash (%uniq) where each key is the content of each line; and each value is a number that is incremented when a line is found to be a substring of another line. Using index we can check whether a string contains a sub-string and increment the corresponding hash value if index() does not return the value for "not found" (-1).
The "master" strings contain all the other strings as sub-strings of themselves and will only be incremented once, so we loop again to print the keys of the %uniq hash that have the value == 1. This second loop could be a map instead:
map { say if ( $uniq{$_} == 1 ) } sort keys uniq ;
As a self-contained script that could be:
#!perl -l
chomp(#r=<DATA>);
for $i (0..$#r) {
map { $uniq{$_}++ if ( index( $r[$i], $_ ) != -1 ) } #r ;
}
map { print if ($uniq{$_} == 1) } sort keys %uniq ;
__DATA__
GSAAQQYW
ATFYGGSDASGT
GSAAQQYWTPANATFYGGSDASGT
GSAAQQYWTPANATF
ATFYGGSDASGT
NYARTTCRRTG
IVPVNYARTTCRRTGGIRFTITGHDYFDN
RFTITGHDYFDN
IVPVNYARTTCRRTG
ARTTCRRTGGIRFTITG
Output:
GSAAQQYWTPANATFYGGSDASGT
IVPVNYARTTCRRTGGIRFTITGHDYFDN

This will help you what you exactly need:
awk '{ print length(), NR, $0 | "sort -rn" }' sed_longer.txt | head -n 2

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Split file with multiple delimited entries in some columns into separate lines - linux

Related

How to print output in table format in shell script

Replace each nth occurrence of 'foo' and 'bar' on two distincts columns by numerically respective nth line of a supplied file in respective columns

Filtering CSV file based on string name

Linux - Search text in a file and join in another file

Remove lines which are substrings of other lines

Categories

Resources