I have a very large file with the following basic format, with a number of additional fields:
posA,id1,id2,posB,id3,name,(n additional fields)
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25;ENST76;ENST35,ENSP91;ENSP77;ENSP78,515;544;544,ENSG765,Gene2
3,ENST25;ENST76;ENST35,ENSP91;ENSP77;ENSP78,515;544;544,ENSG765,Gene2
4,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
5,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
6,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
Line one (posA=1) has a single entry for each column, and does not need to be modified. For lines with a variable number of multiple entries for some columns, for the third line (posA=2), the first entry for "id1" (ENST25) is paired with the first entry for "id2" (ENSP91) and the first entry for "posB" (515), and so on, but the columns with a single entry (eg, "posA", "id3", "name") apply to all of the paired entries in columns 2-4. Some fields in addition to columns 2-4 also rarely contain multiple entries.
I want to split the columns with multiple entries into separate lines, while retaining the data from the other columns, like so:
posA,id1,id2,posB,id3,name,(n additional fields)
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25,ENSP91,515,ENSG765,Gene2
2,ENST76,ENSP77,544,ENSG765,Gene2
2,ENST35,ENSP78,544,ENSG765,Gene2
3,ENST25,ENSP91,515,ENSG765,Gene2
3,ENST76,ENSP77,544,ENSG765,Gene2
3,ENST35,ENSP78,544,ENSG765,Gene2
4,ENST54,ENSP83,1864,ENSG48,Gene3
4,ENST93,ENSP36,722,ENSG48,Gene3
...
What is the best approach for this problem?
Thanks!
Taking your example as an example that at most there will be two-compound attributes, then using simple parameter expansion with substring removal, you can accomplish what you intend fairly easily, e.g.
#!/bin/bash
while IFS=, read -r p a1 a2 a3; do
[[ $a1 =~ ';' ]] && {
printf "%s,%s,%s,%s\n" "$p" "${a1%;*}" "${a2%;*}" "$a3"
printf "%s,%s,%s,%s\n" "$p" "${a1#*;}" "${a2#*;}" "$a3"
} || printf "%s,%s,%s,%s\n" "$p" "$a1" "$a2" "$a3"
done < "$1"
Where [[ $a1 =~ ';' ]] checks for a ';' in $a1 and if found then picks off the first attribute in $a1 and $a2 with ${a1%;*} and ${a2%;*}. Then for the second attribute in each, ${a1#*;} and ${a2#*;} are used.
If no ';' is contained in $a1, the attributes are printed unchanged. IFS=, insures the parameters are word-split on ','.
(note: you should add validation that the filename is valid, etc. to your final script. You can also use echo if you like)
Example Use/Output
$ splitattrib.sh file
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c,e,+
2,d,f,+
the best is break it to three part.
You have 3 line patterns. One has 6 columns. Another has 12, and the last is 9.
6 columns => 1 line
12 columns => 3 lines
9 columns => 2 line
Your 6 columns should not be modified. So reminds 12, and 9. That you can separate them in the if, else if and else. Like:
if( column == 6 ){...}
else if( column == 12 ){...}
else {...}
And here is a Perl one-liner solution:
perl -a -F",|;" -lne '$s=scalar #F;if($s==6){print join ",",#F}elsif($s==12){print join",",#F[0,1,4,7,-2,-1];print join",",#F[0,1,5,8,-2,-1];print join",",#F[0,1,6,9,-2,-1];}else{print join",",#F[0,1,3,5,-2,-1];print join",",#F[0,1,4,6,-2,-1]} ' file
and for you input, the output is:
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25,ENSP91,515,ENSG765,Gene2
2,ENST25,ENSP77,544,ENSG765,Gene2
2,ENST25,ENSP78,544,ENSG765,Gene2
3,ENST25,ENSP91,515,ENSG765,Gene2
3,ENST25,ENSP77,544,ENSG765,Gene2
3,ENST25,ENSP78,544,ENSG765,Gene2
4,ENST54,ENSP83,1864,ENSG48,Gene3
4,ENST54,ENSP36,722,ENSG48,Gene3
5,ENST54,ENSP83,1864,ENSG48,Gene3
5,ENST54,ENSP36,722,ENSG48,Gene3
6,ENST54,ENSP83,1864,ENSG48,Gene3
6,ENST54,ENSP36,722,ENSG48,Gene3
Assume your multiple entries are separated with semicolon ;, here is the awk version to do.
BEGIN {
FS="[,]"
}
{
if ($0 ~ /^[0-9].*/) {
end_split_field = 0
for (f=2;f<=NF;f++) {
if ($f ~ /.*;.*/) {
end_split_field=f
}
}
if (end_split_field == 0) {
print $0
} else {
for (f=2;f<=end_split_field;f++) {
n = split($f, a, ";") #split and return the number
for (i=1;i<=n;i++) {
b[f, i] = a[i]
}
}
for (i=1;i<=n;i++) {
printf $1","
for (j=2;j<=end_split_field;j++) {
printf b[j, i]","
}
for (k=end_split_field;k<NF;k++) {
printf $k","
}
printf $NF"\n"
}
}
} else {
print $0
}
}
Save the content above as input.awk, example input and output
$ cat input
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c;d,e;f,+
3,g;h;i,j;k;l,-
We can get the split output
$ awk -f input.awk input
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c,e,+
2,d,f,+
3,g,j,-
3,h,k,-
3,i,l,-
I'm thinking if there is a way to split the column by matching the header ?
The data looks like this
ID_1 ID_2 ID_3 ID_6 ID_15
value1 0 2 4 7 6
value2 0 4 4 3 8
value3 2 2 3 7 8
I would like to get the columns only on ID_3 & ID_15
ID_3 ID_15
4 6
4 8
3 8
awk can simply separate it if I know the order of the column
However, I have a very huge table and only have a list of ID in hands.
Can I still use awk or there is an easier way in linux ?
The input format isn't well defined, but there are a few simple ways, awk, perl and sqlite.
(FNR==1) {
nocol=split(col,ocols,/,/) # cols contains named columns
ncols=split("vals " $0,cols) # header line
for (nn=1; nn<=ncols; nn++) colmap[cols[nn]]=nn # map names
OFS="\t" # to align output
for (nn=1; nn<=nocol; nn++) printf("%s%s",ocols[nn],OFS)
printf("\n") # output header line
}
(FNR>1) { # read data
for (nn=1; nn<=nocol; nn++) {
if (nn>1) printf(OFS) # pad
if (ocols[nn] in colmap) { printf("%s",$(colmap[ocols[nn]])) }
else { printf "--" } # named column not in data
}
printf("\n") # wrap line
}
$ nawk -f mycols.awk -v col=ID_3,ID_15 data
ID_3 ID_15
4 6
4 8
3 8
Perl, just a variation on the above with some perl idioms to confuse/entertain:
use strict;
use warnings;
our #ocols=split(/,/,$ENV{cols}); # cols contains named columns
our $nocol=scalar(#ocols);
our ($nn,%colmap);
$,="\t"; # OFS equiv
# while (<>) {...} implicit with perl -an
if ($. == 1) { # FNR equiv
%colmap = map { $F[$_] => $_+1 } 0..$#F ; # create name map hash
$colmap{vals}=0; # name anon 1st col
print #ocols,"\n"; # output header
} else {
for ($nn = 0; $nn < $nocol; $nn++) {
print "\t" if ($nn>0);
if (exists($colmap{$ocols[$nn]})) { printf("%s",$F[$colmap{$ocols[$nn]}]) }
else { printf("--") } # named column not in data
}
printf("\n")
}
$ cols="ID_3,ID_15" perl -an mycols.pl < data
That uses an environment variable to skip effort parsing the command line. It needs the perl options -an which set up field-splitting and an input read loop (much like awk does).
And with sqlite (I used v3.11, v3.8 or later is required for useful .import I believe). This uses an in-memory temporary database (name a file if too large for memory, or for a persistent copy of the parsed data), and automatically creates a table based on the first line. The advantages here are that you might not need any scripting at all, and you can perform multiple queries on your data with just one parse overhead.
You can skip this next step if you have a single hard-tab delimiting the columns, in which case replace .mode csv with .mode tab in the sqlite example below.
Otherwise, to convert your data to a suitable CSV-ish format:
nawk -v OFS="," '(FNR==1){$0="vals " $0} {$1=$1;print} < data > data.csv
This adds a dummy first column "vals" to the first line, then prints each line as comma-separated, it does this by a seemingly pointless assignment to $1, but this causes $0 to be recomputed replacing FS (space/tab) with OFS (comma).
$ sqlite3
sqlite> .mode csv
sqlite> .import data.csv mytable
sqlite> .schema mytable
CREATE TABLE mytable(
"vals" TEXT,
"ID_1" TEXT,
"ID_2" TEXT,
"ID_3" TEXT,
"ID_6" TEXT,
"ID_15" TEXT
);
sqlite> select ID_3,ID_15 from mytable;
ID_3,ID_15
4,6
4,8
3,8
sqlite> .mode column
sqlite> select ID_3,ID_15 from mytable;
ID_3 ID_15
---------- ----------
4 6
4 8
3 8
Use .once or .output to send output to a file (sqlite docs). Use .headers on or .headers off as required.
sqlite is quite happy to create an unnamed column, so you don't have to add a name to the first column of the header line, but you do need to make sure the number of columns is the same for all input lines and formats.
If you get "expected X columns but found Y" errors during the .import then you'll need to clean up the data format a little for this.
$ cat c.awk
NR == 1 {
for (i=1; i<=NF; ++i) {
if ($i == "ID_3") col_3 = (i + 1)
if ($i == "ID_15") col_15 = (i + 1)
}
print "ID_3", "ID_15"
}
NR > 1 { print $col_3, $col_15 }
$ awk -f c.awk c.txt
ID_3 ID_15
4 6
4 8
3 8
You could go for something like this:
BEGIN {
keys["ID_3"]
keys["ID_15"]
}
NR == 1 {
for (i = 1; i <= NF; ++i)
if ($i in keys) cols[++n] = i
}
{
for (i = 1; i <= n; ++i)
printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS)
}
Save the script to a file and run it like awk -f script.awk file.
Alternatively, as a "one-liner":
awk 'BEGIN { keys["ID_3"]; keys["ID_15"] }
NR == 1 { for (i = 1; i <= NF; ++i) if ($i in keys) cols[++n] = i }
{ for (i = 1; i <= n; ++i) printf "%s%s", $(cols[i]+(NR>1)), (i < n ? OFS : ORS) }' file
Before the file is processed, keys are set in the keys array, corresponding to the column headings of interest.
On the first line, record all the column numbers that contain one of the keys in the cols array.
Loop through each of the cols and print them out, followed by either the output field separator OFS or the output record separator ORS, depending on whether it's the last one. $(cols[i]+(NR>1)) handles the fact that rows after the first have an extra field at the start, because NR>1 will be true (1) for those lines and false (0) for the first line.
Try below script:
#!/bin/sh
file="$1"; shift
awk -v cols="$*" '
BEGIN{
split(cols,C)
OFS=FS="\t"
getline
split($0,H)
for(c in C){
for(h in H){
if(C[c]==H[h])F[i++]=h
}
}
}
{ l="";for(f in F){l=l $F[f] OFS}print l }
' "$file"
In command line type:
[sumit.gupta#rpm01 ~]$ test.sh filename ID_3 ID_5
I have multiple .txt files containing text in an alphabet; I want to transliterate the text into an other alphabet; some characters of alphabet1 are 1:1 with those of alphabet2 (i.e. a becomes e), whereas others are 1:2 (i.e. x becomes ch).
I would like to do this using a simple script for the Linux shell.
With tr or sed I can convert 1:1 characters:
sed -f y/abcdefghijklmnopqrstuvwxyz/nopqrstuvwxyzabcdefghijklm/
a will become n, b will become o et cetera (a Caesar's cipher, I think)
But how can I deal with 1:2 characters?
Not an answer, just to show a briefer, idiomatic way to populate the table[] array from #konsolebox's answer as discussed in the related comments:
BEGIN {
split("a e b", old)
split("x ch o", new)
for (i in old)
table[old[i]] = new[i]
FS = OFS = ""
}
so the mapping of old to new chars is clearly shown in that the char in the first split() is mapped to the char(s) below it and for any other mapping you want you just need to change the string(s) in the split(), not change 26-ish explicit assignments to table[].
You can even create a general script to do mappings and just pass in the old and new strings as variables:
BEGIN {
split(o, old)
split(n, new)
for (i in old)
table[old[i]] = new[i]
FS = OFS = ""
}
then in shell anything like this:
old="a e b"
new="x ch o"
awk -v o="$old" -v b="$new" -f script.awk file
and you can protect yourself from your own mistakes populating the strings, e.g.:
BEGIN {
numOld = split(o, old)
numNew = split(n, new)
if (numOld != numNew) {
printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
exit 1
}
for (i=1; i <= numOld; i++) {
if (old[i] in table) {
printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
exit 1
}
if (newvals[new[i]]++) {
printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
}
table[old[i]] = new[i]
}
}
Wouldn't it be good to know if you wrote that b maps to x and then later mistakenly wrote that b maps to y? The above really is the best way to do this but your call of course.
Here's one complete solution as discussed in the comments below
BEGIN {
numOld = split("a e b", old)
numNew = split("x ch o", new)
if (numOld != numNew) {
printf "ERROR: #old vals (%d) != #new vals (%d)\n", numOld, numNew | "cat>&1"
exit 1
}
for (i=1; i <= numOld; i++) {
if (old[i] in table) {
printf "ERROR: \"%s\" duplicated at position %d in old string\n", old[i], i | "cat>&2"
exit 1
}
if (newvals[new[i]]++) {
printf "WARNING: \"%s\" duplicated at position %d in new string\n", new[i], i | "cat>&2"
}
map[old[i]] = new[i]
}
FS = OFS = ""
}
{
for (i = 1; i <= NF; ++i) {
if ($i in map) {
$i = map[$i]
}
}
print
}
I renamed the table array as map just because iMHO that better represents the purpose of the array.
save the above in a file script.awk and run it as awk -f script.awk inputfile
Using Awk:
#!/usr/bin/awk -f
BEGIN {
FS = OFS = ""
table["a"] = "e"
table["x"] = "ch"
# and so on...
}
{
for (i = 1; i <= NF; ++i) {
if ($i in table) {
$i = table[$i]
}
}
}
1
Usage:
awk -f script.awk file
Test:
# echo "the quick brown fox jumps over the lazy dog" | awk -f script.awk
the quick brown foch jumps over the lezy dog
This can be done quite concisely using a Perl one-liner:
perl -pe '%h=(a=>"xy",c=>"z"); s/(.)/defined $h{$1} ? $h{$1} : $1/eg'
or equivalently (thanks jaypal):
perl -pe '%h=(a=>"xy",c=>"z"); s|(.)|$h{$1}//=$1|eg'
%h is a hash containing the characters (keys) and their substitutions (values). s is the substitution command (as in sed). The g modifier means that the substitution is global and the e means that the replacement part is evaluated as an expression. It captures each character one by one and substitutes them with the value in the hash if it exists, otherwise keeps the original value. The -p switch means that each line in the input is automatically printed.
Testing it out:
$ perl -pe '%h=(a=>"xy",c=>"z"); s|(.)|$h{$1}//=$1|eg' <<<"abc"
xybz
Using sed.
Write a file transliterate.sed containing:
s/a/e/g
s/x/ch/g
and then run from your command line to get the transliterated output.txt from input.txt:
sed -f transliterate.sed input.txt > output.txt
If you need this more often consider adding #!/bin/sed -f as first line and making your file executable with chmod 744 transliterate.sed as described at the Wikipedia page for sed.
I have a large txt file ("," as delimiter) with some data and string:
2014:04:29:00:00:58:GMT: subject=BMRA.BM.T_GRIFW-1.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=3,TS=2014:04:29:01:00:00:GMT,VP=4.0,TS=2014:04:29:01:29:00:GMT,VP=4.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
2014:04:29:00:00:59:GMT: subject=BMRA.BM.T_GRIFW-2.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=2,TS=2014:04:29:01:00:00:GMT,VP=3.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
I would like to find lines that contain 'T_GRIFW' and then print the $1 field from 'subject' onwards and only the times and floats from $2 onwards. Furthermore, I want to incorporate an if statement so that if field $4 == 'NP=3', only fields $5,$6,$9,$10 are printed after the previous fields and if $4 == 'NP=2', all following fields are printed (times and floats only)
For instance, the result of the two sample lines will be:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I know this is complex and I have tried my best to be thorough in my description. The basic code I have thus far is:
awk 'BEGIN {FS=","}{OFS=","} /T_GRIFW-1.FPN/ {print $1}' tib_messages.2014-04-29
THANKS A MILLION!
Here's an awk executable file that'll create your desired output:
#!/usr/bin/awk -f
# use a more complicated FS => field numbers counted differently
BEGIN { FS="=|,"; OFS="," }
$2 ~ /T_GRIFW/ && $8=="NP" {
str="subject=" $2 OFS
# strip ":GMT" from dates and "}" from everywhere
gsub( /:GMT|[\}]/, "")
# append common fields to str with OFS
for(i=5;i<=13;i+=2) str=str $i OFS
# print the remaining fields and line separator
if($9==3) { print str $19, $21 }
else if($9==2) { print str $15, $17 }
}
Placing that in a file called awko and chmod'ing it then running awko data yields:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I've placed comments in the script, but here are some things that could be spelled out better:
Using a more complicated FS means you don't have reparse for = to work with the field data
I "cheated" and just hard-coded subject (which now falls at the end of $1) for str
:GMT and } appeared to be the only data that needed to be forcibly removed
With this FS Dates and numbers are two apart from each other but still loop-able
In either final print call, the str already ends in an OFS, so the comma between it and next field can be skipped
If I understand your requirements, the following will work:
BEGIN {
FS=","
OFS=","
}
/T_GRIFW/ {
split($1, subject, " ")
result = subject[2] OFS
delete arr
counter = 1
for (i = 2; i <= NF; i++) {
add = 0
if ($4 == "NP=3") {
if (i == 5 || i == 6 || i == 9 || i == 10) {
add = 1
}
}
else if ($4 == "NP=2") {
add = 1
}
if (add) {
counter = counter + 1
split($i, field, "=")
if (match(field[2], "[0-9]*\.[0-9]+|GMT")) {
arr[counter] = field[2]
}
}
}
for (i in arr) {
gsub(/{|}/,"", arr[i]) # remove curly braces
result = result arr[i] OFS
}
print substr(result, 0, length(result)-1)
}