I am new to shell scripting.. I want to disribute all the data of a file in a table format and redirect the output into another file.
I have below input file File.txt
Fruit_label:1 Fruit_name:Apple
Color:Red
Type: S
No.of seeds:10
Color of seeds :brown
Fruit_label:2 fruit_name:Banana
Color:Yellow
Type:NS
I want it to looks like this
Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds
1 | apple | red | S | 10 | brown
2 | banana| yellow | NS
I want to read all the data line by line from text file and make the headerlike fruit_label,fruit_name,color,type, no.of seeds, color of seeds and then print all the assigned value in rows.All the above data is different for different fruits for ex. banana dont have seeds so want to keep its row value as blank ..
Can anyone help me here.
Another approach, is a "Decorate & Process" approach. What is "Decorate & Process"? To Decorate is to take the text you have and decorate it with another separator to make field-splitting easier -- like in your case your fields can contain included whitespace along with the ':' separator between the field-names and values. With your inconsistent whitespace around ':' -- that makes it a nightmare to process ... simply.
So instead of worrying about what the separator is, think about "What should the fields be?" and then add a new separator (Decorate) between the fields and then Process with awk.
Here sed is used to Decorate your input with '|' as separators (a second call eliminates the '|' after the last field) and then a simpler awk process is used to split() the fields on ':' to obtain the field-name and field-value where the field-value is simply printed and the field-names are stored in an array. When a duplicate field-name is found -- it is uses as seen variable to designate the change between records, e.g.
sed -E 's/([^:]+:[[:blank:]]*[^[:blank:]]+)[[:blank:]]*/\1|/g' file |
sed 's/|$//' |
awk '
BEGIN { FS = "|" }
{
for (i=1; i<=NF; i++) {
if (split ($i, parts, /[[:blank:]]*:[[:blank:]]*/)) {
if (! n || parts[1] in fldnames) {
printf "%s %s", n ? "\n" : "", parts[2]
delete fldnames
n = 1
}
else
printf " | %s", parts[2]
fldnames[parts[1]]++
}
}
}
END { print "" }
'
Example Output
With your input in file you would have:
1 | Apple | Red | S | 10 | brown
2 | Banana | Yellow | NS
You will also see a "Decorate-Sort-Undecorate" used to sort data on a new non-existent columns of values by "Decorating" your data with a new last field, sorting on that field, and then "Undecorating" to remove the additional field when sorting is done. This allow sorting by data that may be the sum (or combination) of any two columns, etc...
Here is my solution. It is a new year gift, usually you have to demonstrate what you have tried so far and we help you, not do it for you.
Disclaimer some guru will probably come up with a simpler awk version, but this works.
File script.awk
# Remove space prefix
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
# Remove space suffix
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
# Remove both suffix and prefix spaces
function trim(s) { return rtrim(ltrim(s)); }
# Initialise or reset a fruit array
function array_init() {
for (i = 0; i <= 6; ++i) {
fruit[i] = ""
}
}
# Print the content of the fruit
function array_print() {
# To keep track if something was printed. Yes, print a carriage return.
# To avoid printing a carriage return on an empty array.
printedsomething = 0
for (i = 0; i <= 6; =+i) {
# Do no print if the content is empty
if (fruit[i] != "") {
printedsomething = 1
if (i == 1) {
# The first field must be further split, to remove "Fruit_name"
# Split on the space
split(fruit[i], temparr, / /)
printf "%s", trim(temparr[1])
}
else {
printf " | %s", trim(fruit[i])
}
}
}
if ( printedsomething == 1 ) {
print ""
}
}
BEGIN {
FS = ":"
print "Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds"
array_init()
}
/Fruit_label/ {
array_print()
array_init()
fruit[1] = $2
fruit[2] = $3
}
/Color:/ {
fruit[3] = $2
}
/Type/ {
fruit[4] = $2
}
/No.of seeds/ {
fruit[5] = $2
}
/Color of seeds/ {
fruit[6] = $2
}
END { array_print() }
To execute, call awk -f script.awk File.txt
awk processes a file line per line. So the idea is to store fruit information into an array.
Every time the line "Fruit_label:....." is found, print the current fruit and start a new one.
Since each line is read in sequence, you tell awk what to do with each line, based on a pattern.
The patterns are what are enclosed between // characters at the beginning of each section of code.
Difficulty: since the first line contains 2 information on every fruit, and I cut the lines on : char, the Fruit_label will include "Fruit_name".
I.e.: the first line is cut like this: $1 = Fruit_label, $2 = 1 Fruit_name, $3 = Apple
This is why the array_print() function is so complicated.
Trim functions are there to remove spaces.
Like for the Apple, Type: S when split on the : will result in S
If it meets your requirements, please see https://stackoverflow.com/help/someone-answers to accept it.
I have a source.txt file like below containing two columns of data. The format of the columns of source.txt include [ ] (square bracket) as shown in my source.txt:
[hot] [water]
[16] [boots and, juice]
and I have another target.txt file and contain empty lines plus full stops at the end of each line:
the weather is today (foo) but we still have (bar).
= (
the next bus leaves at (foo) pm, we can't forget to take the (bar).
I want to do replace foo of each nth line of target.txt with the "respective contents" of the first column of source.txt, and also replace bar of each nth line of target.txt with the "respective contents" of the second column of source. txt.
i tried to search other sources and understand how i would do it, at first i already have a command that i use to replace "replace each nth occurrence of 'foo' by numerically respective nth line of a supplied file" but i couldn't adapt it:
awk 'NR==FNR {a[NR]=$0; next} /foo/{gsub("foo", a[++i])} 1' source.txt target.txt > output.txt;
I remember seeing a way to use gsub with containing two columns of data but I don't remember what exactly the difference was.
EDIT POST: sometimes read with some symbols between them = and ( and ) within the target.txt text. I added this symbol as some answers will not work if these symbols are in the target.txt file
Note: the number of target.txt lines and therefore the number of occurrences of bar and foo in this file can vary, I just showed a sample. But the number of occurrences of both foo and bar in each row is 1 respectively.
With your shown samples, please try following answer. Written and tested in GNU awk.
awk -F'\\[|\\] \\[|\\]' '
FNR==NR{
foo[FNR]=$2
bar[FNR]=$3
next
}
NF{
gsub(/\<foo\>/,foo[++count])
gsub(/\<bar\>/,bar[count])
}
1
' source.txt FS=" " target.txt
Explanation: Adding detailed explanation for above.
awk -F'\\[|\\] \\[|\\]' ' ##Setting field separator as [ OR ] [ OR ] here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when source.txt will be read.
foo[FNR]=$2 ##Creating foo array with index of FNR and value of 2nd field here.
bar[FNR]=$3 ##Creating bar array with index of FNR and value of 3rd field here.
next ##next will skip all further statements from here.
}
NF{ ##If line is NOT empty then do following.
gsub(/\<foo\>/,foo[++count]) ##Globally substituting foo with array foo value, whose index is count.
gsub(/\<bar\>/,bar[count]) ##Globally substituting bar with array of bar with index of count.
}
1 ##printing line here.
' source.txt FS=" " target.txt ##Mentioning Input_files names here.
EDIT: Adding following solution also which will handle n number of occurrences of [...] in source and matching them at target file also. Since this is a working solution for OP(confirmed in comments) adding this in here. Also fair warning this will fail when source.txt contains a &.
awk '
FNR==NR{
while(match($0,/\[[^]]*\]/)){
arr[++count]=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
}
next
}
{
line=$0
while(match(line,/\(?[[:space:]]*(\<foo\>|\<bar\>)[[:space:]]*\)?/)){
val=substr(line,RSTART,RLENGTH)
sub(val,arr[++count1])
line=substr(line,RSTART+RLENGTH)
}
}
1
' source.txt target.txt
Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN {
FS="[][]"
tags["foo"]
tags["bar"]
}
NR==FNR {
map["foo",NR] = $2
map["bar",NR] = $4
next
}
{
found = 0
head = ""
while ( match($0,/\([^)]+)/) ) {
tag = substr($0,RSTART+1,RLENGTH-2)
if ( tag in tags ) {
if ( !found++ ) {
lineNr++
}
val = map[tag,lineNr]
}
else {
val = substr($0,RSTART,RLENGTH)
}
head = head substr($0,1,RSTART-1) val
$0 = substr($0,RSTART+RLENGTH)
}
print head $0
}
$ awk -f tst.awk source.txt target.txt
the weather is today hot but we still have water.
= (
the next bus leaves at 16 pm, we can't forget to take the boots and, juice.
awk '
NR==FNR { # build lookup
# delete gumph
gsub(/(^[[:space:]]*\[)|(\][[:space:]]*$)/, "")
# split
split($0, a, /\][[:space:]]+\[/)
# store
foo[FNR] = a[1]
bar[FNR] = a[2]
next
}
!/[^[:space:]]/ { next } # ignore blank lines
{ # do replacements
VFNR++ # FNR - (ignored lines)
# can use sub if foo/bar only appear once
gsub(/\<foo\>/, foo[VFNR])
gsub(/\<bar\>/, bar[VFNR])
print
}
' source.txt target.txt
Note: \< and \> are not in POSIX but are accepted by some versions of awk (eg. gawk). I'm not sure if POSIX awk regex has "word boundary".
I have a very large file with the following basic format, with a number of additional fields:
posA,id1,id2,posB,id3,name,(n additional fields)
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25;ENST76;ENST35,ENSP91;ENSP77;ENSP78,515;544;544,ENSG765,Gene2
3,ENST25;ENST76;ENST35,ENSP91;ENSP77;ENSP78,515;544;544,ENSG765,Gene2
4,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
5,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
6,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
Line one (posA=1) has a single entry for each column, and does not need to be modified. For lines with a variable number of multiple entries for some columns, for the third line (posA=2), the first entry for "id1" (ENST25) is paired with the first entry for "id2" (ENSP91) and the first entry for "posB" (515), and so on, but the columns with a single entry (eg, "posA", "id3", "name") apply to all of the paired entries in columns 2-4. Some fields in addition to columns 2-4 also rarely contain multiple entries.
I want to split the columns with multiple entries into separate lines, while retaining the data from the other columns, like so:
posA,id1,id2,posB,id3,name,(n additional fields)
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25,ENSP91,515,ENSG765,Gene2
2,ENST76,ENSP77,544,ENSG765,Gene2
2,ENST35,ENSP78,544,ENSG765,Gene2
3,ENST25,ENSP91,515,ENSG765,Gene2
3,ENST76,ENSP77,544,ENSG765,Gene2
3,ENST35,ENSP78,544,ENSG765,Gene2
4,ENST54,ENSP83,1864,ENSG48,Gene3
4,ENST93,ENSP36,722,ENSG48,Gene3
...
What is the best approach for this problem?
Thanks!
Taking your example as an example that at most there will be two-compound attributes, then using simple parameter expansion with substring removal, you can accomplish what you intend fairly easily, e.g.
#!/bin/bash
while IFS=, read -r p a1 a2 a3; do
[[ $a1 =~ ';' ]] && {
printf "%s,%s,%s,%s\n" "$p" "${a1%;*}" "${a2%;*}" "$a3"
printf "%s,%s,%s,%s\n" "$p" "${a1#*;}" "${a2#*;}" "$a3"
} || printf "%s,%s,%s,%s\n" "$p" "$a1" "$a2" "$a3"
done < "$1"
Where [[ $a1 =~ ';' ]] checks for a ';' in $a1 and if found then picks off the first attribute in $a1 and $a2 with ${a1%;*} and ${a2%;*}. Then for the second attribute in each, ${a1#*;} and ${a2#*;} are used.
If no ';' is contained in $a1, the attributes are printed unchanged. IFS=, insures the parameters are word-split on ','.
(note: you should add validation that the filename is valid, etc. to your final script. You can also use echo if you like)
Example Use/Output
$ splitattrib.sh file
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c,e,+
2,d,f,+
the best is break it to three part.
You have 3 line patterns. One has 6 columns. Another has 12, and the last is 9.
6 columns => 1 line
12 columns => 3 lines
9 columns => 2 line
Your 6 columns should not be modified. So reminds 12, and 9. That you can separate them in the if, else if and else. Like:
if( column == 6 ){...}
else if( column == 12 ){...}
else {...}
And here is a Perl one-liner solution:
perl -a -F",|;" -lne '$s=scalar #F;if($s==6){print join ",",#F}elsif($s==12){print join",",#F[0,1,4,7,-2,-1];print join",",#F[0,1,5,8,-2,-1];print join",",#F[0,1,6,9,-2,-1];}else{print join",",#F[0,1,3,5,-2,-1];print join",",#F[0,1,4,6,-2,-1]} ' file
and for you input, the output is:
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25,ENSP91,515,ENSG765,Gene2
2,ENST25,ENSP77,544,ENSG765,Gene2
2,ENST25,ENSP78,544,ENSG765,Gene2
3,ENST25,ENSP91,515,ENSG765,Gene2
3,ENST25,ENSP77,544,ENSG765,Gene2
3,ENST25,ENSP78,544,ENSG765,Gene2
4,ENST54,ENSP83,1864,ENSG48,Gene3
4,ENST54,ENSP36,722,ENSG48,Gene3
5,ENST54,ENSP83,1864,ENSG48,Gene3
5,ENST54,ENSP36,722,ENSG48,Gene3
6,ENST54,ENSP83,1864,ENSG48,Gene3
6,ENST54,ENSP36,722,ENSG48,Gene3
Assume your multiple entries are separated with semicolon ;, here is the awk version to do.
BEGIN {
FS="[,]"
}
{
if ($0 ~ /^[0-9].*/) {
end_split_field = 0
for (f=2;f<=NF;f++) {
if ($f ~ /.*;.*/) {
end_split_field=f
}
}
if (end_split_field == 0) {
print $0
} else {
for (f=2;f<=end_split_field;f++) {
n = split($f, a, ";") #split and return the number
for (i=1;i<=n;i++) {
b[f, i] = a[i]
}
}
for (i=1;i<=n;i++) {
printf $1","
for (j=2;j<=end_split_field;j++) {
printf b[j, i]","
}
for (k=end_split_field;k<NF;k++) {
printf $k","
}
printf $NF"\n"
}
}
} else {
print $0
}
}
Save the content above as input.awk, example input and output
$ cat input
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c;d,e;f,+
3,g;h;i,j;k;l,-
We can get the split output
$ awk -f input.awk input
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c,e,+
2,d,f,+
3,g,j,-
3,h,k,-
3,i,l,-
I was given a text file with a whole bunch of data sorted in columns. Each of the columns are
separated by commas.
How could I divide a column by another column to print an output answer? I am using Perl right now now so it has to be done in Perl. How could I do this?
This is what I have so far:
#!/usr/bin/perl
open (FILE, 'census2008.txt');
while (<FILE>) {
chomp;
($sumlev, $stname,$ctyname,$popestimate2008,$births2008,$deaths2008) = split(",");
}
close (FILE);
exit;
There are several options:
Read the file in line by line, split the columns on ',' and divide the relevant columns (don't forget to handle the divide-by-zero error)
Do the same thing as a one-liner:
$ perl -F/,/ -lane 'print( $F[1] == 0 ? "" : $F[3]/$F[1] )' file.txt
Utilize a ready-to-use CPAN module like Text::CSV
Of course, there are more unorthodox/crazy/unspeakable alternatives à la TMTOWTDI ™, so one could:
Parse out the relevant columns with a regex and divide the matches:
if (/^\d*,(\d+),\d*,(\d+)/) { say $2/$1 if $2 != 0; }
Do it with s///e:
$ perl -ple 's!^\d*,(\d+),\d*,(\d+).*$! $2 == 0 ? "" : $2/$1 !e' file.txt;
Get the shell to do the dirty work via backticks:
sub print_divide { say `cat file.txt | some_command_line_command` }
#!/usr/bin/env perl
# divides column 1 by column 2 of some ','-delimited file,
# read from standard input.
# usage:
# $ cat data.txt | 8458760.pl
while (<STDIN>) {
#values = split(/,/, $_);
print $values[0] / $values[1] . "\n";
}
If you have fixed width columns of data you could use 'unpack' along the lines of:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($sumlev,$stname,$ctyname,$popest,$births,$deaths)
= unpack("A2xA10xA15xA7xA5xA5");
printf "%-15s %4.2f\n", $ctyname, $births/$deaths;
}
__DATA__
10,Main ,My City , 10000, 200, 150
12,Poplar ,Somewhere , 3000, 90, 100
13,Maple ,Your Place , 9123, 100, 90