awk or sed to change column value in a file - linux

I have a csv file with data as follows
16:47:07,3,r-4-VM,230000000.,0.466028518635,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,0.50822578824,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.488406067907,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.467893525702,131072,0,0,0,0,0
I would like to shorten the value in the 5th column.
Desired output
16:47:07,3,r-4-VM,230000000.,0.46,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,0.50,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.48,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.46,131072,0,0,0,0,0
Your help is highly appreciated

awk '{$5=sprintf( "%.2g", $5)} 1' OFS=, FS=, input
This will round and print .47 instead of .46 on the first line, but perhaps that is desirable.

Try with this:
cat filename | sed 's/\(^.*\)\(0\.[0-9][0-9]\)[0-9]*\(,.*\)/\1\2\3/g'
So far, the output is at GNU/Linux standard output, so
cat filename | sed 's/\(^.*\)\(0\.[0-9][0-9]\)[0-9]*\(,.*\)/\1\2\3/g' > out_filename
will send the desired result to out_filename

If rounding is not desired, i.e. 0.466028518635 needs to be printed as 0.46, use:
cat <input> | awk -F, '{$5=sprintf( "%.4s", $5)} 1' OFS=,
(This can another example of Useless use of cat)

You want it in perl, This is it:
perl -F, -lane '$F[4]=~s/^(\d+\...).*/$1/g;print join ",",#F' your_file
tested below:
> cat temp
16:47:07,3,r-4-VM,230000000.,0.466028518635,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,10.50822578824,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.488406067907,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.467893525702,131072,0,0,0,0,0
> perl -F, -lane '$F[4]=~s/^(\d+\...).*/$1/g;print join ",",#F' temp
16:47:07,3,r-4-VM,230000000.,0.46,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,10.50,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.48,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.46,131072,0,0,0,0,0

sed -r 's/^(([^,]+,){4}[^,]{4})[^,]*/\1/' file.csv

This might work for you (GNU sed):
sed -r 's/([^,]{,4})[^,]*/\1/5' file
This replaces the 5th occurence of non-commas to no more than 4 characters length.

Related

Insert filename as column, separated by a comma

I have 100 file that looks like this
>file.csv
gene1,55
gene2,23
gene3,33
I want to insert the filename and make it look like this:
file.csv
gene1,55,file.csv
gene2,23,file.csv
gene3,33,file.csv
Now, I can almost get there using awk
awk '{print $0,FILENAME}' *.csv > concatenated_files.csv
But this prints the filenames with a space, instead of a comma. Is there a way to replace the space with a comma?
Is there a way to replace the space with a comma?
Yes, change the OFS
$ awk -v OFS="," '{print $0,FILENAME}' file.csv
gene1,55,file.csv
gene2,23,file.csv
gene3,33,file.csv
Figured it out, turns out:
for d in *.csv; do (awk '{print FILENAME (NF?",":"") $0}' "$d" > ${d}.all_files.csv); done
Works just fine.
You can also create a new field
awk -vOFS=, '{$++NF=FILENAME}1' file.csv
gene1,55,file.csv
gene2,23,file.csv
gene3,33,file.csv

Select subdomains using print command

cat a.txt
a.b.c.d.e.google.com
x.y.z.google.com
rev a.txt | awk -F. '{print $2,$3}' | rev
This is showing:
e google
x google
But I want this output
a.b.c.d.e.google
b.c.d.e.google
c.d.e.google
e.google
x.y.z.google
y.z.google
z.google
With your shown samples, please try following awk code. Written and tested in GNU awk should work in any awk.
awk '
BEGIN{
FS=OFS="."
}
{
nf=NF
for(i=1;i<(nf-1);i++){
print
$1=""
sub(/^[[:space:]]*\./,"")
}
}
' Input_file
Here is one more awk solution:
awk -F. '{while (!/^[^.]+\.[^.]+$/) {print; sub(/^[^.]+\./, "")}}' file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
Using sed
$ sed -En 'p;:a;s/[^.]+\.(.*([^.]+\.){2}[[:alpha:]]+$)/\1/p;ta' input_file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
Using bash:
IFS=.
while read -ra a; do
for ((i=${#a[#]}; i>2; i--)); do
echo "${a[*]: -i}"
done
done < a.txt
Gives:
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
(I assume the lack of d.e.google.com in your expected output is typo?)
For a shorter and arguably simpler solution, you could use Perl.
To auto-split the line on the dot character into the #F array, and then print the range you want:
perl -F'\.' -le 'print join(".", #F[0..$#F-1])' a.txt
-F'\.' will auto-split each input line into the #F array. It will split on the given regular expression, so the dot needs to be escaped to be taken literally.
$#F is the number of elements in the array. So #F[0..$#F-1] is the range of elements from the first one ($F[0]) to the penultimate one. If you wanted to leave out both "google" and "com", you would use #F[0..$#F-2] etc.

filter out unrecognised fields using awk

I have a CVS file where I expect some values such as Y or N. Folks are adding comments or arbitrary entries such as NA? that I want to remove:
Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,
I can use gsub to remove things that I am anticipating such as:
$ cat test.csv | awk '{gsub("NA\\?", ""); gsub("NA \\?",""); gsub("TBD", ""); print}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Yet that will break if someone adds a new comment. I am looking for a regex to generalise the match as "not Y".
I tried some negative look arounds but couldn't get it to work on the awk that I have which is GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.1, GNU MP 6.1.2). Thanks in advance!
awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if ($i !~ /^(y|Y|n|N)$/) $i="";print}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Accepting only Y/N (case-insensitive).
awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'
This seems to do the trick. Loops through the 3rd through the last field, and if the field isn't Y, it's replaced with nothing. Since we're modifying fields we need to set OFS as well.
$ cat file.txt
Create,20055776,Y,,Y,Y,,Y,,NA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,NA ?,,,Y,,,,,,TBD,,,,,,,,,
$ awk 'BEGIN{OFS=FS=","}{for(i=3;i<=NF;i++){if($i!~/^[Y]$/){$i=""}}; print;}'
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
If you wanted to accept "N" too, /^[YN]$/ would work.
cat test.CSV | awk 'BEGIN{FS=OFS=","}{for (i=3;i<=NF;i++) if($i != "Y") $i=""; print}'
Output:
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
Update: So there's no need to use regex if you simply want to determine it's "Y" or not.
However, if you want to use regex, as zzevannn's answer and tink's answer already gave great ideas of regex condition, so I'll give a batch replace by regex instead:
To be exact, and to increase the challenge, I created some boundary conditions:
$ cat test.CSV
Create,20055776,Y,,Y,Y,,Y,,YNA?,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,YN.Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,NANN,,,,,Y,,,NA ?Y,,,Y,,,,,,TYBD,,,,,,,,,
And the batch replace is:
$ awk 'BEGIN{FS=OFS=","}{fst=$1;sub($1 FS,"");print fst,gensub("(,)[^,]*[^Y,]+[^,]*","\\1","g",$0);}' test.CSV
Create,20055776,Y,,Y,Y,,Y,,,,Y,,Y,Y,,Y,,,Y,,Y,,,Y,,,,,,,,
Create,20055777,,,,Y,Y,,Y,,,,Y,,Y,Y,,,,,Y,,Y,,,Y,,,,,,,,
Create,20055779,,Y,,,,,,,,Y,,,,,,Y,,,,,,,,,,,,,,,
"(,)[^,]*[^Y,]+[^,]*" is to match anything between two commas that other than single Y.
Note I saved $1 and deleted $1 and the comma after it first, and later print it back.
sed solution
# POSIX
sed -e ':a' -e 's/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;t a' test.csv
# GNU
sed ':a;s/\(^Create,[0-9]*\(,Y\{0,1\}\)*\),[^Y,][^,]*/\1/;ta' test.csv
awk on same concept (avoid some problem of sed that miss the OR regex)
awk -F ',' '{ Idx=$2;gsub(/,[[:blank:]]*[^YN,][^,]*/, "");sub( /,/, "," Idx);print}'

How can I show only some words in a line using sed?

I'm trying to use sed to show only the 1st, 2nd, and 8th word in a line.
The problem I have is that the words are random, and the amount of spaces between the words are also random... For example:
QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI
Is there a way to get this to output as just the 1st, 2nd, and 8th words:
QST334 FFR67 QD112
Thanks for any advice or hints for the right direction!
Use awk
awk '{print $1,$2,$8}' file
In action:
$ echo "QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI" | awk '{print $1,$2,$8}'
QST334 FFR67 QD112
You do not really need to put " " between two columns as mentioned in another answer. By default awk consider single white space as output field separator AKA OFS. so you just need commas between the desired columns.
so following is enough:
awk '{print $1,$2,$8}' file
For Example:
echo "QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI" |awk '{print $1,$2,$8}'
QST334 FFR67 QD112
However, if you wish to have some other OFS then you can do as follow:
echo "QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI" |awk -v OFS="," '{print $1,$2,$8}'
QST334,FFR67,QD112
Note that this will put a comma between the output columns.
Another solution is to use the cut command:
cut --delimiter '<delimiter-character>' --fields <field> <file>
Where:
'<delimiter-character>'>: is the delimiter on which the string should be parsed.
<field>: specifies which column to output, could a single column 1, multiple columns 1,3 or a range of them 1-3.
In action:
cut -d ' ' -f 1-3 /path/to/file
This might work for you (GNU sed):
sed 's/\s\+/\n/g;s/.*/echo "&"|sed -n "1p;2p;8p"/e;y/\n/ /' file
Convert spaces to newlines. Evaluate each line as a separate file and print only the required lines i.e. fields. Replace remaining newlines with spaces.

Extracting word after fixed word with awk

I have a file file.txt containing a very long line:
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797807950|mar0101|0|00000106829DAE7F3FAB187550B920530C00|0|0|4000018001000002||962797807950|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|1|||||||||||||0|0|||472|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|252|tid{111211344662580792}pfid{10}gob{1}rid{globitel} afid{}uid1{962797807950}aid1{1}ar1{100}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC RESERVE AMOUNT 10000}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{100}ctr{StaffLine}ftksn{JMT}ftksr{0001}ftktp{PayCall Ticket}||
I want to print only the word after "ctr" in this file, which is "StaffLine",
and I don't how many characters there are in this word.
I've tried:
awk '{comp[substr("ctr",0)]{print}}'
but it didn't work. How can I get hold of that word?
Here's one way using awk:
awk -F "[{}]" '{ for(i=1;i<=NF;i++) if ($i == "ctr") print $(i+1) }' file
Or if your version of grep supports Perl-like regex:
grep -oP "(?<=ctr{)[^}]+" file
Results:
StaffLine
Using sed:
sed 's/.*}ctr{\([^}]*\).*/\1/' input
One way of dealing with it is with sed:
sed -e 's/.*}ctr{//; s/}.*//' file.txt
This deletes everything up to and including the { after the word ctr (avoiding issues with any words which have ctr as a suffix, such as a hypothetical pxctr{Bogus} entry); it then deletes anything from the first remaining } onwards, leaving just StaffLine on the sample data.
perl -lne '$_=m/.*ctr{([^}]*)}.*/;print $1' your_file
tested below:
> cat temp
1|34|2012.12.01 00:08:35|12|4|921-*203-0000000000-962797807950|mar0101|0|00000106829DAE7F3FAB187550B920530C00|0|0|4000018001000002||962797807950|||||-1|||||-1||-1|0||||0||||||-1|-1|||-1|0|-1|-1|-1|2012.12.01 00:08:35|1|0||-1|1|||||||||||||0|0|||472|0|12|-2147483648|-2147483648|-2147483648|-2147483648|||||||||||||||||||||||||0|||0||1|6|252|tid{111211344662580792}pfid{10}gob{1}rid{globitel} afid{}uid1{962797807950}aid1{1}ar1{100}uid2{globitel}aid2{-1}pid{1234}pur{!GDRC RESERVE AMOUNT 10000}ratinf{}rec{0}rots{0}tda{}mid{}exd{0}reqa{100}ctr{StaffLine}ftksn{JMT}ftksr{0001}ftktp{PayCall Ticket}||
> perl -lne '$_=m/.*ctr{([^}]*)}.*/;print $1' temp
StaffLine
>

Resources