Insert filename as column, separated by a comma - linux

I have 100 file that looks like this
>file.csv
gene1,55
gene2,23
gene3,33
I want to insert the filename and make it look like this:
file.csv
gene1,55,file.csv
gene2,23,file.csv
gene3,33,file.csv
Now, I can almost get there using awk
awk '{print $0,FILENAME}' *.csv > concatenated_files.csv
But this prints the filenames with a space, instead of a comma. Is there a way to replace the space with a comma?

Is there a way to replace the space with a comma?
Yes, change the OFS
$ awk -v OFS="," '{print $0,FILENAME}' file.csv
gene1,55,file.csv
gene2,23,file.csv
gene3,33,file.csv

Figured it out, turns out:
for d in *.csv; do (awk '{print FILENAME (NF?",":"") $0}' "$d" > ${d}.all_files.csv); done
Works just fine.

You can also create a new field
awk -vOFS=, '{$++NF=FILENAME}1' file.csv
gene1,55,file.csv
gene2,23,file.csv
gene3,33,file.csv

Related

How can I show only some words in a line using sed?

I'm trying to use sed to show only the 1st, 2nd, and 8th word in a line.
The problem I have is that the words are random, and the amount of spaces between the words are also random... For example:
QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI
Is there a way to get this to output as just the 1st, 2nd, and 8th words:
QST334 FFR67 QD112
Thanks for any advice or hints for the right direction!
Use awk
awk '{print $1,$2,$8}' file
In action:
$ echo "QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI" | awk '{print $1,$2,$8}'
QST334 FFR67 QD112
You do not really need to put " " between two columns as mentioned in another answer. By default awk consider single white space as output field separator AKA OFS. so you just need commas between the desired columns.
so following is enough:
awk '{print $1,$2,$8}' file
For Example:
echo "QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI" |awk '{print $1,$2,$8}'
QST334 FFR67 QD112
However, if you wish to have some other OFS then you can do as follow:
echo "QST334 FFR67 HHYT 87UYU HYHL 9876S NJI QD112 989OPI" |awk -v OFS="," '{print $1,$2,$8}'
QST334,FFR67,QD112
Note that this will put a comma between the output columns.
Another solution is to use the cut command:
cut --delimiter '<delimiter-character>' --fields <field> <file>
Where:
'<delimiter-character>'>: is the delimiter on which the string should be parsed.
<field>: specifies which column to output, could a single column 1, multiple columns 1,3 or a range of them 1-3.
In action:
cut -d ' ' -f 1-3 /path/to/file
This might work for you (GNU sed):
sed 's/\s\+/\n/g;s/.*/echo "&"|sed -n "1p;2p;8p"/e;y/\n/ /' file
Convert spaces to newlines. Evaluate each line as a separate file and print only the required lines i.e. fields. Replace remaining newlines with spaces.

Replace comma between two characters

I have a txt file and I need to replace comma with space only between quotation marks.
For example:
This,is,example,"need,delete comma",xxxx
And the result should be:
This,is,example,"need delete comma",xxxx
I have this command, but it's wrong:
sed -i '/^"/,/^"/s/.,/ /' output.txt
Try this:
awk 'NR%2-1{gsub(/,/," ")}1' RS=\" ORS=\" input.txt > output.txt
Input:
This,is,example,"need,delete comma",xxxx
Output:
This,is,example,"need delete comma",xxxx
Complex awk solution:
Sample testfile:
This,is,example,"need,delete comma",xxxx
asda
asd.dasd,asd"sdf,dd","sdf,sdfsdf"
"some,text,here" another text there""
The job:
awk -F'"' '$0~/"/ && NF>1{ for(i=1;i<=NF;i++) { if(!(i%2)) gsub(/,/," ",$i) }}1' OFS='"' testfile
The output:
This,is,example,"need delete comma",xxxx
asda
asd.dasd,asd"sdf dd","sdf sdfsdf"
"some text here" another text there""
how about the following awk, where I am looking for only match between "..." and then simply removing all commas in that match. Then replace the new values of it with old ".." values.
awk '{match($0,/\".*\"/);val=substr($0,RSTART,RLENGTH);gsub(/,/," ",val);gsub(/\".*\"/,val,$0)} 1' Input_file
EDIT1: After seeing RomanPerekhrest's Input_file a little change in above code which will prohibit to change "," in any line.
awk '{match($0,/\".*\"/);val=substr($0,RSTART,RLENGTH);gsub(/[^","],/," ",val);gsub(/\".*\"/,val,$0)} 1' Input_file
echo 'This,is,example,"need,delete comma",xxxx' |awk -F\" '{sub(/,/," ",$2); print}' OFS=\"
This,is,example,"need delete comma",xxxx

Comma separated value within double quote

I have a data file separated by comma, data enclosed by "":
$ head file.txt
"HD","Sep 13 2016 1:05AM","0001"
"DT","273093045","192534"
"DT","273097637","192534" ..
I want to get the 3rd column value (0001) to be assigned to my variable.
I tried
FILE_VER=`cat file.txt | awk -F',' '{if ($1 == "HD") print $3}'`
I don't get any value assigned to FILE_VER. Please help me with correct syntax.
Another awk version:
awk -F'"' '$2 == "HD"{print $6}' file
You were almost there. Simply removing the quotes should be good enough:
foo=$(awk -F, '$1=="\"HD\""{gsub(/"/,"",$3);print $3}' file)
not sure this is the most optimal way but works:
FILE_VER=$(awk -F',' '$1 == "\"HD\"" {gsub("\"","",$3); print $3}' file.txt)
test for HD between quotes
remove quotes before printing result
You can change the file to substitute the comma and quotes to tab:
tr -s '\"," "\t' < filename | awk '{print $3}'
Maybe there is a solution using only awk, but this works just fine!

Linux awk with condition

I have a very large file (2.5M record) with 2 columns seperated by |.
I would like to filter all record that do not contain the value "-1" inside the second column and write it into a new file.
I tried to use:
grep -v "-1" norm_cats_21_07_assignments.psv > norm_cats_21_07_assignments.psv
but noo luck.
For quick and dirty solution, you can simply add | to your grep:
grep -v "|-1" input.psv > output.psv
This assumes that rows to be ignored look like
something|-1
Note that if you ever need to use grep -v "-1", you have to add -- after options, otherwise grep will treat -1 as an option, something like this:
grep -v -- "-1"
You could do this through awk,
awk -F"|" '$2~/^-1$/{next}1' file > newfile
Example:
$ cat r
foo|-1
foo|bar
$ awk -F"|" '$2~/^-1$/{next}1' r
foo|bar
You can have:
awk -F'|' '$2 != "-1"' file.psv > new_file.psv
Or
awk -F'|' '$2 !~ /-1/' file.psv > new_file.psv
!= matches the whole column while !~ needs only a part of it.
Edit: Just noticed that your input file and output file are the same. You can't do that as the output file which is the same file would get truncated even before awk starts reading it.
With awk after making the new filtered file (e.g. new_file.psv), you can save it back by using cat new_file.psv > file.psv or mv new_file.psv file.psv.
But somehow if you exactly have 2 columns separated with | and no spaces in between, and no quotes around, etc. You can just use inline editing with sed:
sed -i '/|-1/d' file.psv
Or perhaps something equivalent to awk -F'|' '$2 !~ /-1/':
sed -i '/|.*-1/d' file.psv

awk or sed to change column value in a file

I have a csv file with data as follows
16:47:07,3,r-4-VM,230000000.,0.466028518635,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,0.50822578824,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.488406067907,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.467893525702,131072,0,0,0,0,0
I would like to shorten the value in the 5th column.
Desired output
16:47:07,3,r-4-VM,230000000.,0.46,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,0.50,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.48,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.46,131072,0,0,0,0,0
Your help is highly appreciated
awk '{$5=sprintf( "%.2g", $5)} 1' OFS=, FS=, input
This will round and print .47 instead of .46 on the first line, but perhaps that is desirable.
Try with this:
cat filename | sed 's/\(^.*\)\(0\.[0-9][0-9]\)[0-9]*\(,.*\)/\1\2\3/g'
So far, the output is at GNU/Linux standard output, so
cat filename | sed 's/\(^.*\)\(0\.[0-9][0-9]\)[0-9]*\(,.*\)/\1\2\3/g' > out_filename
will send the desired result to out_filename
If rounding is not desired, i.e. 0.466028518635 needs to be printed as 0.46, use:
cat <input> | awk -F, '{$5=sprintf( "%.4s", $5)} 1' OFS=,
(This can another example of Useless use of cat)
You want it in perl, This is it:
perl -F, -lane '$F[4]=~s/^(\d+\...).*/$1/g;print join ",",#F' your_file
tested below:
> cat temp
16:47:07,3,r-4-VM,230000000.,0.466028518635,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,10.50822578824,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.488406067907,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.467893525702,131072,0,0,0,0,0
> perl -F, -lane '$F[4]=~s/^(\d+\...).*/$1/g;print join ",",#F' temp
16:47:07,3,r-4-VM,230000000.,0.46,131072,0,0,0,60,0
16:47:11,3,r-4-VM,250000000.,10.50,131072,0,0,0,0,0
16:47:14,3,r-4-VM,240000000.,0.48,131072,0,0,32768,0,0
16:47:17,3,r-4-VM,230000000.,0.46,131072,0,0,0,0,0
sed -r 's/^(([^,]+,){4}[^,]{4})[^,]*/\1/' file.csv
This might work for you (GNU sed):
sed -r 's/([^,]{,4})[^,]*/\1/5' file
This replaces the 5th occurence of non-commas to no more than 4 characters length.

Resources