Removing Fields from file keeping delimeter intact - linux

We have a requirement to remove certain fields inside a delimited file using shell script, but we do not want to loose delimeter. i.e
$cat file1
col1,col2,col3,col4,col5
1234,4567,7890,9876,6754
abcd,efgh,ijkl,rtyt,cvbn
now we need to generate different file (say file2) out of file1 with second and fourth column removed with delimeter (,) intact
i.e
$cat file2
col1,,col3,,col5
1234,,7890,,6754
abcd,,ijkl,,cvbn
Please suggest what would be easiest and efficient way of achieving this, also as file is having around 300 fields/columns AWK is not working because of its limitation related to number of fields.

awk '{$2 = $4 = ""}1' FS=, OFS=, file1 > file2

awk 'BEGIN { FS = OFS = ","; } { $2 = $4 = ""; print }' file1 > file2
Which, simply saying:
sets input & output field separator to , at start,
empties fields 2 & 4 for each line, and then prints the line back.

Related

How to print the value in third column of a line which comes after a line which, contains a specific string using AWK to a different file?

I have an output which contains something like this in the middle.
Stopping criterion = max iterations
Energy initial, next-to-last, final =
-83909.5503696 -86748.8150981 -86748.8512012
What I am trying to do is to print out the last value(3rd column) in line after the line which contains the string "Energy" to a different file. and I have to print out these values from 100 different files. currently I have been trying with this line which only looks at a single file.
awk -F: '/Energy/ { getline; print $0 }' inputfile > outputfile
but this gives output like:
-83909.5503696 -86748.8150981 -86748.8512012
Update - With the help of a suggestion below I was able to output the value to a file. but as it reads through different files it overwrites the final output file and prints out value of the final file that it read. What I tried was this,
#SBATCH --array=1-100
num=$SLURM_ARRAY_TASK_ID..
fold=$(printf '%03d' $num)
cd $main_path/surf_$fold
awk 'f{print $3; f=0} /Energy/{f=1}' inputfile > outputfile
This would not be an appropriate job for getline, see http://awk.freeshell.org/AllAboutGetline, and idk why you're setting FS to : with -F: when your fields are space-separated as awk assumes by default.
Here's how to do what I think you're trying to do with 1 call to awk:
awk 'f{print $3; f=0} /Energy/{f=1}' "$main_path/surf_"*"/inputfile > outputfile

Grouping related rows of data into a single column in Linux

I have a csv file that gets generated daily and automatically that has output similar to the following example:
"N","3.5",3,"Bob","10/29/17"
"Y","4.5",5,"Bob","10/11/18"
"Y","5",6,"Bob","10/28/18"
"Y","3",1,"Jim",
"N","4",2,"Jim","09/29/17"
"N","2.5",4,"Joe","01/26/18"
I need to transform the text so that it is grouped by person (the fourth column), and all of the records are in a single row and in the columns are repeated using the same sequence: 1,2,3,5. Some cells may be missing data but must remain in the sequence so the columns line up. So the output I need will look like this:
"Bob","N","3.5",3,"10/29/17","Y","4.5",5,"10/11/18","Y","5",6,"10/28/18"
"Jim","Y","3",1,,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
I am open to using sed, awk, or pretty much any standard Linux command to get this task done. I've been trying to use awk, and though I get close, I can't figure out how to finish it.
Here is the command where I'm close. It lists the header and the names, but no other data:
awk -F"," 'NR==1; NR>1 {a[$4]=a[$4] ? i : ""} END {for (i in a) {print i}}' test2.csv
you need little more code
$ awk 'BEGIN {FS=OFS=","}
{k=$4; $4=$5; NF--; a[k]=(k in a?a[k] FS $0:$0)}
END {for(k in a) print k,a[k]}' file
"Bob","N","3.5",3,"10/29/17" ,"Y","4.5",5,"10/11/18" ,"Y","5",6,"10/28/18"
"Jim","Y","3",1, ,"N","4",2,"09/29/17"
"Joe","N","2.5",4,"01/26/18"
note that NF-- trick may not work in all awks.
Could you please try following too, reading the Input_file 2 times, it will provide output in same sequence in which 4th column has come in Input_file.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$4]=a[$4]?a[$4] OFS $1 OFS $2 OFS $3 OFS $5:$4 OFS $1 OFS $2 OFS $3 OFS $5
next
}
a[$4]{
print a[$4]
delete a[$4]
}
' Input_file Input_file
If there is any chance that any of the CSV values has a comma, then a "CSV-aware" tool will would be advisable to obtain a reliable but straightforward solution.
One approach would be to use one of the many readily available csv2tsv command-line tools. A variety of elegant solutions then becomes possible. For example, one could pipe the CSV into csv2tsv, awk, and tsv2csv.
Here is another solution that uses csv2tsv and jq:
csv2tsv < input.csv | jq -Rrn '
[inputs | split("\t")]
| group_by(.[3])[]
| sort_by(.[2])
| [.[0][3]] + ( map( del(.[3])) | add)
| #csv
'
This produces:
"Bob","N","3.5","3","10/29/17 ","Y","4.5","5","10/11/18 ","Y","5","6","10/28/18 "
"Jim","Y","3","1"," ","N","4","2","09/29/17 "
"Joe","N","2.5","4","01/26/18"
Trimming the excess spaces is left as an exercise :-)

how to Merge 2 tables with awk

First of all, sorry for my English and I know there's a lot of various topics regarding AWK but it's a very difficult function to me...
I would like to merge two tables using common columns with awk. The tables differ in the amount of rows. I have my first table that I want to modify and the second as a reference table. I would like to compare my colunme1.F1 with my column1.F2. When it matches, add the column2.F2 in my file1. But I need to keep all my lines in file1.
I give you an example:
File1
Num_id,Name,description1,description2,description3
?,atlanta_1,,,
RO_5,babeni_SW,,,
? ,Bib1,,,
RO_9,BoUba_456,,,
?,Castor,,,
File2
official_Num_id,official_Name
RO_1,America
RO_2,Andre
RO_3,Atlanta
RO_4,Axa
RO_5,Babeni
RO_6,Barba
RO_7,Bib
RO_8,Bilbao
RO_9,Bouba
RO_10,Castor
File3
Num_id,Name,description1,description2,description3,official_Name
?,atlanta_1,,,
RO_5,babeni_SW,,,Babeni
?,Bib1,,,
RO_9,BoUba_456,,,Bouba
?,Castor,,,
I read a lot of solution on Internet and it seems that awk could work ..
I tried awk 'NR==FNR {h[$1] = $2; next} {print $0,h[$1]}' $File1 $File2 > file3
But my command doesn't work, my File3 looks exactly that File1.
In a second time, I don't know if it's possible to compare my two second columns when names have difference like atlanta_1 and Atlanta and add the official_num_id and the official_name in my File1.
Any hero over there?
You had it, except for two small things. First you need to set your file separators to , and, second, reverse the order of your input files on the command line so that the reference file is processed first:
$ awk 'BEGIN {FS=OFS=","} NR==FNR {h[$1] = $2; next} {print $0,h[$1]}' File2 File1
Num_id,Name,description1,description2,description3,
?,atlanta_1,,,,
RO_5,babeni_SW,,,,Babeni
? ,Bib1,,,,
RO_9,BoUba_456,,,,Bouba
?,Castor,,,,
You can also use the join command for this:
join --header --nocheck-order -t, -1 1 -2 1 -a 1 file1 file2
To answer your question if it's possible to compare my two second columns when names have difference like atlanta_1 and Atlanta and add the official_num_id and the official_name in my File1:
$ awk '
BEGIN { FS=OFS="," }
NR==FNR { # file2
a[tolower($2)]=$0 # hash on lowercase city
next
}
{ # file1
split($2,b,"[^[:alpha:]]") # split on non-alphabet
print $0 (tolower(b[1]) in a?OFS a[tolower(b[1])]:"")
}' file2 file1
Num_id,Name,description1,description2,description3
?,atlanta_1,,,,RO_3,Atlanta
RO_5,babeni_SW,,,,RO_5,Babeni
? ,Bib1,,,,RO_7,Bib
RO_9,BoUba_456,,,,RO_9,Bouba
?,Castor,,,,RO_10,Castor
split will split Name field on non-alphabetic characters, ie _ in atlanta_1, 1 in Bib1 etc. so it might fail on cities with dashes etc., edit the pattern [^[:alpha:]] in split accordingly. Header doesn't match with those names, rethink the header names.

How to Split a Delimited Text file in Linux, based on no of records, which has end-of-record separator in data fields

Problem Statement:
I have a delimited text file offloaded from Teradata which happens to have "\n" (newline characters or EOL markers) inside data fields.
The same EOL marker is at the end of each new line for one entire line or record.
I need to split this file in two or more files (based on no of records given by me) while retaining the newline chars in data fields but against the line breaks at the end of each lines.
Example:
1|Alan
Wake|15
2|Nathan
Drake|10
3|Gordon
Freeman|11
Expectation :
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11
What i have tried :
awk 'BEGIN{RS="\n"}NR%2==1{x="SplitF"++i;}{print > x}' inputfile.txt
The code can't discern between data field newlines and actual newlines. Is there a way it can be achieved?
EDIT:: i have changed the problem statement with example. Please share your thoughts on the new example.
Use the following awk approach:
awk '{ r=(r!="")?r RS $0 : $0; if(NR%4==0){ print r > "file"++i".txt"; r="" } }
END{ if(r) print r > "file"++i".txt" }' inputfile.txt
NR%4==0 - your logical single line occupies two physical records, so we expect to separate on each 4 records
Results:
> cat file1.txt
1|Alan
Wake
2|Nathan
Drake
> cat file2.txt
3|Gordon
Freeman
If you are using GNU awk you can do this by setting RS appropriately, e.g.:
parse.awk
BEGIN { RS="[0-9]\\|" }
# Skip the empty first record by checking NF (Note: this will also skip
# any empty records later in the input)
NF {
# Send record with the appropriate key to a numbered file
printf("%s", d $0) > "file" i ".txt"
}
# When we found enough records, close current file and
# prepare i for opening the next one
#
# Note: NR-1 because of the empty first record
(NR-1)%n == 0 {
close("file" i ".txt")
i++
}
# Remember the record key in d, again,
# becuase of the empty first record
{ d=RT }
Run it like this:
gawk -f parse.awk n=2 infile
Where n is the number of records to put into each file.
Output:
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11

How to Compare CSV Column using awk?

I receive and CSV like this:
column$1,column$2,column$
john,P,10
john,P,10
john,A,20
john,T,30
john,T,10
marc,P,10
marc,C,10
marc,C,20
marc,T,30
marc,A,10
I need so sum the values and display the name and results but column$2 needs to show the sum of values T separated from values P,A,C.
Output should be this:
column$1,column$2,column$3,column$4
john,PCA,40
john,T,40,CORRECT
marc,PCA,50
marc,T,30,INCORRECT
All i could do was extract the columns i need from the original csv:
awk -F "|" '{print $8 "|" $9 "|" $4}' input.csv >> output.csv
Also sort by the correct column:
sort -t "|" -k1 input.csv >> output.csv
And add a new column to the end of the csv:
awk -F, '{NF=2}1' OFS="|" input.csv >> output.csv
I managed to sum and display the sum by column$1 and $2, but i don't how to group different values from column$2:
awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}' file > output
Awk is stream oriented. It processes input and outputs what you change. It does not do in file changes.
You just need to add a corresponding print
awk '{if($2 == "T") {print "MATCHED"}}'
If you want to output more than the "matched" you need to add it to the print
e.g. '{print $1 "|" $2 "|" $3 "|" " MATCHED"}'
or use print $0 as comment mentions above.
Assuming that "CORRECT" and "INCORRECT" are determined by comparing the "PCA" value to the "T" value, the following awk script should do the trick:
awk -F, -vOFS=, '$2=="T"{t[$1]+=$3;n[$1]} $2!="T"{s[$1]+=$3;n[$1]} END{ for(i in n){print i,"PCA",s[i]; print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")} }' inputfile
Broken out for easier reading, here's what this looks like:
awk -F, -vOFS=, '
$2=="T" { # match all records that are "T"
t[$1]+=$3 # add the value for this record to an array of totals
n[$1] # record this name in our authoritative name list
}
$2!="T" { # match all records that are NOT "T"
s[$1]+=$3 # add the value for this record to an array of sums
n[$1] # record this name too
}
END { # Now that we've collected data, analyse the results
for (i in n) { # step through our authoritative list of names
print i,"PCA",s[i]
print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")
}
}
' inputfile
Note that array order is not guaranteed in awk, so your output may not come out in the same order as your input.
If you want your output to be delimited using vertical bars, change the -vOFS=, to -vOFS='|'.
Then you can sort using:
awk ... | sort
which defaults to -k1.

Resources