Removing last column from rows that have three columns using bash - linux

I have a file that contains several lines of data. Some lines contain three columns, but most contain only two. All lines are single-tab separated. For those that contain three columns, the third column is typically redundant and contains the same data as the second so I'd like to remove it.
I imagine awk or cut would be appropriate, but I'm drawing a blank on how to test the row for three columns so my script will only work on those rows. I know awk is a very powerful language with logic and whatnot built into it, I'm just not that strong with it.
I looked at a similar question, but I'm not sure what is going on with the awk answer. Should the -4 be -1 since I only want to remove one column? What about if the row has two columns; will it remove the second even though I don't want to do anything?
I modified it to what I think it would be:
awk -F"\t" -v OFS="\t" '{ for (i=1;i<=NF-4;i++){ print $i }}'
But when I run it (with the file) nothing happens. If I change NF-1 or NF-2 I get some output, but it only a handful of lines and only the first column.
Can anyone clue me into what I should be doing?

If you just want to remove the third column, you could just print the first and the second:
awk -F '\t' '{print $1 "\t" $2}'
And it's similar to cut:
cut -f 1,2

The awk variable NF gives you the number for fields. So an expression like this should work for you.
awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
Running it on an input file like so
a,b,c
x,y
u,v,w
l,m
gives me
$ cat test | awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
a,b
x,y
u,v
l,m

This might work for you (GNU sed):
sed 's/\t[^\t]*//2g' file
Restricts the file to two columns.

awk 'NF==3{print $1"\t"$2}NF==2{print}' your_file
Testde below:
> cat temp
1 2
3 4 5
6 7
8 9 10
>
> awk 'NF==3{print $1"\t"$2}NF==2{print}' temp
1 2
3 4
6 7
8 9
>
or in a much more simplere way in awk:
awk 'NF==3{print $1"\t"$2}NF==2' your_file
Or you can also go with perl:
perl -lane 'print "$F[0]\t$F[1]"' your_file

Related

combine consecutive awk calls into one

I'm trying to use awk to get a specific number out of a text file. The number can be identified by consecutively applying three rules:
Get only lines starting with the string Name(s):
In the 6th of the said lines, get the 3th element. Elements are separated by one or more spaces
take 100 minus the number found
I got it working with two piped awk calls:
cat file | awk '/^Name\(s\):/' | awk -F " " 'NR==6 {printf "%2.2f", 100 - $3; exit}'
How can I combine the two awk calls into one?
Untested as the filewas not there but:
$ awk '/^Name\(s\):/ && ++c==6 {printf "%2.2f", 100 - $3; exit}' file
You can put the AWK statements into a file -example myprogram.awk- and use it like
awk -f myprogram.awk

find records longer/shorter than a particular col

this is my file: FILEABC.txt
Name|address|age|country
john|london|12|UK
adam|newyork|39|US|X12|123
jake|madrid|45|ESP
ram|delhi
joh|cal|34|US|788
I wanted to find the the header count in the file. so i've this command
cat FILEABC.txt | awk --field-separator='|' '{print NF}' | sort -n |uniq -c
the result i get for this cmd is
cat FILEABC.txt | awk --field-separator='|' '{print NF}' | sort -n |uniq -c
1 2
3 4
1 5
1 6
My requirement is that, how do i find those records that have only 2 fields, 4 fields and so on from my file.
for ex,
if want to see the records having only 2 col:
ram|delhi
if want to see rec's having more than 4 col:
adam|newyork|39|US|X12|123
If you want to only print the records which have 2 fields then following may help you in same.
awk -F"|" 'NF==2' Input_file
For any kind of records if you need a line which has more than 4 fields then change above condition to NF>4 or you need line which have more than 5 fields eg--> NF>5
Explanation: BY doing -F"|" I am making sure field separator is pipe here, then NF is an awk out of the box variable which defines the TOTAL number of fields in a line, so as per your request checking if number of fields are more than 2 here, if this condition is TRUE then print the current line(where I have NOT written print because awk works on method of condition and action, so if condition is TRUE here I am not mentioning any action and by default action print will happen for that line).
Using awk, variable NF gives total number of fields in record/row, by default awk use single space as field separator, if you alter FS, it will calculate NF based on field separator mentioned, so what you can do is
awk -v FS='|' 'NF==2' infile
Which is same as
# Usual Syntax : awk 'condition { action }' infile
awk -v FS='|' 'NF==2{ print }' infile
For more than 4 fields,
awk -v FS='|' 'NF > 4' infile
you can also use grep to filter 2-columed records:
grep '^[^|]*|[^|]*$' FILEABC.txt
It will output:
ram|delhi

awk print number of row only in uniq column

I have data set like this:
1 A
1 B
1 C
2 A
2 B
2 C
3 B
3 C
And I have a script which calculates me:
Number of occurrences in searching string
Number of rows
awk -v search="A" \
'BEGIN{count=0} $2 == search {count++} END{print count "\n" NR}' input
That works perfectly fine.
I would like to add to my awk one liner number of unique lines from the first column.
So the output should be separated by \n:
2
8
3
I can do this in separate awk code, but I am not able to integrate it to my original awk code.
awk '{a[$1]++}END{for(i in a){print i}}' input | wc -l
Any idea how to integrate it in one awk solution without piping ?
Looks like you want this:
awk -v search="A" '{a[$1]++}
$2 == search {count++}
END{OFS="\n";print count+0, NR, length(a)}' file

How to cut first n and last n columns?

How can I cut off the first n and the last n columns from a tab delimited file?
I tried this to cut first n column. But I have no idea to combine first and last n column
cut -f 1-10 -d "<CTR>v <TAB>" filename
Cut can take several ranges in -f:
Columns up to 4 and from 7 onwards:
cut -f -4,7-
or for fields 1,2,5,6 and from 10 onwards:
cut -f 1,2,5,6,10-
etc
The first part of your question is easy. As already pointed out, cut accepts omission of either the starting or the ending index of a column range, interpreting this as meaning either “from the start to column n (inclusive)” or “from column n (inclusive) to the end,” respectively:
$ printf 'this:is:a:test' | cut -d: -f-2
this:is
$ printf 'this:is:a:test' | cut -d: -f3-
a:test
It also supports combining ranges. If you want, e.g., the first 3 and the last 2 columns in a row of 7 columns:
$ printf 'foo:bar:baz:qux:quz:quux:quuz' | cut -d: -f-3,6-
foo:bar:baz:quux:quuz
However, the second part of your question can be a bit trickier depending on what kind of input you’re expecting. If by “last n columns” you mean “last n columns (regardless of their indices in the overall row)” (i.e. because you don’t necessarily know how many columns you’re going to find in advance) then sadly this is not possible to accomplish using cut alone. In order to effectively use cut to pull out “the last n columns” in each line, the total number of columns present in each line must be known beforehand, and each line must be consistent in the number of columns it contains.
If you do not know how many “columns” may be present in each line (e.g. because you’re working with input that is not strictly tabular), then you’ll have to use something like awk instead. E.g., to use awk to pull out the last 2 “columns” (awk calls them fields, the number of which can vary per line) from each line of input:
$ printf '/a\n/a/b\n/a/b/c\n/a/b/c/d\n' | awk -F/ '{print $(NF-1) FS $(NF)}'
/a
a/b
b/c
c/d
You can cut using following ,
-d: delimiter ,-f for fields
\t used for tab separated fields
cut -d$'\t' -f 1-3,7-
To use AWK to cut off the first and last fields:
awk '{$1 = ""; $NF = ""; print}' inputfile
Unfortunately, that leaves the field separators, so
aaa bbb ccc
becomes
[space]bbb[space]
To do this using kurumi's answer which won't leave extra spaces, but in a way that's specific to your requirements:
awk '{delim = ""; for (i=2;i<=NF-1;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
This also fixes a couple of problems in that answer.
To generalize that:
awk -v skipstart=1 -v skipend=1 '{delim = ""; for (i=skipstart+1;i<=NF-skipend;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
Then you can change the number of fields to skip at the beginning or end by changing the variable assignments at the beginning of the command.
You can use Bash for that:
while read -a cols; do echo ${cols[#]:0:1} ${cols[#]:1,-1}; done < file.txt
you can use awk, for example, cut off 1st,2nd and last 3 columns
awk '{for(i=3;i<=NF-3;i++} print $i}' file
if you have a programing language such as Ruby (1.9+)
$ ruby -F"\t" -ane 'print $F[2..-3].join("\t")' file
Try the following:
echo a#b#c | awk -F"#" '{$1 = ""; $NF = ""; print}' OFS=""
Use
cut -b COLUMN_N_BEGINS-COLUMN_N_UNTIL INPUT.TXT > OUTPUT.TXT
-f doesn't work if you have "tabs" in the text file.

Excel and awk disagree about CSV totals

I have a CSV file that I'm totaling up two ways: one using Excel and the other using awk. Here are the totals of my first 8 columns in Excel:
1) 2640502474.00
2) 1272849386284.00
3) 36785.00
4)
5) 107.00
6) 239259.00
7) 0.00
8) 7418570893330.00
And here's my awk output:
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$1} END {printf("%01.2f\n", s)}'
2640502474.00
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$2} END {printf("%01.2f\n", s)}'
1272849386284.00
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$8} END {printf("%01.2f\n", s)}'
7411306364347.00
Notice how 1 and 2 match exactly but 8 is off by many millions. I'm assuming Excel's total is the correct one, so why is awk handling this file differently?
You likely have a comma formatted number contained in quotes. Excel will properly handle that number as a single field. Your regex for field separation in awk won't - a comma internal to a number is a valid separator according to that regex. It is very hard (and mostly futile) to try and handle optional nested escaping like what is possible in csv with a regex.
Compare the following to see what is likely going on:
$ echo '"1","10","15","1,000","14"' | awk -F "\"*,\"*" '{print $4}'
1
$ echo '"1","10","15","1,000","14"' | awk -F "\",\"" '{print $4}'
1,000
Note that the second regex above still has a problem with a trailing " in the last field and only works at all if all field are consistently quoted - it is for illustration purposes only.

Resources