Using awk to extract data and count

Using awk to extract data and count - linux

How do I use awk on a file that looks like this:
abcd Z
efdg Z
aqbs F
edf F
aasd A
I want to extract the number of times each letter of the alphabet occurs in the second column, so output should be:
Z 2
F 2
A 1

try: If you want the order of output same as Input_file then following may help you.
awk 'FNR==NR{A[$2]++;next} A[$2]{print $2,A[$2];delete A[$2]}' Input_file Input_file
if you don't bother of order of $2 then following may help you.
awk '{A[$2]++} END{for(i in A){print i,A[i]}}' Input_file
In first solution reading the Input_file twice and creating an array A whose index is $2 with it's incrementing value. then when second Input_file is being read then printing the $2 and it's count.
In Second solution creating an array A whose index $2 and incrementing value of it. Then in end section go through the array A and print it's index and array A's value.

I would use sort | uniq for this purpose as these two utils are designed specifically for this kind of task:
cat <<END |
abcd Z
efdg Z
aqbs F
edf F
aasd A
END
awk '{print $2}' | sort -r | uniq -c | awk '{printf "%s %d\n", $2, $1}'
Would produce exactly the desired output
Z 2
F 2
A 1
Here awk '{print $2}' is used to get the second column from a document with fields separated by one or more whitespace characters. If we knew the width of the columns is fixed, we could use a faster cut utility instead.
sort -r | uniq -c is doing the main algorithmic part of the task - sort the letters in reverse order and count the number of occurrences of each letter.
awk '{printf "%s %d\n", $2, $1}' does some reformatting of the uniq -c output to match the required format exactly.
Update: AWK has powerful array support so this can be done with awk alone:
cat <<END |
abcd Z
efdg Z
aqbs F
edf F
aasd A
END
awk '{a[$2]++}
END {n=asorti(a,b,"#ind_str_desc");
for (k=1;k<=n;k++) {printf b[k], a[b[k]]} }'
We use the array a that is indexed with letters found in the input stream, and on each line the element indexed by the corresponding letter gets incremented.
In the END clause we reverse the order of indices and output the array.

Related

find records longer/shorter than a particular col

this is my file: FILEABC.txt
Name|address|age|country
john|london|12|UK
adam|newyork|39|US|X12|123
jake|madrid|45|ESP
ram|delhi
joh|cal|34|US|788
I wanted to find the the header count in the file. so i've this command
cat FILEABC.txt | awk --field-separator='|' '{print NF}' | sort -n |uniq -c
the result i get for this cmd is
cat FILEABC.txt | awk --field-separator='|' '{print NF}' | sort -n |uniq -c
1 2
3 4
1 5
1 6
My requirement is that, how do i find those records that have only 2 fields, 4 fields and so on from my file.
for ex,
if want to see the records having only 2 col:
ram|delhi
if want to see rec's having more than 4 col:
adam|newyork|39|US|X12|123

If you want to only print the records which have 2 fields then following may help you in same.
awk -F"|" 'NF==2' Input_file
For any kind of records if you need a line which has more than 4 fields then change above condition to NF>4 or you need line which have more than 5 fields eg--> NF>5
Explanation: BY doing -F"|" I am making sure field separator is pipe here, then NF is an awk out of the box variable which defines the TOTAL number of fields in a line, so as per your request checking if number of fields are more than 2 here, if this condition is TRUE then print the current line(where I have NOT written print because awk works on method of condition and action, so if condition is TRUE here I am not mentioning any action and by default action print will happen for that line).

Using awk, variable NF gives total number of fields in record/row, by default awk use single space as field separator, if you alter FS, it will calculate NF based on field separator mentioned, so what you can do is
awk -v FS='|' 'NF==2' infile
Which is same as
# Usual Syntax : awk 'condition { action }' infile
awk -v FS='|' 'NF==2{ print }' infile
For more than 4 fields,
awk -v FS='|' 'NF > 4' infile

you can also use grep to filter 2-columed records:
grep '^[^|]*|[^|]*$' FILEABC.txt
It will output:
ram|delhi

awk print number of row only in uniq column

I have data set like this:
1 A
1 B
1 C
2 A
2 B
2 C
3 B
3 C
And I have a script which calculates me:
Number of occurrences in searching string
Number of rows
awk -v search="A" \
'BEGIN{count=0} $2 == search {count++} END{print count "\n" NR}' input
That works perfectly fine.
I would like to add to my awk one liner number of unique lines from the first column.
So the output should be separated by \n:
2
8
3
I can do this in separate awk code, but I am not able to integrate it to my original awk code.
awk '{a[$1]++}END{for(i in a){print i}}' input | wc -l
Any idea how to integrate it in one awk solution without piping ?

Looks like you want this:
awk -v search="A" '{a[$1]++}
$2 == search {count++}
END{OFS="\n";print count+0, NR, length(a)}' file

awk max value of column two for dates in column one

I am trying to print only max values of column two for dates in column one.
My file is:
2014-04-09,135303
2014-04-09,416400
2014-04-15,143684
2014-04-15,156011
2014-04-15,184406
2014-04-16,1123083
2014-04-16,167486
2014-04-16,862196
2014-04-17,963023
2014-04-19,583844
Required Output:
2014-04-09,416400
2014-04-15,184406
2014-04-16,1123083
2014-04-17,963023
2014-04-19,583844
I tried sort but not working:
cat file|sort -k2 -r | sort --unique --stable -k1
please suggest how it can be done using awk or sort

kent$ awk -F, '{a[$1]=$2>a[$1]?$2:a[$1]}END{for(x in a)print x "," a[x]}' file
2014-04-15,184406
2014-04-16,1123083
2014-04-17,963023
2014-04-09,416400
2014-04-19,583844
if you want the result ordered by date, pipe the line above to sort:
awk -F, '{a[$1]=$2>a[$1]?$2:a[$1]}END{for(x in a)print x "," a[x]}' f|sort
2014-04-09,416400
2014-04-15,184406
2014-04-16,1123083
2014-04-17,963023
2014-04-19,583844

Removing last column from rows that have three columns using bash

I have a file that contains several lines of data. Some lines contain three columns, but most contain only two. All lines are single-tab separated. For those that contain three columns, the third column is typically redundant and contains the same data as the second so I'd like to remove it.
I imagine awk or cut would be appropriate, but I'm drawing a blank on how to test the row for three columns so my script will only work on those rows. I know awk is a very powerful language with logic and whatnot built into it, I'm just not that strong with it.
I looked at a similar question, but I'm not sure what is going on with the awk answer. Should the -4 be -1 since I only want to remove one column? What about if the row has two columns; will it remove the second even though I don't want to do anything?
I modified it to what I think it would be:
awk -F"\t" -v OFS="\t" '{ for (i=1;i<=NF-4;i++){ print $i }}'
But when I run it (with the file) nothing happens. If I change NF-1 or NF-2 I get some output, but it only a handful of lines and only the first column.
Can anyone clue me into what I should be doing?

If you just want to remove the third column, you could just print the first and the second:
awk -F '\t' '{print $1 "\t" $2}'
And it's similar to cut:
cut -f 1,2

The awk variable NF gives you the number for fields. So an expression like this should work for you.
awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
Running it on an input file like so
a,b,c
x,y
u,v,w
l,m
gives me
$ cat test | awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
a,b
x,y
u,v
l,m

This might work for you (GNU sed):
sed 's/\t[^\t]*//2g' file
Restricts the file to two columns.

awk 'NF==3{print $1"\t"$2}NF==2{print}' your_file
Testde below:
> cat temp
1 2
3 4 5
6 7
8 9 10
>
> awk 'NF==3{print $1"\t"$2}NF==2{print}' temp
1 2
3 4
6 7
8 9
>
or in a much more simplere way in awk:
awk 'NF==3{print $1"\t"$2}NF==2' your_file
Or you can also go with perl:
perl -lane 'print "$F[0]\t$F[1]"' your_file

How to cut first n and last n columns?

How can I cut off the first n and the last n columns from a tab delimited file?
I tried this to cut first n column. But I have no idea to combine first and last n column
cut -f 1-10 -d "<CTR>v <TAB>" filename

Cut can take several ranges in -f:
Columns up to 4 and from 7 onwards:
cut -f -4,7-
or for fields 1,2,5,6 and from 10 onwards:
cut -f 1,2,5,6,10-
etc

The first part of your question is easy. As already pointed out, cut accepts omission of either the starting or the ending index of a column range, interpreting this as meaning either “from the start to column n (inclusive)” or “from column n (inclusive) to the end,” respectively:
$ printf 'this:is:a:test' | cut -d: -f-2
this:is
$ printf 'this:is:a:test' | cut -d: -f3-
a:test
It also supports combining ranges. If you want, e.g., the first 3 and the last 2 columns in a row of 7 columns:
$ printf 'foo:bar:baz:qux:quz:quux:quuz' | cut -d: -f-3,6-
foo:bar:baz:quux:quuz
However, the second part of your question can be a bit trickier depending on what kind of input you’re expecting. If by “last n columns” you mean “last n columns (regardless of their indices in the overall row)” (i.e. because you don’t necessarily know how many columns you’re going to find in advance) then sadly this is not possible to accomplish using cut alone. In order to effectively use cut to pull out “the last n columns” in each line, the total number of columns present in each line must be known beforehand, and each line must be consistent in the number of columns it contains.
If you do not know how many “columns” may be present in each line (e.g. because you’re working with input that is not strictly tabular), then you’ll have to use something like awk instead. E.g., to use awk to pull out the last 2 “columns” (awk calls them fields, the number of which can vary per line) from each line of input:
$ printf '/a\n/a/b\n/a/b/c\n/a/b/c/d\n' | awk -F/ '{print $(NF-1) FS $(NF)}'
/a
a/b
b/c
c/d

You can cut using following ,
-d: delimiter ,-f for fields
\t used for tab separated fields
cut -d$'\t' -f 1-3,7-

To use AWK to cut off the first and last fields:
awk '{$1 = ""; $NF = ""; print}' inputfile
Unfortunately, that leaves the field separators, so
aaa bbb ccc
becomes
[space]bbb[space]
To do this using kurumi's answer which won't leave extra spaces, but in a way that's specific to your requirements:
awk '{delim = ""; for (i=2;i<=NF-1;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
This also fixes a couple of problems in that answer.
To generalize that:
awk -v skipstart=1 -v skipend=1 '{delim = ""; for (i=skipstart+1;i<=NF-skipend;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
Then you can change the number of fields to skip at the beginning or end by changing the variable assignments at the beginning of the command.

You can use Bash for that:
while read -a cols; do echo ${cols[#]:0:1} ${cols[#]:1,-1}; done < file.txt

you can use awk, for example, cut off 1st,2nd and last 3 columns
awk '{for(i=3;i<=NF-3;i++} print $i}' file
if you have a programing language such as Ruby (1.9+)
$ ruby -F"\t" -ane 'print $F[2..-3].join("\t")' file

Try the following:
echo a#b#c | awk -F"#" '{$1 = ""; $NF = ""; print}' OFS=""

Use
cut -b COLUMN_N_BEGINS-COLUMN_N_UNTIL INPUT.TXT > OUTPUT.TXT
-f doesn't work if you have "tabs" in the text file.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using awk to extract data and count - linux

How do I use awk on a file that looks like this: abcd Z efdg Z aqbs F edf F aasd A I want to extract the number of times each letter of the alphabet occurs in the second column, so output should be: Z 2 F 2 A 1

Related

find records longer/shorter than a particular col

awk print number of row only in uniq column

awk max value of column two for dates in column one

Removing last column from rows that have three columns using bash

How to cut first n and last n columns?

Categories

Resources