Extract values from a fixed-width column - linux

I have text file named file that contains the following:
Australia AU 10
New Zealand NZ 1
...
If I use the following command to extract the country names from the first column:
awk '{print $1}' file
I get the following:
Australia
New
...
Only the first word of each country name is output.
How can I get the entire country name?

Try this:
$ awk '{print substr($0,1,15)}' file
Australia
New Zealand

To complement Raymond Hettinger's helpful POSIX-compliant answer:
It looks like your country-name column is 23 characters wide.
In the simplest case, if you don't need to trim trailing whitespace, you can just use cut:
# Works, but has trailing whitespace.
$ cut -c 1-23 file
Australia
New Zealand
Caveat: GNU cut is not UTF-8 aware, so if the input is UTF-8-encoded and contains non-ASCII characters, the above will not work correctly.
To trim trailing whitespace, you can take advantage of GNU awk's nonstandard FIELDWIDTHS variable:
# Trailing whitespace is trimmed.
$ awk -v FIELDWIDTHS=23 '{ sub(" +$", "", $1); print $1 }' file
Australia
New Zealand
FIELDWIDTHS=23 declares the first field (reflected in $1) to be 23 characters wide.
sub(" +$", "", $1) then removes trailing whitespace from $1 by replacing any nonempty run of spaces (" +") at the end of the field ($1) with the empty string.
However, your Linux distro may come with Mawk rather than GNU Awk; use awk -W version to determine which one it is.
For a POSIX-compliant solution that trims trailing whitespace, extend Raymond's answer:
# Trailing whitespace is trimmed.
$ awk '{ c=substr($0, 1, 23); sub(" +$", "", c); print c}' file
Australia
New Zealand

to get rid of the last two columns
awk 'NF>2 && NF-=2' file
NF>2 is the guard to filter records with more than 2 fields. If your data is consistent you can drop that to simply,
awk 'NF-=2' file

This isn't relevant in the case where your data has spaces, but often it doesn't:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
foo bar baz etc...
In these cases it's really easy to get, say, the IMAGE column using tr to remove multiple spaces:
$ docker ps | tr --squeeze-repeats ' '
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
foo bar baz
Now you can pipe this (without the pesky header row) to cut:
$ docker ps | tr --squeeze-repeats ' ' | tail -n +2 | cut -d ' ' -f 2
foo

Related

How can I use grep and regular expression to display names with just 3 characters

I am new to grep and UNIX. I have a sample of data and want to display all the first names that only contain three characters e.g. Lee_example. but I having some difficulty doing that. I am currently using this code cat file.txt|grep -E "[A-Z][a-z]{2}" but it is displaying all the names that contain at least 3 characters and not only 3 characters
Sample data
name
number
Lee_example
1
Hector_exaple
2
You need to match the _ after the first name.
grep -E "[A-Z][a-z]{2}_"
With awk:
awk -F_ 'length($1)==3{print $1}'
-F_ tells awk to split the input lines by _. length($1) == 3 checks whether the first fields (the name) is 3 characters long and {print $1} prints the name in that case.

Get last n characters of one field and complete second field of a string in Linux

I have 2 lines in a file :
MUMBAI,918889986665,POSTPAID,CRBT123,CRBT,SYSTEM,151004,MONTHLY,160201,160302
MUMBAI,912398456781,POSTPAID,SEGP,SEGP30,SMS,151004,MONTHLY,160201,160302
I wanted to cut field 2 and 4 in above lines. Condition is: from field 2, I need only ten digits.
Desired output:
8889986665,CRBT
2398456781,SEGP30
I am trying below command :
cut -d',' -f2 test.txt | cut -c3-12 && cut -d',' -f4 test.txt
My output:
8889986665
2398456781
CRBT
SEGP30
Kindly help me to achieve desired output.
Solution 2:
Here is the solution which will serve the purpose:
cut -d',' -f2,4 1 | sed 's/.*\([0-9]\{10\}\),\(.*\)/\1,\2/'
8889986665,CRBT123
2398456781,SEGP
cut will give us the second and forth field.
Inside sed, .* to skip the initial characters until the first pattern ahead is encountered.
First pattern is 10 digits followed by a semicolon:
\([0-9]\{10\}\),
Second pattern is rest of the line: \(.*\)
Now we print both the patterns with semicolon in between: \1,\2
Note that the number 10 can replaced by number of characters to be
extracted before the delimiter , [0-9] can be replaced by . if
these characters can be any type of characters.
Solution 1:
Using cut will be easiest for you in this case.
You first need to get desired fields (2,4) filtered from the line and then do more filtering (only 10 characters from field #2)
$ cut -d',' -f2,4 test.txt | cut -c3-
8889986665,CRBT123
2398456781,SEGP
This is job best done using awk:
awk -F, -v n=10 '{print substr($2, length($2)-n+1, n) FS $5}' file
8889986665,CRBT
2398456781,SEGP30
substr command will print last n characters in 2nd column.
sed -r 's/[^,]+,..([^,]+,)([^,]+,)([^,]+),.*/\1\3/' file
8889986665,CRBT123
2398456781,SEGP
cat test.txt | cut -f 2,4 -d ","
assuming your file is test.txt

grep to search data in first column

I have a text file with two columns.
Product Cost
Abc....def 10
Abc.def 20
ajsk,,lll 04
I want to search for product starts from "Abc" and ends with "def" then for those entries I want to add Cost.
I have used :
grep "^Abc|def$" myfile
but it is not working
Use awk. cat myfile | awk '{print $1}' | grep query
If you can use awk, try this:
text.txt
--------
Product Cost
Abc....def 10
Abc.def 20
ajsk,,lll 04
With only awk:
awk '$1 ~ /^Abc.*def$/ { SUM += $2 } END { print SUM } ' test.txt
Result: 30
With grep and awk:
grep "^Abc.*def.*\d*$" test.txt | awk '{SUM += $2} END {print SUM}'
Result: 30
Explanation:
awk reads each line and matches the first column with a regular expression (regex)
The first column has to start with Abc, followed by anything (zero or more times), and ends with def
If such match is found, add 2nd column to SUM variable
After reading all lines print the variable
Grep extracts each line that starts with Abc, followed by anything, followed by def, followed by anything, followed by a number (zero or more times) to end. Those lines are fed/piped to awk. Awk just increments SUM for each line it receives. After reading all lines received, it prints the SUM variable.
Thanks edited. Do you want the command like this?
grep "^Abc.*def *.*$"
If you don't want to use cat, and also show the line numbers:
awk '{print $1}' filename | grep -n keyword
If applicable, you may consider caret ^: grep -E '^foo|^bar' it will match text at the beginning of the string. Column one is always located at the beginning of the string.
Regular expression > POSIX basic and extended
^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

How to cut first n and last n columns?

How can I cut off the first n and the last n columns from a tab delimited file?
I tried this to cut first n column. But I have no idea to combine first and last n column
cut -f 1-10 -d "<CTR>v <TAB>" filename
Cut can take several ranges in -f:
Columns up to 4 and from 7 onwards:
cut -f -4,7-
or for fields 1,2,5,6 and from 10 onwards:
cut -f 1,2,5,6,10-
etc
The first part of your question is easy. As already pointed out, cut accepts omission of either the starting or the ending index of a column range, interpreting this as meaning either “from the start to column n (inclusive)” or “from column n (inclusive) to the end,” respectively:
$ printf 'this:is:a:test' | cut -d: -f-2
this:is
$ printf 'this:is:a:test' | cut -d: -f3-
a:test
It also supports combining ranges. If you want, e.g., the first 3 and the last 2 columns in a row of 7 columns:
$ printf 'foo:bar:baz:qux:quz:quux:quuz' | cut -d: -f-3,6-
foo:bar:baz:quux:quuz
However, the second part of your question can be a bit trickier depending on what kind of input you’re expecting. If by “last n columns” you mean “last n columns (regardless of their indices in the overall row)” (i.e. because you don’t necessarily know how many columns you’re going to find in advance) then sadly this is not possible to accomplish using cut alone. In order to effectively use cut to pull out “the last n columns” in each line, the total number of columns present in each line must be known beforehand, and each line must be consistent in the number of columns it contains.
If you do not know how many “columns” may be present in each line (e.g. because you’re working with input that is not strictly tabular), then you’ll have to use something like awk instead. E.g., to use awk to pull out the last 2 “columns” (awk calls them fields, the number of which can vary per line) from each line of input:
$ printf '/a\n/a/b\n/a/b/c\n/a/b/c/d\n' | awk -F/ '{print $(NF-1) FS $(NF)}'
/a
a/b
b/c
c/d
You can cut using following ,
-d: delimiter ,-f for fields
\t used for tab separated fields
cut -d$'\t' -f 1-3,7-
To use AWK to cut off the first and last fields:
awk '{$1 = ""; $NF = ""; print}' inputfile
Unfortunately, that leaves the field separators, so
aaa bbb ccc
becomes
[space]bbb[space]
To do this using kurumi's answer which won't leave extra spaces, but in a way that's specific to your requirements:
awk '{delim = ""; for (i=2;i<=NF-1;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
This also fixes a couple of problems in that answer.
To generalize that:
awk -v skipstart=1 -v skipend=1 '{delim = ""; for (i=skipstart+1;i<=NF-skipend;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
Then you can change the number of fields to skip at the beginning or end by changing the variable assignments at the beginning of the command.
You can use Bash for that:
while read -a cols; do echo ${cols[#]:0:1} ${cols[#]:1,-1}; done < file.txt
you can use awk, for example, cut off 1st,2nd and last 3 columns
awk '{for(i=3;i<=NF-3;i++} print $i}' file
if you have a programing language such as Ruby (1.9+)
$ ruby -F"\t" -ane 'print $F[2..-3].join("\t")' file
Try the following:
echo a#b#c | awk -F"#" '{$1 = ""; $NF = ""; print}' OFS=""
Use
cut -b COLUMN_N_BEGINS-COLUMN_N_UNTIL INPUT.TXT > OUTPUT.TXT
-f doesn't work if you have "tabs" in the text file.

Extract Lines when Column K is empty with AWK/Perl

I have data that looks like this:
foo 78 xxx
bar yyy
qux 99 zzz
xuq xyz
They are tab delimited.
How can I extract lines where column 2 is empty, yielding
bar yyy
xuq xyz
I tried this but doesn't seem to work:
awk '$2==""' myfile.txt
You need to specifically set the field separator to a TAB character:
> cat qq.in
foo 78 xxx
bar yyy
qux 99 zzz
xuq xyz
> cat qq.in | awk 'BEGIN {FS="\t"} $2=="" {print}'
bar yyy
xuq xyz
The default behaviour for awk is to treat an FS of SPACE (the default) as a special case. From the man page:
In the special case that FS is a single space, fields are separated by runs of spaces and/or tabs and/or newlines. (my italics)
perl -F/\t/ -lane 'print unless $F[1] eq q//' myfile.txt
Command Switches
-F tells Perl what delimiter to autosplit on (tabs in this case)
-a enables autosplit mode, splitting each line on the specified delimiter to populate an array #F
-l automatically appends a newline "\n" at the end of each printed line
-n processes the file line-by-line
-e treats the first quoted argument as code and not a filename
grep -e '^.*\t\t.*$' myfile.txt
Will grep each line consisting of characters-tab-tab-characters (nothing between tabs).

Resources