grep to search data in first column - linux

I have a text file with two columns.
Product Cost
Abc....def 10
Abc.def 20
ajsk,,lll 04
I want to search for product starts from "Abc" and ends with "def" then for those entries I want to add Cost.
I have used :
grep "^Abc|def$" myfile
but it is not working

Use awk. cat myfile | awk '{print $1}' | grep query

If you can use awk, try this:
text.txt
--------
Product Cost
Abc....def 10
Abc.def 20
ajsk,,lll 04
With only awk:
awk '$1 ~ /^Abc.*def$/ { SUM += $2 } END { print SUM } ' test.txt
Result: 30
With grep and awk:
grep "^Abc.*def.*\d*$" test.txt | awk '{SUM += $2} END {print SUM}'
Result: 30
Explanation:
awk reads each line and matches the first column with a regular expression (regex)
The first column has to start with Abc, followed by anything (zero or more times), and ends with def
If such match is found, add 2nd column to SUM variable
After reading all lines print the variable
Grep extracts each line that starts with Abc, followed by anything, followed by def, followed by anything, followed by a number (zero or more times) to end. Those lines are fed/piped to awk. Awk just increments SUM for each line it receives. After reading all lines received, it prints the SUM variable.

Thanks edited. Do you want the command like this?
grep "^Abc.*def *.*$"

If you don't want to use cat, and also show the line numbers:
awk '{print $1}' filename | grep -n keyword

If applicable, you may consider caret ^: grep -E '^foo|^bar' it will match text at the beginning of the string. Column one is always located at the beginning of the string.
Regular expression > POSIX basic and extended
^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

Related

How can I use grep and regular expression to display names with just 3 characters

I am new to grep and UNIX. I have a sample of data and want to display all the first names that only contain three characters e.g. Lee_example. but I having some difficulty doing that. I am currently using this code cat file.txt|grep -E "[A-Z][a-z]{2}" but it is displaying all the names that contain at least 3 characters and not only 3 characters
Sample data
name
number
Lee_example
1
Hector_exaple
2
You need to match the _ after the first name.
grep -E "[A-Z][a-z]{2}_"
With awk:
awk -F_ 'length($1)==3{print $1}'
-F_ tells awk to split the input lines by _. length($1) == 3 checks whether the first fields (the name) is 3 characters long and {print $1} prints the name in that case.

find lines existing in one file and not in another, based on a portion of the line

I have two files A.dat and B.dat.
A.dat
112381550RSAP002839002C00000000020200600000110102020-05-26
112539961RSAP002839002C00000000020200700000140102020-05-26
140823748RSAP002839002C00000000020210200000050102020-05-26
110604754RSAP002839002C00000000020200600000110102020-05-26
B.dat
112381550RSAP002839002C00000000020200600000000102020-05-26
112539961RSAP002839002C00000000020200700000000102020-05-26
119A06559RSAP002839002C00000000020210100000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
118372226RSAP002839002C00000000020200800000000102020-05-26
I want to find records in B.dat that do not exist in A.dat based on the first 22 characters (in BOLD)
the output should be below
119A06559RSAP002839002C00000000020210100000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
118372226RSAP002839002C00000000020200800000000102020-05-26
Tried using grep like below
grep -Fvxf B.dat A.dat > c.dat
But didn't find a way to compare only that portion of the data.
Could you please try the following.
awk 'FNR==NR{array[substr($0,1,22)];next} !(substr($0,1,22) in array)' A.dat B.dat
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR then do following.
array[substr($0,1,22)] ##Creating an array whose index is first 22 elements of current line.
next ##next will skip all further statements from here.
}
!(substr($0,1,22) in array) ##Checking condition if current line first 22 characters are NOT in array the print the current line.
' A.dat B.dat ##Mentioning Input_file names here.
I would use the following method based on awk:
awk '{s=substr($0,1,22)}(FNR==NR){a[s];next}!(s in a)' A.dat B.dat
This ensures that you will always match the first 22 characters.
It essentially does the following: everytime a line is read (disregarding the file) it creates a little string s containing the first 22 characters of the line. If we process the first file (FNR==NR) store the string in an array a, if we process the second file, check if that string is a member of a and if not, print the line.
You could also attempt a grep based solution, but this could lead to false positives, depending on how you like your input:
cut -c1-22 A.dat | grep -vFf - B.dat
This however could match the first 22 characters of the lines of A.dat anywhere in the lines of B.dat (not necessarily the first 22 characters)
You can do this with just grep and colrm as follows (a filename of "-" is understood as stdin and you can use that with "-f"):
colrm 23 < A.dat | grep -F -v -f - B.dat
If you're not 100% sure those 22-character patterns are going to match only at the starts of lines, you need to add a '^' to each line of output from colrm and elide the "-F" flag from grep's flags, like so:
colrm 23 < A.dat | sed -e 's/^/\^/;' | grep -v -f - B.dat
If the order of the output is unimportant, here's a grep-free method using bash, sort, and GNU uniq:
sort {A,A,B}.dat | uniq -uw 22
...or in POSIX shell:
sort A.dat A.dat B.dat | uniq -uw 22
Output of either method:
118372226RSAP002839002C00000000020200800000000102020-05-26
119231672RSAP002839002C00000000020200900000000102020-05-26
119A06559RSAP002839002C00000000020210100000000102020-05-26

Get last n characters of one field and complete second field of a string in Linux

I have 2 lines in a file :
MUMBAI,918889986665,POSTPAID,CRBT123,CRBT,SYSTEM,151004,MONTHLY,160201,160302
MUMBAI,912398456781,POSTPAID,SEGP,SEGP30,SMS,151004,MONTHLY,160201,160302
I wanted to cut field 2 and 4 in above lines. Condition is: from field 2, I need only ten digits.
Desired output:
8889986665,CRBT
2398456781,SEGP30
I am trying below command :
cut -d',' -f2 test.txt | cut -c3-12 && cut -d',' -f4 test.txt
My output:
8889986665
2398456781
CRBT
SEGP30
Kindly help me to achieve desired output.
Solution 2:
Here is the solution which will serve the purpose:
cut -d',' -f2,4 1 | sed 's/.*\([0-9]\{10\}\),\(.*\)/\1,\2/'
8889986665,CRBT123
2398456781,SEGP
cut will give us the second and forth field.
Inside sed, .* to skip the initial characters until the first pattern ahead is encountered.
First pattern is 10 digits followed by a semicolon:
\([0-9]\{10\}\),
Second pattern is rest of the line: \(.*\)
Now we print both the patterns with semicolon in between: \1,\2
Note that the number 10 can replaced by number of characters to be
extracted before the delimiter , [0-9] can be replaced by . if
these characters can be any type of characters.
Solution 1:
Using cut will be easiest for you in this case.
You first need to get desired fields (2,4) filtered from the line and then do more filtering (only 10 characters from field #2)
$ cut -d',' -f2,4 test.txt | cut -c3-
8889986665,CRBT123
2398456781,SEGP
This is job best done using awk:
awk -F, -v n=10 '{print substr($2, length($2)-n+1, n) FS $5}' file
8889986665,CRBT
2398456781,SEGP30
substr command will print last n characters in 2nd column.
sed -r 's/[^,]+,..([^,]+,)([^,]+,)([^,]+),.*/\1\3/' file
8889986665,CRBT123
2398456781,SEGP
cat test.txt | cut -f 2,4 -d ","
assuming your file is test.txt

How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file.
This is a list of customers of this type:
John,Mills,81,Crescent,New York,NY,john#mills.com,19/02/1954
I am trying to separate these records into two categories, MEN and FEMALES.
I have a list of some 5000 Female Names , all in plain text , all in one file.
How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ?
I found it easy to "cut" the first column and grep --file=female.names.txt, but this way it's not going to print the entire record any longer.
I am aware of the awk option but in that case I don't know how to read the female names from file.
awk -F ',' ' { if($1==" ???Filename??? ") print $0} '
Many thanks !
You can do this with Awk:
awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv
Would print the lines of your csv file that contain first names of any found in your file female.names.txt.
awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv
Would output lines not found in female.names.txt.
This assumes the format of your female.names.txt file is something like:
Heather
Irene
Jane
Try this:
grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv
This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. Then it uses process substitution to use that as the file to match against the data file.
Another alternative is Perl, which can be useful if you're not super-familiar with awk.
#!/usr/bin/perl -anF,
use strict;
our %names;
BEGIN {
while (<ARGV>) {
chomp;
$names{$_} = 1;
}
}
print if $names{$F[0]};
To run (assume you named this file filter.pl):
perl filter.pl female.names.txt < records.txt
So, I've come up with the following:
Suppose, you have a file having the following lines in a file named test.txt:
abe 123 bdb 532
xyz 593 iau 591
Now you want to find the lines which include the first field having the first and last letters as vowels. If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output:
egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt
Then you want to the find the lines which include the third field having the first and last letters as vowels. Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output:
egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt
The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. After that, we have the field separator space in this case. Change the value within the second curly braces {0} or {2} to the desired field number-1. Then, use a regular expression to mention your criteria.

How to cut first n and last n columns?

How can I cut off the first n and the last n columns from a tab delimited file?
I tried this to cut first n column. But I have no idea to combine first and last n column
cut -f 1-10 -d "<CTR>v <TAB>" filename
Cut can take several ranges in -f:
Columns up to 4 and from 7 onwards:
cut -f -4,7-
or for fields 1,2,5,6 and from 10 onwards:
cut -f 1,2,5,6,10-
etc
The first part of your question is easy. As already pointed out, cut accepts omission of either the starting or the ending index of a column range, interpreting this as meaning either “from the start to column n (inclusive)” or “from column n (inclusive) to the end,” respectively:
$ printf 'this:is:a:test' | cut -d: -f-2
this:is
$ printf 'this:is:a:test' | cut -d: -f3-
a:test
It also supports combining ranges. If you want, e.g., the first 3 and the last 2 columns in a row of 7 columns:
$ printf 'foo:bar:baz:qux:quz:quux:quuz' | cut -d: -f-3,6-
foo:bar:baz:quux:quuz
However, the second part of your question can be a bit trickier depending on what kind of input you’re expecting. If by “last n columns” you mean “last n columns (regardless of their indices in the overall row)” (i.e. because you don’t necessarily know how many columns you’re going to find in advance) then sadly this is not possible to accomplish using cut alone. In order to effectively use cut to pull out “the last n columns” in each line, the total number of columns present in each line must be known beforehand, and each line must be consistent in the number of columns it contains.
If you do not know how many “columns” may be present in each line (e.g. because you’re working with input that is not strictly tabular), then you’ll have to use something like awk instead. E.g., to use awk to pull out the last 2 “columns” (awk calls them fields, the number of which can vary per line) from each line of input:
$ printf '/a\n/a/b\n/a/b/c\n/a/b/c/d\n' | awk -F/ '{print $(NF-1) FS $(NF)}'
/a
a/b
b/c
c/d
You can cut using following ,
-d: delimiter ,-f for fields
\t used for tab separated fields
cut -d$'\t' -f 1-3,7-
To use AWK to cut off the first and last fields:
awk '{$1 = ""; $NF = ""; print}' inputfile
Unfortunately, that leaves the field separators, so
aaa bbb ccc
becomes
[space]bbb[space]
To do this using kurumi's answer which won't leave extra spaces, but in a way that's specific to your requirements:
awk '{delim = ""; for (i=2;i<=NF-1;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
This also fixes a couple of problems in that answer.
To generalize that:
awk -v skipstart=1 -v skipend=1 '{delim = ""; for (i=skipstart+1;i<=NF-skipend;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
Then you can change the number of fields to skip at the beginning or end by changing the variable assignments at the beginning of the command.
You can use Bash for that:
while read -a cols; do echo ${cols[#]:0:1} ${cols[#]:1,-1}; done < file.txt
you can use awk, for example, cut off 1st,2nd and last 3 columns
awk '{for(i=3;i<=NF-3;i++} print $i}' file
if you have a programing language such as Ruby (1.9+)
$ ruby -F"\t" -ane 'print $F[2..-3].join("\t")' file
Try the following:
echo a#b#c | awk -F"#" '{$1 = ""; $NF = ""; print}' OFS=""
Use
cut -b COLUMN_N_BEGINS-COLUMN_N_UNTIL INPUT.TXT > OUTPUT.TXT
-f doesn't work if you have "tabs" in the text file.

Resources