Get string between characters in Linux - linux

I have text file, which has data like this:
asd.www.aaa.com
abc.abc.co
look at me
asd.www.bbb.com
bzc.bzc.co
asd.www.ddd.com
hello world
www.eee.com
xx.yy.z
I want strings which is surrounded by "asd.www.[i want this string].com".
So my output will be like:
aaa
bbb
ddd

try:
grep -Po '^asd\.www\.\K[^.]*(?=\.com)' file
if asd could be in middle of the string, remove the first ^.
there could be other corner cases, like the greedy matching etc. it depends on your source input.

I suggested cut originally but I misread your question. So I'm going to post an alternative with awk instead. You are looking for the third column of your text input where there are a total of four columns.
less file.txt | awk -F '.' '{ if ($4 != "") print $3 }'
It splits your string on . and only prints out column $3 if column $4 is blank. This should yield the following given your example text:
aaa
bbb
ddd

Related

How to use grep to match two strings in the same line

How can I use grep to find two terms / strings in one line?
The output, or an entry in the log file, should only be made if the two terms / strings have been found.
I have made the following attempts:
egrep -n --color '(string1.*string2)' debuglog.log
In this example, everything between the two strings is marked.
But I would like to see only the two found strings marked.
Is that possible?
Maybe you could do this with another tool, I am open for suggestions.
The simplest solution would be to first select only the lines that contain both strings and then grep twice to color the matches, eg:
egrep 'string1.*string2|string2.*string1' |
egrep -n --color=always 'string1' | egrep --color 'string2'
It is important to set color to always, otherwise the grep won't output the color information to the pipe.
Here is single command awk solution that prefixes and suffixes matched strings with color codes:
awk '/string1.*string2/{
gsub(/string1|string2/, "\033[01;31m\033[K&\033[m"); print}' file
I know some people will disagree, but I think the best way is to do it like this :
Lets say this is your input :
$ cat > fruits.txt
apple banana
orange strawberry
coconut watermelon
apple peach
With this code you can get exactly what you need, and the code looks nicer and cleaner :
awk '{ if ( $0 ~ /apple/ && $0 ~ /banana/ )
{
print $0
}
}' fruits.txt
But, as I said before, some people will disagree as its too much typing. ths short way with grep is just concatenate many greps , e.g. :
grep 'apple' fruits.txt | grep 'banana'
Regards!
I am a little confused of what you really want as there was no sample data or expected output, but:
$ cat foo
1
2
12
21
132
13
And the awk that prints the matching parts of the records:
$ awk '
/1/ && /2/ {
while(match($0,/1|2/)) {
b=b substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
}
print b
b=""
}' foo
12
21
12
but fails with printing overlapping matches.

grep string after first occurrence of numbers

How do I get a string after the first occurrence of a number?
For example, I have a file with multiple lines:
34 abcdefg
10 abcd 123
999 abc defg
I want to get the following output:
abcdefg
abcd 123
abc defg
Thank you.
You could use Awk for this, loop through all the columns in each line upto NF (last column in each line) and once matching the first word, print the column next to it. The break statement would exit the for loop after the first iteration.
awk '{ for(i=1;i<=NF;i++) if ($i ~ /[[:digit:]]+/) { print $(i+1); break } }' file
It is not clear what you exactly want, but you can try to express it in sed.
Remove everything until the first digit, the next digits and any spaces.
sed 's/[^0-9]*[0-9]\+ *//'
Imagine the following two input files :
001 ham
03spam
3 spam with 5 eggs
A quick solution with awk would be :
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $1}' <file>
This line substitutes the first string of anything that does not contain a number followed by a number by an empty set (""). This way $0 is redefined and you can reprint the first field or the remainder of the field. This line gives exactly the following output.
ham
spam
spam
If you are interested in the remainder of the line
awk '{sub(/[^0-9]*[0-9]+ */,"",$0); print $0}' <file>
This will have as an output :
ham
spam
spam with 5 eggs
Be aware that an extra " *" is needed in the regular expression to remove all trailing spaces after the number. Without it you would get
awk '{sub(/[^0-9]*[0-9]+/,"",$0); print $0}' <file>
ham
spam
spam with 5 eggs
You can remove digits and whitespaces using sed:
sed -E 's/[0-9 ]+//' file
grep can do the job:
$ grep -o -P '(?<=[0-9] ).*' inputFIle
abcdefg
abcd 123
abc defg
For completeness, here is a solution with perl:
$ perl -lne 'print $1 if /[0-9]+\s*(.*)/' inputFIle
abcdefg
abcd 123
abc defg

Extract values from a fixed-width column

I have text file named file that contains the following:
Australia AU 10
New Zealand NZ 1
...
If I use the following command to extract the country names from the first column:
awk '{print $1}' file
I get the following:
Australia
New
...
Only the first word of each country name is output.
How can I get the entire country name?
Try this:
$ awk '{print substr($0,1,15)}' file
Australia
New Zealand
To complement Raymond Hettinger's helpful POSIX-compliant answer:
It looks like your country-name column is 23 characters wide.
In the simplest case, if you don't need to trim trailing whitespace, you can just use cut:
# Works, but has trailing whitespace.
$ cut -c 1-23 file
Australia
New Zealand
Caveat: GNU cut is not UTF-8 aware, so if the input is UTF-8-encoded and contains non-ASCII characters, the above will not work correctly.
To trim trailing whitespace, you can take advantage of GNU awk's nonstandard FIELDWIDTHS variable:
# Trailing whitespace is trimmed.
$ awk -v FIELDWIDTHS=23 '{ sub(" +$", "", $1); print $1 }' file
Australia
New Zealand
FIELDWIDTHS=23 declares the first field (reflected in $1) to be 23 characters wide.
sub(" +$", "", $1) then removes trailing whitespace from $1 by replacing any nonempty run of spaces (" +") at the end of the field ($1) with the empty string.
However, your Linux distro may come with Mawk rather than GNU Awk; use awk -W version to determine which one it is.
For a POSIX-compliant solution that trims trailing whitespace, extend Raymond's answer:
# Trailing whitespace is trimmed.
$ awk '{ c=substr($0, 1, 23); sub(" +$", "", c); print c}' file
Australia
New Zealand
to get rid of the last two columns
awk 'NF>2 && NF-=2' file
NF>2 is the guard to filter records with more than 2 fields. If your data is consistent you can drop that to simply,
awk 'NF-=2' file
This isn't relevant in the case where your data has spaces, but often it doesn't:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
foo bar baz etc...
In these cases it's really easy to get, say, the IMAGE column using tr to remove multiple spaces:
$ docker ps | tr --squeeze-repeats ' '
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
foo bar baz
Now you can pipe this (without the pesky header row) to cut:
$ docker ps | tr --squeeze-repeats ' ' | tail -n +2 | cut -d ' ' -f 2
foo

How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file.
This is a list of customers of this type:
John,Mills,81,Crescent,New York,NY,john#mills.com,19/02/1954
I am trying to separate these records into two categories, MEN and FEMALES.
I have a list of some 5000 Female Names , all in plain text , all in one file.
How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ?
I found it easy to "cut" the first column and grep --file=female.names.txt, but this way it's not going to print the entire record any longer.
I am aware of the awk option but in that case I don't know how to read the female names from file.
awk -F ',' ' { if($1==" ???Filename??? ") print $0} '
Many thanks !
You can do this with Awk:
awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv
Would print the lines of your csv file that contain first names of any found in your file female.names.txt.
awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv
Would output lines not found in female.names.txt.
This assumes the format of your female.names.txt file is something like:
Heather
Irene
Jane
Try this:
grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv
This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. Then it uses process substitution to use that as the file to match against the data file.
Another alternative is Perl, which can be useful if you're not super-familiar with awk.
#!/usr/bin/perl -anF,
use strict;
our %names;
BEGIN {
while (<ARGV>) {
chomp;
$names{$_} = 1;
}
}
print if $names{$F[0]};
To run (assume you named this file filter.pl):
perl filter.pl female.names.txt < records.txt
So, I've come up with the following:
Suppose, you have a file having the following lines in a file named test.txt:
abe 123 bdb 532
xyz 593 iau 591
Now you want to find the lines which include the first field having the first and last letters as vowels. If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output:
egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt
Then you want to the find the lines which include the third field having the first and last letters as vowels. Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output:
egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt
The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. After that, we have the field separator space in this case. Change the value within the second curly braces {0} or {2} to the desired field number-1. Then, use a regular expression to mention your criteria.

How to cut first n and last n columns?

How can I cut off the first n and the last n columns from a tab delimited file?
I tried this to cut first n column. But I have no idea to combine first and last n column
cut -f 1-10 -d "<CTR>v <TAB>" filename
Cut can take several ranges in -f:
Columns up to 4 and from 7 onwards:
cut -f -4,7-
or for fields 1,2,5,6 and from 10 onwards:
cut -f 1,2,5,6,10-
etc
The first part of your question is easy. As already pointed out, cut accepts omission of either the starting or the ending index of a column range, interpreting this as meaning either “from the start to column n (inclusive)” or “from column n (inclusive) to the end,” respectively:
$ printf 'this:is:a:test' | cut -d: -f-2
this:is
$ printf 'this:is:a:test' | cut -d: -f3-
a:test
It also supports combining ranges. If you want, e.g., the first 3 and the last 2 columns in a row of 7 columns:
$ printf 'foo:bar:baz:qux:quz:quux:quuz' | cut -d: -f-3,6-
foo:bar:baz:quux:quuz
However, the second part of your question can be a bit trickier depending on what kind of input you’re expecting. If by “last n columns” you mean “last n columns (regardless of their indices in the overall row)” (i.e. because you don’t necessarily know how many columns you’re going to find in advance) then sadly this is not possible to accomplish using cut alone. In order to effectively use cut to pull out “the last n columns” in each line, the total number of columns present in each line must be known beforehand, and each line must be consistent in the number of columns it contains.
If you do not know how many “columns” may be present in each line (e.g. because you’re working with input that is not strictly tabular), then you’ll have to use something like awk instead. E.g., to use awk to pull out the last 2 “columns” (awk calls them fields, the number of which can vary per line) from each line of input:
$ printf '/a\n/a/b\n/a/b/c\n/a/b/c/d\n' | awk -F/ '{print $(NF-1) FS $(NF)}'
/a
a/b
b/c
c/d
You can cut using following ,
-d: delimiter ,-f for fields
\t used for tab separated fields
cut -d$'\t' -f 1-3,7-
To use AWK to cut off the first and last fields:
awk '{$1 = ""; $NF = ""; print}' inputfile
Unfortunately, that leaves the field separators, so
aaa bbb ccc
becomes
[space]bbb[space]
To do this using kurumi's answer which won't leave extra spaces, but in a way that's specific to your requirements:
awk '{delim = ""; for (i=2;i<=NF-1;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
This also fixes a couple of problems in that answer.
To generalize that:
awk -v skipstart=1 -v skipend=1 '{delim = ""; for (i=skipstart+1;i<=NF-skipend;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
Then you can change the number of fields to skip at the beginning or end by changing the variable assignments at the beginning of the command.
You can use Bash for that:
while read -a cols; do echo ${cols[#]:0:1} ${cols[#]:1,-1}; done < file.txt
you can use awk, for example, cut off 1st,2nd and last 3 columns
awk '{for(i=3;i<=NF-3;i++} print $i}' file
if you have a programing language such as Ruby (1.9+)
$ ruby -F"\t" -ane 'print $F[2..-3].join("\t")' file
Try the following:
echo a#b#c | awk -F"#" '{$1 = ""; $NF = ""; print}' OFS=""
Use
cut -b COLUMN_N_BEGINS-COLUMN_N_UNTIL INPUT.TXT > OUTPUT.TXT
-f doesn't work if you have "tabs" in the text file.

Resources