Retrieve substring with grep

Retrieve substring with grep - linux

I've got a question concerning grep.
I have some address data in an asc file as simple text. The first 30 characters are for the name. If the name is shorter than the 30 characters whitespaces fill it up to ensure its length is 30. At position 31 is a whitespace to separate the name from the next data which is the address. After the address is also a whitespace and some other data. My plan is to retrieve the address, which starts at index 32 and continues to index 50. I mostly got only nothing or the data beginning at the start of the line. I tried several methods such as
grep -iE '^.{30}' '.{8}$' myfile.asc
or
grep –o -P '^.{31,34}' myfile.asc
I can't search for a certain pattern since every set of data is different except the whitespaces which separate the data. Is it possible to retrieve my substring like that without relying on other methods through a pipe? I prefer to use grep since performance is an issue.

Why don't you use cut instead of grep if you're dealing with fixed positions?
cut -c 32-50 myfile.asc

Related

Using sed to obtain pattern range through multiple files in a directory

I was wondering if it was possible to use the sed command to find a range between 2 patterns (in this case, dates) and output these lines in the range to a new file.
Right now, I am just looking at one file and getting lines within my time range of the file FileMoverTransfer.log. However, after a certain time period, these logs are moved to new log files with a suffix such as FileMoverTransfer.log-20180404-xxxxxx.gz. Here is my current code:
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log >> /public/FileMoverRoot/logs/intervalFMT.log
While this doesn't work, as sed isn't able to look through all of the files in the directory starting with FileMoverTransfer.log?
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log* >> /public/FileMoverRoot/logs/intervalFMT.log
Any help would be greatly appreciated. Thanks!

The range operator only operates within a single file, so you can't use it if the start is in one file and the end is in another file.
You can use cat to concatenate all the files, and pipe this to sed:
cat FileMoverTransfer.log* | sed -n "/^$start_date/,/^$end_date/p;/^$end_date/q" >> /public/FileMoverRoot/logs/intervalFMT.log
And instead of quoting and unquoting the sed command, you can use double quotes so that the variables will be expanded inside it. This will also prevent problems if the variables contain whitespace.

awk solution
As the OP confirmed that an awk solution would be acceptable, I post it.
(gunzip -c FileMoverTransfer.log-*.gz; cat FileMoverTransfer.log ) \
|awk -v st="$start_date" -v en="$end_date" '$1>=st&&$1<=en{print;next}$1>en{exit}'\
>/public/FileMoverRoot/logs/intervalFMT.log
This solution is functionally almost identical to Barmar’s sed solution, with the difference that his solution, like the OP’s, will print and quit at the first record matching the end date, while mine will print all lines matching the end date and quit at the first record past the end date, without printing it.
Some remarks:
The OP didn't specify the date format. I suppose it is a format compatible with ordinary string order, otherwise some conversion function should be used.
The files FileMoverTransfer.log-*.gz must be named in such a way that their alphabetical ordering corresponds to the chronological order (which is probably the case.)
I suppose that the dates are separated from the rest of the line by whitespace. If they aren’t, you have to supply the -F option to awk. E.g., if the dates are separated by -, you must write awk -F- ...
awk is much faster than sed in this case, because awk simply looks for the separator (whitespace or whatever was supplied with -F) while sed performs a regexp match.
There is no concept of range in my code, only date comparison. The only place where I suppose that the lines are ordered is when I say $1>en{exit}, that is exit when a line is newer than the end date. If you remove that final pattern and its action, the code will run through the whole input, but you could drop the requirement that the files be ordered.

Using grep cmd to filter by first letter, #, and "."

I have a file (testdata.txt) with many email addresses and random text.
Using the grep command:
I want to make sure they are email addresses and not text, so I want to filter them out so that only lines with "#" are included.
I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name.
Eg. john.doe#gmail.com
However, johndoe#gmail.com would be included.
Lastly, I want to get the count of all the email addresses that follow these rules.
So far I've only been able to make sure they are email addresses by doing
grep -c "#" testdata.txt
.
Using the grep cmd I also want to check how many email addresses have a government domain ("gov").
I wanted to do a check that it has a # sign in the line and that it also contains gov. However, I don't get the answer I want when I do any of the following.
grep -c "#\|gov" testdata.txt I get the amount of lines that have a # not # and gov
grep -c "#/|gov" testdata.txt I get 0
grep -c "#|gov" testdata.txt I get 0

Going bottom-up with your questions.
You are using grep in its Basic regular expressions mode. In this mode \| means OR, | means the symbol |, and /| mean the symbols /|.
If you were looking for emails in the .gov domain, you would probably be looking for a sequence starting with # and followed by symbols that are permitted in an Internet domain name and the symbols .gov, or .GOV, or .Gov.
Borrowing from another post on this site you would end up with something like
grep -c "#[A-Za-z0-9][A-Za-z0-9.-]*\.\(gov\|Gov\|GOV\)"
skipping another 5 possible spellings for the top level domain, e.g. GoV.
However I would use the -i switch that means ignore case to simplify the expression
grep -ci "#[a-z0-9][a-z0-9.-]*\.gov"
Now you were not very clear regarding the use of dots separating parts of the name:
I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name. Eg. john.doe#gmail.com However, johndoe#gmail.com would be included.
So I will not touch this part.
Finally You could use range expressions to filter the addresses that start with the letters A-M
grep -ci "[a-m][a-z0-9._%+-]*#[a-z0-9][a-z0-9.-]*\.gov"
Please note that this is not an implementation of the Internet Message Format RFC 5322 address specification but only an approximation used mainly for didactic purpose. Never leave not fully compliant implementations in production code.

Grep expression filter out lines of the form [alnum][punct][alnum]

Hi all my first post is for what I thought would be simple ...
I haven't been able to find an example of a similar problem/solution.
I have thousands of text files with thousands of lines of content in the form
<word><space><word><space><number>
Example:
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
In the above example I want to exclude line 3 as it contains internal punctuation
I'm trying to use the GNU grep 2.25 however not having luck
my initial attempt was (however this does not allow the "-" internal to the pattern):
grep -v [:alnum:]*[:punct:]*[:alnum:]* filename
so tried this however
grep -v [:alnum:]*[:space:]*[!]*["]*[#]*[$]*[%]*[&]*[']*[(]*[)]*[*]*[+]*[,]*[.]*[/]*[:]*[;]*[<]*[=]*[>]*[?]*[#]*[[]*[\]*[]]*[^]*[_]*[`]*[{]*[|]*[}]*[~]*[.]*[:space:]*[:alnum:]* filename
however I need to factor in spaces and - as these are acceptable internal to the string.
I had been trying with the :punct" set however now see it contains - so clearly that will not work
I do currently have a stored procedure in TSQL to process these however would prefer to preprocess prior to loading if possible as the routine takes some seconds per file.
Has someone been able to achieve something similar?

On the face of it, you're looking for the 'word space word space number' schema, assuming 'word' is 'one alphanumeric optionally followed by zero or one occurrences of zero or more alphanumeric or punctuation characters and ending with an alphanumeric', and 'space' is 'one or more spaces' and 'number' is 'one or more digits'.
In terms of grep -E (aka egrep):
grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+'
That contains:
[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?
That detects a word with any punctuation surrounded by alphanumerics, and:
[[:space:]]+
[[:digit:]]+
which look for one or more spaces or digits.
Using a mildly extended data file, this produces:
$ cat data
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$ grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+' data
example for 1
useful when 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$
It eliminates the for. However 1 line as required.

Your regex contains a long string of ordered optional elements, but that means it will fail if something happens out of order. For example,
[!]*[?]*
will capture !? but not ?! (and of course, a character class containing a single character is just equivalent to that single character, so you might as well say !*?*).
You can instead use a single character class which contains all of the symbols you want to catch. As soon as you see one next to an alphanumeric character, you are done, so you don't need for the regex to match the entire input line.
grep -v '[[:alnum:]][][!"#$%&'"'"'()*+,./:;<=>?#\^_`{|}~]' filename
Also notice how the expression needs to be in single quotes in order for the shell not to interfere with the many metacharacters here. In order for a single-quoted string to include a literal single quote, I temporarily break out into a double-quoted string; see here for an explanation (I call this "seesaw quoting").
In a character class, if the class needs to include ], it needs to be at the beginning of the enumerated list; for symmetry and idiom, I also moved [ next to it.
Moreover, as pointed out by Jonathan Leffler, a POSIX character class name needs to be inside a character class; so to match one character belonging to the [:alnum:] named set, you say [[:alnum:]]. (This means you can combine sets, so [-[:alnum:].] covers alphanumerics plus dash and period.)
If you need to constrain this to match only on the first field, change the [[:alnum:]] to ^[[:alnum:]]\+.
Not realizing that a*b*c* matches anything is a common newbie error. You want to avoid writing an expression where all elements are optional, because it will match every possible string. Focus on what you want to match (the long list of punctuation characters, in your case) and then maybe add optional bits of context around it if you really need to; but the fewer of these you need, the faster it will run, and the easier it will be to see what it does. As a quick rule of thumb, a*bc* is effectively precisely equivalent to just b -- leading or trailing optional expressions might as well not be specified, because they do not affect what is going to be matched.

Grep filtering of the dictionary

I'm having a hard time getting a grasp of using grep for a class i am in was hoping someone could help guide me in this assignment. The Assignment is as follows.
Using grep print all 5 letter lower case words from the linux dictionary that have a single letter duplicated one time (aabbe or ababe not valid because both a and b are in the word twice). Next to that print the duplicated letter followed buy the non-duplicated letters in alphabetically ascending order.
The Teacher noted that we will need to use several (6) grep statements (piping the results to the next grep) and a sed statement (String Editor) to reformat the final set of words, then pipe them into a read loop where you tear apart the three non-dup letters and sort them.
Sample Output:
aback a bck
abaft a bft
abase a bes
abash a bhs
abask a bks
abate a bet
I haven't figured out how to do more then printing 5 character words,
grep "^.....$" /usr/share/dict/words |

Didn't check it thoroughly, but this might work
tr '[:upper:]' '[:lower:]' | egrep -x '[a-z]{5}' | sed -r 's/^(.*)(.)(.*)\2(.*)$/\2 \1\3\4/' | grep " " | egrep -v "(.).*\1"
But do your way because someone might see it here.

All in one sed
sed -n '
# filter 5 letter word
/[a-zA-Z]\{5\}/ {
# lower letters
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxya/
# filter non single double letter
/\(.\).*\1/ !b
/\(.\).*\(.\).*\1.*\1/ b
/\(.\).*\(.\).*\1.*\2/ b
/\(.\).*\(.\).*\2.*\1/ b
# extract peer and single
s/\(.\)*\(.\)\(.*\)\2\(.*\)/a & \2:\1\3\4/
# sort singles
:sort
s/:\([^a]*\)a\(.*\)$/:\1\2a/
y/abcdefghijklmnopqrstuvwxyz/zabcdefghijklmnopqrstuvwxy/
/^a/ !b sort
# clean and print
s/..//
s/:/ /p
}' YourFile
posix sed so --posix on GNU sed

The first bit, obviously, is to use grep to get it down to just the words that have a single duplication in. I will give you some clues on how to do that.
The key is to use backreferences, which allow you to specify that something that matched a previous expression should appear again. So if you write
grep -E "^(.)...\1...\1$"
then you'll get all the words that have the starting letter reappearing in fifth and ninth positions. The point of the brackets is to allow you to refer later to whatever matched the thing in brackets; you do that with a \1 (to match the thing in the first lot of brackets).
You want to say that there should be a duplicate anywhere in the word, which is slightly more complicated, but not much. You want a character in brackets, then any number of characters, then the repeated character (with no ^ or $ specified).
That will also include ones where there are two or more duplicates, so the next stage is to filter them out. You can do that by a grep -v invocation. Once you've got your list of 5-character words that have at least one duplicate, pipe them through a grep -v call that strips out anything with two (or more) duplicates in. That'll have a (.), and another (.), and a \1, and a \2, and these might appear in several different orders.
You'll also need to strip out anything that has a (.) and a \1 and another \1, since that will have a letter with three occurrences.
That should be enough to get you started, at any rate.

Your next step should be to find the 5-letter words containing a duplicate letter. To do that, you will need to use back-references. Example:
grep "[a-z]*\([a-z]\)[a-z]*\$1[a-z]*"
The $1 picks up the contents of the first parenthesized group and expects to match that group again. In this case, it matches a single letter. See: http://www.thegeekstuff.com/2011/01/advanced-regular-expressions-in-grep-command-with-10-examples--part-ii/ for more description of this capability.
You will next need to filter out those cases that have either a letter repeated 3 times or a word with 2 letters repeated. You will need to use the same sort of back-reference trick, but you can use grep -v to filter the results.
sed can be used for the final display. Grep will merely allow you to construct the correct lines to consider.

Note that the dictionary contains capital letters and also non-letter characters, plus that strange characters used in Southern Europe. say "è".
If you want to distinguish "A" and "a", it's automatic, on the other hand if "A" and "a" are the same letter, in ALL grep invocations you must use the -i option, to instruct grep to ignore case.
Next, you always want to pass the -E option, to avoid the so called backslashitis gravis in the regexp that you want to pass to grep.
Further, if you want to exclude the lines matching a regexp from the output, the correct option is -v.
Eventually, if you want to specify many different regexes to a single grep invocation, this is the way (just an example btw)
grep -E -i -v -e 'regexp_1' -e 'regexp_2' ... -e 'regexp_n'
The preliminaries are after us, let's look forward, use the answer from chiastic-security as a reference to understand the procedings
There are only these possibilities to find a duplicate in a 5 character string
(.)\1
(.).\1
(.)..\1
(.)...\1
grep -E -i -e 'regexp_1' ...
Now you have all the doubles, but this doesn't exclude triples etc that are identified by the following patterns (Edit added a cople of additional matching triples patterns)
(.)\1\1
(.).\1\1
(.)\1.\1
(.)..\1\1
(.).\1.\1
(.)\1\1\1
(.).\1\1\1
(.)\1\1\1\1\
you want to exclude these patterns, so grep -E -i -v -e 'regexp_1' ...
at his point, you have a list of words with at least a couple of the same character, and no triples, etc and you want to drop double doubles, these are the regexes that match double doubles
(.)(.)\1\2
(.)(.)\2\1
(.).(.)\1\2
(.).(.)\2\1
(.)(.).\1\2
(.)(.).\2\1
(.)(.)\1.\2
(.)(.)\2.\1
and you want to exclude the lines with these patterns, so its grep -E -i -v ...
A final hint, to play with my answer copy a few hundred lines of the dictionary in your working directory, head -n 3000 /usr/share/dict/words | tail -n 300 > ./300words so that you can really understand what you're doing, avoiding to be overwhelmed by the volume of the output.
And yes, this is not a complete answer, but it is maybe too much, isn't it?

sed regex not being greedy?

In bash I have a string variable tempvar, which is created thus:
tempvar=`grep -n 'Mesh Tally' ${meshtalfile}`
meshtalfile is a (large) input file which contains some header lines and a number of blocks of data lines, each marked by a beginning line which is searched for in the grep above.
In the case at hand, the variable tempvar contains the following string:
5: Mesh Tally Number 4 977236: Mesh Tally Number 14 1954467: Mesh Tally Number 24 4354479: Mesh Tally Number 34
I now wish to extract the line number relating to a particularly mesh tally number - so I define a variable meshnum1 as equal to 24, and run the following sed command:
echo ${tempvar} | sed -r "s/^.*([0-9][0-9]*):\sMesh\sTally\sNumber\s${meshnum1}.*$/\1/"
This is where things go wrong. I expect the output 1954467, but instead I get 7. Trying with number 34 instead returns 9 instead of 4354479. It seems that sed is returning only the last digit of the number - which surely violates the principle of greedy matching? And oddly, when I move the open parenthesis ( left a couple of characters to include .*, it returns the whole line up to and including the single character it was previously returning. Surely it cannot be greedy in one situation and antigreedy in another? Hopefully I have just done something stupid with the syntax...

The problem is that the .* is being greedy too, which means that it will get all numbers too. Since you force it to get at least one digit in the [0-9][0-9]* part, the .* before it will be greedy enough to leave only one digit for the expression after it.
A solution could be:
echo ${tempvar} | sed -r "s/^.*\s([0-9][0-9]*):\sMesh\sTally\sNumber\s${meshnum1}.*$/\1/"
Where now the \s between the .* and the [0-9][0-9]* explictly forces there to be a space before the digits you want to match.
Hope this helps =)

Are the values in $tempvar supposed to be multiple or a single line? Because if it is a single line, ".*$" should match to the end of line, meaning all the other values too, right?

There's no need for sed, here's one way using GNU grep:
echo "$tempvar" | grep -oP "[0-9]+(?=:\sMesh\sTally\sNumber\s${meshnum1}\b)"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Retrieve substring with grep - linux

Why don't you use cut instead of grep if you're dealing with fixed positions? cut -c 32-50 myfile.asc

Related

Using sed to obtain pattern range through multiple files in a directory

Using grep cmd to filter by first letter, #, and "."

Grep expression filter out lines of the form [alnum][punct][alnum]

Grep filtering of the dictionary

sed regex not being greedy?

Categories

Resources