how to find all strings matching a pattern in linux - linux

Can one suggest a grep command for finding a string in a list of files by providing a part of the string?
I have a list of files in a directory which contains email addresses. I want to extract all email addresses ending with a particular domain name. for example, i want to get a list of all emails ending with "#google.com" in a file.
Directory contains N number of files. The data in each file is a single line seperated with a comma. I have tried so many options with grep command and none worked.
Thanks,

You can try something like:
grep -E -o "\b[a-zA-Z0-9.-_]+#google\.com\b" *.files
Basically include the list of characters in the [a-zA-Z0-9.-_] character class which constitutes acceptable email addresses.

Related

adding many dictionaries to aspell

I have a tex document spanning several files that I want to check with aspell.
The command I use is:
cat $f | aspell list --extra-dicts="./names.spl" --mode=tex -l en |sort -u
for every file name f.
Some files that concern pronunciation have "words" like aj and oo inside them, which aspell counts as spelling mistakes. I want to filter them out without putting them into the names.spl dictionary. (first because they are not names, second because they shouldn't be ignored in other files)
the aspell documentation states that the "extra-dicts" argument can receive a list, but I can't seem to delimit it properly. I tried , : and plain spaces to no avail. They are either treated as a long file path or get entirely separated from the extra-dicts keywords.
I also tried to use the option twice, but the second time just overrides the first.
Am I missing something trivial about how lists are provided as command line arguments in the terminal?
According to the texinfo manual (info aspell), aspell uses a list option format that is different from other GNU programs, in which the base option name is prefixed with add- or rem- to respectively add or remove items from a list:
4.1.1.3 List options ....................
To add a value to the list, prefix the option name with an 'add-' and
then specify the value to add. For example, to add the URL filter use
'--add-filter url'. To remove a value from a list option, prefix the
option name with a 'rem-' and then specify the value to remove. For
example, to remove the URL filter use '--rem-filter url'. To remove
all items from a list prefix the option name with a 'clear-' without
specify any value. For example, to remove all filters use
'--clear-filter'.
Following this pattern for the --extra-dicts option, you would add multiple extra dictionaries as
--add-extra-dicts dict1 --add-extra-dicts dict2
The documentation for Aspell 0.60.7-20110707 also mentions a (possibly newer) more direct delimited list format, using a third prefix lset:
A list option can also be set directly, in which case it will be
set to a single value. To directly set a list option to multiple
values prefix the option name with a 'lset-' and separate each value
with a ':'. For example, to use the URL and TeX filter use
'--lset-filter url:tex'.
Following this format, your option would become
--lset-extra-dicts dict1:dict2

Search ill encoded characters in a file on Linux

I have a lot of huge CSV files, some of them contain ill encoded characters: in vi, I see things like "<8f>" or "<8e>", for example.
First, I wanted to search and replace (:%s) all the characters, but it will be a very long process because I will have to do this everytime I have to handle a file, and I'm not always sure whether new characters are here.
Is it possible to detect such characters, so that I can extract lines containing ill encoded characters?
A simple command may exist, taking a file for argument and creating a file containing only the lines with a problem.
I don't know if I explain me very well...
Thanks in advance!
You could use :g/char/p [vim] to print all the lines in a given file, or the bash utility grep:
grep -lr 'char1\|char2\|char2' .
Will output all the files in a directory containing any of the chars you have listed (the -r makes it recursive and the -l lists only the filenames, rather than all the line matches.

Using grep cmd to filter by first letter, #, and "."

I have a file (testdata.txt) with many email addresses and random text.
Using the grep command:
I want to make sure they are email addresses and not text, so I want to filter them out so that only lines with "#" are included.
I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name.
Eg. john.doe#gmail.com
However, johndoe#gmail.com would be included.
Lastly, I want to get the count of all the email addresses that follow these rules.
So far I've only been able to make sure they are email addresses by doing
grep -c "#" testdata.txt
.
Using the grep cmd I also want to check how many email addresses have a government domain ("gov").
I wanted to do a check that it has a # sign in the line and that it also contains gov. However, I don't get the answer I want when I do any of the following.
grep -c "#\|gov" testdata.txt I get the amount of lines that have a # not # and gov
grep -c "#/|gov" testdata.txt I get 0
grep -c "#|gov" testdata.txt I get 0
Going bottom-up with your questions.
You are using grep in its Basic regular expressions mode. In this mode \| means OR, | means the symbol |, and /| mean the symbols /|.
If you were looking for emails in the .gov domain, you would probably be looking for a sequence starting with # and followed by symbols that are permitted in an Internet domain name and the symbols .gov, or .GOV, or .Gov.
Borrowing from another post on this site you would end up with something like
grep -c "#[A-Za-z0-9][A-Za-z0-9.-]*\.\(gov\|Gov\|GOV\)"
skipping another 5 possible spellings for the top level domain, e.g. GoV.
However I would use the -i switch that means ignore case to simplify the expression
grep -ci "#[a-z0-9][a-z0-9.-]*\.gov"
Now you were not very clear regarding the use of dots separating parts of the name:
I also want to filter them out so that only email addresses that start with the letter A-M or a-m are shown and have a period separating the first name and last name. Eg. john.doe#gmail.com However, johndoe#gmail.com would be included.
So I will not touch this part.
Finally You could use range expressions to filter the addresses that start with the letters A-M
grep -ci "[a-m][a-z0-9._%+-]*#[a-z0-9][a-z0-9.-]*\.gov"
Please note that this is not an implementation of the Internet Message Format RFC 5322 address specification but only an approximation used mainly for didactic purpose. Never leave not fully compliant implementations in production code.

Retrieve substring with grep

I've got a question concerning grep.
I have some address data in an asc file as simple text. The first 30 characters are for the name. If the name is shorter than the 30 characters whitespaces fill it up to ensure its length is 30. At position 31 is a whitespace to separate the name from the next data which is the address. After the address is also a whitespace and some other data. My plan is to retrieve the address, which starts at index 32 and continues to index 50. I mostly got only nothing or the data beginning at the start of the line. I tried several methods such as
grep -iE '^.{30}' '.{8}$' myfile.asc
or
grep –o -P '^.{31,34}' myfile.asc
I can't search for a certain pattern since every set of data is different except the whitespaces which separate the data. Is it possible to retrieve my substring like that without relying on other methods through a pipe? I prefer to use grep since performance is an issue.
Why don't you use cut instead of grep if you're dealing with fixed positions?
cut -c 32-50 myfile.asc

Find required files by pattern and the change the pattern on Linux

I need to find all *.xml files that matched by pattern on Linux. I need to have written the file name on the screen and then change the pattern in the file just was found.
For instance.
I can start the script with arguments for keyword and for value, i.e
script.sh keyword "another word"
Script should find all files with keyword and do the following changes in the files containing keyword.
<keyword></keyword> should be the same <keyword></keyword>
<keyword>some word</keyword> should be like this <keyword>some word, another word</keyword>
In other words if initially value in keyword node was empty, then I don't need to change it and if it contains some value then I need to extend it with the value I will specify.
What is best way to do this on Linux? Using find, grep, sed?
Performance is also important since the number of files are thousands.
Thank you.
It seems using a combination of find, grep and sed would do this and they are pretty fast since you'll be doing text processing so there might not be a need for xml processing but if you could you give an example or rephrase your question I might be able to provide more help.

Resources