Using standard Unix tools how can I search in a text file or output for a word with maybe 1-2 letters transposed or missed?
For example my input
function addtion(number, increment)
return number+increment
end
function additoin(number, increment)
return number+increment
end
I would like to search for addition and match addtion and additoin in my input and tell me about it. Because it's code, checking against dictionary is out of the question.
Currently cat file.txt | grep "addition" will simply yield me nothing.
You can play around with the agrep command. It can perform fuzzy, approximate matches.
The following command worked for me:
agrep -2 addition file
You can't do a fuzzy match with standard grep, but if there are specific misspelling you're interested in, you could construct a regular expression that matches those.
For example:
grep add[it]*on
matches the example misspelling you gave. But that's probably not general enough for your purposes.
A better approach is likely going to be to use some sort of static analysis tool specific to the language the code is in. It might not give you the right spelling, but should be able to tell you where the function name and calls to the function use different spellings.
Try the spell command. Note: You might need a dictionary (usually aspell-en in your distro's repositories).
As the answer says, you should definitely try agrep. In addition, there is a newer and much faster alternative ugrep for fuzzy search. Use -Z2 to allow up to 2 errors:
ugrep -Z2 addition file.txt
An insertion, deletion, or substitution is one error. A transposition (as in additoin) counts as two errors, i.e. two substitutions. Use option -i for case-insensitive search and -w to match whole words.
Try this on linux terminal:
grep -rnw "text" ./
Related
I once found a built-in command that would take a prefix as an argument and return all words that could complete that word. So for example,
>> COMMAND cali
California
calibrate
calibration
........
of course it would list a lot more words, in alphanumerical order. It was really useful, and optionally took a file other than the default to look in.
I'm not just trying to produce this behavior: there are obviously a million ways to use grep, sed, awk, perl, or INSERT TURING-COMPLETE LANGUAGE HERE to get this. I'm looking for the command.
Unfortunately it's hard to google something when you don't remember the name, but while it might not have been POSIX standard it was definitely a very common Linux utility, does anyone know what this was called?
Found it: it's called look, and it seems to have been around Unix since V7. (The man page is dated 1993!)
It does a binary search on the optional second argument to find all matches, defaulting to /usr/share/dict/words.
Not really a builtin command, but there is /usr/share/dict/* and grep:
$ grep -i '^Cali' /usr/share/dict/words
Caliban
Calibanism
caliber
calibered
calibogus
calibrate
calibration
calibrator
calibre
Caliburn
...
I ran into this problem with grep and would like to know if it's a bug or not. The reproducible scenario is a file with the contents:
string
string-
and save it as 'file'. The goal is to use grep with --color=always to output 'string' while excluding 'string-'. Without --color, the following works as expected:
$ grep string file | grep -v string-
but using --color outputs both instances:
$ grep --color=always string file | grep -v string-
I experimented with several variations but it seems --color breaks the expected behavior. Is this a bug or am I misunderstanding something? My assumption is that passing --color should have no effect on the outcome.
#Jake Gould's answer provides a great analysis of what actually happens, but let me try to phrase it differently:
--color=always uses ANSI escape codes for coloring.
In other words: --color=always by design ALTERS its output, because it must add the requisite escape sequences to achieve coloring.
Never use --color=always, unless you know the output is expected to contain ANSI escape sequences - typically, for human eyeballs on a terminal.
If you're not sure how the input is processed, use --color=auto, which - I believe - causes grep to apply coloring only if its stdout is connected to a terminal.
I a given pipeline, it typically only makes sense to apply --color=auto (or --color=always) to a grep command that is the LAST command in the pipeline.
When you use --color grep adds ANSI (I believe?) color coding. So your text which looks like this:
string
string-
Will actually look like this in terms of pure, unprocessed ASCII text:
^[[01;31m^[[Kstring^[[m^[[K
^[[01;31m^[[Kstring^[[m^[[K-
There is some nice info provided in this question thread including this great this answer.
My assumption is that passing --color should have no effect on the outcome.
Nope. The purpose of grep—as most all Unix/Linux tools—is to provide a basic simple service & do that well. And that service is to search a plain-text (key here) input file based on a patter & return the output. The --color option is a small nod to the fact that we are humans & staring at screens with uncolored text all day can drive you nuts. Color coding makes work easier.
So color coding with ANSI is usually considered a final step in a process. It’s not the job of grep to assume that if it comes across ANSI in it’s input it should ignore it. Perhaps a case could be made to add a --decolor option to grep, but I doubt that is a feature worth the effort.
grep is a base level plain-text parsing tool. Nothing more & nothing less.
I need to find all *.xml files that matched by pattern on Linux. I need to have written the file name on the screen and then change the pattern in the file just was found.
For instance.
I can start the script with arguments for keyword and for value, i.e
script.sh keyword "another word"
Script should find all files with keyword and do the following changes in the files containing keyword.
<keyword></keyword> should be the same <keyword></keyword>
<keyword>some word</keyword> should be like this <keyword>some word, another word</keyword>
In other words if initially value in keyword node was empty, then I don't need to change it and if it contains some value then I need to extend it with the value I will specify.
What is best way to do this on Linux? Using find, grep, sed?
Performance is also important since the number of files are thousands.
Thank you.
It seems using a combination of find, grep and sed would do this and they are pretty fast since you'll be doing text processing so there might not be a need for xml processing but if you could you give an example or rephrase your question I might be able to provide more help.
When searching code for strings, I constantly run into the problem that I get meaningless, context-less results. For example, if a function call is split across 3 lines, and I search for the name of a parameter, I get the parameter on a line by itself and not the name of the function.
For example, in a file containing
...
someFunctionCall ("test",
MY_CONSTANT,
(some *really) - long / expression);
grepping for MY_CONSTANT would return a line that looked like this:
MY_CONSTANT,
Likewise, in a comment block:
/////////////////////////////////////////
// FIXMESOON, do..while is the wrong choice here, because
// it makes the wrong thing happen
/////////////////////////////////////////
Grepping for FIXMESOON gives the very frustrating answer:
// FIXMESOON, do..while is the wrong choice here, because
When there are thousands of hits, single line results are a little meaningless. What I would like to do is have grep be aware of the start and stop points of source code lines, something as simple as having it consider ";" as the line separator would be a good start.
Bonus points if you can make it return the entire comment block if the hit is inside a comment.
I know you can't do this with grep alone. I also am aware of the option to have grep return a certain number of lines of context. Any suggestions on how to accomplish under Linux? FYI my preferred languages are C and Perl.
I'm sure I could write something, but I know that somebody must have already done this.
Thanks!
You can use pcregrep with the -M option (multiline matching; pcregrep is grep with Perl-compatible regular expressions). Something like:
pcregrep -M ";*\R*.*thingtosearchfor*\R*.*;.*"
Here's an example using awk.
$ cat file
blah1
blah2
function1 ("test",
MY_CONSTANT,
(some *really) - long / expression);
function2( one , two )
blah3
blah4
$ awk -vRS=")" '/function1/{gsub(".*function1","function1");print $0RT}' file
function1 ("test",
MY_CONSTANT,
(some *really)
the concept behind: RS is record separator. by setting it to ")", then every record in your file is separated by ")" instead of newline. This make it easy to find your "function1" since you can then "grep" for it. If you don't use awk, the same concept can be applied using "splitting" on ")".
You can write a command line using grep with the options that give you the line number and the filename, then xarg these results into awk to parse these columns and then use a little script from you to display the N lines surrounding that line? :)
If this isn't an academic endeavour you could just use cscope (for C code only though). If you are willing to drop the requirement to search in comments ctags should be enough (and it also supports Perl).
I had a situation in which I had an xml file full of the names of zip files in an xml style format, that is, with carrots bracketing the names of the files, say example.zip<\stuff>
I used awk to change all carrots into newlines then used grep :)
I need to read through some gigantic log files on a Linux system. There's a lot of clutter in the logs. At the moment I'm doing something like this:
cat logfile.txt | grep -v "IgnoreThis\|IgnoreThat" | less
But it's cumbersome -- every time I want to add another filter, I need to quit less and edit the command line. Some of the filters are relatively complicated and may be multi-line.
I'd like some way to apply filters as I am reading through the log, and a way to save these filters somewhere.
Is there a tool that can do this for me? I can't install new software so hopefully it's something that would already be installed -- e.g., less, vi, something in a Python or Perl lib, etc.
Changing the code that generates the log to generate less is not an option.
Use &pattern command within less.
From the man page for less
&pattern
Display only lines which match the pattern; lines which do not
match the pattern are not displayed. If pattern is empty (if
you type & immediately followed by ENTER), any filtering is
turned off, and all lines are displayed. While filtering is in
effect, an ampersand is displayed at the beginning of the
prompt, as a reminder that some lines in the file may be hidden.
Certain characters are special as in the / command:
^N or !
Display only lines which do NOT match the pattern.
^R Don't interpret regular expression metacharacters; that
is, do a simple textual comparison.
Try the multitail tool - as well as letting you view multile logs at once, I'm pretty sure it lets you apply regex filters interactively.
Based on ghostdog74's answer and the less manpage, I came up with this:
~/.bashrc:
export LESSOPEN='|~/less-filter.sh %s'
export LESS=-R # to allow ANSI colors
~/less-filter.sh:
#!/bin/sh
case "$1" in
*logfile*.log*) ~/less-filter.sed < $1
;;
esac
~/less-filter.sed:
/deleteLinesLikeThis/d # to filter out lines
s/this/that/ # to change text on lines (useful to colorize using ANSI escapes)
Then:
less logfileFooBar.log.1 -- applies the filter applies automatically.
cat logfileFooBar.log.1 | less -- to see the log without filtering
This is adequate for now but I would still like to be able to edit the filters on the fly.
see the man page of less. there are some options you can use to search for words for example. It has line editing mode as well.
There's an application by Casstor Software Solutions called LogFilter (www.casstor.com) that can edit Windows/Mac/Linux text files and can easily perform file filtering. It supports multiple filters as well as regular expressions. I think it might be what you're looking for.