How do I grep for entire, possibly wrapped, lines of code? - linux

When searching code for strings, I constantly run into the problem that I get meaningless, context-less results. For example, if a function call is split across 3 lines, and I search for the name of a parameter, I get the parameter on a line by itself and not the name of the function.
For example, in a file containing
...
someFunctionCall ("test",
MY_CONSTANT,
(some *really) - long / expression);
grepping for MY_CONSTANT would return a line that looked like this:
MY_CONSTANT,
Likewise, in a comment block:
/////////////////////////////////////////
// FIXMESOON, do..while is the wrong choice here, because
// it makes the wrong thing happen
/////////////////////////////////////////
Grepping for FIXMESOON gives the very frustrating answer:
// FIXMESOON, do..while is the wrong choice here, because
When there are thousands of hits, single line results are a little meaningless. What I would like to do is have grep be aware of the start and stop points of source code lines, something as simple as having it consider ";" as the line separator would be a good start.
Bonus points if you can make it return the entire comment block if the hit is inside a comment.
I know you can't do this with grep alone. I also am aware of the option to have grep return a certain number of lines of context. Any suggestions on how to accomplish under Linux? FYI my preferred languages are C and Perl.
I'm sure I could write something, but I know that somebody must have already done this.
Thanks!

You can use pcregrep with the -M option (multiline matching; pcregrep is grep with Perl-compatible regular expressions). Something like:
pcregrep -M ";*\R*.*thingtosearchfor*\R*.*;.*"

Here's an example using awk.
$ cat file
blah1
blah2
function1 ("test",
MY_CONSTANT,
(some *really) - long / expression);
function2( one , two )
blah3
blah4
$ awk -vRS=")" '/function1/{gsub(".*function1","function1");print $0RT}' file
function1 ("test",
MY_CONSTANT,
(some *really)
the concept behind: RS is record separator. by setting it to ")", then every record in your file is separated by ")" instead of newline. This make it easy to find your "function1" since you can then "grep" for it. If you don't use awk, the same concept can be applied using "splitting" on ")".

You can write a command line using grep with the options that give you the line number and the filename, then xarg these results into awk to parse these columns and then use a little script from you to display the N lines surrounding that line? :)

If this isn't an academic endeavour you could just use cscope (for C code only though). If you are willing to drop the requirement to search in comments ctags should be enough (and it also supports Perl).

I had a situation in which I had an xml file full of the names of zip files in an xml style format, that is, with carrots bracketing the names of the files, say example.zip<\stuff>
I used awk to change all carrots into newlines then used grep :)

Related

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

How to grep text for small mistakes

Using standard Unix tools how can I search in a text file or output for a word with maybe 1-2 letters transposed or missed?
For example my input
function addtion(number, increment)
return number+increment
end
function additoin(number, increment)
return number+increment
end
I would like to search for addition and match addtion and additoin in my input and tell me about it. Because it's code, checking against dictionary is out of the question.
Currently cat file.txt | grep "addition" will simply yield me nothing.
You can play around with the agrep command. It can perform fuzzy, approximate matches.
The following command worked for me:
agrep -2 addition file
You can't do a fuzzy match with standard grep, but if there are specific misspelling you're interested in, you could construct a regular expression that matches those.
For example:
grep add[it]*on
matches the example misspelling you gave. But that's probably not general enough for your purposes.
A better approach is likely going to be to use some sort of static analysis tool specific to the language the code is in. It might not give you the right spelling, but should be able to tell you where the function name and calls to the function use different spellings.
Try the spell command. Note: You might need a dictionary (usually aspell-en in your distro's repositories).
As the answer says, you should definitely try agrep. In addition, there is a newer and much faster alternative ugrep for fuzzy search. Use -Z2 to allow up to 2 errors:
ugrep -Z2 addition file.txt
An insertion, deletion, or substitution is one error. A transposition (as in additoin) counts as two errors, i.e. two substitutions. Use option -i for case-insensitive search and -w to match whole words.
Try this on linux terminal:
grep -rnw "text" ./

how to use do loop to read several files with similar names in shell script

I have several files named scale1.dat, scale2.dat scale3.dat ... up to scale9.dat.
I want to read these files in do loop one by one and with each file I want to do some manipulation (I want to write the 1st column of each scale*.dat file to scale*.txt).
So my question is, is there a way to read files with similar names. Thanks.
The regular syntax for this is
for file in scale*.dat; do
awk '{print $1}' "$file" >"${file%.dat}.txt"
done
The asterisk * matches any text or no text; if you want to constrain to just single non-zero digits, you could say for file in scale[1-9].dat instead.
In Bash, there is a non-standard additional glob syntax scale{1..9}.dat but this is Bash-only, and so will not work in #!/bin/sh scripts. (Your question has both sh and bash so it's not clear which you require. Your comment that the Bash syntax is not working for you suggests that you may need a POSIX portable solution.) Furthermore, Bash has something called extended globbing, which allows for quite elaborate pattern matching. See also http://mywiki.wooledge.org/glob
For a simple task like this, you don't really need the shell at all, though.
awk 'FNR==1 { if (f) close (f); f=FILENAME; sub(/\.dat/, ".txt", f); }
{ print $1 >f }' scale[1-9]*.dat
(Okay, maybe that's slightly intimidating for a first-timer. But the basic point is that you will often find that the commands you want to use will happily work on multiple files, and so you don't need shell loops at all in those cases.)
I don't think so. Similar names or not, you will have to iterate through all your files (perhaps with a for loop) and use a nested loop to iterate through lines or words or whatever you plan to read from those files.
Alternatively, you can copy your files into one (say, scale-all.dat) and read that single file.

Using sed to print range when pattern is inside the range?

I have a log file full of queries, and I only want to see the queries that have an error. The log entries look something like:
path to file executing query
QUERY
SIZE: ...
ROWS: ...
MSG: ...
DURATION: ...
I want to print all of this stuff, but only when MSG: contains something of interest (an error message). All I've got right now is the sed -n '/^path to file/,/^DURATION/' and I have no idea where to go from here.
Note: Queries are often multiline, so using grep's -B sadly doesn't work all the time (this is what I've been doing thus far, just being generous with the -B value)
Somehow I'd like to use only sed, but if I absolutely must use something else like awk I guess that's fine.
Thanks!
You haven't said what an error message looks like, so I'll assume it contains the word "ERROR":
sed -n '/^MSG.*ERROR/{H;g;N;p;};/^DURATION/{s/.*//;h;d;};H' < logname
(I wish there were a tidier way to purge the hold space. Anyone?...)
I could suggest a solution with grep. That will work if the structure in the log file is always the same as above (i.e. MSG is in the 5th line, and one line follows):
egrep -i '^MSG:.*error' -A 1 -B 4 logfile
That means: If the word error occurs in a MSG line then output the block beginning from 4 lines before MSG till one line after it.
Of course you have to adjust the regexp to recognize an error.
This will not work if the structure of those blocks differs.
Perhaps you can use the cgrep.sed script, as described by Unix Power Tools book

bash - remove improper words

I have a file with bunch of words in which many of them don't make much sense such as 'completemakes' or even #s mixed with letters/words. What I need is to use a tool to spell check them, if it exists on the dictionary leave it, if not delete it.
What would be a good way of doing this in bash?
Thanks
You can script Aspell.
I had some fun with getting a single quote character in here, but hey, it should be as hard to read as it was to write, right? (assuming your words are listed in words.txt)
awk 'system("grep -i -q " "'"'"'^"$0"$'"'"'" " /usr/share/dict/words") == 0 {print $0};' words.txt

Resources