Bash recursive similarities between directories content - linux

I am looking for a bash command/script that will do the following:
Having two directory structures with different structure and file names
To find all lines in one structure that is the same as a line in another file in the other directory structure
E.g. line 56 "int archiveHex = 0x.." in file1.cpp is the same as same as line 89 of fileArchive.cpp. Of course the line numbers are not required at that stage the line content is good enought.
Long story is I do have two projects both quite big and I want to see does anyone used GPL code from one of the projects into his commercial product. However names of files and directory structure is changed but I see similarities and I am sure they copied something.
I found this two related questions:
How to compare two text files for the same exact text using BASH?
so it uses GREP but you have to pass the 2 files and cannot work
recursively.
Also I found
https://unix.stackexchange.com/questions/1079/output-the-common-lines-similarities-of-two-text-files-the-opposite-of-diff this as a way to use DIFF but for similarities not differences.
And also I found for the recursive part this question
https://askubuntu.com/questions/111495/how-to-diff-multiple-files-across-directories
But anyway I don't know how to combine all of them. How would you do this?

This can be done with a bit of shell and Awk script. Read all the lines of the first directory's files into an array, then for each input line, see if it is a defined key in the array. (I'm filtering out whitespace lines to reduce false positives. Maybe add empty comments to the filter, too.) The array key is the contents of the line and the array key's value is a string which identifies the source file name and line number. We conveniently receive these as colon-separated values from grep -nr:
grep -nrv '^[[:space:]]*$' "$srcdir" |
awk -F : 'NR==FNR { a[substr($0, length($1 ":" $2 ":")+1)] = $1 ":" $2; next }
$0 in a { print FILENAME ":" FNR " matches " a[$0] ":" $0; result=1}
END { exit 1-result }' - $(find "$otherdir" -type f)
The Awk script is fundamentally very simple; NR==FNR is a common idiom which matches the first input file (here, standard input, the pipe from grep) which is where we obtain the values for the array a; for subsequent input files, we trigger if the input line is a key in the array. The associative array type of Awk is ideal here.
This assumes that you have no file names with colons or newlines in them. It also assumes that the find output is small enough to not trigger an "Argument list too long" error, though if it does, that will be somewhat easier to fix.

Related

Remove all but one line if they have common text at the end

this is very specific and couldn't find this case in the other answers.
This is the sample text:
exampleA/file_A.a
exampleB/file_A.a
exampleB/another_dir/file_B.a
exampleB/file_A.a
exampleA/file_C.a
exampleB/file_D.b
exampleB/file_C.a
exampleB/file_B.a
exampleA/another_dir/file_D.b
exampleA/another_dir/file_C.a
exampleB/another_dir/another_one/file_D.b
I want to delete the lines of the duplicated files with an specific extension (.a) that could appear in this list of files (text file) EXCEPT one, so the text contains only 1 line per file. But there can be more files than fileA.a, fileB.a and fileC.a, so I can't "hardcode" those.
How do I search the lines that contain the same file at the end? I managed to do this: ( the extension of the files I want to delete is always .a, and the names can contain underscores anywhere, and I dont want to delete the files with the extension .b, because they are intended to be the same and they're always in different folders )
grep -o '/[a-zA-Z0-9_]*\.[a]' file.txt | sort | uniq -d
But I don't have the line numbers.
And then, how do I delete those lines BUT one? I have seen in a question the next line:
awk '!seen[$0]++' file.txt
But I can't figure out how to combine these to have the output that I need, that should look like this:
exampleA/file_A.a
exampleB/another_dir/file_B.a
exampleA/file_C.a
exampleB/file_D.b
exampleA/another_dir/file_D.b
exampleB/another_dir/another_one/file_D.b
Thanks.
Edit: I forgot to mention that there are another files in the text that have another extension (let's say .b) and I don't want to touch those. I just want to delete the ones with an specific extension (.a) and maybe another one (.d) if they appear, but that is not ultra necessary. I edited the sample.
You may use this awk:
awk -F/ 'NF && (!/\.a$/ || !seen[$NF]++)' file
exampleA/file_A.a
exampleB/another_dir/file_B.a
exampleA/file_C.a
exampleB/file_D.b
exampleA/another_dir/file_D.b
exampleB/another_dir/another_one/file_D.b
Here
-F/ sets / as input file separator
NF selects all non-empty lines
!/.a$/ || !seen[$NF]++: Prints a line if it doesn't end with .a` or if last field is read first time.
awk -F/ '$NF!~/\.(a|d)$/|| !seen[$NF]++' file.txt
exampleA/file_A.a
exampleB/another_dir/file_B.a
exampleA/file_C.a
exampleB/file_D.b
exampleA/another_dir/file_D.b
exampleB/another_dir/another_one/file_D.b
should do what you want. Note that in your input sample you have file_C.a and fileC.a
What we do is to tell awk to use the / as a field separator and only use the filen-name portion, the last field $NF, as the array index.

Using sed to obtain pattern range through multiple files in a directory

I was wondering if it was possible to use the sed command to find a range between 2 patterns (in this case, dates) and output these lines in the range to a new file.
Right now, I am just looking at one file and getting lines within my time range of the file FileMoverTransfer.log. However, after a certain time period, these logs are moved to new log files with a suffix such as FileMoverTransfer.log-20180404-xxxxxx.gz. Here is my current code:
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log >> /public/FileMoverRoot/logs/intervalFMT.log
While this doesn't work, as sed isn't able to look through all of the files in the directory starting with FileMoverTransfer.log?
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log* >> /public/FileMoverRoot/logs/intervalFMT.log
Any help would be greatly appreciated. Thanks!
The range operator only operates within a single file, so you can't use it if the start is in one file and the end is in another file.
You can use cat to concatenate all the files, and pipe this to sed:
cat FileMoverTransfer.log* | sed -n "/^$start_date/,/^$end_date/p;/^$end_date/q" >> /public/FileMoverRoot/logs/intervalFMT.log
And instead of quoting and unquoting the sed command, you can use double quotes so that the variables will be expanded inside it. This will also prevent problems if the variables contain whitespace.
awk solution
As the OP confirmed that an awk solution would be acceptable, I post it.
(gunzip -c FileMoverTransfer.log-*.gz; cat FileMoverTransfer.log ) \
|awk -v st="$start_date" -v en="$end_date" '$1>=st&&$1<=en{print;next}$1>en{exit}'\
>/public/FileMoverRoot/logs/intervalFMT.log
This solution is functionally almost identical to Barmar’s sed solution, with the difference that his solution, like the OP’s, will print and quit at the first record matching the end date, while mine will print all lines matching the end date and quit at the first record past the end date, without printing it.
Some remarks:
The OP didn't specify the date format. I suppose it is a format compatible with ordinary string order, otherwise some conversion function should be used.
The files FileMoverTransfer.log-*.gz must be named in such a way that their alphabetical ordering corresponds to the chronological order (which is probably the case.)
I suppose that the dates are separated from the rest of the line by whitespace. If they aren’t, you have to supply the -F option to awk. E.g., if the dates are separated by -, you must write awk -F- ...
awk is much faster than sed in this case, because awk simply looks for the separator (whitespace or whatever was supplied with -F) while sed performs a regexp match.
There is no concept of range in my code, only date comparison. The only place where I suppose that the lines are ordered is when I say $1>en{exit}, that is exit when a line is newer than the end date. If you remove that final pattern and its action, the code will run through the whole input, but you could drop the requirement that the files be ordered.

Prefix search names to output in bash

I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.

Copy a section within two keywords into a target file

I have thousand of files in a directory and each file contains numbers of defined variables starting with keyword DEFINE and ending with a semicolon (;), I want to copy all the occurrences of the data between this keyword(Inclusive) into a target file.
Example: Below is the content of the text file:
/* This code is for lookup */
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
END.
Now from the above content i just want to copy the section starting with DEFINE and ending with ; into a target file i.e. the output should be:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
this needs to done for thousands of scripts and multiple occurences, Please help out.
Thanks a lot , the provided code works, but to a limited extent only when the whole sentence is in a single line but the data is not supposed to be in one single line it is spread in multiple line like below:
/* This code is for lookup */
DEFINE variable as a1 expr= if branchno > 55
then
extract (n123f1 using brach, code)
else
branchno = null
;
END.
The code is also in the above fashion i need to capture all the data between DEFINE and semicolon (;) after every define there will be an ending semicolon ;, this is the pattern.
It sounds like you want grep(1):
grep '^DEFINE.*;$' input > output
Try using grep. Let's say you have files with extension .txt in present directory,
grep -ho 'DEFINE.*;' *.txt > outfile
Output:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
Short Description
-o will give you only matching string rather than whole line, if line also contains something else and want to ommit it.
-h will suppress file names before matching result
Read man page of grep by typing man grep on your terminal
EDIT
If you want capability to search in multiple lines, you can use pcregrep with -M option
pcregrep -M 'DEFINE.*?(\n|.)*?;' *.txt > outfile
Works fine on my system. Check man pcregrep for more details
Reference : SO Question
One can make a simple solution using sed with version :
sed -n -e '/^DEFINE/{:a p;/;$/!{n;ba}}' your-file
Option -n prevents sed from printing every line; then each time a line begins with DEFINE, print the line (command p) then enter a loop: until you find a line ending with ;, grab the next line and loop to the print command. When exiting the loop, you do nothing.
It looks a bit dirty; it seems that the version sed15 has a shorter (and more straightforward) way to achieve this in one line:
sed -n -e '/^DEFINE/,/;$/p' your-file
Indeed, only for this version of sed, both patterns are treated; for other versions of sed like mine under cygwin, the range patterns must be on separate lines to work properly.
One last thing to remember: it does not treat inclusive patterned ranges, i.e. it stops printing after the first encountered end-pattern even if multiple start patterns have been matched. Prefer something with awk if this is a feature you are looking for.

Expanding several filenames into one directory in bash

I want to run awk on several files. I have the filenames and a path to the files, but I can't seem to connect the two. Here's what I have tried:
files=(a b c)
directory=/my/dir
awk $my_script "$directory/${files[#]}"
It awks the first file and leaves the rest alone. I'd rather not have to add the full path in my array (the values are used in several places). I think I want brace expansion, but it doesn't seem to work with arrays. What else could I do?
Using pattern substitution (# means something like ^ in regexps): ${files[#]/#/$directory/}
for i in /my/dir/[abc]; do
awk $my_script "$i"
done
Or, if you want to actually just pass all of the file names to awk at once:
awk $my_script /my/dir/[abc]
If the file names are not actually single letters:
awk $my_script /my/dir/{file1,file2,file3,...}

Resources