Remove all but one line if they have common text at the end - linux

this is very specific and couldn't find this case in the other answers.
This is the sample text:
exampleA/file_A.a
exampleB/file_A.a
exampleB/another_dir/file_B.a
exampleB/file_A.a
exampleA/file_C.a
exampleB/file_D.b
exampleB/file_C.a
exampleB/file_B.a
exampleA/another_dir/file_D.b
exampleA/another_dir/file_C.a
exampleB/another_dir/another_one/file_D.b
I want to delete the lines of the duplicated files with an specific extension (.a) that could appear in this list of files (text file) EXCEPT one, so the text contains only 1 line per file. But there can be more files than fileA.a, fileB.a and fileC.a, so I can't "hardcode" those.
How do I search the lines that contain the same file at the end? I managed to do this: ( the extension of the files I want to delete is always .a, and the names can contain underscores anywhere, and I dont want to delete the files with the extension .b, because they are intended to be the same and they're always in different folders )
grep -o '/[a-zA-Z0-9_]*\.[a]' file.txt | sort | uniq -d
But I don't have the line numbers.
And then, how do I delete those lines BUT one? I have seen in a question the next line:
awk '!seen[$0]++' file.txt
But I can't figure out how to combine these to have the output that I need, that should look like this:
exampleA/file_A.a
exampleB/another_dir/file_B.a
exampleA/file_C.a
exampleB/file_D.b
exampleA/another_dir/file_D.b
exampleB/another_dir/another_one/file_D.b
Thanks.
Edit: I forgot to mention that there are another files in the text that have another extension (let's say .b) and I don't want to touch those. I just want to delete the ones with an specific extension (.a) and maybe another one (.d) if they appear, but that is not ultra necessary. I edited the sample.

You may use this awk:
awk -F/ 'NF && (!/\.a$/ || !seen[$NF]++)' file
exampleA/file_A.a
exampleB/another_dir/file_B.a
exampleA/file_C.a
exampleB/file_D.b
exampleA/another_dir/file_D.b
exampleB/another_dir/another_one/file_D.b
Here
-F/ sets / as input file separator
NF selects all non-empty lines
!/.a$/ || !seen[$NF]++: Prints a line if it doesn't end with .a` or if last field is read first time.

awk -F/ '$NF!~/\.(a|d)$/|| !seen[$NF]++' file.txt
exampleA/file_A.a
exampleB/another_dir/file_B.a
exampleA/file_C.a
exampleB/file_D.b
exampleA/another_dir/file_D.b
exampleB/another_dir/another_one/file_D.b
should do what you want. Note that in your input sample you have file_C.a and fileC.a
What we do is to tell awk to use the / as a field separator and only use the filen-name portion, the last field $NF, as the array index.

Related

How to use invert "-v" in grep when I do not have a file but a long string that is just one line?

Supposed I have
echo "The first part. The second part. The third part."
and want to remove The first part and The third part to get:
The second part.
I tried:
echo "The first part. The second part. The third part." | grep -v -e "The first part." -e "The third part."
but the inverting flag appears to work only for files with multiple lines. How can I do it for a single string?
Use sed instead:
echo "The first part. The second part. The third part." \
| sed -e 's/[[:space:]]*The first part\.[[:space:]]*//g' \
-e 's/[[:space:]]*The third part\.[[:space:]]*//g'
grep is a tool which works line-based and is more as a select-lines-which-satesfy-condition tool, The task you want to implement is more remove-substrings-from-file. This is in the area of substitutions and not in the area of selection: The best tool for this task is to use sed
sed 's/string_to_get_rid_of//g' file
Of course it is possible that your file is structured in records and you want to remove all records which contain a particular word, then there is another option. Assume that your file is split into various records which are delimited by a unique character (eg. <full-stop>-character (.)). The it is better to use awk for this. Awk allows you to redefine it's record separator from a new-line (default) to anything you want by defining RS and ORS (the latter for the output):
awk 'BEGIN{RS=ORS="."}/string_that_should_not_appear/{next}1' file
Assume you have a file with the content:
foo.bar.baz.qux
quux.quuz.corge
If we want to remove all the records which do not contain qux, we do:
awk 'BEGIN{RS=ORS="."}/qux/{next}1' file
which returns
foo.bar.baz.quuz.corge.
Notice that the record containing "cux" contained a newline and that an extra ORS is added at the end. Also you might get
foo.bar.baz.quuz.corge
.
Which is due to the POSIX standard that files should end with a newline
In case of the OP, it would read:
awk 'BEGIN{RS=ORS="."}/The first part/{next}/The third part/{next}1' file

Using sed to obtain pattern range through multiple files in a directory

I was wondering if it was possible to use the sed command to find a range between 2 patterns (in this case, dates) and output these lines in the range to a new file.
Right now, I am just looking at one file and getting lines within my time range of the file FileMoverTransfer.log. However, after a certain time period, these logs are moved to new log files with a suffix such as FileMoverTransfer.log-20180404-xxxxxx.gz. Here is my current code:
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log >> /public/FileMoverRoot/logs/intervalFMT.log
While this doesn't work, as sed isn't able to look through all of the files in the directory starting with FileMoverTransfer.log?
sed -n '/^'$start_date'/,/^'$end_date'/p;/^'$end_date'/q' FileMoverTransfer.log* >> /public/FileMoverRoot/logs/intervalFMT.log
Any help would be greatly appreciated. Thanks!
The range operator only operates within a single file, so you can't use it if the start is in one file and the end is in another file.
You can use cat to concatenate all the files, and pipe this to sed:
cat FileMoverTransfer.log* | sed -n "/^$start_date/,/^$end_date/p;/^$end_date/q" >> /public/FileMoverRoot/logs/intervalFMT.log
And instead of quoting and unquoting the sed command, you can use double quotes so that the variables will be expanded inside it. This will also prevent problems if the variables contain whitespace.
awk solution
As the OP confirmed that an awk solution would be acceptable, I post it.
(gunzip -c FileMoverTransfer.log-*.gz; cat FileMoverTransfer.log ) \
|awk -v st="$start_date" -v en="$end_date" '$1>=st&&$1<=en{print;next}$1>en{exit}'\
>/public/FileMoverRoot/logs/intervalFMT.log
This solution is functionally almost identical to Barmar’s sed solution, with the difference that his solution, like the OP’s, will print and quit at the first record matching the end date, while mine will print all lines matching the end date and quit at the first record past the end date, without printing it.
Some remarks:
The OP didn't specify the date format. I suppose it is a format compatible with ordinary string order, otherwise some conversion function should be used.
The files FileMoverTransfer.log-*.gz must be named in such a way that their alphabetical ordering corresponds to the chronological order (which is probably the case.)
I suppose that the dates are separated from the rest of the line by whitespace. If they aren’t, you have to supply the -F option to awk. E.g., if the dates are separated by -, you must write awk -F- ...
awk is much faster than sed in this case, because awk simply looks for the separator (whitespace or whatever was supplied with -F) while sed performs a regexp match.
There is no concept of range in my code, only date comparison. The only place where I suppose that the lines are ordered is when I say $1>en{exit}, that is exit when a line is newer than the end date. If you remove that final pattern and its action, the code will run through the whole input, but you could drop the requirement that the files be ordered.

Prefix search names to output in bash

I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.

Bash recursive similarities between directories content

I am looking for a bash command/script that will do the following:
Having two directory structures with different structure and file names
To find all lines in one structure that is the same as a line in another file in the other directory structure
E.g. line 56 "int archiveHex = 0x.." in file1.cpp is the same as same as line 89 of fileArchive.cpp. Of course the line numbers are not required at that stage the line content is good enought.
Long story is I do have two projects both quite big and I want to see does anyone used GPL code from one of the projects into his commercial product. However names of files and directory structure is changed but I see similarities and I am sure they copied something.
I found this two related questions:
How to compare two text files for the same exact text using BASH?
so it uses GREP but you have to pass the 2 files and cannot work
recursively.
Also I found
https://unix.stackexchange.com/questions/1079/output-the-common-lines-similarities-of-two-text-files-the-opposite-of-diff this as a way to use DIFF but for similarities not differences.
And also I found for the recursive part this question
https://askubuntu.com/questions/111495/how-to-diff-multiple-files-across-directories
But anyway I don't know how to combine all of them. How would you do this?
This can be done with a bit of shell and Awk script. Read all the lines of the first directory's files into an array, then for each input line, see if it is a defined key in the array. (I'm filtering out whitespace lines to reduce false positives. Maybe add empty comments to the filter, too.) The array key is the contents of the line and the array key's value is a string which identifies the source file name and line number. We conveniently receive these as colon-separated values from grep -nr:
grep -nrv '^[[:space:]]*$' "$srcdir" |
awk -F : 'NR==FNR { a[substr($0, length($1 ":" $2 ":")+1)] = $1 ":" $2; next }
$0 in a { print FILENAME ":" FNR " matches " a[$0] ":" $0; result=1}
END { exit 1-result }' - $(find "$otherdir" -type f)
The Awk script is fundamentally very simple; NR==FNR is a common idiom which matches the first input file (here, standard input, the pipe from grep) which is where we obtain the values for the array a; for subsequent input files, we trigger if the input line is a key in the array. The associative array type of Awk is ideal here.
This assumes that you have no file names with colons or newlines in them. It also assumes that the find output is small enough to not trigger an "Argument list too long" error, though if it does, that will be somewhat easier to fix.

remove almost-duplicates containing substring of next line

I need to know a way to remove duplicate strings in line, but let me explain, cause I have already used uniq. In a file, I get these two lines:
ANASI:A=4-63261950;
ANASI:A=4-63261950,ES=541;
The string 4-63261950 is duplicated in both lines, but the line itself is different, only that string is equal in both lines. I just need a way to process the entire file and remove the first line and leave only the one with the ANASI:A=4-63261950,ES=541;. The file will contain several lines with this exact same scenario. Is there a way to do this with sed or something?
awk to the rescue...
assuming your delimiters and structure stays the same
sort file | awk -F"[;,]" '!a[$1]++'
will pick the first one based on lexical order (, < ;)
If file is huge (and memory a problem or issue)
sort YourFile | awk -F '[;,]' 'Last != $1{print}{Last = $1}'
This might work for you (GNU sed):
sed -r 'N;/^(.*);\n\1,/!P;D' file
This uses a moving window to compare successive pairs of lines to print the required match.

Resources