Grep : Memory Exhausted on comparing two files to find the delta

Grep : Memory Exhausted on comparing two files to find the delta - linux

In my project , i am comparing file1 with file2 and the difference will be created in the output_file(delta between the two files). I am using the following command to find the difference :
grep -v -F -f <file1> <file2> > <output_file>
When i am comparing files around 22MB in size , i am getting the following error:
grep: memory exhausted
When i am comparing files with lesser size , its working fine.Please letme know if any tweak is needed.

you may add swap partition on this machine, so that if RAM is exhausted then, the machine can take space from swap partition.
here is the link to add swap partition http://www.thegeekstuff.com/2010/08/how-to-add-swap-space/?utm_source=feedburner

also you can use diff file1 file2 command, that will be better option.

Related

Find difference line by line

I have a program which stores some data in two files stored in separate folders. /Path_1/File A and /Path_2/File B.
Now I need to compare those two files line by line for any differences. If any difference noted, I need to capture that and stored in a separate file or print it on the screen.
I tried using comm,diff and join. But none of them worked so far. Appreciate any help.
Sample file looks like following.
124 days
3.10.0-327.13.1.el7.x86_64
/dev/mapper/vg_sda-lv_root ext4
devtmpfs devtmpfs
In other file number of days and kernel version can be differ. I only need to capture that while running a script.
I tried diff -y -W 120 Source/File Destination/File , comm File1 File2

You can try this:
diff --suppress-common-lines /path_1/file_a /path_2/file_b > output
Where --suppress-common-lines does:
"do not output common lines"

How to use sed command to delete lines without backup file?

I have large file with size of 130GB.
# ls -lrth
-rw-------. 1 root root 129G Apr 20 04:25 syslog.log
So I need to reduce file size by deleting line which starts with "Nov 2" , So I have given the following command,
sed -i '/Nov 2/d' syslog.log
So I can't edit file using VIM editor also.
When I trigger SED command , its creating backup file also. But I don't have much space in root. Please try to give alternate solution to delete particular line from this file without increasing space in server.

It does not create a real backup file. sed is a stream editor. When applied to a file with option -i it will stream that file through the sed process, write the output to a new file (a temporary one), when everything is done, it will rename the new file to the original name.
(There are options to create backup files also, but you didn't give them, so I won't mention that further.)
In your case you have a very large file and don't want to create any copy, however temporary. For this you need to open the file for reading and writing at the same time, then your sed process can overwrite the original. After this, you will have to truncate the file at the end of the writing.
To demonstrate how this can be done, we first perform a test case.
Create a test file, containing lots of lines:
seq 0 999999 > x
Now, lets say we want to remove all lines containing the digit 4:
grep -v 4 1<>x <x
This will open the file for reading and writing as STDOUT (1), and for reading as STDIN. The grep command will read all lines and will output only the lines not containing a 4 (option -v).
This will effectively overwrite the beginning of the original file.
You will not know how long the output is, so after the output the original contents of the file will appear:
…
999991
999992
999993
999995
999996
999997
999998
999999
537824
537825
537826
537827
537828
537829
…
You can use the Unix tool truncate to shorten your file manually afterwards. In a real scenario you will have trouble finding the right spot for this, so it makes sense to count the number of bytes written (using wc):
(Don't forget to recreate the original x for this test.)
(grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c
This will preform the step above and additionally print out the number of bytes written to the terminal, in this example case the output will be 3653658. Now use truncate:
truncate -s 3653658 x
Now you have the result you want.
If you want to do this in a script, i. e. without interaction, you can use this:
length=$((grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c)
truncate -s "$length" x
I cannot guarantee that this will work for files >2GB or >4GB on your machine; depending on your operating system (32bit?) and the versions of the installed tools you might run into largefile issues. I'd perform tests with large files first (>4GB as this is typically a limit for many things) and then cross your fingers and give it a try :)
Some caveats you have to keep in mind:
Of course, nobody is supposed to append log entries to that log file while the procedure is running.
Also, any abort during the running of the process (power failure, signal caught, etc.) will leave the file in an undefined state. But re-running the command again after such a mishap will in most cases produce the correct output; some lines might be doubled, but not more than a single line should be corrupted then.
The output must be smaller than the input, of course, otherwise the writing will overtake the reading, corrupting the whole result so that lines which should be there will be missing (or truncated at the start).

sort runs out of memory

I'm using a pipe including sort to merge multiple large textfiles and remove dupes.
I don't have root permissions but the box isn't configured in any way to cut non root privileges further down than default debian jessie.
The box has 32GB RAM and 16GB are in use.
Regardless on how I call sort (GNU sort 8.13) it fills up all the remaining RAM and crashes with "out of memory".
It really fills up all the memory before crashing. I followed the process in top.I tried to explicitly set the max memory usage with the -S parameter ranging from 80% to 10% and from 8G to 500M.
The whole pipe looks similar to:
cat * | tr -cd '[:print:]' |sort {various params tested here} -T /other/tmp/path/ | uniq > ../output.txt
Always the same behavior.
Does anyone know what could cause such issue?
And of course how to solve it?

I found the issue myself. It's fairly easy.
The "tr -cd '[:print:]'" removes line breaks and sort reads line by line.
So it tries to read all the files as one line and the -S parameter can't do its job.

Understanding sincedb files from Logstash file input

When using the file input with Logstash, a sincedb file is written in order to keep track of the current position of monitored log files. How to understand its contents?
Example of a sincedb file:
286105 0 19 20678374

There are 4 fields (source):
inode
major device number
minor device number
byte offset
Assuming that a hard disk would be segmented in thousands of very tiny parts with a number for each one, the inode would be more or less like the number of the tiny part where the file begins. So a given inode is unique to each hard disk, but in order to address cases where there are multiple disks on the same server, using major and minor device number is required in order to guarantee uniqueness of the triplet {inode, minor device number, minor device number}. More accurate info about inodes on Wikipedia.
That said, I am not so sure that (for example) files mounted through NFS could not collide with local files since the inode of a file mounted through NFS seems to be the remote one. Even though I don't think that the plugin writer bothered about such cases, and despite using NFS myself, never ran into any trouble so far. Also I suspect the collision probability to be very tiny.
Now with the triplet formed by inode and major and minor device number we have a way of targeting the single log file that is being read by the plugin without error (or at least that was the original intent). The last number, the byte offset, keeps track of how far the input log file as already been read and outputted to Logstash.
In some specific architectures like Solaris or Windows there have been bugs with ruby wrongly detecting the inode number, which was equal to 0. This could for example lead to issues like logstash not detecting a file rotation.

This was super helpful. I wanted to map all my SinceDB files to the logstash inputs, so I put together a little bash two-liner to print this mapping.
filesystems=$(grep path /etc/logstash/conf.d/*.conf | awk -F'=>' '{ print $2 }' | xargs -I {} df -P {} 2>/dev/null | grep -v Filesystem | sort | uniq | cut -d' ' -f 1)
for fs in $filesystems; do for f in $(ls -a .sincedb_*); do echo $f; inodes=$(cut -d' ' -f 1 $f); for inode in $inodes; do sudo debugfs -R "ncheck $inode" $fs 2>/dev/null | grep -v Inode | cut -f 2; done; echo; done; done
I just documented the details about mapping SinceDB files to logstash input.

grep but indexable?

I have over 200mb of source code files that I have to constantly look up (I am part of a very big team). I notice that grep does not create an index so lookup requires going through the entire source code database each time.
Is there a command line utility similar to grep which has indexing ability?

The solutions below are rather simple. There are a lot of corner cases that they do not cover:
searching for start of line ^
filenames containing \n or : will fail
filenames containing white space will fail (though that can be fixed by using GNU Parallel instead of xargs)
searching for a string that matches the path of another files will be suboptimal
The good part about the solutions is that they are very easy to implement.
Solution 1: one big file
Fact: Seeking is dead slow, reading one big file is often faster.
Given those facts the idea is to simply make an index containing all the files with all their content - each line prepended with the filename and the line number:
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . > .index
Use the index:
grep foo .index
Solution 2: one big compressed file
Fact: Harddrives are slow. Seeking is dead slow. Multi core CPUs are normal.
So it may be faster to read a compressed file and decompress it on the fly than reading the uncompressed file - especially if you have RAM enough to cache the compressed file but not enough for the uncompressed file.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Use the index:
pbzcat .index | grep foo
Solution 3: use index for finding potential candidates
Generating the index can be time consuming and you might not want to do that for every single change in the dir.
To speed that up only use the index for identifying filenames that might match and do an actual grep through those (hopefully limited number of) files. This will discover files that no longer match, but it will not discover new files that do match.
The sort -u is needed to avoid grepping the same file multiple times.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Use the index:
pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo
Solution 4: append to the index
Re-creating the full index can be very slow. If most of the dir stays the same, you can simply append to the index with newly changed files. The index will again only be used for locating potential candidates, so if a file no longer matches it will be discovered when grepping through the actual file.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Append to the index:
find . -type f -newer .index -print0 | xargs -0 grep -Han . | pbzip2 >> .index
Use the index:
pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo
It can be even faster if you use pzstd instead of pbzip2/pbzcat.
Solution 5: use git
git grep can grep through a git repository. But it seems to do a lot of seeks and is 4 times slower on my system than solution 4.
The good part is that the .git index is smaller than the .index.bz2.
Index a dir:
git init
git add .
Append to the index:
git add .
Use the index:
git grep foo
Solution 6: optimize git
Git puts its data into many small files. This results in seeking. But you can ask git to compress the small files into few, bigger files:
git gc --aggressive
This takes a while, but it packs the index very efficiently in few files.
Now you can do:
find .git -type f | xargs cat >/dev/null
git grep foo
git will do a lot of seeking into the index, but by running cat first, you put the whole index into RAM.
Adding to the index is the same as in solution 5, but run git gc now and then to avoid many small files, and git gc --aggressive to save more disk space, when the system is idle.
git will not free disk space if you remove files. So if you remove large amounts of data, remove .git and do git init; git add . again.

There is https://code.google.com/p/codesearch/ project which is capable of creating index and fast searching in the index. Regexps are supported and computed using index (actually, only subset of regexp can use index to filter file set, and then real regexp is reevaluted on the matched files).
Index from codesearch is usually 10-20% of source code size, building an index is fast like running classic grep 2 or 3 times, and the searching is almost instantaneous.
The ideas used in the codesearch project are from google's Code Search site (RIP). E.g. the index contains map from n-grams (3-grams or every 3-byte set found in your sources) to the files; and regexp is translated to 4-grams when searching.
PS And there are ctags and cscope to navigate in C/C++ sources. Ctags can find declarations/definitions, cscope is more capable, but has problems with C++.
PPS and there are also clang-based tools for C/C++/ObjC languages: http://blog.wuwon.id.au/2011/10/vim-plugin-for-navigating-c-with.html and clang-complete

I notice that grep does not create an index so lookup requires going through the entire source code database each time.
Without addressing the indexing ability part, git grep will have, with Git 2.8 (Q1 2016) the abililty to run in parallel!
See commit 89f09dd, commit 044b1f3, commit b6b468b (15 Dec 2015) by Victor Leschuk (vleschuk).
(Merged by Junio C Hamano -- gitster -- in commit bdd1cc2, 12 Jan 2016)
grep: add --threads=<num> option and grep.threads configuration
"git grep" can now be configured (or told from the command line) how
many threads to use when searching in the working tree files.
grep.threads:
Number of grep worker threads to use.

ack is a code searching tool that is optimized for programmers, especially programmers dealing with large heterogeneous source code trees: http://beyondgrep.com/
Is some of your search examples where you only want to search a certain type of file, like only Java files? Then you can do
ack --java function
ack does not index the source code, but it may not matter depending on what your searching patterns are like. In many cases, only searching for certain types of files gives the speedup that you need because you're not also searching all those other XML, etc files.
And if ack doesn't do it for you, here is a list of many tools designed for searching source code: http://beyondgrep.com/more-tools/

We use a tool internally to index very large log files and make efficient searches of them. It has been open-sourced. I don't know how well it scales to large numbers of files, though. It multithreads by default, it searches inside gzipped files, and it caches indexes of previously searched files.
https://github.com/purestorage/4grep

This grep-cache article has a script for caching grep results. His examples were run on windows with linux tools installed, so it can easily be used on nix/mac with little modification. It's mostly just a perl script anyway.
Also, the filesystem itself (assuming your using *nix) often caches recently read data, causing future grep times to be faster since grep is effectively searching virt memory instead of disk.
The cache is usually located in /proc/sys/vm/drop_caches if you want manually erase it to see the speed increase from an uncached to a cached grep.

Since you mention various kinds of text files that are not really code, I suggest you have a look at GNU ID utils. For example:
cd /tmp
# create index file named 'ID'
mkid -m /dev/null -d text /var/log/messages.*
# query index
gid -r 'spamd|kernel'
These tools focus on tokens, so queries on strings of tokens are not possible. There is minimal integration in emacs for the gid command.
For the more specific case of indexing source code, I prefer to use GNU global, which I find more flexible. For example:
cd sourcedir
# index source tree
gtags .
# look for a definition
global -x main
# look for a reference
global -xr printf
# look for another kind of symbol
global -xs argc
Global natively supports C/C++ and Java, and with a bit of configuration, can be extended to support many more languages. It also has very good integration with emacs: successive queries are stacked, and updating a source file updates the index efficiently. However I'm not aware that it is able to index plain text (yet).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string