How to index text files to improve grep time - linux

I have a large number of text files I need to grep through on a regular basis.
There are ~230,000 files amounting to around 15GB of data.
I've read the following threads:
How to use grep efficiently?
How to use grep with large (millions) number of files to search for string and get result in few minutes
The machine I'll be grepping on is an Intel Core i3 (i.e. dual-core), so I can't parallelize to any great extent. The machine is running Ubuntu and I'd prefer to do everything via the command line.
Instead of running a bog-standard grep each time, is there any way I can either index or tag the contents of the text files to improve searching?

To search a large number of files for text patterns, qgrep uses indexing. See the article on why and how: https://zeux.io/2019/04/20/qgrep-internals
Alternatively, perhaps try modern multi-threaded grep tools like the new ugrep or ag aka silver searcher (note: the ag bug list on GitHub shows that the most recent ag 2.2.0 may run slower with multiple threads, which I assume will be fixed in a future update).

Have you tried ag as a replacement for grep? It should be in the Ubuntu repositories. I had a similar problem as yours, and ag is really much faster than grep for most regex searches. There are some differences in syntax and features, but that would only matter if you had special grep-specific needs.

Related

Merge, sort, maintain line order

This probably sounds contradicting. So let me explain. I have a number of log files that use log4j to write to different files and rotate. What I want to do is merge them into fewer files.
How I started to go about doing this:
- use awk to concat multi-line entries into one line into a separate file.
- cat awk output files to 1 file.
- sort the cat file
- awk to separate the concatenated lines.
But I see that the sort is putting entries with the same second/ms in a different order than they appeared in their original output file. It may not be a HUGE deal. But, I don't like it. Any ideas for how I go about doing what I want (maintaining their original line order while sorting)? I would rather not write my own program and would like to use native linux utils if possible. But, I am open to the "best" way of doing this (Perl, Python, etc..).
I thought about cat'ing the output files from highest to lowest (log4j rotate files) so that I wouldn't have to sort. But that only solves the problem for files writing to the same log file (file1.0.log, file1.1.log, etc..). But this doesn't help when needing to merge file2 with file1.
Thank you,
Gregg
What you are talking about is "stable" sorting. There is a -s option on sort that should give you what you want.
Stability in sorting algorithms

How can I get a word, varied on a number I provide?

A quick explanation of the problem I'm trying to solve: I have a .bashrc and PS1 I like to use on all systems I log into. I have written a little script to automatically set this up, and hosted it on gist.github, so I can use a one-liner to set everything up. I use terminator to keep many shells open at once. Lately, I've been keeping open ssh in many of the shells, and it's becoming hard to keep track of which shells are my local box and which shells are other servers.
I want to differentiate between shells. However, I don't want to use the hostname or things like that, because my PS1 is already enormously long.
My proposed solution is to hash the output of ifconfig, use it to retrieve a word from somewhere, then stick the first four letters of it in the PS1. As such, the word provider should have the following constraints: the same number should return the same word every time, and the words provided should vary relatively widely.
Anyone have any ideas, or a better solution? Thanks!
Edit: Here's a screenshot of my current PS1 for reference:
Edit 2: Here's a screenshot of my PS1 as of Feb 2018, after splitting the contents onto multiple lines. The "START" lines ensure I can always remember when I ran a command, and how long it took.
Yeah... that's an enormously long $PS1. Anyway.
Hashing the same value with the same algorithm and no salt will result in the same hash result each time.
$ echo -n "123" | md5sum | cut -c 1-8
202cb962

Something like .htaccess in Linux

I have a directory with lot of files (above 4.000.000 files). All filenames has this same pattern:
PREFIX-XXXXXX-YY.ext
where
XXXXXX contains letters and digits
YY contains digits
ext is a extension of file (.txt, .jpg)
File structure have 12MB, so listing/searching of this directory takes long time. I divided all content of this directory to subdirectories, depends of filename, precisiously first letter of XXXXXX from pattern above.
ie.
main_directory/A/PREFIX-AXXXXX-YY.ext
main_directory/B/PREFIX-BXXXXX-YY.ext
main_directory/1/PREFIX-1XXXXX-YY.ext
Is in Linux easy way to make a rule, when I type in linux command for example
test:/home/usr/admin # ls main_directory/PREFIX-AXXXXX-*
I will get a list of filenames from main_directory/A/ directory? This rule MUST work only for main_directory.
You can't have this at file-system layer, not without creating links and circling back to your original problem. I can think of two easy ways out.
Take 1: scripting
You could write a short script to rewrite the names for you.
Suppose you had a rewrite script that took PREFIX-AXXXX-* and outputted main_directory/A/PREFIX-AXXXX-*. You could then change your ls line to:
$ ls `rewrite PREFIX-AXXXXX-*`
This can be easily accomplished with sed, awk or any other on-the-fly text transformation tool.
Shell programs are composable for a reason! :)
Take 2: embed a faster file-system
You could do away with the restructuring and rewriting names by using a faster file-system, mounted in your main directory. XFS sounds good for this. It should remove your performance concerns without further ado.
This requires a deeper understanding of what's going on to be effective for day-to-day usage, however.
Edit: Here's an article on how to create virtual user-space file-systems.
Edit 2: actually no, I don't think XFS would cut it. Maybe another file-system, though.

How to edit multi-gigabyte text files? Vim doesn't work =( [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Are there any editors that can edit multi-gigabyte text files, perhaps by only loading small portions into memory at once? It doesn't seem like Vim can handle it =(
Ctrl-C will stop file load. If the file is small enough you may have been lucky to have loaded all the contents and just killed any post load steps. Verify that the whole file has been loaded when using this tip.
Vim can handle large files pretty well. I just edited a 3.4GB file, deleting lines, etc. Three things to keep in mind:
Press Ctrl-C: Vim tries to read in the whole file initially, to do things like syntax highlighting and number of lines in file, etc. Ctrl-C will cancel this enumeration (and the syntax highlighting), and it will only load what's needed to display on your screen.
Readonly: Vim will likely start read-only when the file is too big for it to make a . file copy to perform the edits on. I had to w! to save the file, and that's when it took the most time.
Go to line: Typing :115355 will take you directly to line 115355, which is much faster going in those large files. Vim seems to start scanning from the beginning every time it loads a buffer of lines, and holding down Ctrl-F to scan through the file seems to get really slow near the end of it.
Note - If your Vim instance is in readonly because you hit Ctrl-C, it is possible that Vim did not load the entire file into the buffer. If that happens, saving it will only save what is in the buffer, not the entire file. You might quickly check with a G to skip to the end to make sure all the lines in your file are there.
If you are on *nix (and assuming you have to modify only parts of file (and rarely)), you may split the files (using the split command), edit them individually (using awk, sed, or something similar) and concatenate them after you are done.
cat file2 file3 >> file1
It may be plugins that are causing it to choke. (syntax highlighting, folds etc.)
You can run vim without plugins.
vim -u "NONE" hugefile.log
It's minimalist but it will at least give you the vi motions you are used to.
syntax off
is another obvious one. Prune your install down and source what you need. You'll find out what it's capable of and if you need to accomplish a task via other means.
A slight improvement on the answer given by #Al pachio with the split + vim solution you can read the files in with a glob, effectively using file chunks as a buffer e.g
$ split -l 5000 myBigFile
xaa
xab
xac
...
$ vim xa*
#edit the files
:nw #skip forward and write
:n! #skip forward and don't save
:Nw #skip back and write
:N! #skip back and don't save
You might want to check out this VIM plugin which disables certain vim features in the interest of speed when loading large files.
I've tried to do that, mostly with files around 1 GB when I needed to make some small change to an SQL dump. I'm on Windows, which makes it a major pain. It's seriously difficult.
The obvious question is "why do you need to?" I can tell you from experience having to try this more than once, you probably really want to try to find another way.
So how do you do it? There are a few ways I've done it. Sometimes I can get vim or nano to open the file, and I can use them. That's a really tough pain, but it works.
When that doesn't work (as in your case) you only have a few options. You can write a little program to make the changes you need (for example, search & replaces). You could use a command line program that may be able to do it (maybe it could be accomplished with sed/awk/grep/etc?)
If those don't work, you can always split the file into chunks (something like split being the obvious choice, but you could use head/tail to get the part you want) and then edit the part(s) that need it, and recombine later.
Trust me though, try to find another way.
I think it is reasonably common for hex editors to handle huge files. On Windows, I use HxD, which claims to handle files up to 8 EB (8 billion gigabytes).
I'm using vim 7.3.3 on Win7 x64 with the LargeFile plugin by Charles Campbell to handle multi-gigabyte plain text files. It works really well.
I hope you come right.
Wow, never managed to get vim to choke, even with a GB or two. I've heard that UltraEdit (on Windows) and BBEdit (on Macs) are even more suitable for even-larger files, but I have no personal experience.
In the past I opened up to a 3 gig file with this tool http://csved.sjfrancke.nl/
Personally, I like UltraEdit. Here is their little spiel on large files.
I've used FAR Commander's built-in editor/viewer for super-large log files.
I have used TextPad for large log files it doesn't have an upper limit.
The only thing I've been able to use for something like that is my favorite Mac hex editor, 0XED. However, that was with files that I considered large at tens of megabytes. I'm not sure how far it will go. I'm pretty sure it only loads parts of the file into memory at once, though.
In the past I've successfully used a split/edit/join approach when files get very large. For this to work you have to know about where the to-be-edited text is, in the original file.

Generate disk usage graphs/charts with CLI only tools in Linux

In this question someone asked for ways to display disk usage in Linux. I'd like to take this one step further down the cli-path... how about a shell script that takes the output from something like a reasonable answer to the previous question and generates a graph/chart from it (output in a png file or something)? This may be a bit too much code to ask for in a regular question, but my guess is that someone already has a oneliner laying around somewhere...
If some ASCII chars are "graphical" enough for you, I can recommend ncdu. It is a very nice interactive CLI tool, which helps me a lot to step down large directories without doing cd bigdir ; du -hs over and over again.
I would recommend munin. It is designed for exactly this sort of thing - graphing CPU usage, memory usage, disc-usage and such. sort of like MRTG (but MRTG is primarily aimed at graphing router's traffic, graphing anything but bandwidth with it is very hackish)
Writing Munin plugins is very easy (it was one of the projects goals). They can be written in almost anything (shell script, perl/python/ruby/etc, C, anything that can be execute and produce an output). The plugin output format is basically disc1usage.value 1234. And debugging the plugins is very easy (compared to MRTG)
I've set it up on my laptop to monitor disc-usage, bandwidth usage (by pulling data from my ISP's control panel, it graphs my two download "bins", uploads and newsgroup usage), load average and number of processes. Once I got it installed (currently slightly difficult on OS X, but it's trivial on Linux/FreeBSD), I had written a plugin in a few minutes, and it worked, first time!
I would describe how it's setup, but the munin site will do that far better than I could!
There's an example installation here
Some alternatives are nagios and cacti. You could also write something similar using rrdtool. Munin, MRTG and Cacti are basically all far-nicer-to-use systems based around this graphing tool.
If you want something really, really simple, you could do..
import os
import time
while True:
disc_usage = os.system("df -h / | awk '{print $3}'")
log = open("mylog.txt")
log.write(disc_usage + "\n")
log.close()
time.sleep(60*5)
Then..
f = open("mylog.txt")
lines = f.readlines()
# Convert each line to a float number
lines = [float(cur_line) for cur_line in lines]
# Get the biggest and smallest
biggest = max(lines)
smallest = min(lines)
for cur_line in lines:
base = (cur_line - smallest) + 1 # make lowest value 1
normalised = base / (biggest - smallest) # normalise value between 0 and 1
line_length = int(round(normalised * 28)) # make a graph between 0 and 28 characters wide
print "#" * line_length
That'll make a simple ascii graph of the disc usage. I really really don't recommend you use something like this. Why? The log file will get bigger, and bigger, and bigger. The graph will get progressively slower to graph. RRDTool uses a rolling-database system to store it's data, so the file will never get bigger than about 50-100KB, and it's consistently quick to graph as the file is a fixed length.
In short. If you want something to easily graph almost anything, use munin. If you want something smaller and self-contained, write something with RRDTool.
We rolled our own at work using RRDtool (the data storage back end to tools like MRTG). We run a perl script every 5 minutes that takes a du per partition and stuffs it into an RRD database and then uses RRD's graph function to build graphs. It takes a while to igure out how to set up the .rrd files (for instance, I had to re-learn RPN to do some of the calculations I wanted to do) but if you have some data you want to graph over time, RRD tool's a good bet.
I guess there are a couple of options:
For a pure CLI solution, use something like gnuplot. See here for example usage. I haven't used gnuplot since my student days :-)
Not really a pure CLI solution, but download something like JFreeChart and write a simple Java app that reads stdin and creates your chart.
Hope this helps.

Resources