Searching through really big files

Searching through really big files - linux

I need to search through a TB of raw hard disk data. I need to find a couple of things inside. I tried using sudo cat /dev/sdc | less but this fails because it puts everything into RAM that is read. I only have 8 GB of RAM and 8 in swap space so putting a whole TB of data into RAM will not work.
I was wondering if I could somehow make less forgot what it has read after the 1GB mark or maybe use another editor.
I accidentally repartitioned my drive and lost some important files. I tried some utilities but none of them worked so I tried this. I got a few of the files but I can't get the rest because the computer freezes and runs out of RAM.
I learned my lesson, I need to make more frequent backups. Any help is greatly appreciated.

The -B option to less is exactly what you ask for. It allows less to be forgetful. Combine with -b1048576 to allocate 1G (the -b unit is K)
Or do it the interactive way: run less normally, scroll down until the point where it starts to get a little laggy, then just type -B at the less prompt to activate the option (did you know you can set less options interactively?)
Just don't try to scroll backward very far or you'll be forgotten-content land, where weird things happen.
(Side note: I've done this kind of recovery before, and it's easier if you can find the filesystem structures (inode blocks etc.) that point to the data, rather than searching for the data in a big dump. Even if some of the inodes are gone, by first recovering everything you can from the surviving inodes you narrow down the range of unknown blocks where the other files might be.)

Related

How does du estimate file size?

I am downloading a large file with wget, which I ran in the background with wget -bqc. I wanted to see how much of the file was downloaded so I ran
du -sh *
in the directory. (I'd also be interested to know a better way to check wget progress in this case if anyone knows...) I saw that 25 GB had been downloaded, but for several attempts afterwards it showed the same result of 25 GB. I became worried that du had somehow interfered with the download until some time later when du showed a result of 33 GB and subsequently 40 GB.
In searching stackoverflow and online, I didn't find whether it is safe to use du on files being written to but I did see that it is only an estimate that can be somewhat off. However, 7-8 GB seems like a lot, particularly because it is a single file, and not a directory tree, which it seems is what causes errors in the estimate. I'd be interested to know how it makes this estimate for a single file that is being written and why I would see this result.

The operating system has to go guarantee safe access.
du does not estimate anything. the kernel knows the size of the file and when du asks for it that's what it learns.
If the file is in the range of gigabytes and the reported size is only with that granularity, it should not be a surprise that consecutive invocations show the same size - do you expect wget to fetch enough data to flip to another gigabyte in between your checks? You can try running du without sh in order to get a more accurate read.
Also wget will hold some amount of data in ram, but that should be negligible.

du doesn't estimate, it sums up. But it has access to some file-system-internal information which might make its output be a surprise. The various aspects should be looked up separately as they are a bit too much to explain here in detail.
Sparse files may make a file look bigger than it is on disk.
Hard links may make a directory tree look bigger than it is on disk.
Block sizes may make a file look smaller than it is on disk.
du will always print out the size a directory tree (or several) actually and really occupy on disk. Due to various facts (the three most common are given above) this can be different from the size of the information stored in theses trees.

Is there a way to show linux buffer cache misses?

I am trying to measure the effects of adding memory to a LAMP server.
How can I find which processes try to read from the Linux buffer cache, but miss and read from disk instead?

SystemTap is one of the best ways to do this, but fair warning it's difficult to get a great answer. The kernel simply doesn't provide this data directly. You have to infer it based on how many times the system requested a read and how many times a disk was read from. Usually they line up fairly well and you can attribute the difference to the VFS cache, but not always. One problem is LVM- LVM is a "block device", but so is the underlying disk(s), so if you're not careful it's easy to double-count the disk reads.
A while back I took a stab at it and wrote this:
https://sourceware.org/systemtap/wiki/WSCacheHitRate
I do not claim that it is perfect, but it works better than nothing, and usually generates reasonable output as long as the environment is fairly "normal". It does attempt to account for LVM in a fairly crude way.

Compare filesystem space usage over time

Is there a good, graphical way to represent disk usage changes in a linux/unix filesystem over time?
Let me elaborate: there are several good ways to represent disk usage in a filesystem. I'm not interested in summary statistics such total space used (as given by du(1)), but more advanced interactive/visualization tools such as ncdu, gdmap, filelight or baobab, that can give me an idea of where the space is being used.
From a technical perspective, I think the best approach is squarified tree-maps (as available in gdmap), since it makes a better use of the visual space available. The circular approach used by filelight for instance cannot represent huge hierarchies efficiently, and it's dubious how to account for the increasing area of the outer rings in the representation from a human perspective. Looks nice, but that's about it.
Treemaps are perfect to have the current snapshot of disk usage in the filesystem, but I'd like to have something similar to see how disk usage has been evolving over time.
My current solution is very simple: I'm dumping the filesystem usage state using "ncdu -o" over time, and then I compare them side-by-side using two ncdu instances. It's very inefficient, but does the job. I'd like something more visual though.
All the relevant information can be dumped using:
find [dir] -printf "%P\t%s\n"
I did a crappy hack to load this state information in gdmap, so I can use two gdmap instances instead. Still not optimal though, as a treemap will fit the total allocated space into the same rectangle. As such, you cannot really tell if the same area is equivalent to more or less space. If two big directories grow proportionately, they will not change the visualization.
I need something better than that. Obviously, I cannot plot the cumulative directory sizes in a simple line plot, as I would have too many directories.
I'd like something similar to a treemap, where maybe the color of the square represents size increase/decrease using some colormap. However, since a treemap will show individual files as opposed to directories, it's not obvious on how to color-map a directory in which the allocated space has been growing/shrinking due to new/removed files.
What kind of visualization techniques could be used to see the evolution of allocated space over time, which take the whole underlying tree into account?
To elaborate even more, in a squarified treemap the whole allocated space is proportionally divided by file size, and each directory logically clusters the allocated space within it. As such, we don't "see" directories, we see the proportional space taken by it's content.
How we could extend and/or improve the visualization in order to see how the allocated space has been moved to a different area of the treemap?

You can usee Cacti for this.
You need to install snmp deamon on you machine and install cacti (freeware) localy or on any other PC and monitor you linux machine.
http://blog.securactive.net/wp-content/uploads/2012/12/cacti_performance_vision1.png
You can monitor network interfaces, spaces of any partitions and lot of other parameters of your LINUX OS.
apt-get install cacti
vim /etc/snmp/snmpd.conf
add this at about 42 line:
view systemonly included .1.3.6.1
close and restart snmpd deamon
go to cacti config and try to discover your linux machine.

IntelliJ IDEA compilation speedup in Linux

I'm working with IntelliJ IDEA on Linux and recently I've got 16 GB of RAM, so is there any ways to speed up my projects compilation using this memory?

First of all, in order to speedup IntelliJ IDEA itself, you may find this discussion very useful.
The easiest way to speedup compilation is to move compilation output to RAM disk.
RAM disk setup
Open fstab
$ sudo gedit /etc/fstab
(instead of gedit you can use vi or whatever you like)
Set up RAM disk mount point
I'm using RAM disks in several places in my system, and one of them is /tmp, so I'll just put my compile output there:
tmpfs /var/tmp tmpfs defaults 0 0
In this case your filesystem size will not be bounded, but it's ok, my /tmp size right now is 73MB. But if you afraid that RAM disk size will become too big - you can limit it's size, e.g.:
tmpfs /var/tmp tmpfs defaults,size=512M 0 0
Project setup
In IntelliJ IDEA, open Project Structure (Ctrl+Alt+Shift+S by default), then go to Project - 'Project compiler output' and move it to RAM disk mount point:
/tmp/projectName/out
(I've added projectName folder in order to find it easily if I need to get there or will work with several projects at same time)
Then, go to Modules, and in all your modules go to Paths and select 'Inherit project compile output path' or, if you want to use custom compile output path, modify 'Output path' and 'Test output path' the way you did it to project compiler output before.
That's all, folks!
P.S. A few numbers: time of my current project compilation in different cases (approx):
HDD: 80s
SSD: 30s
SSD+RAM: 20s
P.P.S. If you use SSD disk, besides compilation speedup you will reduce write operations on your disk, so it will also help your SSD to live happily ever after ;)

Yes you can. There is several ways to do this. First you can fine tune the JVM for the amount of memory you have. Take this https://gist.github.com/zafarella/43bc260c3c0cdc34f109 one as example.
In addition depending on what linux distribution you use there is a way creating RAM disk and rsyncing content into HDD. Basically you will place all logs and tmp files (including indexes) into RAM - your Idea will fly.
Use something like this profile-sync-daemon to keep files synced. It is possible easily add Idea as an app. Alternatively you can use anything-sync-daemon
You need to change "idea.system.path" and "idea.log.path"
More details on Idea settings could be found at their docs. The idea is to move whatever changes often into RAM.
More RAM Disk alternatives https://wiki.debian.org/SSDOptimization#Persistent_RAMDISK
The bad about this solution is that when you run out of space in RAM OS will page things and it will slow down everything.
Hope that helps.

In addition to ramdisk approach, you might speedup compilation by giving its process more memory (but not too much) and compiling independent modules in parallel. Both options can be found on Settings | Compiler.

How do I measure net used disk space change due to activity by a given process in Linux?

I'd like to monitor disk space requirements of a running process. Ideally, I want to be able to point to a process and find out the net change in used disk space attributable to it. Is there an easy way of doing this in Linux? (I'm pretty sure it would be feasible, though maybe not very easy, to do this in Solaris with DTrace)

Probably you'll have to ptrace it (or get strace to do it for you and parse the output), and then try to work out what disc is being used.
This is nontrivial, as your tracing process will need to understand which file operations use disc space - and be free of race conditions. However, you might be able to do an approximation.
Quite a lot of things can use up disc space, because most Linux filesystems support "holes". I suppose you could count holes as well for accounting purposes.
Another problem is knowing what filesystem operations free up disc space - for example, opening a file for writing may, in some cases, truncate it. This clearly frees up space. Likewise, renaming a file can free up space if it's renamed over an existing file.
Another issue is processes which invoke helper processes to do stuff - for example if myprog does a system("rm -rf somedir").
Also it's somewhat difficult to know when a file has been completely deleted, as it might be deleted from the filesystem but still open by another process.
Happy hacking :)

If you know the PID of the process to monitor, you'll find plenty of information about it in /proc/<PID>.
The file /proc/<PID>/io contains statistics about bytes read and written by the process, it should be what you are seeking for.
Moreover, in /proc/<PID>/fd/ you'll find links to all the files opened by your process, so you could monitor them.

there is Dtrace for linux is available
http://librenix.com/?inode=13584
Ashitosh

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string