Tracking a program's progress reading through a file? - linux

Is there a way to figure out where in a file a program is reading from? It seems like might be doable with strace or dtrace?
To clarify the question and give motivation, say I have a 10GB log file and am counting the number of unique lines:
$ cat log.txt | sort | uniq | wc -l
Can I check where in the file cat is currently at, effectively giving the progress of the command? Using lsof, I can't seem to get the offset of last file read, which I think is what would do the trick:
$ lsof log.txt
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
cat 16021 erik 3r REG 0,22 13416118210 1078133219
Edit: I apologize, the example I gave is too narrow and misses the point. Ideally, for an arbitrary program, I would like to see where in the file reads are occurring (regardless of pipe).

You can do what you want with the progress command. It shows the progress of coreutils tools such as cat or other programs in reading their file.
File and offset information is available in Linux in /proc/<PID>/fd and /proc/<PID>/fdinfo.

Instead of cat :
pv log.txt | sort | uniq | wc -l
Piping with pv :
SIZE=$( ls -l log.txt | awk '{print $5}'); cat log.txt | sort | pv -s $SIZE | uniq | wc -l

If the example is truly your use case, then I'd recommend pipe viewer.

Related

cat: pid.txt: No such file or directory

I have a problem with cat. I want to write script doing the same thing as ps -e. In pid.txt i have PID of running processes.
ls /proc/ | grep -o "[0-9]" | sort -h > pid.txt
Then i want use $line like a part of path to cmdline for evry PID.
cat pid.txt | while read line; do cat /proc/$line/cmdline; done
i try for loop too
for id in 'ls /proc/ | grep -o "[0-9]\+" | sort -h'; do
cat /proc/$id/cmdline;
done
Don't know what i'm doing wrong. Thanks in advance.
I think what you're after is this - there were a few flaws with all of your approaches (or did you really just want to look at process with a single-digit PID?):
for pid in $(ls /proc/ | grep -E '^[0-9]+$'|sort -h); do cat /proc/${pid}/cmdline; tr '\x00' '\n'; done
You seem to be in a different current directory when running cat pid.txt... command compared to when you ran your ls... command. Run both your commands on the same terminal window, or use absolute path, like /path/to/pid.txt
Other than your error, you might wanna remove -o from your grep command as it gives you 1 digit for a matching pid. For example, you get 2 when pid is 423. #Roadowl also pointed that already.

How do I grep multiple lines (output from another command) at the same time?

I have a Linux driver running in the background that is able to return the current system data/stats. I view the data by running a console utility (let's call it dump-data) in a console. All data is dumped every time I run dump-data. The output of the utility is like below
Output:
- A=reading1
- B=reading2
- C=reading3
- D=reading4
- E=reading5
...
- variableX=readingX
...
The list of readings returned by the utility can be really long. Depending on the scenario, certain readings would be useful while everything else would be useless.
I need a way to grep only the useful readings whose names might have have nothing in common (via a bash script). I.e. Sometimes I'll need to collect A,D,E; and other times I'll need C,D,E.
I'm attempting to graph the readings over time to look for trends, so I can't run something like this:
# forgive my pseudocode
Loop
dump-data | grep A
dump-data | grep D
dump-data | grep E
End Loop
to collect A,D,E as that would actually give me readings from 3 separate calls of dump-data as that would not be accurate.
If you want to save all result of grep in the same file, you can just join all expressions in one:
grep -E 'expr1|expr2|expr3'
But if you want to have results (for expr1, expr2 and expr3) in separate files, things are getting more interesting.
You can do this using tee >(command).
For example, here I process the same pipe with thre different commands:
$ echo abc | tee >(sed s/a/_a_/ > file1) | tee >(sed s/b/_b_/ > file2) | sed s/c/_c_/ > file3
$ grep "" file[123]
file1:_a_bc
file2:a_b_c
file3:ab_c_
But the command seems to be too complex.
I would better save dump-data results to a file and then grep it.
TEMP=$(mktemp /tmp/dump-data-XXXXXXXX)
dump-data > ${TEMP}
grep A ${TEMP}
grep B ${TEMP}
grep C ${TEMP}
You can use dump-data | grep -E "A|D|E". Note the -E option of grep. Alternatively you could use egrep without the -E option.
you can simply use:
dump-data | grep -E 'A|D|E'
awk '/MY PATTERN/{print > "matches-"FILENAME;}' myfile{1,3}
thx Guru at Stack Exchange

What is this Bash (and/or other shell?) construct called?

What is the construct in bash called where you can take wrap a command that outputs to stdout, such that the output itself is treated like a stream? In case I'm not describing that so well, maybe an example will do best, and this is what I typically use it for: applying diff to output that does not come from a file, but from other commands, where
cmd
is wrapped as
<(cmd)
By wrapping a command in such a manner, in the example below I determine that there a difference of one between the two commands that I am running, and then I am able to determine that one precise difference. What is the construct/technique of wrapping a command as <(cmd) called? Thanks
[builder#george v6.5 html]$ git status | egrep modified | awk '{print $3}' | wc -l
51
[builder#george v6.5 html]$ git status | egrep modified | awk '{print $3}' | xargs grep -l 'Ext\.define' | wc -l
50
[builder#george v6.5 html]$ diff <(git status | egrep modified | awk '{print $3}') <(git status | egrep modified | awk '{print $3}' | xargs grep -l 'Ext\.define')
39d38
< javascript/reports/report_initiator.js
ADDENDUM
The revised command using the advice for using git's ls-file should be as follows (untested):
diff <(git ls-files -m) <(git ls-files -m | xargs grep -l 'Ext\.define')
It is called process substitution.
This is called Process Substitution
This is process substitution, as you have been told. I'd just like to point out that this also works in the other direction. Process substitution with >(cmd) allows you to take a command that writes to a file and instead have that output redirected to another command's stdin. It's very useful for inserting something into a pipeline that takes an output filename as an argument. You don't see it as much because pretty much every standard command will write to stdout already, but I have used it often with custom stuff. Here is a contrived example:
$ echo "hello world" | tee >(wc)
hello world
1 2 12

Why no output is shown when using grep twice?

Basically I'm wondering why this doesn't output anything:
tail --follow=name file.txt | grep something | grep something_else
You can assume that it should produce output I have run another line to confirm
cat file.txt | grep something | grep something_else
It seems like you can't pipe the output of tail more than once!? Anyone know what the deal is and is there a solution?
EDIT:
To answer the questions so far, the file definitely has contents that should be displayed by the grep. As evidence if the grep is done like so:
tail --follow=name file.txt | grep something
Output shows up correctly, but if this is used instead:
tail --follow=name file.txt | grep something | grep something
No output is shown.
If at all helpful I am running ubuntu 10.04
You might also run into a problem with grep buffering when inside a pipe.
ie, you don't see the output from
tail --follow=name file.txt | grep something > output.txt
since grep will buffer its own output.
Use the --line-buffered switch for grep to work around this:
tail --follow=name file.txt | grep --line-buffered something > output.txt
This is useful if you want to get the results of the follow into the output.txt file as rapidly as possible.
Figured out what was going on here. It turns out that the command is working it's just that the output takes a long time to reach the console (approx 120 seconds in my case). This is because the buffer on the standard out is not written each line but rather each block. So instead of getting every line from the file as it was being written I would get a giant block every 2 minutes or so.
It should be noted that this works correctly:
tail file.txt | grep something | grep something
It is the following of the file with --follow=name that is problematic.
For my purposes I found a way around it, what I was intending to do was capture the output of the first grep to a file, so the command would be:
tail --follow=name file.txt | grep something > output.txt
A way around this is to use the script command like so:
script -c 'tail --follow=name file.txt | grep something' output.txt
Script captures the output of the command and writes it to file, thus avoiding the second pipe.
This has effectively worked around the issue for me, and I have explained why the command wasn't working as I expected, problem solved.
FYI, These other stackoverflow questions are related:
Trick an application into thinking its stdin is interactive, not a pipe
Force another program's standard output to be unbuffered using Python
You do know that tail starts by default with the last ten lines of the file? My guess is everything the cat version found is well into the past. Try tail -n+1 --follow=name file.txt to start from the beginning of the file.
works for me on Mac without --follow=name
bash-3.2$ tail delme.txt | grep po
position.bin
position.lrn
bash-3.2$ tail delme.txt | grep po | grep lr
position.lrn
grep pattern filename | grep pattern | grep pattern | grep pattern ......

Identify the files opened a particular process on linux

I need a script to identify the files opened a particular process on linux
To identify fd :
>cd /proc/<PID>/fd; ls |wc –l
I expect to see a list of numbers which is the list of files descriptors' number using in the process. Please show me how to see all the files using in that process.
Thanks.
The command you probably want to use is lsof. This is a better idea than digging in /proc, since the command is a more clear and a more stable way to get system information.
lsof -p pid
However, if you're interested in /proc stuff, you may notice that files /proc/<pid>/fd/x is a symlink to the file it's associated with. You can read the symlink value with readlink command. For example, this shows the terminal stdin is bound to:
$ readlink /proc/self/fd/0
/dev/pts/43
or, to get all files for some process,
ls /proc/<pid>/fd/* | xargs -L 1 readlink
While lsof is nice you can just do:
ls -l /proc/pidoftheproces/fd
lsof -p <pid number here> | wc -l
if you don't have lsof, you can do roughly the same using just /proc
eg
$ pid=1825
$ ls -1 /proc/$pid/fd/*
$ awk '!/\[/&&$6{_[$6]++}END{for(i in _)print i}' /proc/$pid/maps
You need lsof. To get the PID of the application which opened foo.txt:
lsof | grep foo.txt | awk -F\ '{print $2}'
or what Macmede said to do the opposite (list files opened by a process).
lsof | grep processName

Resources