Why doesn't "sort file1 > file1" work? - linux

When I am trying to sort a file and save the sorted output in itself, like this
sort file1 > file1;
the contents of the file1 is getting erased altogether, whereas when i am trying to do the same with 'tee' command like this
sort file1 | tee file1;
it works fine [ed: "works fine" only for small files with lucky timing, will cause lost data on large ones or with unhelpful process scheduling], i.e it is overwriting the sorted output of file1 in itself and also showing it on standard output.
Can someone explain why the first case is not working?

As other people explained, the problem is that the I/O redirection is done before the sort command is executed, so the file is truncated before sort gets a chance to read it. If you think for a bit, the reason why is obvious - the shell handles the I/O redirection, and must do that before running the command.
The sort command has 'always' (since at least Version 7 UNIX) supported a -o option to make it safe to output to one of the input files:
sort -o file1 file1 file2 file3
The trick with tee depends on timing and luck (and probably a small data file). If you had a megabyte or larger file, I expect it would be clobbered, at least in part, by the tee command. That is, if the file is large enough, the tee command would open the file for output and truncate it before sort finished reading it.

It doesn't work because '>' redirection implies truncation, and to avoid keeping the whole output of sort in the memory before re-directing to the file, bash truncates and redirects output before running sort. Thus, contents of the file1 file will be truncated before sort will have a chance to read it.

It's unwise to depend on either of these command to work the way you expect.
The way to modify a file in place is to write the modified version to a new file, then rename the new file to the original name:
sort file1 > file1.tmp && mv file1.tmp file1
This avoids the problem of reading the file after it's been partially modified, which is likely to mess up the results. It also makes it possible to deal gracefully with errors; if the file is N bytes long, and you only have N/2 bytes of space available on the file system, you can detect the failure creating the temporary file and not do the rename.
Or you can rename the original file, then read it and write to a new file with the same name:
mv file1 file1.bak && sort file1.bak > file1
Some commands have options to modify files in place (for example, perl and sed both have -i options (note that the syntax of sed's -i option can vary). But these options work by creating temporary files; it's just done internally.

Redirection has higher precedence. So in the first case, > file1 executes first and empties the file.

The first command doesn't work (sort file1 > file1), because when using the redirection operator (> or >>) shell creates/truncates file before the sort command is even invoked, since it has higher precedence.
The second command works (sort file1 | tee file1), because sort reads lines from the file first, then writes sorted data to standard output.
So when using any other similar command, you should avoid using redirection operator when reading and writing into the same file, but you should use relevant in-place editors for that (e.g. ex, ed, sed), for example:
ex '+%!sort' -cwq file1
or use other utils such as sponge.
Luckily for sort there is the -o parameter which write results to the file (as suggested by #Jonathan), so the solution is straight forward: sort -o file1 file1.

Bash open a new empty file when reads the pipe, and then calls to sort.
In the second case, tee opens the file after sort has already read the contents.

You can use this method
sort file1 -o file1
This will sort and store back to the original file. Also, you can use this command to remove duplicated line:
sort -u file1 -o file1

Related

AWK very slow when splitting large file based on column value and using close command

I need to split a large log file into smaller ones based on the id found in the first column, this solution worked wonders and very fast for months:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"}' ${nome}.all;
Where $nome is a file and directory name.
Its very fast and worked until the log file reachead several million lines +2GB text file, then it started to show
"Too many open files"
The solution is indeed very simple, adding the close command:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"; close(dir"/"dir"_"$1".log")}' ${nome}.all;
The problem is, now its VERY slow, its taking forever to do something that was done in seconds and I need to optmize this.
AWK is not mandatory, I can use an alternative, I just dont know how
Untested since you didn't provide any sample input/output to test with but this should do it:
sort -t';' -k1,1 "${nome}.all" |
awk -v dir="$nome" -F\; '$1!=prev{close(out); out=dir"/"dir"_"$1".log"; prev=$1} {print > out}'
Your first script:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"}' ${nome}.all;
had 3 problems:
It wasn't closing file names as you go and so exceeded the threshold you saw.
It had an unparenthesized expression on the right side of output redirection which is undefined behavior per POSIX.
It wasn't quoting the shell variable ${nome} in the file name.
It's worth mentioning that gawk would be able to handle 1 and 2 without failing but it would slow down as the number of open files grew and it was having to manage the opens/closes internally.
Your second script:
awk -v dir="$nome" -F\; '{print>dir"/"dir"_"$1".log"; close(dir"/"dir"_"$1".log")}' ${nome}.all;
though now closing the output file name, still had problems 2 and 3 and then added 2 new problems:
It was opening and closing the output files once per input line instead of only when the output file name had to change.
It was overwriting the output file for each $1 for every line written to it instead of appending to it.
The above assumes you have multiple lines of input for each $1 and so each output file will have multiple lines. Otherwise the slow down you saw when closing the output files wouldn't have happened.
The above sort could rearrange the order of input lines for each $1. If that's a problem add -s for "stable sort" if you have GNU sort or let us know as it's easy to work around with POSIX tools.

linux - output changes on terminal when file changes

In a open terminal how could I see all the new content added to a file whenever a process writes data into it?
I've tried combinations with cat and tee but no success
Use tail with -f
tail -f filename
Taken from the man pages for tail:
-f, --follow[={name|descriptor}]
output appended data as the file grows;
an absent option argument means 'descriptor'
You can not do it with cat, you should use tail -f <filename> or less <filename> and push F in order to wait data.
$ man less
...
F Scroll forward, and keep trying to read when the end of file is reached. Normally this
command would be used when already at the end of the file. It is a way to monitor the
tail of a file which is growing while it is being viewed. (The behavior is similar to
the "tail -f" command.)
...

How do I update a file using commands run against the same file?

As an easy example, consider the following command:
$ sort file.txt
This will output the file's data in sorted order. How do I put that data right back into the same file? I want to update the file with the sorted results.
This is not the solution:
$ sort file.txt > file.txt
... as it will cause the file to come out blank. Is there a way to update this file without creating a temporary file?
Sure, I could do something like this:
sort file.txt > temp.txt; mv temp.txt file.txt
But I would rather keep the results in memory until processing is done, and then write them back to the same file. sort actually has a flag that will allow this to be possible:
sort file.txt -o file.txt
...but I'm looking for a solution that doesn't rely on the binary having a special flag to account for this, as not all are guaranteed to. Is there some kind of linux command that will hold the data until the processing is finished?
For sort, you can use the -o option.
For a more general solution, you can use sponge, from the moreutils package:
sort file.txt | sponge file.txt
As mentioned below, error handling here is tricky. You may end up with an empty file if something goes wrong in the steps before sponge.
This is a duplicate of this question, which discusses the solutions above: How do I execute any command editing its file (argument) "in place" using bash?
You can do it with sed (with its r command), and Process Substitution:
sed -ni r<(sort file) file
In this way, you're telling sed not to print the (original) lines (-n option) and to append the file generated by <(sort file).
The well known -i option is the one which does the trick.
Example
$ cat file
b
d
c
a
e
$ sed -ni r<(sort file) file
$ cat file
a
b
c
d
e
Try vim-way:
$ ex -s +'%!sort' -cxa file.txt

Bash Sorting Redirection

What are the differences between sort file1 -o file2 and sort file1 > file2 ? So far from what I have done they do the same thing but perhaps I'm missing something.
Following two commands are similar as long as file1 and file2 are different.
sort file1 -o file2 # Output redirection within sort command
sort file1 > file2 # Output redirection via shell
Let's see what happens when input and output files are same file i.e. you try to sort in-place
sort file -o file # Works perfectly fine and does in-place sorting
sort file > file # Surprise! Generates empty file. Data is lost :(
In summary, above two redirection methods are similar but not the same
Test
$ cat file
2
5
1
4
3
$ sort file -o file
$ cat file
1
2
3
4
5
$ sort file > file
$ cat file
$ ls -s file
0 file
The result is the same but in the case of -o file2 the resulting file is created by sort directly while in the other case, it is created by bash and filled with the standard output of sort. The xfopen defined in line 450 of sort.c in coreutils treats both cases (stdout and -o filename) equally.
Redirecting the standard output of sort is more generic as it could be redirected to another program with a | in place of a >, which the -o option makes more difficult to do (but not impossible)
The -o option is handy for in place sorting as the redirection to the same file will lead to a truncated file because it is created (and truncated) by the shell prior to the invocation of sort.
There is not much difference > is a standard unix output redirection function. That is to say 'write your output that you would otherwise display on the terminal to the given file' The -o option is more specific to the sort function. It is a way to again say 'write the output to this given file'
The > can be used where a tool does not specifically have a write to file argument or option.

grep based on blacklist -- without procedural code?

It's a well-known task, simple to describe:
Given a text file foo.txt, and a blacklist file of exclusion strings, one per line, produce foo_filtered.txt that has only the lines of foo.txt that do not contain any exclusion string.
A common application is filtering compiler warnings from a build log, but to ignore warnings on files that are not yours. The file foo.txt is the warnings file (itself filtered from the build log), and a blacklist file excluded_filenames.txt with file names, one per line.
I know how it's done in procedural languages like Perl or AWK, and I've even done it with combinations of Linux commands such as cut, comm, and sort.
But I feel that I should be really close with xargs, and just can't see the last step.
I know that if excluded_filenames.txt has only 1 file name in it, then
grep -v foo.txt `cat excluded_filenames.txt`
will do it.
And I know that I can get the filenames one per line with
xargs -L1 -a excluded_filenames.txt
So how do I combine those two into a single solution, without explicit loops in a procedural language?
Looking for the simple and elegant solution.
You should use the -f option (or you can use fgrep which is the same):
grep -vf excluded_filenames.txt foo.txt
You could also use -F which is more directly the answer to what you asked:
grep -vF "`cat excluded_filenames.txt`" foo.txt
from man grep
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.

Resources