Where is the name of the file supplied from in piped tar command? - linux

Consider the command
The link to this. Scroll a bit down, and you will find it.
(cd /source/directory && tar cf - . ) | (cd /dest/directory && tar xpvf -)
How this command is explained in the documentation:
(cd /source/directory && tar cf - . ) | (cd /dest/directory && tar xpvf -)
# Move entire file tree from one directory to another
# [courtesy Alan Cox <a.cox#swansea.ac.uk>, with a minor change]
# 1) cd /source/directory
# Source directory, where the files to be moved are.
# 2) &&
# "And-list": if the 'cd' operation successful,
# then execute the next command.
# 3) tar cf - .
# The 'c' option 'tar' archiving command creates a new archive,
# the 'f' (file) option, followed by '-' designates the target file
# as stdout, and do it in current directory tree ('.').
# 4) |
# Piped to ...
# 5) ( ... )
# a subshell
# 6) cd /dest/directory
# Change to the destination directory.
# 7) &&
# "And-list", as above
# 8) tar xpvf -
# Unarchive ('x'), preserve ownership and file permissions ('p'),
# and send verbose messages to stdout ('v'),
# reading data from stdin ('f' followed by '-').
#
# Note that 'x' is a command, and 'p', 'v', 'f' are options.
#
# Whew!
There are a couple of things which I don't understand in the explanation given above-
In the third step it states f - designates the target file as standard output, but there is nothing in output right now, while creating the archive. From where is the name of the file supplied?
In the eighth step it states it reads data from standard input, but I didn't give any input. Is there any input left in the stream?
This command works fine, but I am a little confused about how it works.

The explanation you quoted from the code is remarkably good. I wish every script I read (or wrote!) was documented so well.
In the 3rd step it states f - designates the target file as stdout, but there is nothing in output right now, while creating archive, from where is the name of file supplied?
There is no file name. The archive data are written to stdout, the process's standard output stream. If that were not piped into another program then it would be displayed on the screen.
In the 8th step it states, it reads data from stdin, but I didn't give any input, is there any input left in the stream?
The output (to its stdout) of the first tar command is piped into (the stdin of) the second tar command, as mentioned at step 4 of the documentation. You can't give the second tar any input directly, because it is reading its input from the pipe, not the keyboard or any regular file.

In the 3rd step it states f - designates the target file as stdout, but there is nothing in output right now, while creating archive, from where is the name of file supplied?
stdin and stdout are two data streams that are created by OS for every process (a command that is run). The process may or may not use these streams but OS will create them anyways.
This is where the first process is writing its output to.
In-memory data streams need not have a file name, these are not regular files on disk.
In the 8th step it states, it reads data from stdin, but I didn't give any input, is there any input left in the stream?
There are 2 processes separated by a pipe | and you can visualize it as:
(cd /source/directory && tar cf - . ) -> [stdout] -> | -> [stdin] -> (cd /dest/directory && tar xpvf -)
So, the first process on left hand side writes the output to its own stdout. The pipe | is OS level plumbing that pumps data from stdout of previous process to stdin of next process. The process on right hand side reads data from its stdin.
Also, like the pipe | pattern, the dash - is a common cli pattern to provide in place of a file, that tells the command to read write data to/from stdin/stdout instead of a file.

The first parenthesis does the following: it changes directory to /source/directory/ and generates a tar file whose content is the current directory "." and sends it to the standard output.
The second parenthesis changes directory to /dest/directory/ and extracts there the archive it reads from its standard input.
I.e., you tar the content of "/source/directory" and you untar it in "/dest/directory" without using an intermediate file to do so, just a pipe "|" to make the junction between the two commands.
NB: the parenthesis creates a subprocess, so you've got a subprocess executing the tar c and another one executing the tar f running at the same time, the output of one sub-process being fed to the second subprocess.

Related

Node.js delete first N bytes from a file

How to delete (remove | trim) N bytes from the beginning of a binary file without loading it in the memory?
We have fs.ftruncate(fd, len, callback), which cuts out bytes from the end of the file (if it is bigger).
How to cut bytes from the beginning, or trim from the beginning in Node.js without reading a file in the memory?
I need something like truncateFromBeggining(fd, len, callback) or removeBytes(fd, 0, N, callback).
If it is not possible, what is the fastest way to do it with file streams?
On most filesystems you can't "cut" a part out from the beginning or from the middle of a file, you can only truncate it at the end.
Having the above in mind I imagine, we have to probably open the input file stream, to seek to after the Nth byte, and to pipe the rest of the bytes to an output file stream.
You're asking for an OS file system operation: the ability to remove some bytes from the beginning of a file in place, without rewriting the file.
You're asking for a file system operation that does not exist, at least in Linux / FreeBSD / MacOS / Windows.
If your program is the only user of the file and it fits in RAM, your best bet is to read the whole thing into RAM, then reopen the file for writing, then write out the part you want to keep.
Or you can create a new file. Let's say your input file is called q. Then you'd create a file called, maybe new_q with a stream attached. You'd pipe the contents you wanted to the new file. Then you'd unlink (delete) the input file q and rename the output file new_q to q.
Careful: this unlink / rename operation will create a short time when no file named q is available. So if some other program tries to open it and doesn't find it, it should try again a few times.
If you're creating a queueing scheme, you might consider using some other scheme to hold your queue data. This file read / rewrite / unlink / rename sequence has lots of ways it can go wrong on you under heavy load. (Ask me how I know that when you have a couple of hours to spare ;-) redis is worth a look.
I decided to solve the problem in bash.
The script truncates the files in a temp folder first, then moves them back to the original folder.
The truncate is done with tail:
tail --bytes="$max_size" "$from_file" > "$to_file"
The full script:
#!/bin/bash
declare -r store="/my/data/store"
declare -r temp="/my/data/temp"
declare -r max_size=$(( 200000 * 24 ))
or_exit() {
local exit_status=$?
local message=$*
if [ $exit_status -gt 0 ]
then
echo "$(date '+%F %T') [$(basename "$0" .sh)] [ERROR] $message" >&2
exit $exit_status
fi
}
# Checks if there are any files in 'temp'. It should be empty.
! ls "$temp/"* &> '/dev/null'
or_exit 'Temp folder is not empty'
# Loops over all the files in 'store'
for file_path in "$store/"*
do
# Trim bigger then 'max_size' files from 'store' to 'temp'
if [ "$( stat --format=%s "$file_path" )" -gt "$max_size" ]
then
# Truncates the file to the temp folder
tail --bytes="$max_size" "$file_path" > "$temp/$(basename "$file_path")"
or_exit "Cannot tail: $file_path"
fi
done
unset -v file_path
# If there are files in 'temp', move all of them back to 'store'
if ls "$temp/"* &> '/dev/null'
then
# Moves all the truncated files back to the store
mv "$temp/"* "$store/"
or_exit 'Cannot move files from temp to store'
fi

How to remove some number of files using a wild card before adding stuff back to them?

I have a set of .txt files named my_file_1.txt, my_file_2.txt, ..., my_file_n.txt where n is finite integer. As my python code is running (in a directory with path ~/simulation/some_code), it is adding data into these files using the following for loop:
for realization in np.arange(1, n+1):
# Identifying the file_path
some_name = 'output/{}/Info/{}/{}/parameter_{:.3f}/my_file_{}.txt'.format(size, name, status, value, realization)
# do some stuff
with open(some_name, "a") as filename:
print('{}'.format(some_list), file=filename)
filename.close()
However, to begin with, these files are not empty and need to be emptied. To do so, I am running the following line ahead of time (in the home directory ~/ which is two levels up with respect to the directory of the above code) to make sure files are empty before being modified:
os.system('> output/{}/Info/{}/{}/parameter_{:.3f}/my_file_*.txt'.format(size, name, status, value))
While I expected to see * symbol should act as a wild card to empty all similar text files above, it seems that files are accumulating from previous data instead of removing initial data and adding the stuff above. Am I using * wild card incorrectly? Is this problem fixable without changing the path of my codes?
Your understanding of the wildcard is correct. The mistake is with redirection. By default you only redirect (>) to one output. You can use the program tee to redirect std out to multiple other files like this:
(echo -n, echoes nothing)
echo -n | tee *
The pipe | passes the stdout of echo -n to the stdin of tee.
Then the wildcard will expand to all files in the directory.
echo -n | tee my_file_1.txt, my_file_2.txt, ..., my_file_n.txt

How to use sed command to delete lines without backup file?

I have large file with size of 130GB.
# ls -lrth
-rw-------. 1 root root 129G Apr 20 04:25 syslog.log
So I need to reduce file size by deleting line which starts with "Nov 2" , So I have given the following command,
sed -i '/Nov 2/d' syslog.log
So I can't edit file using VIM editor also.
When I trigger SED command , its creating backup file also. But I don't have much space in root. Please try to give alternate solution to delete particular line from this file without increasing space in server.
It does not create a real backup file. sed is a stream editor. When applied to a file with option -i it will stream that file through the sed process, write the output to a new file (a temporary one), when everything is done, it will rename the new file to the original name.
(There are options to create backup files also, but you didn't give them, so I won't mention that further.)
In your case you have a very large file and don't want to create any copy, however temporary. For this you need to open the file for reading and writing at the same time, then your sed process can overwrite the original. After this, you will have to truncate the file at the end of the writing.
To demonstrate how this can be done, we first perform a test case.
Create a test file, containing lots of lines:
seq 0 999999 > x
Now, lets say we want to remove all lines containing the digit 4:
grep -v 4 1<>x <x
This will open the file for reading and writing as STDOUT (1), and for reading as STDIN. The grep command will read all lines and will output only the lines not containing a 4 (option -v).
This will effectively overwrite the beginning of the original file.
You will not know how long the output is, so after the output the original contents of the file will appear:
…
999991
999992
999993
999995
999996
999997
999998
999999
537824
537825
537826
537827
537828
537829
…
You can use the Unix tool truncate to shorten your file manually afterwards. In a real scenario you will have trouble finding the right spot for this, so it makes sense to count the number of bytes written (using wc):
(Don't forget to recreate the original x for this test.)
(grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c
This will preform the step above and additionally print out the number of bytes written to the terminal, in this example case the output will be 3653658. Now use truncate:
truncate -s 3653658 x
Now you have the result you want.
If you want to do this in a script, i. e. without interaction, you can use this:
length=$((grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c)
truncate -s "$length" x
I cannot guarantee that this will work for files >2GB or >4GB on your machine; depending on your operating system (32bit?) and the versions of the installed tools you might run into largefile issues. I'd perform tests with large files first (>4GB as this is typically a limit for many things) and then cross your fingers and give it a try :)
Some caveats you have to keep in mind:
Of course, nobody is supposed to append log entries to that log file while the procedure is running.
Also, any abort during the running of the process (power failure, signal caught, etc.) will leave the file in an undefined state. But re-running the command again after such a mishap will in most cases produce the correct output; some lines might be doubled, but not more than a single line should be corrupted then.
The output must be smaller than the input, of course, otherwise the writing will overtake the reading, corrupting the whole result so that lines which should be there will be missing (or truncated at the start).

How do you format output string in bash script for input by another script?

I need to unzip a bunch of student assignment (jar) files so that I can use a script to submit the contents to the Moss (Stanford) plagiarism detection server. I did the same thing in Java which was trivial but I'm trying to re-implement to as a bash script.
I am trying to do the following:
Get a list of student names (each student has a directory).
In each student directory, sub-directories exist numbered from 1 to the
latest submission. I need to get the directory with the highest
number.
Inside of each of those submission directories contains a
jar file that I need. I copy each jar into a temp directory with the
same name as the student and unzip it.
I need that temp directory listing formatted as a string in the form
/tempDir/studentName1/.languageExt /tempDir/studentName2/.languageExt
The student directory has the basic structure:
Student_Root_Directory:
Student1
Student2
Student1
Sub-Directories: 1 2 3 4 5
1: student1.jar
2: student1.jar
...
Student2
Sub-Directories: 1 2 3
1. student2.jar
...
To do the first 3 steps above I did:
#!/bin/bash
# Extract all jar files into a temp directory called /home/moss/tempJarFiles/studentName
# $1 is the command line argument that contains the path to the institution submission dir.
# $2 is the language extension: .c, .cpp, .java, .py
students=`ls $1`
student_dir=$1
languageExt=$2
mossDir="/home/moss"
tempDir="/home/moss/tempJarStorage"
for student in $students
do
latestSubmissionDir=`ls -t $student_dir/$student | head -1`
for jarDir in $latestSubmissionDir
do
mkdir $tempDir/$student
cp $student_dir/$student/$jarDir/*.jar $tempDir/$student
unzip -d $tempDir/$student/ -o -j $tempDir/$student/$student.jar *.$languageExt
rm $tempDir/$student/$student.jar
done
done
...which results in a number of student directories being created in a temp directory that contains only the unzipped contents for the student submissions.
I need the ls output of the new temp directories formatted as a string that contains:
/tempDir/studentName1/\*.languageExt /tempDir/studentName2/\*.languageExt
I have tried variations on
find "$tempDir" -iname "*.$languageExt" -printf "%p/*.$languageExt"
using iname and not - but I either have output that contains extra directory information such as $tempDir/*.languageExt (when I just need the subdirectories $tempDir/$studentName/*.languageExt) or I have output where the path for every source file is also listed such as:
$tempDir/$studentName/studentNameA.java
$tempDir/$studentName/studentNameB.java
when I only need
$tempDir/$studentName/*.java
I think this should be really easy and I'm just over thinking it. Any hints for improving the script also appreciated.
Here's a revised version of the script hat may work:
#/bin/bash
# Extract all jar files into a temp directory called /home/moss/tempJarFiles/studentName
# $1 is the command line argument that contains the path to the institution submission dir.
# $2 is the language extension: c, cpp, java, py
students_dir=$1
languageExt=$2
studentPathsT=( "$students_dir"/*/ )
mossDir='/home/moss'
tempDir='/home/moss/tempJarStorage'
for studentPathT in "${studentPathsT[#]}"; do
student=$(basename "$studentPathT")
mkdir "$tempDir/$student"
submissionDirsT=( "$studentPathT"*/ )
latestSubmissionDirT=${submissionDirsT[${#submissionDirsT[#]-1]}
cp "$latestSubmissionDirT"*.jar "$tempDir/$student/"
unzip -d "$tempDir/$student/" -o -j "$tempDir/$student/*.jar" "*.$languageExt"
rm "$tempDir/$student"/*.jar
done
# Note that at this point `"$tempDir"/*/*.$languageExt` would expand
# to all extracted submission files, across all students.
# Finally, output each student's extracted files as an unexpanded glob à la
# /{tempDir}/{studentName1}/*.{languageExt}
for pT in "$tempDir"/*/; do
echo "$pT*.$languageExt"
# Note: If there is a chance that your filenames contain
# embedded newlines (rare in practice) using `echo` won't work properly
# as #Charles Duffy points out.
# If that is a concern, use
# printf '%s\0' "$pT*.$languageExt"
# and process the output with a utility that can process NUL characters
# as separators, such as `xargs -0`.
done
It avoids using ls and only uses pathname expansion and array variables so as to properly deal with paths that contain embedded spaces and other shell metacharacters.
suffix ...T in variable names indicates that a particular path or array of paths is *T*erminated, i.e, that it ends in a /.
The assumption is that the numbered subdirectories do not go beyond 9, as the implicit lexical sorting of pathname expansion is relied upon; if the numbers go higher, explicit numerical sorting must be applied.
Note that the globs (pathname patterns) passed to unzip are intentionally double-quoted, as they should be interpreted by unzip, not the shell.
Note that, based on your original code, I've assumed that $languageExt does NOT start with . (e.g., cpp rather than .cpp), despite what your comment says.

What does the '-' operator actually do in Linux?

I see the - operator behaving in different ways with different commands.
For example,
cd -
cds to the previous directory, whereas,
vim -
reads from stdin
So I want to know why the - operator is behaving in 2 different ways here. Can someone point me to some detailed documentation of the - operator?
It is not an operator, it is an argument. When you write a program in C or C++ it comes as argv[1] (when it is the first argument) and you can do whatever you like with it.
By convention, many programs use - as a placeholder for stdin where an input file name is normally required, and stdout where an output file name is expected. But cd does not require reading a file stream, why should it need stdin or stdout?
Extra: here below is the excerpt from vim's main.c that parses arguments that begin with -: if there is no additional character it activates STDIN input.
else if (argv[0][0] == '-' && !had_minmin)
{
want_argument = FALSE;
c = argv[0][argv_idx++];
#ifdef VMS
...
#endif
switch (c)
{
case NUL: /* "vim -" read from stdin */
/* "ex -" silent mode */
if (exmode_active)
silent_mode = TRUE;
else
{
if (parmp->edit_type != EDIT_NONE)
mainerr(ME_TOO_MANY_ARGS, (char_u *)argv[0]);
parmp->edit_type = EDIT_STDIN;
read_cmd_fd = 2; /* read from stderr instead of stdin */
}
The dash on its own is a simple command argument. Its meaning is command dependent. Its two most usual meanings are 'standard input' or (less often) 'standard output'. The meaning of 'previous directory' is unique to the cd shell built-in (and it only means that in some shells, not all shells).
cat file1 - file2 | troff ...
This means read file1, standard input, and file2 in that sequence and send the output to troff.
An extreme case of using - to mean 'standard input' or 'standard output' comes from (GNU) tar:
generate_file_list ... |
tar -cf - -T - |
( cd /some/where/else; tar -xf - )
The -cf - options in the first tar mean 'create an archive' and 'the output file is standard output'; the -T - option means 'read the list of files and/or directories from standard input'.
The -xf - options in the second tar mean 'extract an archive' and 'the input file is standard input'. In fact, GNU tar has an option -C /some/where/else which means it does the cd itself, so the whole command could be:
generate_file_list ... |
tar -cf - -T - |
tar -xf - -C /some/where/else
The net effect of this is to copy the files named by the generate_file_list command from under the 'current directory' to /some/where/else, preserving the directory structure. (The 'current directory' has to be taken with a small pinch of salt; any absolute file names are given special treatment by GNU tar — it removes the leading slash — and relative names are taken as relative to the current directory.)
It depends on the program it's being used in. It means different things to different programs.
I think different program use different convention. manpages shows how each program interpret -, here is man bash
-
At shell startup, set to the absolute pathname used to invoke the shell
or shell script being executed as passed in the environment or argument list.
Subsequently, expands to the last argument to the previous command, after expansion.
Also set to the full pathname used to invoke each command executed and placed
in the environment exported to that command. When checking mail, this parameter
holds the name of the mail file currently being checked.
and man vim
- The file to edit is read from stdin. Commands are read from stderr,
which should be a tty.

Resources