How to delete (remove | trim) N bytes from the beginning of a binary file without loading it in the memory?
We have fs.ftruncate(fd, len, callback), which cuts out bytes from the end of the file (if it is bigger).
How to cut bytes from the beginning, or trim from the beginning in Node.js without reading a file in the memory?
I need something like truncateFromBeggining(fd, len, callback) or removeBytes(fd, 0, N, callback).
If it is not possible, what is the fastest way to do it with file streams?
On most filesystems you can't "cut" a part out from the beginning or from the middle of a file, you can only truncate it at the end.
Having the above in mind I imagine, we have to probably open the input file stream, to seek to after the Nth byte, and to pipe the rest of the bytes to an output file stream.
You're asking for an OS file system operation: the ability to remove some bytes from the beginning of a file in place, without rewriting the file.
You're asking for a file system operation that does not exist, at least in Linux / FreeBSD / MacOS / Windows.
If your program is the only user of the file and it fits in RAM, your best bet is to read the whole thing into RAM, then reopen the file for writing, then write out the part you want to keep.
Or you can create a new file. Let's say your input file is called q. Then you'd create a file called, maybe new_q with a stream attached. You'd pipe the contents you wanted to the new file. Then you'd unlink (delete) the input file q and rename the output file new_q to q.
Careful: this unlink / rename operation will create a short time when no file named q is available. So if some other program tries to open it and doesn't find it, it should try again a few times.
If you're creating a queueing scheme, you might consider using some other scheme to hold your queue data. This file read / rewrite / unlink / rename sequence has lots of ways it can go wrong on you under heavy load. (Ask me how I know that when you have a couple of hours to spare ;-) redis is worth a look.
I decided to solve the problem in bash.
The script truncates the files in a temp folder first, then moves them back to the original folder.
The truncate is done with tail:
tail --bytes="$max_size" "$from_file" > "$to_file"
The full script:
#!/bin/bash
declare -r store="/my/data/store"
declare -r temp="/my/data/temp"
declare -r max_size=$(( 200000 * 24 ))
or_exit() {
local exit_status=$?
local message=$*
if [ $exit_status -gt 0 ]
then
echo "$(date '+%F %T') [$(basename "$0" .sh)] [ERROR] $message" >&2
exit $exit_status
fi
}
# Checks if there are any files in 'temp'. It should be empty.
! ls "$temp/"* &> '/dev/null'
or_exit 'Temp folder is not empty'
# Loops over all the files in 'store'
for file_path in "$store/"*
do
# Trim bigger then 'max_size' files from 'store' to 'temp'
if [ "$( stat --format=%s "$file_path" )" -gt "$max_size" ]
then
# Truncates the file to the temp folder
tail --bytes="$max_size" "$file_path" > "$temp/$(basename "$file_path")"
or_exit "Cannot tail: $file_path"
fi
done
unset -v file_path
# If there are files in 'temp', move all of them back to 'store'
if ls "$temp/"* &> '/dev/null'
then
# Moves all the truncated files back to the store
mv "$temp/"* "$store/"
or_exit 'Cannot move files from temp to store'
fi
Related
I'm trying to display the output of an AWS lambda that is being captured in a temporary text file, and I want to remove that file as I display its contents. Right now I'm doing:
... && cat output.json && rm output.json
Is there a clever way to combine those last two commands into one command? My goal is to make the full combined command string as short as possible.
For cases where
it is possible to control the name of the temporary text file.
If file is not used by other code
Possible to pass "/dev/stdout" as the.name of the output
Regarding portability: see stack exchange how portable ... /dev/stdout
POSIX 7 says they are extensions.
Base Definitions,
Section 2.1.1 Requirements:
The system may provide non-standard extensions. These are features not required by POSIX.1-2008 and may include, but are not limited to:
[...]
• Additional character special files with special properties (for example, /dev/stdin, /dev/stdout, and /dev/stderr)
Using the mandatory supported /dev/tty will force output into “current” terminal, making it impossible to pipe the output of the whole command into different program (or log file), or to use the program when there is no connected terminals (cron job, or other automation tools)
No, you cannot easily remove the lines of a file while displaying them. It would be highly inefficient as it would require removing characters from the beginning of a file each time you read a line. Current filesystems are pretty good at truncating lines at the end of a file, but not at the beginning.
A simple but extremely slow method would look like this:
while [ -s output.json ]
do
head -1 output.json
sed -i 1d output.json
done
While this algorithm is plain and simple, you should know that each time you remove the first line with sed -i 1d it will copy the whole content of the file but the first line into a temporary file, resulting in approximately 0.5*n² lines written in total (where n is the number of lines in your file).
In theory you could avoid this by do something like that:
while [ -s output.json ]
do
line=$(head -1 output.json)
printf -- '%s\n' "$line"
fallocate -c -o 0 -l $((${#len}+1)) output.json
done
But this does not account for variable newline characters (namely DOS-formatted newlines) and fallocate does not always work on xfs, among other issues.
Since you are trying to consume a file alongside its creation without leaving a trace of its existence on disk, you are essentially asking for a pipe functionality. In my opinion you should look into how your output.json file is produced and hopefully you can pipe it to a script of your own.
I have large file with size of 130GB.
# ls -lrth
-rw-------. 1 root root 129G Apr 20 04:25 syslog.log
So I need to reduce file size by deleting line which starts with "Nov 2" , So I have given the following command,
sed -i '/Nov 2/d' syslog.log
So I can't edit file using VIM editor also.
When I trigger SED command , its creating backup file also. But I don't have much space in root. Please try to give alternate solution to delete particular line from this file without increasing space in server.
It does not create a real backup file. sed is a stream editor. When applied to a file with option -i it will stream that file through the sed process, write the output to a new file (a temporary one), when everything is done, it will rename the new file to the original name.
(There are options to create backup files also, but you didn't give them, so I won't mention that further.)
In your case you have a very large file and don't want to create any copy, however temporary. For this you need to open the file for reading and writing at the same time, then your sed process can overwrite the original. After this, you will have to truncate the file at the end of the writing.
To demonstrate how this can be done, we first perform a test case.
Create a test file, containing lots of lines:
seq 0 999999 > x
Now, lets say we want to remove all lines containing the digit 4:
grep -v 4 1<>x <x
This will open the file for reading and writing as STDOUT (1), and for reading as STDIN. The grep command will read all lines and will output only the lines not containing a 4 (option -v).
This will effectively overwrite the beginning of the original file.
You will not know how long the output is, so after the output the original contents of the file will appear:
…
999991
999992
999993
999995
999996
999997
999998
999999
537824
537825
537826
537827
537828
537829
…
You can use the Unix tool truncate to shorten your file manually afterwards. In a real scenario you will have trouble finding the right spot for this, so it makes sense to count the number of bytes written (using wc):
(Don't forget to recreate the original x for this test.)
(grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c
This will preform the step above and additionally print out the number of bytes written to the terminal, in this example case the output will be 3653658. Now use truncate:
truncate -s 3653658 x
Now you have the result you want.
If you want to do this in a script, i. e. without interaction, you can use this:
length=$((grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c)
truncate -s "$length" x
I cannot guarantee that this will work for files >2GB or >4GB on your machine; depending on your operating system (32bit?) and the versions of the installed tools you might run into largefile issues. I'd perform tests with large files first (>4GB as this is typically a limit for many things) and then cross your fingers and give it a try :)
Some caveats you have to keep in mind:
Of course, nobody is supposed to append log entries to that log file while the procedure is running.
Also, any abort during the running of the process (power failure, signal caught, etc.) will leave the file in an undefined state. But re-running the command again after such a mishap will in most cases produce the correct output; some lines might be doubled, but not more than a single line should be corrupted then.
The output must be smaller than the input, of course, otherwise the writing will overtake the reading, corrupting the whole result so that lines which should be there will be missing (or truncated at the start).
I'm trying to redirect(?) my standard error/output to a text file.
I did my research, but for some reason the online answers are not working for me.
What am I doing wrong?
cd /home/user1/lists/
for dir in $(ls)
do
(
echo | $dir > /root/user1/$dir" "log.txt
) > /root/Desktop/Logs/Update.log
done
I also tried
2> /root/Desktop/Logs/Update.log
1> /root/Desktop/Logs/Update.log
&> /root/Desktop/Logs/Update.log
None of these work for me :(
Help please!
Try this for the basics:
echo hello >> log.txt 2>&1
Could be read as: echo the word hello, redirecting and appending STDOUT to the file log.txt. STDERR (file descriptor 2) is redirected to wherever STDOUT is being pointed. Note that STDOUT is the default and thus there is no "1" in front of the ">>". Works on the current line only.
To redirect and append all output and error of all commands in a script, put this line near the top. It will be in effect for the length of the script instead of doing it on each line:
exec >>log.txt 2>&1
If you are trying to obtain a list of the files in /home/user1/lists, you do not need a loop at all:
ls /home/usr1/lists/ >Update.log
If you are attempting to run every file in the directory as an executable with a newline as its input, and collect the output from all these programs in Update.log, try this:
for file in /home/user1/lists/*; do
echo | "$file"
done >Update.log
(Notice how we avoid the useless use of ls and how there is no redirection inside the loop.)
If you want to create an empty file called *.log.txt for each file in the directory, you would do
for file in /home/user1/lists/*; do
touch "$(basename "$file")"log.txt
done
(Using basename to obtain the file name without the directory part avoids the cd but you could do it the other way around. Generally, we tend to avoid changing the directory in scripts, so that the tool can be run from anywhere and generate output in the current directory.)
If you want to create a file containing a single newline, regardless of whether it already exists or not,
for file in /home/user1/lists/*; do
echo >"$(basename "$file")"log.txt
done
In your original program, you redirect the echo inside the loop, which means that the redirection after done will not receive any output at all, so the created file will be empty.
These are somewhat wild guesses at what you might actually be trying to accomplish, but should hopefully help nudge you slightly in the right direction. (This should properly be a comment, I suppose, but it's way too long and complex.)
I wrote a script that gets load and mem information for a list of servers by ssh'ing to each server. However, since there are around 20 servers, it's not very efficient to wait for the script to end. That's why I thought it might be interesting to make a crontab that writes the output of the script to a file, so all I need to do is cat this file whenever I need to know load and mem information for the 20 servers. However, when I cat this file during the execution of the crontab it will give me incomplete information. That's because the output of my script is written line by line to the file instead of all at once at termination. I wonder what needs to be done to make this work...
My crontab:
* * * * * (date;~/bin/RUP_ssh) &> ~/bin/RUP.out
My bash script (RUP_ssh):
for comp in `cat ~/bin/servers`; do
ssh $comp ~/bin/ca
done
Thanks,
niefpaarschoenen
You can buffer the output to a temporary file and then output all at once like this:
outputbuffer=`mktemp` # Create a new temporary file, usually in /tmp/
trap "rm '$outputbuffer'" EXIT # Remove the temporary file if we exit early.
for comp in `cat ~/bin/servers`; do
ssh $comp ~/bin/ca >> "$outputbuffer" # gather info to buffer file
done
cat "$outputbuffer" # print buffer to stdout
# rm "$outputbuffer" # delete temporary file, not necessary when using trap
Assuming there is a string to identify which host the mem/load data has come from you can update your txt file as each result comes in. Asuming the data block is one line long you could use
for comp in `cat ~/bin/servers`; do
output=$( ssh $comp ~/bin/ca )
# remove old mem/load data for $comp from RUP.out
sed -i '/'"$comp"'/d' RUP.out # this assumes that the string "$comp" is
# integrated into the output from ca, and
# not elsewhere
echo "$output" >> RUP.out
done
This can be adapted depending on the output of ca. There is lots of help on sed across the net.
I have the following situation:
There is a windows folder that has been mounted on a Linux machine. There could be multiple folders (setup before hand)
in this windows mount. I have to do something (preferably a script to start with) to watch these folders.
These are the steps:
Watch for any incoming file(s). Make sure they are transferred completely.
Move it to another folder.
I do not have any control over the file transfer program on the windows machine. It is a secure FTP I believe.
So I cannot ask that process to send me a trailer file to ensure the completion of file transfer.
I have written a bash script. I would like to know about any potential pitfalls with this approach. Reason is,
there is a possibility of mulitple copies of this script running for multiple directories like this.
At the moment, there could be upto 100 directories that may have to be monitored.
Following is the script. I'm sorry for pasting a very long one here. Please take your time to review it and
comment / criticize it. :-)
It takes 3 parameters, the folder that has to be watched, the folder where the file has to be moved,
and a time interval, which has been explained below.
I'm sorry there seems to be a problem with the alignment. Markdown doesn't seem to like it. I tried to organize it properly, but not able to do so.
Linux servername 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:27:17 EDT 2006 i686 i686 i386
GNU/Linux
#!/bin/bash
log_this()
{
message="$1"
now=`date "+%D-%T"`
echo $$": "$now ": " $message
}
usage()
{
cat << EOF
Usage: $0 <Directory to be watched> <Directory to transfer> <time interval>
Time interval is the amount of time after which the modification time of a
file will be monitored.
EOF
`exit 1`
}
if [ $# -lt 2 ]
then
usage
fi
WATCH_DIR=$1
APP_DIR=$2
if [ ! -d "$WATCH_DIR" ]
then
log_this "FATAL: WATCH_DIR, $WATCH_DIR does not exist. Exiting"
exit 1
fi
if [ ! -d "$APP_DIR" ]
then
log_this "APP_DIR: $APP_DIR does not exist. Exiting"
exit 1
fi
# This needs to be set after considering the rate of file transfer.
# Represents the seconds elapsed after the last modification to the file.
# If not supplied as parameter, defaults to 3.
seconds_between_mods=$3
if ! [[ "$seconds_between_mods" =~ ^[0-9]+$ ]]; then
if [ ${#seconds_between_mods} -eq 0 ]; then
log_this "No value supplied for elapse time. Defaulting to 3."
seconds_between_mods=3
else
log_this "Invalid value provided for elapse time"
exit 1
fi
fi
log_this "Start Monitor."
while true
do
ls -1 $WATCH_DIR | while read file_name
do
log_this "Start Monitoring for $file_name"
# Refer only the modification with reference to the mount folder.
# If there is a diff in time between servers, we are in trouble.
token_file=$WATCH_DIR/foo.$$
current_time=`touch $token_file && stat -c "%Y" $token_file`
rm -f $token_file 2>/dev/null
log_this "Current Time: $current_time"
last_mod_time=`stat -c "%Y" $WATCH_DIR/$file_name`
elapsed_time=`expr $current_time - $last_mod_time`
log_this "Elapsed time ==> $elapsed_time"
if [ $elapsed_time -ge $seconds_between_mods ]
then
log_this "Moving $file_name to $APP_DIR"
# In case if there is no space left on the target mount, hide the file
# in the mount itself and remove the incomplete file from APP_DIR.
mv $WATCH_DIR/$file_name $APP_DIR
if [ $? -ne 0 ]
then
log_this "FATAL: mv failed!! Hiding $file_name"
rm $APP_DIR/$file_name
mv $WATCH_DIR/$file_name $WATCH_DIR/.$file_name
log_this "Removed $APP_DIR/$file_name. Look for $WATCH_DIR/.$file_name and submit later."
fi
log_this "End Monitoring for $file_name"
else
log_this "$file_name: Transfer seems to be in progress"
fi
done
log_this "Nothing more to monitor."
echo
sleep 5
done
This isn't going to work for any length of time. In production, you will have network problems and other errors which can leave a partial file in the upload directory. I also don't like the idea of a "trailer" file. The usual approach is to upload the file under a temporary name and then rename it after the upload completes.
This way, you just have to list the directory, filter the temporary names out and and if there is anything left, use it.
If you can't make this change, then ask your boss for a written permission to implement something which can lead to arbitrary data corruption. This is for two purposes: 1) To make them understand that this is a real problem and not something which you make up and 2) to protect yourself when it breaks ... because it will and guess who'll get all the blame?
I believe a much saner approach would be the use of a kernel-level filesystem notify item. Such as inotify. Get also the tools here.
incron is an "inotify cron" system. It consists of a daemon and a table manipulator. You can use it a similar way as the regular cron. The difference is that the inotify cron handles filesystem events rather than time periods.
First make sure inotify-tools in installed.
Then use them like this:
logOfChanges="/tmp/changes.log.csv" # Set your file name here.
# Lock and load
inotifywait -mrcq $DIR > "$logOfChanges" & # monitor, recursively, output CSV, be quiet.
IN_PID=$$
# Do your stuff here
...
# Kill and analyze
kill $IN_PID
cat "$logOfChanges" | while read entry; do
# Split your CSV, but beware that file names may contain spaces too.
# Just look up how to parse CSV with bash. :)
path=...
event=...
... # Other stuff like time stamps
# Depending on the event…
case "$event" in
SOME_EVENT) myHandlingCode path ;;
...
*) myDefaultHandlingCode path ;;
done
Alternatively, using --format instead of -c on inotifywait would be an idea.
Just man inotifywait and man inotifywatch for more infos.
To be honest a python app set up to run at start-up will do this quickly and efficiently. Python has amazing OS support and its rather complete.
Running the script will likely work, but it will be troublesome to take care and manage. I take it you will run these as frequent cron jobs?
To get you off your feet here is a small app I wrote which takes a path and looks at the binary output of jpeg files. I never quite finished it, but it will get you started and to see the structure of python as well as some use of os..
I wouldnt spend to much time worrying about my code.
import time, os, sys
#analyze() takes in a path and moves into the output_files folder, to then analyze files
def analyze(path):
list_outputfiles = os.listdir(path + "/output_files")
print list_outputfiles
for i in range(len(list_outputfiles)):
#print list_outputfiles[i]
f = open(list_outputfiles[i], 'r')
f.readlines()
#txtmaker reads the media file and writes its binary contents to a text file.
def txtmaker(c_file):
print c_file
os.system("cat" + " " + c_file + ">" + " " + c_file +".txt")
os.system("mv *.txt output_files")
#parser() takes in the inputed path, reads and lists all files, creates a directory, then calls txtmaker.
def parser(path):
os.chdir(path)
os.mkdir(path + "/output_files", 0777)
list_files = os.listdir(path)
for i in range(len(list_files)):
if os.path.isdir(list_files[i]) == True:
print (list_files[i], "is a directory")
else:
txtmaker(list_files[i])
analyze(path)
def main():
path = raw_input("Enter the full path to the media: ")
parser(path)
if __name__ == '__main__':
main()