Need suggestion to move a big live file in linux - linux

Multiple scripts are running in my Linux server which are generating huge data and I realise that it will eat all my 500GB of storage size in next 2-5 days and scripts require 10 more days to finish the process means they need more space. So most likely I am going to have a space issue problem and I will have to restart the entire process again.
Process is like this -
script1.sh content is like below
"calling an api" > /tmp/output1.txt
script2.sh content is like below
"calling an api" > /tmp/output2.txt
Executed like this -
nohup ./script1.sh & ### this create file in /tmp/output1.txt
nohup ./script2.sh & ### this create file in /tmp/output2.txt
My understand initially was, if I will follow below steps, it should work --
when scripts are running with nohup in background execute this command -
mv /tmp/output1.txt /tmp/output1.txt_bkp; touch /tmp/output1.txt
And then transfer this file /tmp/output1.txt_bkp to another server via ftp and remove it after that to get space on server and script will keep on writing in /tmp/output1.txt file.
But this assumption was wrong and script is keep on writing in /tmp/output1.txt_bkp file. I think script is writing based on inode number that is why it is keep on writing in old file.
Now the question is how to avoid space issue without killing/restart scripts?

Essentially what you're trying to do is pull a file out from under a script that's actively writing into it. I'm not sure how nohup would let you do that.
May I suggest a different approach?
Why don't you move an x number of lines from your /tmp/output[x].txt to /tmp/output[x].txt_bkp? You can do so without much trouble while your script is running and dumping stuff into /tmp/output[x].txt. That way you can free up space by shrinking your output[x] files.
Try this as a test. Open 2 terminals (or use screen) to your Linux box. Make sure both are in the same directory. Run this command in one of your terminals:
for line in `seq 1 2000000`; do echo $line >> output1.txt; done
And then run this command in the other before the first one finishes:
head -1000 output1.txt > output1.txt_bkp && sed -i '1,+999d' output1.txt
Here is what's going to happen. The first command will start producing a file that looks like this:
1
2
3
...
2000000
The second command will chop off the first 1000 lines of output1.txt and put them into output1.txt_bkp and it will do so WHILE the file is being generated.
Afterwards, look inside output1.txt and output1.txt_bkp, you will see that the former looks like this:
1001
1002
1003
1004
...
2000000
While the latter will have the first 1000 lines. You can do the same exact thing with your logs.
A word of caution: Based on your description, your box is under a heavy load from all that dumping. This may negatively impact the process outlined above.

Related

How to take control on files in Linux before processing starts - bash

I am currently working on project to automate a manual task in my office. We have a process that we have to re-trigger some of our ID's when they fall in repair. As part of the process, we have to extract those ID's from a oracle DB table and then put in a file on our Linux server and run the command like this-
Example file:
$cat /task/abc_YYYYMMDD_1.txt
23456
45678
...and so on
cat abc_YYYYMMDD_1.txt | scripttoprocess -args
I am using an existing java based code called 'scripttoprocess'. I can't see what's inside this code as it is encrypted( it seems) in my script. I simply go to the location where my files are present present and then use it like this:
cd /export/incoming/task
for i in `ls abc_YYYYMMDD*.txt`;do
cat $i | scripttoprocess -args
if [ $? -eq 0];then
mv $i /export/incoming/HIST/
fi
done
scripttoprocess is and existing script. I am just calling it in my own script. My script is running continuously in a loop in the background. It simply searches for abc_YYYYMMDD_1.txt file in /task directory and if it detects such a file then it starts processing the file. But I have noticed that my script starts processing the file well before it is fully written and sometime moves the file to HIST without fully processing it.
How can handle this situation. I want to be fully sure that file is completely written before I start processing it. Secondly, Is there any way to take control of the file like preparing a control file which contains list of the files which are present in the /task directory. And then I can cat this control file and pick up file names from inside of it ? Your guidance will be much appreciated.
I used
iwatch -e close_write -c "/usr/bin/pdflatex -interaction batchmode %f" document.tex
To run a command (Latex to PDF conversion) when a file (document.tex) is closed after writing to it, which you could do as well.
However, there is a caveat: This was only meant to catch manual edits to the file and failure was not critical. Therefore, this ignores the case that immediately after closing, it is opened and written again. Ask yourself if that is good enough for you.
I agree with #TenG, normally you shouldn't move a file until it is fully written. If you know for sure that the file is finished (like a file from yesterday) then you can move it safely, otherwise you can process it, but not move it. You can for example process a part of it and remember the number of processed rows so that you don't restart from scratch next time.
If you really really want to work with files that are "in progress", sometimes tail -F works for this case, but then your bash script is an ongoing process as well, not a job, and you have to manage it.
You can also check if a file is currently open (and thus unfinished) using lsof (see https://superuser.com/questions/97844/how-can-i-determine-what-process-has-a-file-open-in-linux ; check if file is open with lsof ).
Change the process, that extracts the ID's from the oracle DB table.
You can use the mv as commented by #TenG, or put something special in the file that shows the work is done:
#!/bin/bash
source file_that_runs_sqlcommands_with_credentials
output=$(your_sql_function "select * from repairjobs")
# Something more for removing them from the table and check the number of deleted records
printf "%s\nFinished\n" "${output}" >> /task/abc_YYYYMMDD_1.txt
or
#!/bin/bash
source file_that_runs_sqlcommands_with_credentials
output=$(your_sql_function "select * from repairjobs union select 'EOF' from dual")
# Something more for removing them from the table and check the number of deleted records
printf "%s\n" "${output}" >> /task/abc_YYYYMMDD_1.txt

How to store lines in a file as latest at top in linux

I have lines that can store logs of my script in a separate file and it working perfectly but their is an issue it stores latest at bottom of a file which is a default behavior of Linux what I want is that it redirect output to a logs file as latest at the top.
Here are the lines which i include at the top of my Hooks_script and these lines store output of every command written below these lines in a separate logs file
#!/bin/bash
exec 3>&1 4>&2
trap 'exec 2>&4 1>&3' 0 1 2 3
exec 1>>/tmp/git_hooks_logs.log 2>&1
#create logs of every command written below these lines
The general consensus is there is no reliable way to do this in a single operation, and several ways to do it with more than a single operation, some of which can be enetered on one line if that is important to you.
You might want to look at https://www.cyberciti.biz/faq/bash-prepend-text-lines-to-file/

Python - read from a remote logfile that is updated frequently

I have a logfile that is written constantly on a remote networking device (F5 bigip). I have a Linux hopping station from where I can fetch that log file and parse it. I did find a solution that would implement a "tail -f" but I cannot use nice or similar to keep my script running after I log out. What I can do is to run a cronjob and copy over the file every 5 min let's say. I can process the file I downloaded but the next time I copy it it will contain a lot of common data, so how do I process only what is new? Any help or sugestions are welcome!
Two possible (non-python) solutions for your problems. If you want to keep a script running on your machine after logout, check nohup in combination with & like:
nohup my_program & > /dev/null
On a linux machine you can extract the difference between the two files with
grep -Fxv -f old.txt new.txt > dif.txt
This might be slow if the file is large. The dif.txt file will only contain the new stuff and can be inspected by your program. There also might be a solution involving diff.

Run two shell script in parallel and capture their output

I want have a shell script, which configure several things and then call two other shell scripts. I want these two scripts run in parallel and I want to be able to get and print their live output.
Here is my first script which calls the other two
#!/bin/bash
#CONFIGURE SOME STUFF
$path/instance2_commands.sh
$path/instance1_commands.sh
These two process trying to deploy two different application and each of them took around 5 minute so I want to run them in parallel and also see their live output so I know where are they with the deploying tasks. Is this possible?
Running both scripts in parallel can look like this:
#!/bin/bash
#CONFIGURE SOME STUFF
$path/instance2_commands.sh >instance2.out 2>&1 &
$path/instance1_commands.sh >instance1.out 2>&1 &
wait
Notes:
wait pauses until the children, instance1 and instance2, finish
2>&1 on each line redirects error messages to the relevant output file
& at the end of a line causes the main script to continue running after forking, thereby producing a child that is executing that line of the script concurrently with the rest of the main script
each script should send its output to a separate file. Sending both to the same file will be visually messy and impossible to sort out when the instances generate similar output messages.
you may attempt to read the output files while the scripts are running with any reader, e.g. less instance1.out however output may be stuck in a buffer and not up-to-date. To fix that, the programs would have to open stdout in line buffered or unbuffered mode. It is also up to you to use -f or > to refresh the display.
Example D from an article on Apache Spark and parallel processing on my blog provides a similar shell script for calculating sums of a series for Pi on all cores, given a C program for calculating the sum on one core. This is a bit beyond the scope of the question, but I mention it in case you'd like to see a deeper example.
It is very possible, change your script to look like this:
#!/bin/bash
#CONFIGURE SOME STUFF
$path/instance2_commands.sh >> script.log
$path/instance1_commands.sh >> script.log
They will both output to the same file and you can watch that file by running:
tail -f script.log
If you like you can output to 2 different files if you wish. Just change each ling to output (>>) to a second file name.
This how I end up writing it using Paul instruction.
source $path/instance2_commands.sh >instance2.out 2>&1 &
source $path/instance1_commands.sh >instance1.out 2>&1 &
tail -q -f instance1.out -f instance2.out --pid $!
wait
sudo rm instance1.out
sudo rm instance2.out
My logs in two processes was different so I didn't care if aren't all together, that is why I put them all in one file.

Bash script exits with no error

I have a bash script that I'm running from DVD. This script copies multi-volume tar files from DVD to the local machine. Part-way through the copy, the script prompts the user to insert a second DVD, at which point the remaining files are copied. The script exists on the first DVD but not on the second.
This script is simply stopping after the last file is copied, but prior to starting the tar multi-volume extract operation and subsequent processing. There are no errors or messages reported. I've tried running bash with '-x' but there's nothing suspicious - not even an exit statement. Even more unfortunate is the fact that this behavior is inconsistent. Sometimes the script will stop, but other times it will continue with no problems.
I have run strace on the script. Following the conclusion of the copy operations, I see this:
read(255, "\0\0\0\0\0\0\0\0\0\0"..., 5007) = 1302
read(255, "", 5007) = 0
exit_group(0) = ?
I know that bash reads the script file into memory and executes it from there, but is it possible that it's trying to re-read the script file at some point and failing (since it no longer exists)? The tar files are quite large, and it takes approximately 10-15 minutes from the time the script starts to the time the last file is copied (from the second DVD).
I see you have already found a workaround, so I will just try to uncover what's happening:
bash isn't reading the whole script into memory, it's doing buffered reads on it, only as much as necessary each time (presumably that's for code sharing with terminal input). Before any external commands are launched, bash seeks to the exact position in the script and continues to read from there after the command finishes. You can see this if you edit the script file while it's running:
term1$ cat > test.sh
sleep 8
echo DONE
term1$ bash test.sh
While the sleep is executing, change the script from another terminal:
term2$ cat > test.sh
echo HAHA
Observe how bash becomes confused when the sleep is complete:
test.sh: line 2: A: command not found
It remembers that the position in the input file was 8 before the sleep, so it tries to read from there and is confronted with the last A from the overwritten script.
Now to your case. Normally, having a file open from a dvd locks the drive and prohibits disk change. If you nevertheless manage to change the disk, that should definitely involve an umount which should then invalidate the script fd. That's clearly not happening according to your strace output, which is a little strange. In any case, bash won't be able to read the rest of the script.

Resources