Howto find process that processes a bash pipe - linux

I am running a while loop in bash to periodically delete all "old" files in a directory (keeping only the 100 newest files). This process is ran in the background:
((cd /tmp/test && while true; do ls -t | sed -e '1,100d' | xargs -I{} -d '\n' rm -R {}; sleep 1; done)&)
How do I find the process of this bash process in ps?
The objective is that part of a larger script I would like to automatically detect if this process is already running and if not, I would like to start it.
[edit / clarification] Saving the PID is not a solution, because the script can be executed multiple times. It is supposed to ensure that the machine is prepared for a following process. The intention is that the user / developer can just run it to make sure everything is setup. If parts of the conditions are already fulfilled, they will be skipped. Most of what needs to be run are idempotent commands. This is the only command I am struggling to make idempotent. It is an intermediate hack until we have a proper provisioning system in place.

One way would be to do something like in your script:
lock="/tmp/${name}.lock"
if [ -f "${lock}" ];then
echo "already running. pid: $(cat "${lock}")"
exit 1
fi
trap rm "${lock}" 2>/dev/null;exit" 0 1 2 15
echo $$ > "${lock}"
Instead of deleting the N oldest files I'd suggest deleting files older than a certain age, like N days, like so:
N=5 # delete everything older than N days
find /tmp/test -ctime +"${N}"d -depth 1 -exec rm -r {} +
This has a lot of benefits, like being able to handle files with spaces or other odd chars in the name. It also is more predictable because if your process creates, say, 10000 files in /tmp/test, then deleting 100 is going to leave you a mess, or, the other way, may delete new files you really wanted to keep. Notice also that if you have new files in an old directory your new files will be wiped out. -depth 1 keeps /tmp/test itself from being deleted.
If you are really wanting to go with deleting the oldest N files I'd do it within a script and call it something like thecleaner.py & and search ps for that.
#! /usr/bin/env python
# -*- coding: UTF8 -*-
import os
import shutil
import time
secs = 5 # number of seconds to pause during each loop
N = 100 # delete this many of the oldest files
top = "/tmp/test" # remove N oldest files in this directory
while True:
paths = sorted(os.listdir(top), key=os.path.getctime)
oldest = paths[:N]
for path in oldest:
print("removing: {}".format(path))
if os.path.isfile(path):
os.remove(path)
else:
shutil.rmtree(path) # delete entire directory tree
time.sleep(secs)
Otherwise I don't know of any way to reliably find the loop you've given.

ps aux should show the most processes and the line that actually ran them, so you can grep what you need. lsof might show some more, probably too much.
If you need to create something easy to grep, put it in a file, say ~/my.cmd, and run
bash --init-file my.cmd
or just make it runable and run my.cmd.
you could even use a setsid with these to detach from the terminal, so it runs in the background always. Then just
ps aux | grep "my.cmd"
should identify it if there are two hits (two hits - one for grep, the other the run). You can use | wc -l and check if it returns 2.
I suggest you check out crontab though, it seems better suited to what you want over all.

Related

How can dmenu show input as soon as there is input from pipe?

TL;DR
Here is the default behavior.
find ~/ -name *.git 2>/dev/null | dmenu
# Searches everything in home directory and shows output
Time taken about 1-2 sec
What I want:
find ~/ -name *.git 2>/dev/null | less
# Show as soon as it finds result. How to get similar output in dmenu?
As files in my PC will increase, this is going to take longer time.
Detailed description:
I am piping input into dmenu from a find command which takes about 1-2 seconds. Is it possible for dmenu to show input as soon as there is some input in the pipe. Because that's the basic working of piping. It seems like dmenu waits until there are all the entries in pipe so that user can search from it which also looks legit, but still can this be avoided? I would like to run dmenu as soon as there is input in buffer.
I found some workaround to decrease time against find here. Instead of find, locate can be used. So the command goes like
locate -r '/home'"$USER"'.*\.git$'
-r takes input a regular expression. Arguments to -r here filters all git repositories inside /home/$USER. This is a bit faster than using find.
Catch using locate
locate uses a local database for searching. So it will only work as expected when local database will be built/updated.
To update database, use sudo updatedb. Whenever you add/move/delete a file (or a directory in this case), remember to update database for locate to give proper results.
Tip
To avoid entering password every time for updatedb (and other frequently used commands), add them to sudoers by executing sudo visudo and adding entry for path to command's binary's location
Update
I recently realized why use locate when I can simply maintain my own database and cat all the entries to dmenu. With this I was able to achieve what I needed.
# Make a temp directory
mkdir -p $HOME/.tmp
# Search for all git directories and store them in ~/.tmp/gitfies.
[ -e $HOME/.tmp/gitfiles ] || find $HOME/ -regex .*/\.git$ -type d 2>/dev/null > $HOME/.tmp/gitfiles
# cat this file into dmenu
cat $HOME/.tmp/gitfiles | dmenu
This gives a fuzzy finding for directories with dmenu. This is better than using locate as even in locate you need to update local database and so in here. Since we do the filtering of git files at runtime with locate, it is a bit slower than this case.
I can simple create an alias to update this database analogous to sudo updatedb in case of locate, by
alias gitdbupdate="find $HOME/ -regex .*/\.git$ -type d 2>/dev/null > $HOME/.tmp/gitfiles"
Note that I am not using /tmp/ as it won't be persistent across power cycles. So rather I create my own $HOME/.tmp/ directory.

Listing files while working with them - Shell Linux

I have a database server that it basic work is to import some specific files, do some calculations and provide data in a web interface.
It's planned for next weeks a hardware replacement, it needs to migrate the database. But there's one problem in it: the actual database is corrupted and show some errors in web interface. This is due to server freezing while importing/calculating, that's why the replacement.
So I'm not willing to just dump the db and restore in the new server. Doesn't make sense to still use the corrupted database and while dumping the old server goes really slow. I have a backup from all files to be imported (the current number is 551) and I'm working on a script to "re-import" all of them and have a nice database again.
The actual server takes ~20 minutes to import each new file. Let's say that new server takes 10 for each file due to its power... It's a long time! And here comes the problem: it receives new file hourly, so there will be more files when it finishes the job.
Restore script start like this:
for a in $(ls $BACKUP_DIR | grep part_of_filename); do
Question is: does this "ls" will have new file names when they come? File names are timestamp based, so they will be in the end of the list.
Or does this "ls" is execute once and results goes to a temp var?
Thanks.
ls will execute once, at the beginning, and any new files won't show up.
You can rewrite that statement to list the files again at the start of each loop (and, as Trey mentioned, better to use find, not ls):
while all=$(find $BACKUP_DIR/* -type f | grep part_of_filename); do
for a in $all; do
But this has a major problem: it will repeatedly process the same files over and over again.
The script needs to record which files are done. Then it can list the directory again and process any (and only) new files. Here's one way:
touch ~/done.list
cd $BACKUP_DIR
# loop while f=first file not in done list:
# find list the files; more portable and safer than ls in pipes and scripts
# fgrep -v -f ~/done.list pass through only files not in the done list
# head -n1 pass through only the first one
# grep . control the loop (true iff there is something)
while f=`find * -type f | fgrep -v -f ~/done.list | head -n1 | grep .`; do
<process file $f>
echo "$f" >> ~/done.list
done

run a script with $(cat filename.txt)

So im running a script called backup.sh. It creates a backup of a site. Now I have a file called sites.txt that has a list if sites that I need to backup. i dont want to run the script for every site that I need to backup. So what im trying to do is run is like this:
backup.sh $(cat sites.txt)
But it only backups the 1st site thats on the list then stop. any suggestions how i could keep make it go throughout the whole list?
To iterate over the lines of a file, use a while loop with the read command.
while IFS= read -r file_name; do
backup.sh "$file_name"
done < sites.txt
The proper fix is to refactor backup.sh so that it meets your expectation to accept a list of sites on its command line. If you are not allowed to change it, you can write a simple small wrapper script.
#!/bin/sh
for site in "$#"; do
backup.sh "$site"
done
Save this as maybe backup_sites, do a chmod +x, and run it with the list of sites. (I would perhaps recommend xargs -a sites.txt over $(cat sites.txt) but both should work if the contents are one token per line.)
I think this should do, provided that sites.txt has one site per line (not tested):
xargs -L 1 backup.sh < sites.txt
If you are permitted to modify backup.sh, I would enhance it so that it accepts a list of sites, not a single one. Of course, if sites.txt, is very, very large, the xargs way would still be the better one (but then without the -L switch).

Bash script for moving and renaming application log files on Linux

I'm relatively new to coding on linux.
I have the below script for moving my ERP log file.
!/bin/bash #Andrew O. MBX 2015-09-03
#HansaWorld Script to periodically move the log file
_now=$(date +"%m_%d_%Y")
mv /u/OML_Server_72/hansa.log /u/HansaLogs/hansa_$now.log
The code runs but does not rename the log file to the date it has been moved.
I would also like to check when the file exceeds the 90M size so it moves it automatically at the end of every day. a cron job of some kind.
Help Please
After editing this is my new code.
#!/bin/bash
#Andrew O. MBX 2015-09-03
#HansaWorld Script to periodically move the log file
now=$(date +"%m_%d_%Y")
mv /u/OML_Server_72/hansa.log /u/HansaLogs/hansa$now.log
I wish to add code to check if hansa.log file is over 90M then move it. If it is not then leave it as it is.
cd /u find. -name '*hansa.log*' -size +90000k -exec mv '{}' /u/HansaLogs\;
In addition to the other comments, there are a few other things to consider. tgo's logrotate suggestion is a good one. In Linux, if you are every stuck on the use of a utility, etc.. the man files (while a bit cryptic at first), provide concise usage information. To see the logs available for a given utility, use man -k name (some distributions provide this selection capability by default alias) e.g.:
$ man -k logrotate
logrotate (8) - rotates, compresses, and mails system logs
logrotate.conf (5) - rotates, compresses, and mails system logs
Then if you want the logrotate page:
$ man 8 logrotate
or the conf page
$ man 5 logrotate.conf
There are several things you may want to change/consider regarding your script. First, while there is nothing wrong with a variable now, you may run into confusion with the date command's builtin use of now. There is no conflict, but it would look strange to write now=$(date -d "now + 24 hours" "+%F %T"). (recommend a name like tstamp, short for timestamp instead).
For maintainability, readability, etc... you may consider assiging your path components to variables that will help with readability later on. (example below).
Finally, before moving, copying, deleting, etc... it is always a good idea to validate that the target file exists and to provide an error message if something is out of whack. A rewrite could be:
#!/bin/bash
#Andrew O. MBX 2015-09-03
#HansaWorld Script to periodically move the log file
tstamp=$(date +"%m_%d_%Y")
logdir="/u/HansaLogs"
logname="/u/OML_Server_72/hansa.log"
if [ -f "$logname" ]; then
mv "$logname" "$logdir/hansa_${tstamp}.log"
else
printf "error: file not found '%s'.\n" "$logname" >&2
exit 1
fi
Note: the >&2 simply redirects the output of printf to stderr rather than stdout.
As for the find command, there is no need to cd and find ., the find command takes the path as its first argument. Additionally, the --size option has builtin support for Megabytes M. A rewrite here could look like:
find /u -name "*hansa.log*" -size +90M -exec mv '{}' /u/HansaLogs \;
All in all, it looks like you will pick up shell programming without any problem. Just develop good habits early, they will save you a lot of grief later.
Hi Guys Thanx for the help. So far I have come up with this code. I am stuck at creating a cron job to run this periodically say after every 22hrs
#!/bin/bash
#Andrew O. MBX 2015-09-03
#HansaWorld Script to Check if log file exists before moving:
tstamp=$(date +"%m_%d_%Y")
logdir="/u/HansaLogs"
logname="/u/OML_Server_72/hansa.log"
minimumsize=90000
actualsize=$(wc -c <"$logname")
if [ $actualsize -ge $minimumsize ]; then
mv "$logname" "$logdir/hansa_${tstamp}.log"
else
echo size is under $minimumsize bytes
exit 1
fi

Prevent files to be moved by another process in linux

I have problem with bash script.
I have two cron tasks, which gets some number of files from same folder for further processing.
ls -1h "targdir/*.json" | head -n ${LIMIT} > ${TMP_LIST_FILE}
while read REMOTE_FILE
do
mv $REMOTE_FILE $SCRDRL
done < "${TMP_LIST_FILE}"
rm -f "${TMP_LIST_FILE}"
But then two instances of script run simultaneously same file beeing moved to $SRCDRL which different for instances.
The question is how to prevent files to be moved by different script?
UPD:
Maybe I was little uncleare...
I have folder "targdir" where I store json files. And I have two cron tasks which gets some files from that directory to process. For example in targdir exists 25 files first cron task should get first 10 files and move them to /tmp/task1, second cron task should get next 10 files and move them to /tmp/task2 , e.t.c.
But now first 10 files moves to /tmp/task1 and /tmp/task2.
First and foremost: rename is atomic. It is not possible for a file to be moved twice. One of the moves will fail, because the file is no longer there. If the scripts run in parallel, both list the same 10 files and instead of first 10 files moved to /tmp/task1 and next 10 to /tmp/task2 you may get 4 moved to /tmp/task1 and 6 to /tmp/task2. Or maybe 5 and 5 or 9 and 1 or any other combination. But each file will only end up in one task.
So nothing is incorrect; each file is still processed only once. But it will be inefficient, because you could process 10 files at a time, but you are only processing 5. If you want to make sure you always process 10 if there is enough files available, you will have to do some synchronization. There are basically two options:
Place lock around the list+copy. This is most easily done using flock(1) and a lock file. There are two ways to call that too:
Call the whole copying operation via flock:
flock targdir -c copy-script
This requires that you make the part that should be excluded a separate script.
Lock via file descriptor. Before the copying, do
exec 3>targdir/.lock
flock 3
and after it do
flock -u 3
This lets you lock over part of the script only. This does not work in Cygwin (but you probably don't need that).
Move the files one by one until you have enough.
ls -1h targdir/*.json > ${TMP_LIST_FILE}
# ^^^ do NOT limit here
COUNT=0
while read REMOTE_FILE
do
if mv $REMOTE_FILE $SCRDRL 2>/dev/null; then
COUNT=$(($COUNT + 1))
fi
if [ "$COUNT" -ge "$LIMIT" ]; then
break
fi
done < "${TMP_LIST_FILE}"
rm -f "${TMP_LIST_FILE}"
The mv will sometimes fail, in which case you don't count the file and try to move the next one, assuming the mv failed because the file was meanwhile moved by the other script. Each script copies at most $LIMIT files, but it may be rather random selection.
On a side-note if you don't absolutely need to set environment variables in the while loop, you can do without a temporary file. Simply:
ls -1h targdir/*.json | while read REMOTE_FILE
do
...
done
You can't propagate variables out of such loop, because as part of a pipeline it runs in subshell.
If you do need to set environment variables and can live with using bash specifically (I usually try to stick to /bin/sh), you can also write
while read REMOTE_FILE
do
...
done <(ls -1h targdir/*.json)
In this case the loop runs in current shell, but this kind of redirection is bash extension.
The fact that two cron jobs move the same file to the same path should not matter for you unless you are disturbed by the error you get from one of them (one will succeed and the other will fail).
You can ignore the error by using:
...
mv $REMOTE_FILE $SCRDRL 2>/dev/null
...
Since your script is supposed to move a specific number of files from the list, two instances will at best move twice as many files. Unless they even interfere with each other, then the number of moved files might be less.
In any case, this is probably a bad situation to begin with. If you have any way of preventing two scripts running at the same time, you should do that.
If, however, you have no way of preventing two script instances from running at the same time, you should at least harden the scripts against errors:
mv "$REMOTE_FILE" "$SCRDRL" 2>/dev/null
Otherwise your scripts will produce error output (no good idea in a cron script).
Further, I hope that your ${TMP_LIST_FILE} is not the same in both instances (you could use $$ in it to avoid that); otherwise they'd even overwrite this temp file, in the worst case resulting in a corrupted file containing paths you do not want to move.

Resources