Listing files while working with them - Shell Linux

Listing files while working with them - Shell Linux - linux

I have a database server that it basic work is to import some specific files, do some calculations and provide data in a web interface.
It's planned for next weeks a hardware replacement, it needs to migrate the database. But there's one problem in it: the actual database is corrupted and show some errors in web interface. This is due to server freezing while importing/calculating, that's why the replacement.
So I'm not willing to just dump the db and restore in the new server. Doesn't make sense to still use the corrupted database and while dumping the old server goes really slow. I have a backup from all files to be imported (the current number is 551) and I'm working on a script to "re-import" all of them and have a nice database again.
The actual server takes ~20 minutes to import each new file. Let's say that new server takes 10 for each file due to its power... It's a long time! And here comes the problem: it receives new file hourly, so there will be more files when it finishes the job.
Restore script start like this:
for a in $(ls $BACKUP_DIR | grep part_of_filename); do
Question is: does this "ls" will have new file names when they come? File names are timestamp based, so they will be in the end of the list.
Or does this "ls" is execute once and results goes to a temp var?
Thanks.

ls will execute once, at the beginning, and any new files won't show up.
You can rewrite that statement to list the files again at the start of each loop (and, as Trey mentioned, better to use find, not ls):
while all=$(find $BACKUP_DIR/* -type f | grep part_of_filename); do
for a in $all; do
But this has a major problem: it will repeatedly process the same files over and over again.
The script needs to record which files are done. Then it can list the directory again and process any (and only) new files. Here's one way:
touch ~/done.list
cd $BACKUP_DIR
# loop while f=first file not in done list:
# find list the files; more portable and safer than ls in pipes and scripts
# fgrep -v -f ~/done.list pass through only files not in the done list
# head -n1 pass through only the first one
# grep . control the loop (true iff there is something)
while f=`find * -type f | fgrep -v -f ~/done.list | head -n1 | grep .`; do
<process file $f>
echo "$f" >> ~/done.list
done

Related

Tail command not tailing the newly created files under a directory in linux

I am trying to tail all the log files present under a directory and it's sub-directories recursively using below command
shopt -s globstar
tail -f -n +2 /app/mylogs/**/* | awk '/^==> / {a=substr($0, 5, length-8); next} {print a":"$0}'
and the output is below:
/app/mylogs/myapplog10062020.log:Hi this is first line
/app/mylogs/myapplog10062020.log:Hi this is second line
which is fine, but problem is when I add a new log file under /app/mylogs/,directory after I fire above tail command. tail will not take that new file into consideration.
Is there a way to get this done?

When you start your the tail process, you pass to it a (fixed) list of the files which tail is suppoed to follow, as you can see from the tail man page. This is different to, say, 'find', where you can in its options pass a file name pattern. After the process has been started, tail has no way of knowing that you suddenly want it to follow another file too.
If you want to have a feature like this, you would have to program your own version of tail, which gets passed for instance a directory to scan, and either periodically checks the directory content for change, or using a service such as inotify to be informed by directory changes.

How can dmenu show input as soon as there is input from pipe?

TL;DR
Here is the default behavior.
find ~/ -name *.git 2>/dev/null | dmenu
# Searches everything in home directory and shows output
Time taken about 1-2 sec
What I want:
find ~/ -name *.git 2>/dev/null | less
# Show as soon as it finds result. How to get similar output in dmenu?
As files in my PC will increase, this is going to take longer time.
Detailed description:
I am piping input into dmenu from a find command which takes about 1-2 seconds. Is it possible for dmenu to show input as soon as there is some input in the pipe. Because that's the basic working of piping. It seems like dmenu waits until there are all the entries in pipe so that user can search from it which also looks legit, but still can this be avoided? I would like to run dmenu as soon as there is input in buffer.

I found some workaround to decrease time against find here. Instead of find, locate can be used. So the command goes like
locate -r '/home'"$USER"'.*\.git$'
-r takes input a regular expression. Arguments to -r here filters all git repositories inside /home/$USER. This is a bit faster than using find.
Catch using locate
locate uses a local database for searching. So it will only work as expected when local database will be built/updated.
To update database, use sudo updatedb. Whenever you add/move/delete a file (or a directory in this case), remember to update database for locate to give proper results.
Tip
To avoid entering password every time for updatedb (and other frequently used commands), add them to sudoers by executing sudo visudo and adding entry for path to command's binary's location
Update
I recently realized why use locate when I can simply maintain my own database and cat all the entries to dmenu. With this I was able to achieve what I needed.
# Make a temp directory
mkdir -p $HOME/.tmp
# Search for all git directories and store them in ~/.tmp/gitfies.
[ -e $HOME/.tmp/gitfiles ] || find $HOME/ -regex .*/\.git$ -type d 2>/dev/null > $HOME/.tmp/gitfiles
# cat this file into dmenu
cat $HOME/.tmp/gitfiles | dmenu
This gives a fuzzy finding for directories with dmenu. This is better than using locate as even in locate you need to update local database and so in here. Since we do the filtering of git files at runtime with locate, it is a bit slower than this case.
I can simple create an alias to update this database analogous to sudo updatedb in case of locate, by
alias gitdbupdate="find $HOME/ -regex .*/\.git$ -type d 2>/dev/null > $HOME/.tmp/gitfiles"
Note that I am not using /tmp/ as it won't be persistent across power cycles. So rather I create my own $HOME/.tmp/ directory.

Howto find process that processes a bash pipe

I am running a while loop in bash to periodically delete all "old" files in a directory (keeping only the 100 newest files). This process is ran in the background:
((cd /tmp/test && while true; do ls -t | sed -e '1,100d' | xargs -I{} -d '\n' rm -R {}; sleep 1; done)&)
How do I find the process of this bash process in ps?
The objective is that part of a larger script I would like to automatically detect if this process is already running and if not, I would like to start it.
[edit / clarification] Saving the PID is not a solution, because the script can be executed multiple times. It is supposed to ensure that the machine is prepared for a following process. The intention is that the user / developer can just run it to make sure everything is setup. If parts of the conditions are already fulfilled, they will be skipped. Most of what needs to be run are idempotent commands. This is the only command I am struggling to make idempotent. It is an intermediate hack until we have a proper provisioning system in place.

One way would be to do something like in your script:
lock="/tmp/${name}.lock"
if [ -f "${lock}" ];then
echo "already running. pid: $(cat "${lock}")"
exit 1
fi
trap rm "${lock}" 2>/dev/null;exit" 0 1 2 15
echo $$ > "${lock}"
Instead of deleting the N oldest files I'd suggest deleting files older than a certain age, like N days, like so:
N=5 # delete everything older than N days
find /tmp/test -ctime +"${N}"d -depth 1 -exec rm -r {} +
This has a lot of benefits, like being able to handle files with spaces or other odd chars in the name. It also is more predictable because if your process creates, say, 10000 files in /tmp/test, then deleting 100 is going to leave you a mess, or, the other way, may delete new files you really wanted to keep. Notice also that if you have new files in an old directory your new files will be wiped out. -depth 1 keeps /tmp/test itself from being deleted.
If you are really wanting to go with deleting the oldest N files I'd do it within a script and call it something like thecleaner.py & and search ps for that.
#! /usr/bin/env python
# -*- coding: UTF8 -*-
import os
import shutil
import time
secs = 5 # number of seconds to pause during each loop
N = 100 # delete this many of the oldest files
top = "/tmp/test" # remove N oldest files in this directory
while True:
paths = sorted(os.listdir(top), key=os.path.getctime)
oldest = paths[:N]
for path in oldest:
print("removing: {}".format(path))
if os.path.isfile(path):
os.remove(path)
else:
shutil.rmtree(path) # delete entire directory tree
time.sleep(secs)
Otherwise I don't know of any way to reliably find the loop you've given.

ps aux should show the most processes and the line that actually ran them, so you can grep what you need. lsof might show some more, probably too much.
If you need to create something easy to grep, put it in a file, say ~/my.cmd, and run
bash --init-file my.cmd
or just make it runable and run my.cmd.
you could even use a setsid with these to detach from the terminal, so it runs in the background always. Then just
ps aux | grep "my.cmd"
should identify it if there are two hits (two hits - one for grep, the other the run). You can use | wc -l and check if it returns 2.
I suggest you check out crontab though, it seems better suited to what you want over all.

List files which have a corresponding "ready" file

I have a service "A" which generates some compressed files comprising of the data it receives in requests. In parallel there is another service "B" which consumes these compressed files.
The trick is "B" shouldn't consume any of the files unless they are written completely. The service deduces this information by looking for a ".ready" file created by service "A" with name exactly same as the file generated along with the extension mentioned; once the compression is done. Service "B" uses Apache Camel to do this filtering.
Now, I am writing a shell script which needs the same compressed files and this would need the same filtering be implemented in shell. I need help writing this script. I am aware of find command but a naive shell user, so have very limited knowledge.
Example:
Compressed file: sumit_20171118_1.gz
Corresponding ready
file: sumit_20171118_1.gz.ready
Another compressed file: sumit_20171118_2.gz
No ready file is present for this one.
Of the above listed files only the first should be picked up as it has a corresponding ready file.

The most obvious way would be to use a busy loop. But if you are on GNU/Linux you can do better than that (from: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-GNU-Parallel-as-dir-processor)
inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |
parallel -uj1 echo Do stuff to file {}
This way you do not even have to wait for the .ready file: The command will only be run when writing to the file is finished and the file is closed.
If, however, the .ready file is only written much later then you can search for that one:
inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |
grep --line-buffered '\.ready$' |
parallel -uj1 echo Do stuff to file {.}

linux server create symbolic links from filenames

I need to write a shell script to run as a cron task, or preferably on creation of a file in a certain folder.
I have an incoming and an outgoing folder (they will be used to log mail). There will be files created with codes as follows...
bmo-001-012-dfd-11 for outgoing and 012-dfd-003-11 for incoming. I need to filter the project/client code (012-dfd) and then place it in a folder in the specific project folder.
Project folders are located in /projects and follow the format 012-dfd. I need to create symbolic links inside the incoming or outgoing folders of the projects, that leads to the correct file in the general incoming and outgoing folders.
/incoming/012-dfd-003-11.pdf -> /projects/012-dfd/incoming/012-dfd-003-11.pdf
/outgoing/bmo-001-012-dfd-11.pdf -> /projects/012-dfd/outgoing/bmo-001-012-dfd-11.pdf
So my questions
How would I make my script run when a file is added to either incoming or outgoing folder
Additionally, is there any associated disadvantages with running upon file modification compared with running as cron task every 5 mins
How would I get the filename of recent (since script last run) files
How would I extract the code from the filename
How would I use the code to create a symlink in the desired folder
EDIT: What I ended up doing...
while inotifywait outgoing; do find -L . -type l -delete; ls outgoing | php -R '
if(
preg_match("/^\w{3}-\d{3}-(\d{3}-\w{3})-\d{2}(.+)$/", $argn, $m)
&& $m[1] && (file_exists("projects/$m[1]/outgoing/$argn") != TRUE)
){
`ln -s $(pwd)/outgoing/$argn projects/$m[1]/outgoing/$argn;`;
}
'; done;
This works quite well - cleaning up deleted symlinks also (with find -L . -type l -delete) but I would prefer to do it without the overhead of calling php. I just don't know bash well enough yet.

Some near-answers for your task breakdown:
On linux, use inotify, possibly through one of its command-line tools, or script language bindings.
See above
Assuming the project name can be extracted thinking positionally from your examples (meaning not only does the project name follows a strict 7-character format, but what precedes it in the outgoing file also does):
echo `basename /incoming/012-dfd-003-11.pdf` | cut -c 1-7
012-dfd
echo `basename /outgoing/bmo-001-012-dfd-11.pdf`| cut -c 9-15
012-dfd
mkdir -p /projects/$i/incoming/ creates directory /projects/012-dfd/incoming/ if i = 012-dfd,
ln -s /incoming/foo /projects/$i/incoming/foo creates a symbolic link from the latter argument, to the preexisting, former file /incoming/foo.

How would I make my script run when a file is added to either incoming or outgoing folder
Additionally, is there any associated disadvantages with running upon file modification compared with running as cron task
every 5 mins
If a 5 minutes delay isn't an issue, I would go for the cron job (it's easier and -IMHO- more flexible)
How would I get the filename of recent (since script last run) files
If your script runs every 5 minutes, then you can tell that all the files created in between now (and now - 5 minutes) are newso, using the command ls or find you can list those files.
How would I extract the code from the filename
You can use the sed command
How would I use the code to create a symlink in the desired folder
Once you have the desired file names, you can usen ln -s command to create the symbolic link

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string