shell script to move data from other server's (or nodes) sub-dirs to current server(node) matching sub-dir - linux

I've .parquet files for multiple dates (from 20190927 to 20200131) inside /data/pg/export/schema.table_YYYYMMDD<random alphanumric string> directory structure in seven different nodes. When process ran, it created sub-directory in schema.table_YYYYMMDD<random alphanumric string> format (such as schema.table_20190927) inside /data/pg/export path for each date. However, it did append some random letter on sub-dir on other hosts. So for instance, I've folder, files in following format:
on node#1 (10.245.122.100)
/data/pg/export/schema.table_20190927 contains:
----1.parquet
----2.parquet
----3.parquet
on node#2 (10.245.122.101)
/data/pg/export/schema.table_20190927S8rW4dQ2 contains:
----4.parquet
----5.parquet
----6.parquet
on node#3 (10.245.122.102)
/data/pg/export/schema.table_20190927P5SJ9aX4 contains:
----7.parquet
----8.parquet
----9.parquet
and so on for other nodes.
How I can bring files from /data/pg/export/schema.table_20190927S8rW4dQ2 on node#2 (10.245.122.101) and /data/pg/export/schema.table_20190927P5SJ9aX4 on node#3 (10.245.122.102) (and similar for other hosts) to /data/pg/export/schema.table_20190927 on node#1 (10.245.122.100) so
final output look like:
***on node#1 (10.245.122.100)***
/data/pg/export/schema.table_20190927 will have:
----1.parquet
----2.parquet
----3.parquet
----4.parquet
----5.parquet
----6.parquet
----7.parquet
----8.parquet
----9.parquet

Welcome to SO. Since it is your first question (well the first I see), and I liked the challenge, here is a script that will do that. For your next question, you must provide your own code with a specific problem you are having, and not expect a complete script as an answer. See my comment for stuff to read on using SO.
The bash knowledge required to make this work is:
while loop
date calculation
variable value incrementation (so basic math)
I made some assumptions:
you have a single user on all nodes which can be used to do scp from node1
that user is hopefully setup to use ssh keys to login, otherwise you will type your password a lot of times!
you have connected at least 1 time to each node, and they are listed in your known_hosts file
on each node, there is 1 and only one directory with a specific date in the name.
all files are copied in each directory. You can modify the scp command to get only the .parquet files if you want.
Basic ideas in the code
loop on each node, so from 2 to 7
loop on dates, so from 20190927 to 20200131
copy files for each node, each date within the loops
this was tested on Linux Mint (== Ubuntu) so the date command is the gnu version, which allows for date calculation the way I did it.
Before use, modify the value of the user variable with your user name.
DISCLAIMER: I did not have multiple systems to test the scp command, so this command was added by memory.
The code:
#!/bin/bash
#
# This script runs on node1
# The node1 IP is 10.245.122.100
#
# This script assumes that you want to copy all files under
# /data/pg/export/schema.table_YYYMMDD<random>
#
###############################################################################
# node1 variables
targetdirprefix="/data/pg/export/schema.table_"
user="YOURUSER"
# Other nodes variables
total_number_nodes=7 # includes node1
ip_prefix=10.245.122
ip_lastdigit_start=99 # node1 == 100, so start at 99
# loop on nodes ---------------------------------------------------------------
# start at node 2, node1 is the target node
nodecount=2
# Stop at maxnode+1, here he last node will be 7
(( whileexit = total_number_nodes + 1 ))
while [[ "$nodecount" -lt "$whileexit" ]]
do
# build the current node IP
(( currentnode_lastdigit = ip_lastdigit_start + nodecount ))
currentnode_ip="${ip_prefix}.${currentnode_lastdigit}"
# DEBUG
echo "nodecount=$nodecount, ip=$currentnode_ip"
# loop on dates ---------------------------------------
firstdate="20190927"
lastdate="20200131"
loopdate="$firstdate"
while [[ "$loopdate" -le "$lastdate" ]]
do
# DEBUG
echo "loopdate=$loopdate"
# go into the target directory (create it if required)
targetdir="${targetdirprefix}${loopdate}"
if [[ -d "$targetdir" ]]
then
cd "$targetdir"
else
mkdir -p "$targetdir"
if [[ "$?" -ne 0 ]]
then
echo "ERROR: could not create directory $targetdir, exiting."
exit 1
else
cd "$targetdir"
fi
fi
# copy the date's file into the target dir (i.e. localy, since we did a cd before)
# the source directory is the same as the targetdir, with extra chars at the end
# this script assumes there is only 1 directory with that particular date!
scp ${user}#${currentnode_ip}:${targetdir}* .
if [[ "$?" -ne 0 ]]
then
echo "WARNING: copy failed from node $nodecount, date $loopdate."
echo " The script will continue for other dates and nodes..."
fi
loopdate=$(date --date "$loopdate +1 days" +%Y%m%d)
done
(( nodecount += 1 ))
done

Related

Bash script deletes files older than N days using lftp - but does not remove recursive directories and files

I have finally got this script working and it logs on to my remote FTP and removes files in a folder that are older than N days. I cannot however get it to remove recursive directories also. What can be changed or added to make this script remove files in subfolders as well as subfolders that are also older than N days? I have tried adding the -r function at a few places but it did not work. I think it needs to be added to where the script also builds the list of files to be removed. Any help would be greatly appreciated. Thank you in advance!
#!/bin/bash
# Simple script to delete files older than specific number of days from FTP.
# This script use 'lftp'. And 'date' with '-d' option which is not POSIX compatible.
# FTP credentials and path
FTP_HOST="xxxxxxxxxxxx"
FTP_USER="xxxxxx"
FTP_PASS="xxxxxxxxxxxxxxxxx"
FTP_PATH="/directadmin"
# Full path to lftp executable
LFTP=`which lftp`
# Enquery days to store from 1-st passed argument or strictly hardcode it, uncomment one to use
STORE_DAYS=${1:? "Usage ${0##*/} X, where X - count of daily archives to store"}
# STORE_DAYS=7
function removeOlderThanDays() {
# Make some temp files to store intermediate data
LIST=`mktemp`
DELLIST=`mktemp`
# Connect to ftp get file list and store it into temp file
${LFTP} << EOF
open ${FTP_USER}:${FTP_PASS}#${FTP_HOST}
cd ${FTP_PATH}
cache flush
cls -q -1 --date --time-style="+%Y%m%d" > ${LIST}
quit
EOF
# Print obtained list, uncomment for debug
# echo "File list"
# cat ${LIST}
# Delete list header, uncomment for debug
# echo "Delete list"
# Let's find date to compare
STORE_DATE=$(date -d "now - ${STORE_DAYS} days" '+%Y%m%d')
while read LINE; do
if [[ ${STORE_DATE} -ge ${LINE:0:8} && "${LINE}" != *\/ ]]; then
echo "rm -f \"${LINE:9}\"" >> ${DELLIST}
# Print files which are subject to deletion, uncomment for debug
#echo "${LINE:9}"
fi
done < ${LIST}
# More debug strings
# echo "Delete list complete"
# Print notify if list is empty and exit.
if [ ! -f ${DELLIST} ] || [ -z "$(cat ${DELLIST})" ]; then
echo "Delete list doesn't exist or empty, nothing to delete. Exiting"
exit 0;
fi
# Connect to ftp and delete files by previously formed list
${LFTP} << EOF
open ${FTP_USER}:${FTP_PASS}#${FTP_HOST}
cd ${FTP_PATH}
$(cat ${DELLIST})
quit
I have addressed this sort of thing a few times.
How to connect to a ftp server via bash script?
Provide commands automatically to ftp in bash script
Bash FTP upload - events to log
Better to use scp and/or ssh when you can, especially if you can set up passwordless access with public keys. Otherwise, I recommend a more robust language like Python or Perl that lets you check the return codes of these steps individually and respond accordingly.

How can I stop my script to overwrite existing files

I am learning bash since 6 days I think I got some of the basics.
Anyway, for the wallpapers downloaded from Variety I've written two scripts. One of them moves downloaded photos older than 12 days to a folder and renames them all as "Aday 1,2,3..." and the other lets me select these and moves them to another folder and removes photos I didn't select. 1st script works just as I intended, my question is about the other
I think I should write the script down to better explain my problem
Script:
#!/bin/bash
#Move victors of 'Seçme-Eleme' to 'Kazananlar'
cd /home/eurydice/Bulunur\ Bir\ Şeyler/Dosyamsılar/Seçme-Eleme
echo "Select victors"
read vct
for i in $vct; do
mv -i "Aday $i.png" /home/eurydice/"Bulunur Bir Şeyler"/Dosyamsılar/Kazananlar/"Bahar $RANDOM.png" ;
mv -i "Aday $i.jpg" /home/eurydice/"Bulunur Bir Şeyler"/Dosyamsılar/Kazananlar/"Bahar $RANDOM.jpg" ;
done
#Now let's remove the rest
rm /home/eurydice/Bulunur\ Bir\ Şeyler/Dosyamsılar/Seçme-Eleme/*
In this script I originally intended to define another variable (let's call this "n") and so did I with copying and changing the variable from the first script. It was something like that
for i in $vct; do
n=1
mv "Aday $i.png" /home/eurydice/"Bulunur Bir Şeyler"/Dosyamsılar/Kazananlar/"Bahar $n.png" ;
mv "Aday $i.jpg" /home/eurydice/"Bulunur Bir Şeyler"/Dosyamsılar/Kazananlar/"Bahar $n.jpg" ;
n=$((n+1))
done
When I do that for the first time the script worked just as I intended. However, in my 2nd test run this script overwrote the files that already existed. I mean, for example in 1st run i had 5 files whose names are "Bahar 1,2,3,4,5" and the 2nd time I chose 3 files to add. I wanted their names to be "Bahar 6,7,8" but instead, my script made them the new 1,2 and 3. I tried many solutions and when I couldn't fix that I just assigned random numbers to them.
Is there a way to make this script work as I intended?
This command finds the biggest file name number amongst files in current directory. If no file is found, biggest number is assigned to 0.
biggest_number=$(ls -1 | sed -n 's/^[^0-9]*\([0-9]\+\)\(\.[a-zA-Z]\+\)\?$/\1/p' | sort -r -g | head -n 1)
[[ ! -z "$biggest_number" ]] || biggest_number=0
The regex in sed command assumes that there is no digit in filenames before the trailing number intended for increment.
As soon as you have found the biggest number, you can use it to start your loop to prevent overwrites.
n=$((biggest_number+1))

variable part in a variable path in ksh script

I'm sorry if something similar was already answered in the past, but I wasn't able to find it. I'm writing a script to perform some housekeeping tasks, and I get stuck in the step below. To put you in the record, it's a script which reads a config file in order to be able to use it as standard protocol in different environments.
The problem is with this code:
# Check if destination folder exist, if not create it.
if [ ! -d ${V_DestFolder} ]; then # Create folder
F_Log "${IF_ROOT} mkdir -p ${V_DestFolder}"
${IF_ROOT} mkdir -p ${V_DestFolder}
continue
fi
# If movement, check write permissions of destination folder.
V_CheckIfMovement=`echo $1|grep #`
if [ $? -eq 0 ]; then # File will be moved.
V_DestFolder=`echo $1|awk -F"#" {'print $2'}`
if [ ! -w ${V_DestFolder} ]; then # Destination folder IS NOT writable.
F_Log "Destination folder ${V_DestFolder} does not have WRITE permissions. Skipping."
continue
fi
fi
Basically I need to move (in this step) some files from one route to another.
It checks if the folder (name read from config file) exists, if not it will be created, after that check if the folder have write rights and move the files.
Here you can see the part of config file which is read in this step:
app/tom*/instances/*/logs|+9|/.*\.gz)$/|move#/app/archive/tom*/logs
I need to say the files are properly moved when I change the tom* of the destination for anything, as "test" or any word without * (as it should).
What I need to know is how I can use a variable in "tom*" in destination. Variable should contain the same name of tom* in the source, which I use as the name of the cell.
This is because I use different tomcat cells with the reference tom7 or tom8 plus 3 letters to describe each one. as example tom7dog or tom7cat.
You should give the shell a chance to evaluate.
V_DestFolder=`echo $1|awk -F"#" {'print $2'}`
for p in ${V_DestFolder}; do
if [ ! -w ${p} ]; then

Homebrew: Pulling updates from a repository with BASH and GPG

I have a fleet of linux computers ("nodes" from here on out) who are what I'll call ephemeral members of a network. The nodes are vehicle mounted and frequently move into and out of wifi coverage.
Of course, it's often beneficial for me to push the update of a single script, program or file to all nodes. What I came up with is this:
Generate a key pair to be shared by all nodes
Encrypt the new file version, with a header that contains installation path, on my workstation. My workstation of course has the public key.
Place the encrypted update in a node-accessible network "staging" folder
When a node finds itself with a good connection, it checks the staging folder.
If there are new files, they're:
copied to the node
decrypted
checked for integrity("Does the file header look good?")
moved to the location prescribed by the header
Here's a simple version of my code. Is this a bad idea? Is there a more elegant way to deal with updating unattended nodes on a super flaky connection?
#!/bin/bash
#A method for autonomously retrieving distributed updates
#The latest and greatest files are here:
stageDir="/remoteDirectory/stage"
#Files are initially moved to a quarantine area
qDir="/localDirectory/quarantine"
#If all went well, put a copy of the encrypted file here:
aDir="/localDirectory/pulled"
#generic extension for encrypted files "Secure Up Date"
ext="sud"
for file in "$stageDir"/*."$ext"; do #For each "sud" file...
fname=$(basename $file)
if [ ! -f $aDir/$fname ]; then #If this file has not already been worked on...
cp "$file" "$qDir"/"$fname" #Move it to the quarantine directory
else
echo "$fname has already been pulled" #Move along
fi
done
if [ "$(ls $qDir)" ]; then #If there's something to do (i.e. files in the directory)
for file in "$qDir"/*."$ext"; do
fname=$(basename $file)
qPath="$qDir/$fname"
untrusted="$qPath.untrusted"
#Decrypt file
gpg --output "$untrusted" --yes --passphrase "supersecretpassphrase" --decrypt "$qPath" #Say yes to overwriting
headline=$(head -n 1 $untrusted) #Get the header (which is the first line of the file)
#Check to see if this is a valid file
if [[ $headline == "#LOOKSGOOD:"* ]]; then #All headers must start with "#LOOKSGOOD:" or something
#Get install path
installPath=$(echo $headline | cut -d ':' -f 2) #Get the stuff after the colon
tail -n +2 $untrusted > $installPath #Send everything but the header line to the install path
#Clean up our working files
rm $untrusted
mv $qPath "$aDir/$fname"
#Report what we did
echo $headline
else
#trash the file if it's not a legit file
echo "$fname is not a legit update...trashing it"
rm "$qDir/$fname"*
fi
done
fi

bash script to watch a folder

I have the following situation:
There is a windows folder that has been mounted on a Linux machine. There could be multiple folders (setup before hand)
in this windows mount. I have to do something (preferably a script to start with) to watch these folders.
These are the steps:
Watch for any incoming file(s). Make sure they are transferred completely.
Move it to another folder.
I do not have any control over the file transfer program on the windows machine. It is a secure FTP I believe.
So I cannot ask that process to send me a trailer file to ensure the completion of file transfer.
I have written a bash script. I would like to know about any potential pitfalls with this approach. Reason is,
there is a possibility of mulitple copies of this script running for multiple directories like this.
At the moment, there could be upto 100 directories that may have to be monitored.
Following is the script. I'm sorry for pasting a very long one here. Please take your time to review it and
comment / criticize it. :-)
It takes 3 parameters, the folder that has to be watched, the folder where the file has to be moved,
and a time interval, which has been explained below.
I'm sorry there seems to be a problem with the alignment. Markdown doesn't seem to like it. I tried to organize it properly, but not able to do so.
Linux servername 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:27:17 EDT 2006 i686 i686 i386
GNU/Linux
#!/bin/bash
log_this()
{
message="$1"
now=`date "+%D-%T"`
echo $$": "$now ": " $message
}
usage()
{
cat << EOF
Usage: $0 <Directory to be watched> <Directory to transfer> <time interval>
Time interval is the amount of time after which the modification time of a
file will be monitored.
EOF
`exit 1`
}
if [ $# -lt 2 ]
then
usage
fi
WATCH_DIR=$1
APP_DIR=$2
if [ ! -d "$WATCH_DIR" ]
then
log_this "FATAL: WATCH_DIR, $WATCH_DIR does not exist. Exiting"
exit 1
fi
if [ ! -d "$APP_DIR" ]
then
log_this "APP_DIR: $APP_DIR does not exist. Exiting"
exit 1
fi
# This needs to be set after considering the rate of file transfer.
# Represents the seconds elapsed after the last modification to the file.
# If not supplied as parameter, defaults to 3.
seconds_between_mods=$3
if ! [[ "$seconds_between_mods" =~ ^[0-9]+$ ]]; then
if [ ${#seconds_between_mods} -eq 0 ]; then
log_this "No value supplied for elapse time. Defaulting to 3."
seconds_between_mods=3
else
log_this "Invalid value provided for elapse time"
exit 1
fi
fi
log_this "Start Monitor."
while true
do
ls -1 $WATCH_DIR | while read file_name
do
log_this "Start Monitoring for $file_name"
# Refer only the modification with reference to the mount folder.
# If there is a diff in time between servers, we are in trouble.
token_file=$WATCH_DIR/foo.$$
current_time=`touch $token_file && stat -c "%Y" $token_file`
rm -f $token_file 2>/dev/null
log_this "Current Time: $current_time"
last_mod_time=`stat -c "%Y" $WATCH_DIR/$file_name`
elapsed_time=`expr $current_time - $last_mod_time`
log_this "Elapsed time ==> $elapsed_time"
if [ $elapsed_time -ge $seconds_between_mods ]
then
log_this "Moving $file_name to $APP_DIR"
# In case if there is no space left on the target mount, hide the file
# in the mount itself and remove the incomplete file from APP_DIR.
mv $WATCH_DIR/$file_name $APP_DIR
if [ $? -ne 0 ]
then
log_this "FATAL: mv failed!! Hiding $file_name"
rm $APP_DIR/$file_name
mv $WATCH_DIR/$file_name $WATCH_DIR/.$file_name
log_this "Removed $APP_DIR/$file_name. Look for $WATCH_DIR/.$file_name and submit later."
fi
log_this "End Monitoring for $file_name"
else
log_this "$file_name: Transfer seems to be in progress"
fi
done
log_this "Nothing more to monitor."
echo
sleep 5
done
This isn't going to work for any length of time. In production, you will have network problems and other errors which can leave a partial file in the upload directory. I also don't like the idea of a "trailer" file. The usual approach is to upload the file under a temporary name and then rename it after the upload completes.
This way, you just have to list the directory, filter the temporary names out and and if there is anything left, use it.
If you can't make this change, then ask your boss for a written permission to implement something which can lead to arbitrary data corruption. This is for two purposes: 1) To make them understand that this is a real problem and not something which you make up and 2) to protect yourself when it breaks ... because it will and guess who'll get all the blame?
I believe a much saner approach would be the use of a kernel-level filesystem notify item. Such as inotify. Get also the tools here.
incron is an "inotify cron" system. It consists of a daemon and a table manipulator. You can use it a similar way as the regular cron. The difference is that the inotify cron handles filesystem events rather than time periods.
First make sure inotify-tools in installed.
Then use them like this:
logOfChanges="/tmp/changes.log.csv" # Set your file name here.
# Lock and load
inotifywait -mrcq $DIR > "$logOfChanges" & # monitor, recursively, output CSV, be quiet.
IN_PID=$$
# Do your stuff here
...
# Kill and analyze
kill $IN_PID
cat "$logOfChanges" | while read entry; do
# Split your CSV, but beware that file names may contain spaces too.
# Just look up how to parse CSV with bash. :)
path=...
event=...
... # Other stuff like time stamps
# Depending on the event…
case "$event" in
SOME_EVENT) myHandlingCode path ;;
...
*) myDefaultHandlingCode path ;;
done
Alternatively, using --format instead of -c on inotifywait would be an idea.
Just man inotifywait and man inotifywatch for more infos.
To be honest a python app set up to run at start-up will do this quickly and efficiently. Python has amazing OS support and its rather complete.
Running the script will likely work, but it will be troublesome to take care and manage. I take it you will run these as frequent cron jobs?
To get you off your feet here is a small app I wrote which takes a path and looks at the binary output of jpeg files. I never quite finished it, but it will get you started and to see the structure of python as well as some use of os..
I wouldnt spend to much time worrying about my code.
import time, os, sys
#analyze() takes in a path and moves into the output_files folder, to then analyze files
def analyze(path):
list_outputfiles = os.listdir(path + "/output_files")
print list_outputfiles
for i in range(len(list_outputfiles)):
#print list_outputfiles[i]
f = open(list_outputfiles[i], 'r')
f.readlines()
#txtmaker reads the media file and writes its binary contents to a text file.
def txtmaker(c_file):
print c_file
os.system("cat" + " " + c_file + ">" + " " + c_file +".txt")
os.system("mv *.txt output_files")
#parser() takes in the inputed path, reads and lists all files, creates a directory, then calls txtmaker.
def parser(path):
os.chdir(path)
os.mkdir(path + "/output_files", 0777)
list_files = os.listdir(path)
for i in range(len(list_files)):
if os.path.isdir(list_files[i]) == True:
print (list_files[i], "is a directory")
else:
txtmaker(list_files[i])
analyze(path)
def main():
path = raw_input("Enter the full path to the media: ")
parser(path)
if __name__ == '__main__':
main()

Resources