Percentage of completion of script: Name a file with percentage - linux

I have a script that i run on 2k servers simultaneously that creates a temp working directory on a NAS.
The script builds a list of files...the list could be 1k files or 1m files.
I run a for loop on the list to run some grep commands on each file
counter=0
num_files=`wc -l $filelist`
cat $filelist| while read line; do
do_stuff_here
counter=`expr $counter+ 1`
((percent=$counter/$num_files))
##CREATE a file named "$percent".percent
done
What I am thinking is I can take the total number of files from the list ( wc -l $filelist) and add a counter that i increase by 1 in the loop.
I can then divide $counter/$num_files.
This seems to work, but the problem I have is that I would like to rename the same file, instead of just creating a new one each time. What can i do here?
I do not want this to output to stdout/stderr....i already have enough stuff going to these places. I would like to be able to browse to a subdir in WinSCP and quickly see where each is.

Try this one
touch 0.percent
counter=0
num_files=$(wc -l $filelist)
num_files=${num_files/ */}
cat $filelist| while read line; do
do_stuff_here
mv -f {$((counter*100/num_files)),$((++counter*100/num_files))}.percent
done
rm -f *.percent

Related

How to view syslog entries since last time I looked

I want to view the entries in Linux /var/log/syslog, but I only want to see the entries since last time I looked (preferably create a bash script to do this). The solution I thought of was to take a copy of syslog and diff it against the last time I took a copy, but this seems unclean because syslog can be big and diff adds artifacts in its output. Im thinking maybe somehow use tail directly on syslog, but I dont know how to do this when I dont know how many lines have been added since last time I tried. Any better thoughts? I would like to be able to redirect the result to a file so I can later interactively grep for specific parts of interest.
Linux has a wc command which can count the number of lines within a file, for example
wc -l /var/log/syslog. The bash script below stores the output of the wc -l command in a file called ./prevlinecount. Whenever you want just the new lines in a file it gets the value in ./prevlinecount and subtracts this value from a new instance of wc -l /var/log/syslog called newlinecount. Then it tails (newlinecount - prevlinecount).
#!/bin/bash
prevlinecount=`cat ./prevlinecount`
if [ -z $prevlinecount ]; then
echo `wc -l $1 | awk '{ print $1 }' > ./prevlinecount`
tail -n +1 $1
else
newlinecount=`wc -l $1 | awk '{print $1}'`
tail -n `expr $newlinecount - $prevlinecount` $1
echo $newlinecount > ./prevlinecount
fi
beware
this is a very rudimentary script which can only keep track of one file. If you would like to extend this script to multiple files, look into associative arrays. With associative arrays you could keep track of multiple files by having the key as the filename and value being the previous line count.
beware too that over time syslog files can be archived after the file reaches a predetermined size (maybe 10MB) and this script does not account for the archival process.

Keep newest x amount of files delete rest bash

I have this this bash script as a crontab running every hour. I want to keep the latest 1,000 images in a folder, deleting the oldest files. I don't want to delete by mtime because if no new files are being uploaded, I want to keep them, it's fine to keep if image is 1 day or 50 days old, I just want when image 1,001 is uploaded (newest) image_1 (oldest) will be deleted, cycling through folder to keep a static amount of 1,000 images.
This works, However at ever hour, there could be now 1,200 by the time it executes. Running the crontab every say minute seems to be overkill. Can I make it so once the folder hits 1,001 images it auto executes? Basically I want the folder to be self-scanning and keep the newest 1,000 images, deleted the oldest one.
#!/bin/sh
cd /folder/to/execute; ls -t | sed -e '1,1000d' | xargs -d '\n' rm
keep=10 #set this to how many files want to keep
discard=$(expr $keep - $(ls|wc -l))
if [ $discard -lt 0 ]; then
ls -Bt|tail $discard|tr '\n' '\0'|xargs -0 printf "%b\0"|xargs -0 rm --
fi
This first calculates the number of files to delete, then safely passes them to rm. It uses negative numbers intentionally, since that conveniently works as the argument to tail.
The use of tr and xargs -0 is to ensure that this works even if file names contain spaces. The printf bit is to handle file names containing newlines.
EDIT: added -- to rm args to be safe if any of the files to be deleted start with a hyphen.
Try the following script.It first checks the count in the current directory and then , if the count is greater than 1000 , it evaluates the difference and then gets the oldest such files.
#/bin/bash
count=`ls -1 | wc -l`
if [ $count -gt 1000 ]
then
difference=${count-1000}
dirnames=`ls -t * | tail -n $difference`
arr=($dirnames)
for i in "${arr[#]}"
do
echo $i
done
fi

How do I copy the beginning of multiple files in Linux?

I want to copy a bunch of files (*.txt) from one directory to another in Ubuntu. I want to reduce them in size, so I am using head to get the first 100 lines of each.
I want the new files to keep their original names but be in the subdirectory small/.
I have tried:
head -n 100 *.txt > small/*.txt
but this creates one file called *.txt.
I have also tried:
head -n 100 *.txt > small/
but this gives Is a directory error.
It's got to be easy right, but I am pretty bad at Linux.
Any help is much appreciated.
You'll have to create a loop instead:
for file in *.txt; do
head -n 100 "$file" > small/"$file"
done
This loops through all the .txt files performing a head -n 100 in all of them and outputting into a new file in the small/ directory.
Try
for f in *.txt; do
head -n 100 $f > small/$f
done

Bash Script to replicate files

I have 25 files in a directory. I need to amass 25000 files for testing purposes. I thought I could just replicate these files over and over until I get 25000 files. I could manually copy paste 1000 times but that seemed tedious. So I thought I could write a script to do it for me. I tried
cp * .
As a trial but I got an error that said the source and destination file are the same. If I were to automate it how would i do it so that each of the 1000 times the new files are made with unique names?
As discussed in the comments, you can do something like this:
for file in *
do
filename="${file%.*}" # get everything up to last dot
extension="${file##*.}" # get extension (text after last dot)
for i in {00001..10000}
do
cp $file ${filename}${i}${extension}
done
done
The trick for i in {00001..10000} is used to loop from 1 to 10000 having the number with leading zeros.
The ${filename}${i}${extension} is the same as $filename$i$extension but makes more clarity over what is a variable name and what is text. This way, you can also do ${filename}_${i}${extension} to get files like a_23.txt, etc.
In case your current files match a specific pattern, you can always do for file in a* (if they all are on the a + something format).
If you want to keep the extension of the files, you can use this. Assuming, you want to copy all txt-files:
#!/bin/bash
for f in *.txt
do
for i in {1..10000}
do
cp "$f" "${f%.*}_${i}.${f##*.}"
done
done
You could try this:
for file in *; do for i in {1..1000}; do cp $file $file-$i; done; done;
It will append a number to any existing files.
The next script
for file in *.*
do
eval $(sed 's/\(.*\)\.\([^\.]*\)$/base="\1";ext="\2";/' <<< "$file")
for n in {1..1000}
do
echo cp "$file" "$base-$n.$ext"
done
done
will:
take all files with extensions *.*
creates the basename and extension (sed)
in a cycle 1000 times copyes the original file to file-number.extension
it is for DRY-RUN, remove the echo if satisfied

How to parallelize my bash script for use with `find` without facing race conditions?

I am trying to execute a command like this:
find ./ -name "*.gz" -print -exec ./extract.sh {} \;
The gz files themselves are small. Currently my extract.sh contains the following:
# Start delimiter
echo "#####" $1 >> Info
zcat $1 > temp
# Series of greps to extract some useful information
grep -o -P "..." temp >> Info
grep -o -P "..." temp >> Info
rm temp
echo "####" >> Info
Obviously, this is not parallelizable because if I run multiple extract.sh instances, they all write to the same file. What is a smart way of doing this?
I have 80K gz files on a machine with massive horse power of 32 cores.
Assume (just for simplicity and clearness) all your files starts with a-z.
So you could use 26 cores in parallel when launching an find sequence like above for each letter. Each "find" need to generate an own aggregate file
find ./ -name "a*.gz" -print -exec ./extract.sh a {} \; &
find ./ -name "b*.gz" -print -exec ./extract.sh b {} \; &
..
find ./ -name "z*.gz" -print -exec ./extract.sh z {} \;
(extract needs to take to first parameter to separate the "info" destination file)
When you want a big aggregate file just joins all aggregate.
However, I am not convinced to gain performance with that approach. In the end all file content will be serialized.
Probably hard disk head movement will be the limitation not the unzip (cpu) performance.
But let's try
A quick check through the findutils source reveals that find starts a child process for each exec. I believe it then moves on, though I may be misreading the source. Because of this you are already parallel, since the OS will handle sharing these out across your cores. And through the magic of virtual memory, the same executables will mostly share the same memory space.
The problem you are going to run into is file locking/data mixing. As each individual child runs, it pipes info into your info file. These are individual script commands, so they will mix their output together like spaghetti. This does not guarantee that the files will be in order! Just that all of an individual file's contents will stay together.
To solve this problem, all you need to do is take advantage of the shell's ability to create a temporary file (using tempfile), have each script dump to the temp file, then have each script cat the temp file into the info file. Don't forget to delete your temp file after use.
If the tempfiles are in ram(see tmpfs), then you will avoid being IO bound except when writing to your final file, and running the find search.
Tmpfs is a special file system that uses your ram as "disk space". It will take up to the amount of ram you allow, not use more than it needs from that amount, and swap to disk as needed if it does fill up.
To use:
Create a mount point ( I like /mnt/ramdisk or /media/ramdisk )
Edit /etc/fstab as root
Add tmpfs /mnt/ramdrive tmpfs size=1G 0 0
Run umount as root to mount your new ramdrive. It will also mount at boot.
See the wikipedia entry on fstab for all the options available.
You can use xargs to run your search in parallel. --max-procs limits number of processes executed (default is 1):
find ./ -name "*.gz" -print | xargs --max-args 1 --max-procs 32 ./extract.sh
In the ./extract.sh you can use mktemp to write data from each .gz to a temporary file, all of which may be later combined:
# Start delimiter
tmp=`mktemp -t Info.XXXXXX`
src=$1
echo "#####" $1 >> $tmp
zcat $1 > $tmp.unzip
src=$tmp.unzip
# Series of greps to extract some useful information
grep -o -P "..." $src >> $tmp
grep -o -P "..." $src >> $tmp
rm $src
echo "####" >> $tmp
If you have massive horse power you can use zgrep directly, without unzipping first. But it may be faster to zcat first if you have many greps later.
Anyway, later combine everything into a single file:
cat /tmp/Info.* > Info
rm /tmp/Info.*
If you care about order of .gz files apply second argument to ./extract.sh:
find files/ -name "*.gz" | nl -n rz | sed -e 's/\t/\n/' | xargs --max-args 2 ...
And in ./extract.sh:
tmp=`mktemp -t Info.$1.XXXXXX`
src=$2
I would create a temporary directory. Then create an output file for each grep (based on the name of te file it processed). Files created under /tmp are located on a RAM disk and so will not thrash your harddrive with lots of writes.
You can then either cat it all together at the end, or get each grep to signal another process when it has finished and that process can begin catting files immediately (and removing them when done).
Example:
working_dir="`pwd`"
temp_dir="`mktemp -d`"
cd "$temp_dir"
find "$working_dir" -name "*.gz" | xargs -P 32 -n 1 extract.sh
cat *.output > "$working_dir/Info"
rm -rf "$temp_dir"
extract.sh
filename=$(basename $1)
output="$filename.output"
extracted="$filename.extracted"
zcat "$1" > "$extracted"
echo "#####" $filename > "$output"
# Series of greps to extract some useful information
grep -o -P "..." "$extracted" >> "$output"
grep -o -P "..." "$extracted" >> "$output"
rm "$extracted"
echo "####" >> "$output"
The multiple grep invocations in extract.sh are probably the main bottleneck here. An obvious optimization is to read each file only once, then print a summary in the order you want. As an added benefit, we can speculate that the report can get written as a single block, but it might not prevent interleaved output completely. Still, here's my attempt.
#!/bin/sh
for f; do
zcat "$f" |
perl -ne '
/(pattern1)/ && push #pat1, $1;
/(pattern2)/ && push #pat2, $1;
# ...
END { print "##### '"$1"'\n";
print join ("\n", #pat1), "\n";
print join ("\n", #pat2), "\n";
# ...
print "#### '"$f"'\n"; }'
done
Doing this in awk instead of Perl might be slightly more efficient, but since you are using grep -P I figure it's useful to be able to keep the same regex syntax.
The script accepts multiple .gz files as input, so you can use find -exec extract.sh {} \+ or xargs to launch a number of parallel processes. With xargs you can try to find a balance between sequential jobs and parallel jobs by feeding each new process, say, 100 to 500 files in one batch. You save on the number of new processes, but lose in parallelization. Some experimentation should reveal what the balance should be, but this is the point where I would just pull a number out of my hat and see if it's good enough already.
Granted, if your input files are small enough, the multiple grep invocations will run out of the disk cache, and turn out to be faster than the overhead of starting up Perl.

Resources