Maintaining variables in function - Global variables - linux

Im trying to run a script in a function and then calling it
filedetails ()
{
# read TOTAL_DU < "/tmp/sizes.out";
disksize=`du -s "$1" | awk '{print $1}'`;
let TOTAL_DU+=$disksize;
echo "$TOTAL_DU";
# echo $TOTAL_DU > "/tmp/sizes.out"
}
Im using the variable TOTAL_DU as a counter to keep count of the du of all the files
Im running it using parallel or xargs
find . -type f | parallel -j 8 filedetails
But the variable TOTAL_DU is resetting every time and the count is not maintained which is as expected as a new shell is used each time.
I've also tried using a file to export and then read the counter but because of parallel some complete faster than others so its not sequential (as expected) so this is no good....
Question in is there a way to keep the count whilst using parallel or xargs

Aside from learning purposes, this is not likely to be a good use of parallel, because:
Calling du like that will quite possibly be slower than just invoking du in the normal way. First, the information about files sizes can be extracted from the directory, and so an entire directory can be computed in a single access. Effectively, directories are stored as a special kind of file object, whose data is a vector of directory entities ("dirents"), which contain the name and metadata for each file. What you are doing is using find to print these dirents, then getting du to parse each one (every file, not every directory); almost all of this second scan is redundant work.
Insisting that du examine every file prevents it from avoiding double-counting multiple hard-links to the same file. So you can easily end up inflating the disk usage this way. On the other hand, directories also take up diskspace, and normally du will include this space in its reports. But you're never calling it on any directory, so you will end up understating the total disk usage.
You're invoking a shell and an instance of du for every file. Normally, you would only create a single process for a single du. Process creation is a lot slower than reading a filesize from a directory. At a minimum, you should use parallel -X and rewrite your shell function to invoke du on all the arguments, rather than just $1.
There is no way to share environment variables between sibling shells. So you would have to accumulate the results in a persistent store, such as a temporary file or database table. That's also an expensive operation, but if you adopted the above suggestion, you would only need to do it once for each invocation of du, rather than for every file.
So, ignoring the first two issues, and just looking at the last two, solely for didactic purposes, you could do something like the following:
# Create a temporary file to store results
tmpfile=$(mktemp)
# Function which invokes du and safely appends its summary line
# to the temporary file
collectsizes() {
# Get the name of the temporary file, and remove it from the args
tmpfile=$1
shift
# Call du on all the parameters, and get the last (grand total) line
size=$(du -c -s "$#" | tail -n1)
# lock the temporary file and append the dataline under lock
flock "$tmpfile" bash -c 'cat "$1" >> "$2"' _ "$size" "$tmpfile"
}
export -f collectsizes
# Find all regular files, and feed them to parallel taking care
# to avoid problems if files have whitespace in their names
find -type f -print0 | parallel -0 -j8 collectsizes "$tmpfile"
# When all that's done, sum up the values in the temporary file
awk '{s+=$1}END{print s}' "$tmpfile"
# And delete it.
rm "$tmpfile"

Related

Trying to scrub 700 000 data against 15 million data

I am trying to scrub 700 000 data obtained from single file, which need to be scrubbed against a data of 15 million present in multiple files.
Example: 1 file of 700 000 say A. Multiple files pool which have 15 million call it B.
I want a pool B of files with no data of file A.
Below is the shell script I am trying to use it is working fine. But it is taking massive time of more than 8 Hours in scrubbing.
IFS=$'\r\n' suppressionArray=($(cat abhinav.csv1))
suppressionCount=${#suppressionArray[#]}
cd /home/abhinav/01-01-2015/
for (( j=0; j<$suppressionCount; j++));
do
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
arrayOffileNameInWhichSuppressionFoundCount=${#arrayOffileNameInWhichSuppressionFound[#]}
if [ $arrayOffileNameInWhichSuppressionFoundCount -gt 0 ];
then
echo -e "${suppressionArray[$j]}" >> /home/abhinav/emailid_Deleted.txt
for (( k=0; k<$arrayOffileNameInWhichSuppressionFoundCount; k++));
do
sed "/^${suppressionArray[$j]}/d" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$k]} > /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" && mv -f /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}".tmp" /home/abhinav/06-07-2015/${arrayOffileNameInWhichSuppressionFound[$i]}
done
fi
done
Another solution clicked in my mind is to breakdown 700k data into smaller size files of 50K and send across 5-available servers, also POOL A will be available at each server.
Each server will serve for 2-Smaller file.
These two lines are peculiar:
arrayOffileNameInWhichSuppressionFound=`grep "${suppressionArray[$j]}," *.csv| awk -F ':' '{print $1}' > /home/abhinav/fileNameContainer.txt`
IFS=$'\r\n' arrayOffileNameInWhichSuppressionFound=($(cat /home/abhinav/fileNameContainer.txt))
The first assigns an empty string to the mile-long variable name because the standard output is directed to the file. The second then reads that file into the array. ('Tis curious that the name is not arrayOfFileNameInWhichSuppressionFound, but the lower-case f for file is consistent, so I guess it doesn't matter beyond making it harder to read the variable name.)
That could be reduced to:
ArrFileNames=( $(grep -l "${suppressionArray[$j]}," *.csv) )
You shouldn't need to keep futzing with carriage returns in IFS; either set it permanently, or make sure there are no carriage returns before you start.
You're running these loops 7,00,000 times (using the Indian notation). That's a lot. No wonder it is taking hours. You need to group things together.
You should probably simply take the lines from abhinav.csv1 and arrange to convert them into appropriate sed commands, and then split them up and apply them. Along the lines of:
sed 's%.*%/&,/d%' abhinav.csv1 > names.tmp
split -l 500 names.tmp sed-script.
for script in sed-script.*
do
sed -f "$script" -i.bak *.csv
done
This uses the -i option to backup the files. It may be necessary to do redirection explicitly if your sed does not support the -i option:
for file in *.csv
do
sed -f "$script" "$file" > "$file.tmp" &&
mv "$file.tmp" "$file"
done
You should experiment to see how big the scripts can be. I chose 500 in the split command as a moderate compromise. Unless you're on antique HP-UX, that should be safe, but you may be able to increase the size of the script more, which will reduce the number of times you have to edit each file, which speeds up the processing. If you can use 5,000 or 50,000, you should do so. Experiment to see what the upper limit. I'm not sure that you'd find doing all 700,000 lines at once is feasible — but it should be fastest if you can do it that way.

How to make many edits to files, without writing to the harddrive very often, in BASH?

I often need to make many edits to text files. The files are typically 20 MB in size and require ~500,000 individual edits, all which must be made in a very specific order. Here is a simple example of a script I might need to use:
while read -r line
do
...
(20-100 lines of BASH commands preparing $a and $b)
...
sed -i "s/$a/$b/g" ./editfile.txt
...
done < ./readfile.txt
As many other lines of code appear before and after the sed script, it seems the only option for editing the file is sed with the -i option. Many have warned me against using sed -i, as that makes too many writes to the file. Recently, I had to replace two computers, as the hard drives stopped working after running the scripts. I need to find a solution that does not damage my computer's hardware.
Is there some way to send files somewhere else, such as storing the whole file into a BASH variable, or into RAM, where I sed, grep, and awk, can make the edits without making millions of writes to the hard drive?
Don't use sed -i once per transform. A far better approach -- leaving you with more control -- is to construct a pipeline (if you can't use a single sed with multiple -e arguments to perform multiple operations within a single instance), and redirect to or from disk at only the beginning and end.
This can even be done recursively, if you use a FD other than stdin for reading from your file:
editstep() {
read -u 3 -r line # read from readfile into REPLY
if [[ $REPLY ]]; then # we read something new from readfile
sed ... | editstep # perform the edits, then a recursive call!
else
cat
fi
}
editstep <editfile.txt >editfile.txt.new 3<readfile.txt
Better than that, though, is to consolidate to a single sed instance.
sed_args=( )
while read -r line; do
sed_args+=( -e "s/in/out/" )
done <readfile.txt
sed -i "${sed_args[#]}" editfile.txt
...or, for edit lists too long to pass in on the command line:
sed_args=( )
while read -r line; do
sed_args+=( "s/in/out/" )
done <readfile.txt
sed -i -f <(printf '%s\n' "${sed_args[#]}") editfile.txt
(Please don't read the above as an endorsement of sed -i, which is a non-POSIX extension and has its own set of problems; the POSIX-specified editor intended for in-place rather than streaming operations is ex, not sed).
Even better? Don't use sed at all, but keep all the operations inline in native bash.
Consider the following:
content=$(<editfile.txt)
while IFS= read -r; do
# put your own logic here to set `in` and `out`
content=${content//$in/$out}
done <readfile.txt
printf '%s\n' "$content" >editfile.new
One important caveat: This approach treats in as a literal string, not a regular expression. Depending on the edits you're actually making, this may actually improve correctness over the original code... but in any event, it's worth being aware of.
Another caveat: Reading the file's contents into a bash string is not necessarily a lossless operation; expect content to be truncated at the first NUL byte (if any exist), and a trailing newline to be added at the end of the file if none existed before.
simple ...
instead of trying too many threads, you can simple copy all your files and dirs to /dev/shm
This is representation of ram drive. When you are done editing, copy all back to the original destination. Do not forget to run sync after you are done :-)

Searching a particular string pattern out of 10000 files in parallel

Problem Statement:-
I need to search a particular String Pattern in around 10000 files and find the records in the files which contains that particular pattern. I can use grep here, but it is taking lots of time.
Below is the command I am using to search a particular string pattern after unzipping the dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1'
If I simply count how many files are there after unzipping the above dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | wc -l
I get around 10000 files. And I need to search the above string pattern in all these 10000 files and find out the records which contains the above String Pattern. And my above command is working fine but it is very very slow.
What is the best approach on this? Should we take 100 files at a time and search for the particular String Pattern in that 100 files parallelly.
Note:
I am running SunOS
bash-3.00$ uname -a
SunOS lvsaishdc3in0001 5.10 Generic_142901-02 i86pc i386 i86pc
Do NOT run this in parallel!!!! That's going to bounce the disk head all over the place, it will be much slower.
Since you are reading an archive file there's one way to get a substantial performance boost--don't write the results of the decompression out. The ideal answer would be to decompress to a stream in memory, if that's not viable then decompress to a ramdisk.
In any case you do want some parallelism here--one thread should be obtaining the data and then handing it off to another that does the search. That way you will either be waiting on the disk or on the core doing the decompressing, you won't waste any of that time doing the search.
(Note that in case of the ramdisk you will want to aggressively read the files it wrote and then kill them so the ramdisk doesn't fill up.)
For starters, you will need to uncompress the file to disk.
This does work (in bash,) but you probably don't want to try to start 10,000 processes all at once. Run it inside the uncompressed directory:
for i in `find . -type f`; do ((grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1' $i )&); done
So, we need to have a way to limit the number of spawned processes. This will loop as long as the number of grep processes running on the machine exceeds 10 (including the one doing the counting):
while [ `top -b -n1 | grep -c grep` -gt 10 ]; do echo true; done
I have run this, and it works.... but top takes so long to run that it effectively limits you to one grep per second. Can someone improve upon this, adding one to a count when a new process is started and decrementing by one when a process ends?
for i in `find . -type f`; do ((grep -l 'blah' $i)&); (while [ `top -b -n1 | grep -c grep` -gt 10 ]; do sleep 1; done); done
Any other ideas for how to determine when to sleep and when not to? Sorry for the partial solution, but I hope someone has the other bit you need.
If you are not using regular expressions you can use the -F option of grep or use fgrep. This may provide you with additional performance.
Your gzcat .... | wc -l does not indicate 10000 files, it indicates 10000 lines total for however many files there are.
This is the type of problem that xargs exists for. Assuming your version of gzip came with a script called gzgrep (or maybe just zgrep), you can do this:
find /data/newfolder/real-time-newdata -type f -name "*_20120809_0_*.gz" -print | xargs gzgrep
That will run one gzgrep command with batches of as many individual files as it can fit on a command line (there are options to xargs to limit how many, or for a number of other things). Unfortunately, gzgrep still has to uncompress each file and pass it off to grep, but there's not really any good way to avoid having to uncompress the whole corpus in order to search through it. Using xargs in this way will however cut down some on the overall number of new processes that need to be spawned.

How to parallelize my bash script for use with `find` without facing race conditions?

I am trying to execute a command like this:
find ./ -name "*.gz" -print -exec ./extract.sh {} \;
The gz files themselves are small. Currently my extract.sh contains the following:
# Start delimiter
echo "#####" $1 >> Info
zcat $1 > temp
# Series of greps to extract some useful information
grep -o -P "..." temp >> Info
grep -o -P "..." temp >> Info
rm temp
echo "####" >> Info
Obviously, this is not parallelizable because if I run multiple extract.sh instances, they all write to the same file. What is a smart way of doing this?
I have 80K gz files on a machine with massive horse power of 32 cores.
Assume (just for simplicity and clearness) all your files starts with a-z.
So you could use 26 cores in parallel when launching an find sequence like above for each letter. Each "find" need to generate an own aggregate file
find ./ -name "a*.gz" -print -exec ./extract.sh a {} \; &
find ./ -name "b*.gz" -print -exec ./extract.sh b {} \; &
..
find ./ -name "z*.gz" -print -exec ./extract.sh z {} \;
(extract needs to take to first parameter to separate the "info" destination file)
When you want a big aggregate file just joins all aggregate.
However, I am not convinced to gain performance with that approach. In the end all file content will be serialized.
Probably hard disk head movement will be the limitation not the unzip (cpu) performance.
But let's try
A quick check through the findutils source reveals that find starts a child process for each exec. I believe it then moves on, though I may be misreading the source. Because of this you are already parallel, since the OS will handle sharing these out across your cores. And through the magic of virtual memory, the same executables will mostly share the same memory space.
The problem you are going to run into is file locking/data mixing. As each individual child runs, it pipes info into your info file. These are individual script commands, so they will mix their output together like spaghetti. This does not guarantee that the files will be in order! Just that all of an individual file's contents will stay together.
To solve this problem, all you need to do is take advantage of the shell's ability to create a temporary file (using tempfile), have each script dump to the temp file, then have each script cat the temp file into the info file. Don't forget to delete your temp file after use.
If the tempfiles are in ram(see tmpfs), then you will avoid being IO bound except when writing to your final file, and running the find search.
Tmpfs is a special file system that uses your ram as "disk space". It will take up to the amount of ram you allow, not use more than it needs from that amount, and swap to disk as needed if it does fill up.
To use:
Create a mount point ( I like /mnt/ramdisk or /media/ramdisk )
Edit /etc/fstab as root
Add tmpfs /mnt/ramdrive tmpfs size=1G 0 0
Run umount as root to mount your new ramdrive. It will also mount at boot.
See the wikipedia entry on fstab for all the options available.
You can use xargs to run your search in parallel. --max-procs limits number of processes executed (default is 1):
find ./ -name "*.gz" -print | xargs --max-args 1 --max-procs 32 ./extract.sh
In the ./extract.sh you can use mktemp to write data from each .gz to a temporary file, all of which may be later combined:
# Start delimiter
tmp=`mktemp -t Info.XXXXXX`
src=$1
echo "#####" $1 >> $tmp
zcat $1 > $tmp.unzip
src=$tmp.unzip
# Series of greps to extract some useful information
grep -o -P "..." $src >> $tmp
grep -o -P "..." $src >> $tmp
rm $src
echo "####" >> $tmp
If you have massive horse power you can use zgrep directly, without unzipping first. But it may be faster to zcat first if you have many greps later.
Anyway, later combine everything into a single file:
cat /tmp/Info.* > Info
rm /tmp/Info.*
If you care about order of .gz files apply second argument to ./extract.sh:
find files/ -name "*.gz" | nl -n rz | sed -e 's/\t/\n/' | xargs --max-args 2 ...
And in ./extract.sh:
tmp=`mktemp -t Info.$1.XXXXXX`
src=$2
I would create a temporary directory. Then create an output file for each grep (based on the name of te file it processed). Files created under /tmp are located on a RAM disk and so will not thrash your harddrive with lots of writes.
You can then either cat it all together at the end, or get each grep to signal another process when it has finished and that process can begin catting files immediately (and removing them when done).
Example:
working_dir="`pwd`"
temp_dir="`mktemp -d`"
cd "$temp_dir"
find "$working_dir" -name "*.gz" | xargs -P 32 -n 1 extract.sh
cat *.output > "$working_dir/Info"
rm -rf "$temp_dir"
extract.sh
filename=$(basename $1)
output="$filename.output"
extracted="$filename.extracted"
zcat "$1" > "$extracted"
echo "#####" $filename > "$output"
# Series of greps to extract some useful information
grep -o -P "..." "$extracted" >> "$output"
grep -o -P "..." "$extracted" >> "$output"
rm "$extracted"
echo "####" >> "$output"
The multiple grep invocations in extract.sh are probably the main bottleneck here. An obvious optimization is to read each file only once, then print a summary in the order you want. As an added benefit, we can speculate that the report can get written as a single block, but it might not prevent interleaved output completely. Still, here's my attempt.
#!/bin/sh
for f; do
zcat "$f" |
perl -ne '
/(pattern1)/ && push #pat1, $1;
/(pattern2)/ && push #pat2, $1;
# ...
END { print "##### '"$1"'\n";
print join ("\n", #pat1), "\n";
print join ("\n", #pat2), "\n";
# ...
print "#### '"$f"'\n"; }'
done
Doing this in awk instead of Perl might be slightly more efficient, but since you are using grep -P I figure it's useful to be able to keep the same regex syntax.
The script accepts multiple .gz files as input, so you can use find -exec extract.sh {} \+ or xargs to launch a number of parallel processes. With xargs you can try to find a balance between sequential jobs and parallel jobs by feeding each new process, say, 100 to 500 files in one batch. You save on the number of new processes, but lose in parallelization. Some experimentation should reveal what the balance should be, but this is the point where I would just pull a number out of my hat and see if it's good enough already.
Granted, if your input files are small enough, the multiple grep invocations will run out of the disk cache, and turn out to be faster than the overhead of starting up Perl.

Accessing each line using a $ sign in linux

Whenever I execute a linux command that outputs multiple lines, I want to perform some operation on each line of the output. generally i do
command something | while read a
do
some operation on $a;
done
This works fine. But my question is, Is there some how I can access each line by a predefined symbol( dont know how to call it) /// something like $? .. or .. $! .. or .. $_
Is it possible to do
cat to_be_removed.txt | rm -f $LINE
is there a predefined $LINE in bash .. or the previous one is the shortest way. ie.
cat to_be_removed.txt | while read line; do rm -f $line; done;
xargs is what you're looking for:
cat to_be_removed.txt | xargs rm -f
Watch out for spaces in your filenames if you use that one, though. Check out the xargs man page for more information.
You might be looking for the xargs command.
It takes control arguments, plus a command and optionally some arguments for the command. It then reads its standard input, normally splitting at white space, and then arranges to repeatedly execute the command with the given arguments and as many 'file names' read from the standard input as will fit on the command line.
rm -f $(<to_be_removed.txt)
This works because rm can take multiple files as input. It also makes it much more efficient because you only call rm once and you don't need to create a pipe to cat or xargs
On a separate note, rather than using pipes in a while loop, you can avoid a subshell by using process substitution:
while read line; do
some operation on $a;
done < <(command something)
The additional benefit you get by avoiding a subshell is that variables you change inside the loop maintain their altered values outside the loop as well. This is not the case when using the pipe form and it is a common gotcha.

Resources