how to specify the amount of RAM to use for a script in linux - linux

Let's say I want to edit all the files in a folder, changing every header. I'm using this script:
for thing in $(ls $1); do
sed -i '1c\SNP A2 A1 beta N P' $thing
done
The problem is that it takes a lot of time. So, I'd like to find a way to dedicate more RAM for this script, in order to do it quickly.
Is it possible?

I'd like to find a way to dedicate more RAM for this script, in order to do it quickly. Is it possible?
No. The tools you use, bash to enumerate your files and sed to edit them, take the RAM they need to do their work. They can't use more RAM even if they had a way to give it to them.
You can run your sed operations in parallel. This uses more cores on your machine and may finish faster. Put an & character after each command in the loop. Something like this.
for thing in $(ls $1); do
sed -i '1c\SNP A2 A1 beta N P' $thing &
done

With GNU Parallel you can control how many are run in parallel:
doit() {
sed -i '1c\SNP A2 A1 beta N P' "$1"
}
export -f doit
ls $1 | parallel -j5 doit
Adjust -j5 until you find the number that gives you most throughput.

Related

Finding a line that shows in a file only once

Assuming that I have files with 100 lines. There are a lot of lines that repeat themselves in the file, and only one line that does not.
I want to find the line that shows only once. Is there a command for that or do I have to build some complicated loop as below?
My code so far:
#!/bin/bash
filename="repeat_lines.txt"
var="$(wc -l <$filename )"
echo "length:" $var
#cp ex4.txt ex4_copy.txt
for((index=0; index < var; index++));
do
one="$(head -n $index $filename | tail -1)"
counter=0
for((index2=0; index2 < var; index2++));
do
two="$(head -n $index2 $filename | tail -1)"
if [ "$one" == "$two" ]; then
counter=$((counter+1))
fi
done
echo $one"is "$counter" times in the text: "
done
If I understood your question correctly, then
sort repeat_lines.txt | uniq -u should do the trick.
e.g. for file containing:
a
b
a
c
b
it will output c.
For further reference, see sort manpage, uniq manpage.
You've got a reasonable answer that uses standard shell tools sort and uniq. That's probably the solution you want to use, if you want something that is portable and doesn't require bash.
But an alternative would be to use functionality built into your bash shell. One method might be to use an associative array, which is a feature of bash 4 and above.
$ cat file.txt
a
b
c
a
b
$ declare -A lines
$ while read -r x; do ((lines[$x]++)); done < file.txt
$ for x in "${!lines[#]}"; do [[ ${lines["$x"]} -gt 1 ]] && unset lines["$x"]; done
$ declare -p lines
declare -A lines='([c]="1" )'
What we're doing here is:
declare -A creates the associative array. This is the bash 4 feature I mentioned.
The while loop reads each line of the file, and increments a counter that uses the content of a line of the file as the key in the associative array.
The for loop steps through the array, deleting any element whose counter is greater than 1.
declare -p prints the details of an array in a predictable, re-usable format. You could alternately use another for loop to step through the remaining array elements (of which there might be only one) in order to do something with them.
Note that this solution, while fine for small files (say, up to a few thousand lines), may not scale well for very large files of, say, millions of lines. Bash isn't the fastest at reading input this way, and one must be cognizant of memory limits when using arrays.
The sort alternative has the benefit of memory optimization using files on disk for extremely large files, at the expense of speed.
If you're dealing with files of only a few hundred lines, then it's hard to predict which solution will be faster. In the end, the form of output may dictate your choice of solution. The sort | uniq pipe generates a list to standard output. The bash solution above generates the same list as keys in an array. Otherwise, they are functionally equivalent.

Fastest way to search for multiple values on a linux machine

I want to search for multiple values (say v1, v2, v3....) in a directory with around 6-10 huge files (~300 MB each). I have tried grep and fgrep, with regular expression search like ('v1 | v2 | v3'). The command seems to be running really slow. I am running something like
grep -e 'v1|v2|v3' .
Is there a way I can make my search faster? Please note that the machine is single core, so parallelization may not be of much help.
I'd suggest
LANG=C egrep 'v1|v2|v3'
It won't get much faster than that.
You can search with -R option
-R - Read all files under each directory, recursively.
grep -i 'test' -R .

How to script bulk nslookups

I have a list of several million domain names and I want to see if they are available or not.
I tried pywhois first but am getting rate limited. As I don't need an authoritative answer, I thought I would just use nslookup. I am having trouble scripting this though.
Basically, what I want to do is, if the domain is registered, echo it. What I'm getting is grep: find”: No such file or directory . I think its something easy and I've just been looking at this for too long...
#!/bin/bash
START_TIME=$SECONDS
for DOMAIN in `cat ./domains.txt`;
do
if ! nslookup $DOMAIN | grep -v “can’t find”; then
echo $DOMAIN
fi
done
echo ELAPSED_TIME=$(($SECONDS - $START_TIME))
If you have millions to check, you may like to use GNU Parallel to get the job done faster, like this if you want to repeatedly do, say 32 lookups in parallel
parallel -j 32 nslookup < domains.txt | grep "^Name"
If you want to fiddle with the output of nslookup, the easiest way is probably to declare a little function called lkup(), tell GNU Parallel about it and then use that, like this
#!/bin/bash
lkup() {
if ! nslookup $1 | grep -v "can't find"; then
echo $1
fi
}
# Make lkup() function visible to GNU parallel
export -f lkup
# Check the domains in parallel
parallel -j 32 lkup < domains.txt
If the order of the lookups is important to you, you can add the -k flag to parallel to keep the order.
The error is because you have curly quotes in your script, which are not the proper way to quote command line elements. As a result, they're being treated as part of a filename. Change to:
if ! nslookup $DOMAIN | grep -v "can't find"; then

optimize extracting text from large data set

I have a recurring problem where I need to search log files for all threads that match a pattern. e.g the following
(thread a) blah blah
(thread b) blha
(thread a) match blah
(thread c) blah blah
(thread d) blah match
(thread a) xxx
will produce all lines from threads a & d
There are multiple log files (compressed). Multiple are in the hundreds or thousands. Each file up to ~20MB uncompressed.
The way I do this now is first grep "match" in all the files, cut the (thread x) portion into a sort/uniq file, then use fgrep on the file with the matching threads on the original log set.
I'm already parallelizing the initial grep and the final grep. However this is still slow.
Is there a way to improve performance of this workload?
(I though hadoop, but it requires too many resources to setup/implement)
This is the script for the overall process.
#!/bin/bash
FILE=/tmp/thg.$$
PAT=$1
shift
trap "rm -f $FILE" 3 13
pargrep "$PAT" $# | awk '{ print $6 }' | sed 's/(\(.*\)):/\1/' | sort | uniq >$FILE
pargrep --fgrep $FILE $#
rm -f $FILE
The parallel grep is a much longer script that manages a queue of up to N grep processes that work on M files. Each grep process produces an intermediate file (in /dev/shm or /tmp - some memory file system), that are then concatenated once the queue drains from tasks.
I had to reboot my workstation today after it ran on a set of ~3000 files for over 24 hours. I guess dual xeon with 8GB and 16GB of swap are not up to such a workload :-)
Updated
In order to unzip and grep your files in parallel, try using GNU Parallel something like this:
parallel -j 8 -k --eta 'gzcat {} | grep something ' ::: *.gz
Original Answer
Mmmmmm... I see you are not getting any answers, so I'll stick my neck out and have a try. I have not tested this as I don't have many spare GB of your data lying around...
Firstly, I would benchmark the scripts and see what is eating the time. In concrete terms, is it the initial grep phase, or the awk|sed|sort|uniq phase?
I would remove the sort|uniq because I suspect that sorting multiple GB will eat your time up. If that is the case, maybe try replacing the sort|uniq with awk like this:
pargrep ... | ... | awk '!seen[$0]++'
which will only print each line the first time it occurs without the need to sort the entire file.
Failing that, I think you need to benchmark the times for the various phases and report back. Also, you should consider posting your pargrep code as that may be the bottleneck. Also, have you considered using GNU Parallel for that part?

Searching a particular string pattern out of 10000 files in parallel

Problem Statement:-
I need to search a particular String Pattern in around 10000 files and find the records in the files which contains that particular pattern. I can use grep here, but it is taking lots of time.
Below is the command I am using to search a particular string pattern after unzipping the dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1'
If I simply count how many files are there after unzipping the above dat.gz file
gzcat /data/newfolder/real-time-newdata/*_20120809_0_*.gz | wc -l
I get around 10000 files. And I need to search the above string pattern in all these 10000 files and find out the records which contains the above String Pattern. And my above command is working fine but it is very very slow.
What is the best approach on this? Should we take 100 files at a time and search for the particular String Pattern in that 100 files parallelly.
Note:
I am running SunOS
bash-3.00$ uname -a
SunOS lvsaishdc3in0001 5.10 Generic_142901-02 i86pc i386 i86pc
Do NOT run this in parallel!!!! That's going to bounce the disk head all over the place, it will be much slower.
Since you are reading an archive file there's one way to get a substantial performance boost--don't write the results of the decompression out. The ideal answer would be to decompress to a stream in memory, if that's not viable then decompress to a ramdisk.
In any case you do want some parallelism here--one thread should be obtaining the data and then handing it off to another that does the search. That way you will either be waiting on the disk or on the core doing the decompressing, you won't waste any of that time doing the search.
(Note that in case of the ramdisk you will want to aggressively read the files it wrote and then kill them so the ramdisk doesn't fill up.)
For starters, you will need to uncompress the file to disk.
This does work (in bash,) but you probably don't want to try to start 10,000 processes all at once. Run it inside the uncompressed directory:
for i in `find . -type f`; do ((grep 'b295ed051380a47a2f65fb75ff0d7aa7^]3^]-1' $i )&); done
So, we need to have a way to limit the number of spawned processes. This will loop as long as the number of grep processes running on the machine exceeds 10 (including the one doing the counting):
while [ `top -b -n1 | grep -c grep` -gt 10 ]; do echo true; done
I have run this, and it works.... but top takes so long to run that it effectively limits you to one grep per second. Can someone improve upon this, adding one to a count when a new process is started and decrementing by one when a process ends?
for i in `find . -type f`; do ((grep -l 'blah' $i)&); (while [ `top -b -n1 | grep -c grep` -gt 10 ]; do sleep 1; done); done
Any other ideas for how to determine when to sleep and when not to? Sorry for the partial solution, but I hope someone has the other bit you need.
If you are not using regular expressions you can use the -F option of grep or use fgrep. This may provide you with additional performance.
Your gzcat .... | wc -l does not indicate 10000 files, it indicates 10000 lines total for however many files there are.
This is the type of problem that xargs exists for. Assuming your version of gzip came with a script called gzgrep (or maybe just zgrep), you can do this:
find /data/newfolder/real-time-newdata -type f -name "*_20120809_0_*.gz" -print | xargs gzgrep
That will run one gzgrep command with batches of as many individual files as it can fit on a command line (there are options to xargs to limit how many, or for a number of other things). Unfortunately, gzgrep still has to uncompress each file and pass it off to grep, but there's not really any good way to avoid having to uncompress the whole corpus in order to search through it. Using xargs in this way will however cut down some on the overall number of new processes that need to be spawned.

Resources