Multi thread shell script - multithreading

can anybody help me with writing a multi thread shell script
Basically i have two files one file contain around >10K lines(main_file) and another contain around 200 line(sub_file). These 200 lines contain repeated string sorted of main file.I'm trying make separate files for each string to other file using below command
i have collected the string which are repeated into sub_file.
The string are present randomly in main_file.
a=0
while IFS= read -r line
do
a=$(($a+1));
users[$a]=$line
egrep "${line}" $main_file >> $line
done <"$sub_file"
if i make to use in single thread it take more time so thinking to use multithread process and complete the process in minimum time..
help me out...

The tool you need for that is gnu parallel:
parallel egrep '{}' "$mainfile" '>' '{}' < "$sub_file"
You can adjust the number of jobs processed with the option -P:
parallel -P 4 egrep '{}' "$mainfile" '>' '{}' < "$sub_file"
Please see the manual for more info.
By the way to make sure that you don't process a line twice you could make the input unique:
awk '!a[$0]++' "$sub_file" | parallel -P 4 egrep '{}' "$mainfile" '>' '{}'

NOTE: Posting from my previous post. This is not directly applicable, but very similar to tweak
I have a file 1.txt with the below contents.
-----cat 1.txt-----
1234
5678
1256
1234
1247
I have 3 more files in a folder
-----ls -lrt-------
A1.txt
A2.txt
A3.txt
The contents of those three files are similar format with different data values (All the three files are tab delimited)
-----cat A1.txt----
A X 1234 B 1234
A X 5678 B 1234
A X 1256 B 1256
-----cat A2.txt----
A Y 8888 B 1234
A Y 9999 B 1256
A X 1234 B 1256
-----cat A3.txt----
A Y 6798 C 1256
My objective is to do a search on all the A1,A2 and A3 (Only for the 3rd column of the TAB delimited file)for text in 1.txt
and the output must be redirected to the file matches.txt as given below.
Code:
/home/A1.txt:A X 1234 B 1234
/home/A1.txt:A X 5678 B 1234
/home/A1.txt:A X 1256 B 1256
/home/A2.txt:A X 1234 B 1256
The following should work.
cat A*.txt | tr -s '\t' '|' > combined.dat
{ while read myline;do
recset=`echo $myline | cut -f19 -d '|'|tr -d '\r'`
var=$(grep $recset 1.txt|wc -l)
if [[ $var -ne 0 ]]; then
echo $myline >> final.dat
fi
done } < combined.dat
{ while read myline;do
recset=`echo $myline | cut -f19 -d '|'|tr -d '\r'`
var=$(grep $recset 1.txt|wc -l)
if [[ $var -ne 0 ]]; then
echo $myline >> final2.dat
fi
done } < combined.dat
Using AWK
awk 'NR==FNR{a[$0]=1}$3 in a{print FILENAME":"$0}' 1.txt A* > matches.txt
For pipe delimited
awk –F’|’ 'NR==FNR{a[$0]=1}$3 in a{print FILENAME":"$0}' 1.txt A* > matches.txt

Related

compare grep output for same string from two files

I have 3 files:
1 contains list of strings to check for
2 contains new prices
3 contains prices that would need to be replaced if these prices change in file #2
Example:
file #1
item1
item2
file #2
item1cost100
item2cost200
file# #3
item1cost101
item2cost199
After running the script, the file #3 should be updated
file# #3
item1cost100
item2cost200
file #2 and #3 contain a lot of entries but only entries in file #1 need to be checked and if different written to file #3
I only got as far as comparing the two files for 1 string, i am not sure how to loop through contents of file #1 and how to write the changes to file #3
I started working with sed command and got stuck not knowing how to unwrap variables
Here is what I got
item="item1"
itemold=$(cat file2 | grep item1)
echo $itemold
itemnew=$(cat file3 | grep item1)
echo $itemnew
echo $item
if [ $itemold = $itemnew ]; then
echo "MATCH!"
else
echo "NO MATCH!"
fi
#!/bin/bash
cut -d' ' -f1 file1 > a
grep `cat a` file2 > b
grep `cat a` file3 > c
sed s/`cat c`/`cat b`/g file3 > d
cut -d' ' -f2 file1 > a
grep `cat a` file2 > b
grep `cat a` file3 > c
sed s/`cat c`/`cat b`/g d > file3
This takes care of the case you gave. It could be generalized.

Join 2 files by common column header (without awk/sed)

Basically I want to get all records from file2, but filter out columns whose header doesn't appear in file1
Example:
file1
Name Location
file2
Name Phone_Number Location Email
Jim 032131 xyz xyz#qqq.com
Tim 037903 zzz zzz#qqq.com
Pimp 039141 xxz xxz#qqq.com
Output
Name Location
Jim xyz
Tim zzz
Pimp xxz
Is there a way to do this without awk or sed, but still using coreutils tools? I've tried doing it with join, but couldn't get it working.
ALL_COLUMNS=$(head -n1 file2)
for COLUMN in $(head -n1 file1); do
JOIN_FORMAT+="2.$(( $(echo ${ALL_COLUMNS%%$COLUMN*} | wc -w)+1 )),"
done
join -a2 -o ${JOIN_FORMAT%?} /dev/null file2
Explanation:
ALL_COLUMNS=$(head -n1 file2)
It saves all the column names to filter next
for COLUMN in $(head -n1 file1); do
JOIN_FORMAT+="2.$(( $(echo ${ALL_COLUMNS%%$COLUMN*} | wc -w)+1 )),"
done
For every column in file1, we look for the position of the one with the same name in file2 and append it to JOIN_FORMAT in the way of "2.<number_of_column>,"
join -a2 -o ${JOIN_FORMAT%?} /dev/null file2
Once we have the option string complete (2.1,2.3,), we pass it to join removing the last ,.
join prints the unpairable lines from the second file provided (-a2 -> file2), but only the columns specified in the -o option.
Not very efficient, but works for your example:
#!/bin/bash
read -r -a cols < file1
echo "${cols[#]}"
read -r -a header < <(head -n1 file2)
keep=()
for (( i=0; i<${#header}; i++ )) ; do
for c in "${cols[#]}" ; do
if [[ ${header[i]} == "$c" ]] ; then
keep+=($i)
fi
done
done
while read -r -a data ; do
for idx in ${keep[#]} ; do
printf '%s ' "${data[idx]}"
done
printf '\n'
done < <(tail -n+2 file2)
Tools used: head and tail. They aren't essential, though. And bash, of course.

Take nth column in a text file

I have a text file:
1 Q0 1657 1 19.6117 Exp
1 Q0 1410 2 18.8302 Exp
2 Q0 3078 1 18.6695 Exp
2 Q0 2434 2 14.0508 Exp
2 Q0 3129 3 13.5495 Exp
I want to take the 2nd and 4th word of every line like this:
1657 19.6117
1410 18.8302
3078 18.6695
2434 14.0508
3129 13.5495
I'm using this code:
nol=$(cat "/path/of/my/text" | wc -l)
x=1
while [ $x -le "$nol" ]
do
line=($(sed -n "$x"p /path/of/my/text)
echo ""${line[1]}" "${line[3]}"" >> out.txt
x=$(( $x + 1 ))
done
It works, but it is very complicated and takes a long time to process long text files.
Is there a simpler way to do this?
iirc :
cat filename.txt | awk '{ print $2 $4 }'
or, as mentioned in the comments :
awk '{ print $2 $4 }' filename.txt
You can use the cut command:
cut -d' ' -f3,5 < datafile.txt
prints
1657 19.6117
1410 18.8302
3078 18.6695
2434 14.0508
3129 13.5495
the
-d' ' - mean, use space as a delimiter
-f3,5 - take and print 3rd and 5th column
The cut is much faster for large files as a pure shell solution. If your file is delimited with multiple whitespaces, you can remove them first, like:
sed 's/[\t ][\t ]*/ /g' < datafile.txt | cut -d' ' -f3,5
where the (gnu) sed will replace any tab or space characters with a single space.
For a variant - here is a perl solution too:
perl -lanE 'say "$F[2] $F[4]"' < datafile.txt
For the sake of completeness:
while read -r _ _ one _ two _; do
echo "$one $two"
done < file.txt
Instead of _ an arbitrary variable (such as junk) can be used as well. The point is just to extract the columns.
Demo:
$ while read -r _ _ one _ two _; do echo "$one $two"; done < /tmp/file.txt
1657 19.6117
1410 18.8302
3078 18.6695
2434 14.0508
3129 13.5495
One more simple variant -
$ while read line
do
set $line # assigns words in line to positional parameters
echo "$3 $5"
done < file
If your file contains n lines, then your script has to read the file n times; so if you double the length of the file, you quadruple the amount of work your script does — and almost all of that work is simply thrown away, since all you want to do is loop over the lines in order.
Instead, the best way to loop over the lines of a file is to use a while loop, with the condition-command being the read builtin:
while IFS= read -r line ; do
# $line is a single line of the file, as a single string
: ... commands that use $line ...
done < input_file.txt
In your case, since you want to split the line into an array, and the read builtin actually has special support for populating an array variable, which is what you want, you can write:
while read -r -a line ; do
echo ""${line[1]}" "${line[3]}"" >> out.txt
done < /path/of/my/text
or better yet:
while read -r -a line ; do
echo "${line[1]} ${line[3]}"
done < /path/of/my/text > out.txt
However, for what you're doing you can just use the cut utility:
cut -d' ' -f2,4 < /path/of/my/text > out.txt
(or awk, as Tom van der Woerdt suggests, or perl, or even sed).
If you are using structured data, this has the added benefit of not invoking an extra shell process to run tr and/or cut or something. ...
(Of course, you will want to guard against bad inputs with conditionals and sane alternatives.)
...
while read line ;
do
lineCols=( $line ) ;
echo "${lineCols[0]}"
echo "${lineCols[1]}"
done < $myFQFileToRead ;
...

Linux Shell Script: Writing file and string together to another file

This is my code is bash script:
for string in $LIST
do
echo "$string" >> file
grep -v "#" file1 >> file
done
"file1" contains only one value, and that value changes iteratively.
The result I get in "file" is:
a
1
b
2
c
3
However, I want this:
a 1
b 2
c 3
Thanks in advance.
for string in $LIST
do
echo "$string" $(grep -v "#" file1)
done > file

How do I find the count of multiple words in a text file?

I am able to find the number of times a word occurs in a text file, like in Linux we can use:
cat filename|grep -c tom
My question is, how do I find the count of multiple words like "tom" and "joe" in a text file.
Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).
test.txt:
tom is really really cool! joe for the win!
tom is actually lame.
$ grep -c '\<\(tom\|joe\)\>' test.txt
2
As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.
I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.
$ grep -o '\(joe\|tom\)' test.txt|wc -l
3
3...the correct answer! Hope this helps
Ok, so first split the file into words, then sort and uniq:
tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c
You use uniq:
sort filename | uniq -c
Use awk:
{for (i=1;i<=NF;i++)
count[$i]++
}
END {
for (i in count)
print count[i], i
}
This will produce a complete word frequency count for the input.
Pipe tho output to grep to get the desired fields
awk -f w.awk input | grep -E 'tom|joe'
BTW, you do not need cat in your example, most programs that acts as filters can take the filename as an parameter; hence it's better to use
grep -c tom filename
if not, there is a strong possibility that people will start throwing Useless Use of Cat Award at you ;-)
The sample you gave does not search for words "tom". It will count "atom" and "bottom" and many more.
Grep searches for regular expressions. Regular expression that matches word "tom" or "joe" is
\<\(tom\|joe\)\>
You could do regexp,
cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"
Here is one:
cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c
UPDATE
A shell script solution:
#!/bin/bash
file_name="$2"
string="$1"
if [ $# -ne 2 ]
then
echo "Usage: $0 <pattern to search> <file_name>"
exit 1
fi
if [ ! -f "$file_name" ]
then
echo "file \"$file_name\" does not exist, or is not a regular file"
exit 2
fi
line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0
# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
do
flag=0
while [[ "$line" == *$string* ]]
do
flag=1
line_no_list[line_no_indx]=$curr_line_indx
line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
line=${line/"$string"/}
done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
if (( flag == 1 ))
then
line_no_indx=$((line_no_indx+2))
fi
curr_line_indx=$((curr_line_indx+1))
done < "$file_name"
echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "
for ((i=0; i<line_no_indx; i=i+2))
do
echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done
echo
I completely forgot about grep -f:
cat filename | grep -fc names
AWK solution:
Assuming the names are in a file called names:
cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -
Note that your original grep doesn't search for words. e.g.
$ echo tomorrow | grep -c tom
1
You need grep -w
gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'
The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. Then grep counts lines that match one of the words you want exactly.
We use gawk because the POSIX awk doesn't allow regex record separator.
For brevity, you can replace '{print}' with 1 - either way, it's an Awk program that simply prints out all input records ("is 1 true? it is? then do the default action, which is {print}.")
To find all hits in all lines
echo "tom is really really cool! joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3
This will count "tomtom" as 2 hits.

Resources