Best way to identify similar text inside strings? - string

I've a list of phrases, actually it's an Excel file, but I can extract each single line if needed.
I need to find the line that is quite similar, for example one line can be:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°)
and some line after I can have the same line or this one:
ANTIBRATING SSPIRING JOINT (type 2) mod. GA200 (temp.max60°)
Like you can see these two lines are pretty the same, not equal in this case but at 98%
The main problem is that I've to process about 45k lines, for this reason I'm searching a way to do that in a quick and maybe visual way.
The first thing that came in my mind was to compare the very 1st line to the 2nd then the 3rd till the end, and so on with the 2nd one and the 3rd one till latest-1 and make a kind of score, for example the 1st line is 100% with line 42, 99% with line 522 ... 21% with line 22142 etc etc...
But is only one idea, maybe not the best.
Maybe out there's already a good program/script/online services/program, I searched but I can't find it, so at the end I asked here.
Anyone knows a good way (if this is possible) or script or one online services to achieve this?

One thing you can do is write a script, which does as follows:
Extract data from csv file
Define a regex which can conclude a similarity, a python example can be:
[\w\s]+\([\w]+\)[\w\s]+\([\w°]+\)
Or such, refer the documentation.

The problem you have is that you are not looking for an exact match, but a like.
This is a problem even databases have never solved and results in a full table scan.
So we're unlikely to solve it.
However, I'd like to propose that you consider alternatives:
You could decide to limit the differences to specific character sets.
In the above example, you were ignoring numbers, but respected letters.
If we can assume that this rule will always hold true, then we can perform a text replace on the string.
ANTIBRATING SSPIRING JOINT (type 2) mod. GA160 (temp.max60°) ==> ANTIBRATING SSPIRING JOINT (type _) mod. GA_ (temp.max_°)
Now, we can deal with this problem by performing an exact string comparison. This can be done by hashing. The easiest way is to feed a hashmap/hashset or a database with a hash index on the column where you will store this adjusted text.
You could decide to trade time for space.
For example, you can feed the strings to a service which will build lots of different variations of indexes on your string. For example, feed elasticsearch with your data, and then perform analytic queries on it.

Fuzzy searches is the key.
I found several projects and ideas, but the one I used is tree-agrep, I know that is quite old but in this case works for me, I created this little script to help me to create a list of differences, so I can manually check it with my file
#!/bin/bash
########## CONFIGURATIONS ##########
original_file=/path/jjj.txt
t_agrep_bin="$(command -v tre-agrep)"
destination_file=/path/destination_file.txt
distance=1
########## CONFIGURATIONS ##########
lines=$(grep "" -c "$original_file")
if [[ -s "$destination_file" ]]; then
rm -rf "$destination_file"
fi
start=1
while IFS= read -r line; do
echo "Checking line $start/$lines"
lista=$($t_agrep_bin -$distance -B --colour -s -n -i "$line" $original_file)
echo "$lista" | awk -F ':' '{print $1}' ORS=' ' >> "$destination_file"
echo >> "$destination_file"
start=$((start+1))
done < "$original_file"

Related

Is there a bash function for determining number of variables from a read provided from end user

I am currently working on a small command line interface tool that someone not very familiar with bash could run from their computer. I have changed content for confidentiality, however functionality has remained the same.
The user would be given a prompt
the user would then respond with their answer(s)
From this, I would be given two bits of information:
1.their responses now as individual variables
2.the number of variables that I have now been given: this value is now a variable as well
my current script is as follows
echo List your favorite car manufacturers
read $car1 $car2 $car3 #end user can list as many as they wish
for n in {1..$numberofmanufacturers} #finding the number of
variables/manufactures is my second question
do
echo car$n
done
I am wanting to allow for the user to enter as many car manufacturers as they please (1=<n), however I need each manufacturer to be a different variable. I also need to be able to automate the count of the number of manufacturers and have that value be its own new variable.
Would I be better suited for having the end user create a .txt file in which they list (vertically) their manufactures, thus forcing me to use wc -l to determine the number of manufacturers?
I appreciate any help in advance.
As I said in the comment, whenever you want to use multiple dynamically created variables, you should check if there isn't a better data structure for your use case; and in almost all cases there will be. Here is the implementation using bash arrays. It prints out the contents of the input array in three different ways.
echo List your favorite car manufacturers
# read in an array, split on spaces
read -a cars
echo Looping over array values
for car in "${cars[#]}"
do
echo $car
done
echo Looping over array indices
for i in ${!cars[#]}
do
echo ${cars[$i]}
done
echo Looping from 0 to length-1
let numcars=${#cars[#]}
for i in $(seq 0 $((numcars-1)))
do
echo ${cars[$i]}
done

grouping input files with incremental names to divide a computation into smaller ones

I created a script to run a function that takes a specific file and compares it to several others. The files that I have to scan for the matches are more than 900 and incrementally named like this:
A001-abc.txt A001-efg.txt [..] A002-efg.txt A002-hjk.txt [..] A120-xwz.txt (no whitespaces)
The script looks like this:
#!/bin/bash
for f in *.txt; do
bedtools function /mydir/myfile.txt *.txt> output.txt
done
Now, the computation is extremely intensive and can't be completed. How can I group the files according to their incremental name and perform the computation on files A001-abc A001-def then go to A002-abc, A002-def.., to A123-xyz and so on?
I was thinking that by finding a way to specify the name with an incremental variable and then grouping them, I could divide the computation in many smaller ones. Is it reasonable and doable?
Many thanks!
The problem might be that you're not using $f. Try using this line in the for-loop instead. Also, use >> so the output appends instead of overwriting the output file.
bedtools function /mydir/myfile.txt "$f" >> output.txt
But if you still want to break it up, you might try a double loop like this:
for n in $(seq --format="%03.f" 1 120); do
for f in "A$n"*; do
bedtools function /mydir/myfile.txt "$f" >> "${n}output.txt"
done
done
The values of $n from the output of the seq command is 001 002 .. 120.
So the inner loop loops through A001* A002* .. A120*.

bash shell - increment (associative) array value

Just want to share with you something I did not easily find by myself...
I am a newbie in shell script and was just wondering how can I increment a value of a an associative array.
Let's assume this script:
#!/bin/bash
declare -A b # declare an associative array
a="aaa"
b[$a]=1
echo ${b[#]} # display all the values
echo ${b[$a]} # display the first value (1)
echo ${b[aaa]} # display the first value as well (1)
The solution can be
((b[$a]++))
echo ${b[#]} # display 2
Now that I found it, it seems evident, but I spent some time to get it...
I hope this can save some time to people :)
As describe above, the solution can be
((b[$a]++)) # or (('b[$a]'++)) for a more secure way as pointed by #gniourf_gniourf
echo ${b[#]} # display 2
Now that I found it, it seems evident, but I spent some time to get it...
I hope this can save some time to people :)

Bash script key/value pair regardless of bash version

I am writing a curl bash script to test webservices. I will have file_1 which would contain the URL paths
/path/to/url/1/{dynamic_path}.xml
/path/to/url/2/list.xml?{query_param}
Since the values in between {} is dynamic, I am creating a separate file, which will have values for these params. the input would be in key-value pair i.e.,
dynamic_path=123
query_param=shipment
By combining two files, the input should become
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
This is the background of my problem. Now my questions
I am doing it in bash script, and the approach I am using is first reading the file with parameters and parse it based on '=' and store it in key/value pair. so it will be easy to replace i.e., for each url I will find the substring between {} and whatever the text it comes with, I will use it as the key to fetch the value from the array
My approach sounds okay (at least to me) BUT, I just realized that
declare -A input_map is only supported in bashscript higher than 4.0. Now, I am not 100% sure what would be the target environment for my script, since it could run in multiple department.
Is there anything better you could suggest ? Any other approach ? Any other design ?
P.S:
This is the first time i am working on bash script.
Here's a risky way to do it: Assuming the values are in a file named "values"
. values
eval "$( sed 's/^/echo "/; s/{/${/; s/$/"/' file_1 )"
Basically, stick a dollar sign in front of the braces and transform each line into an echo statement.
More effort, with awk:
awk '
NR==FNR {split($0, a, /=/); v[a[1]]=a[2]; next}
(i=index($0, "{")) && (j=index($0,"}")) {
key=substr($0,i+1, j-i-1)
print substr($0, 1, i-1) v[key] substr($0, j+1)
}
' values file_1
There are many ways to do this. You seem to think of putting all inputs in a hashmap, and then iterate over that hashmap. In shell scripting it's more common and practical to process things as a stream using pipelines.
For example, your inputs could be in a csv file:
123,shipment
345,order
Then you could process this file like this:
while IFS=, read path param; do
sed -e "s/{dynamic_path}/$path/" -e "s/{query_param}/$param/" file_1
done < input.csv
The output will be:
/path/to/url/1/123.xml
/path/to/url/2/list.xml?shipment
/path/to/url/1/345.xml
/path/to/url/2/list.xml?order
But this is just an example, there can be so many other ways.
You should definitely start by writing a proof of concept and test it on your deployment server. This example should work in old versions of bash too.

how to create a txt file with columns being the descending sub-directories in Linux?

My data follow the structure:
../data/study_ID/FF_Number/Exam_Number/date,
Where the data dir contains 176 participants` sub-directories. The ID number represents the participants ID, and each of the following sub-directories represents some experimental number.
I want to create a txt file with one line per participants and the following columns: study ID, FF_number, Exam_Number and date.
However it gets a bit more complicated as I want to divide the participants into chunks of ~ 15-20 ppt per chunk for the following analysis.
Any suggestions?
Cheers.
Hmm, nobody?
You should redirect output of "find" command, consider switches -type d, and -maxdepth, and probably parse it with sed, replacing "/" with "spaces". Maybe piping through "cut" and "column -t" commands, and "sort" and "uniq" will be useful. Do names, except FF and ID, contain spaces, or special characters e.g. related to names of participants?
It should be possible to get a TXT with "one liner" and a few pipes.
You should try, and post first results of your work on this :)
EDIT: Alright, I created for me a structure with several thousands of directories and subdirectories numbered by participant, by exam number etc., which look like this ( maybe it's not identical with what you have, but don't worry ). Studies are numbered from 5 to 150, FF from 45 to 75, and dates from 2012_01_00 to 2012_01_30 - which makes really huge quantity of directories in total.
/Users/pwadas/bzz/data
/Users/pwadas/bzz/data/study_005
/Users/pwadas/bzz/data/study_005/05_Num
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_00
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_01
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_02
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_03
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_04
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_05
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_06
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_07
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_08
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_09
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_10
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_11
/Users/pwadas/bzz/data/study_005/05_Num/45_Exam/2012_01_12
Now, I want ( quote ) "txt file with one line per participants and the following columns: study ID, FF_number, Exam_Number and date."
So I use the following one-liner:
find /Users/pwadas/bzz/data -type d | head -n 5000 |cut -d'/' -f5-7 | uniq |while read line; do echo -n "$line: " && ls -d /Users/pwadas/bzz/$line/*Exam/* | perl -0pe 's/.*2012/2012/g;s/\n/ /g' && echo ; done > out.txt
and here is the output ( a few first lines from out.txt ). Lines are very long, I cutted it on output for first 80-90 characters:
dtpwmbp:data pwadas$ cat out.txt |cut -c1-90
data:
data/study_005:
data/study_005/05_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/06_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/07_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
data/study_005/08_Num: 2012_01_00 2012_01_01 2012_01_02 2012_01_03 2012_01_04 2012_01_05 2
dtpwmbp:data pwadas$
I hope this will help you a little, and you'll be able to modify it according to your needs and patterns, and that seems to be all I can do :) You should analyze the one liner, especially "cut" command, and perl-regex part, which removes newlines and full directory name from "ls" output. This is probably fair from optimal, but beautifying is not the point here, I guess :)
So, good luck :)
PS. "head" command limits output for N first lines, you'll probably want to skip out
| head .. |
part.

Resources