Is there a way to use only the word result of most_similar function of gensim? - trigonometry

I am trying to use the most similar function of gensim but the results came in as a list that is let's say a=(word, cosine similarity). However I can't retrieve the word by a[0].Is there a way to access the word itself? I need to use it as an input.

The ranked list of results returned by most_similar() is a list where each item is a tuple of a word and its similarity value.
After...
sims = model.wv.most_similar(my_word)
top_word = sims[0][0]
...top_word would have the word most-similar to my_word.

Related

String matching keywords and key phrases in Python

I am trying to perform a smart dynamic lookup with strings in Python for a NLP-like task. I have a large amount of similar-structure sentences that I would like to parse through each, and tokenize parts of the sentence. For example, I first parse a string such as "bob goes to the grocery store".
I am taking this string in, splitting it into words and my goal is to look up matching words in a keyword list. Let's say I have a list of single keywords such as "store" and a list of keyword phrases such as "grocery store".
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery store', 'computer store', 'coffee shop']
for word in sample.split():
# do dynamic length lookups
Now the issue is this Sometimes my sentences might be simply "bob goes to the store" instead of "bob goes to the grocery store".
I want to find the keyword "store" for sure but if there are descriptive words such as "grocery" or "computer" before the word store I would like to capture that as well. That is why I have the keyphrases list as well. I am trying to figure out a way to basically capture a keyword at the very least then if there are words related to it that might be a possible "phrase" I want to capture those too.
Maybe an alternative is to have some sort of adjective list instead of a phrase list of multiple words?
How could I go about doing these sort of variable length lookups where I look at more than just a single word if one is captured, or is there an entirely different method I should be considering?
Here is how you can use a nested for loop and a formatted string:
sample = 'bob goes to the grocery store'
keywords = ['store', 'restaurant', 'shop', 'office']
keyphrases = ['grocery', 'computer', 'coffee']
for kw in keywords:
for kp in keyphrases:
if f"{kp} {kw}" in sample:
# Do something

FuzzyWuzzy for very similar records in Python

I have a dataset with which I want to find the closest string match. For that purpose I'm using FuzzyWuzzy in this way
sol=process.extract(t,dev2,scorer=fuzz.token_sort_ratio)
Where t is the string and dev2 is the list to compare to. My problem is that sometimes it has very similar records and options provided by FuzzyWuzzy seems to be lacking. And I've tested with token_sort, token_set, partial_token sort and set, ratio, partial_ratio, and WRatio.
For example, the string Italy - Serie A gives me the following 2 closest matches.
Token_sort_ratio: (92, 'Italy - Serie D');(86, 'Italian - Serie A')
The one wanted is obviously the second one, but character by character is closer the first one, which is a different league.
This happens as well with teams. If, let's say I have a string Buchtholz I would obtains Buchtholz II before I get TSV Buchtholz.
My main guess now would be to try and weight the presence and absence of several characters more heavily, like single capital letters at the end of the string, so if there is a difference in the letter or an absence it is weighted as less close. Or for () and special characters.
I don't know if there is a way to take this into account or you guys have a better approach to get the string that really matches.
Similarity matches often require knowledge of the data being analysed. i.e. it is not just a blind single round of matching. I recommend that you pass your results through more steps of matching, starting with inclusive/optimistic approaches (like token_set_ratio) with low cut off scores and working toward more exclusive/pessimistic approaches with higher cut off scores until you have a clear winner. If you know more about the text you're analyzing, you can even modify the strings as you progress.
In a case I worked on, I did similarity matches of goods movement descriptions. In the descriptions the numbers sequences were more important than the text. e.g. when looking for a match for "SLURRY VALVE 250MM RAGMAX 2000" the 250 and 2000 part of the string are important, otherwise I get a "SLURRY VALVE 50MM RAGMAX 2000" as the best match instead of "VALVE B/F 250MM,RAGMAX 250RAG2000 RAGON" which is a better result.
I put the similarity match process through two steps: 1. Get a bunch of similar matches using an optimistic matching scorer (token_set_ratio) 2. get the number sequences of these results and pass them through another round of matching with a more strict scorer (token_sort_ratio). Doing this gave me the better result in the example I showed above.
Below is some blocks of code that could be of assistance:
here's a function to get numbers from the sequence. (In your case you might use this to exclude numbers from your string instead?)
def get_numbers_from_string(description):
numbers = ''.join((ch if ch in '0123456789.-' else ' ') for ch in description)
numbers = ' '.join([nr for nr in numbers.split()])
return numbers
and here is a portion of the code I used to put the description match through two rounds:
try:
# get close match from goods move that has material numbers
df_material = pd.DataFrame(process.extract(description,
corpus_material,
scorer=fuzz.token_set_ratio),
columns=['Similar Text','Score']
)
if df_material['Score'][df_material['Score']>=cut_off_accuracy_materials].count()>=1:
similar_text = df_material['Similar Text'].iloc[0]
score = df_material['Score'].iloc[0]
if nr_description_numbers>4:
# if there are multiple matches found, then get best number combination match
df_material = df_material[df_material['Score']>=cut_off_accuracy_materials]
new_corpus = list(df_material['Similar Text'])
new_corpus = np.vectorize(get_numbers_from_string)(new_corpus)
df_material['numbers'] = new_corpus
df_numbers = pd.DataFrame(process.extract(description_numbers,
new_corpus,
scorer=fuzz.token_sort_ratio),
columns=['numbers','Score']
)
similar_text = df_material['Similar Text'][df_material['numbers']==df_numbers['numbers'].iloc[0]].iloc[0]
nr_score = df_numbers['Score'].iloc[0]
hope it helps, and good luck

Replace words marked with offsets

I have a sentence like that:
"My name is Bond. It's a fake name."
and I have to replace some words in a list with offsets of each word:
name, 29-33; Bond, 11-15; name, 3-7
In addition, each word must replace with a specific word:
name -> noun
Bond -> proper
I have to obtain this output:
"My noun is proper. It's a fake noun."
I tried to manage the offsets with a post-offset variable that I update after each replacement but it is not valid because is an unordered list. Note that find method is not valid due to names repetition. Is there any algorithm to do it? Any vectorial implementation (String, Numpy, NLTK) that computes it in one step?
Bro Check this one :
string = "My name is Bond. It's a fake name."
y=list()
y=string.split(" ") #now it will break your strings into words
Now traverse the list and set the condition
for i in y:
if(i==name):
i="noun"
if(i==Bond):
i="Proper"
Now the list values will be changed and use the Join() method to make back the list into string
For more Please refer to this website https://www.tutorialspoint.com/python/python_strings.htm
This page contains all the data related to string processing in python.

How do I get all hits from a cts:search() in Marklogic

I have a collection containing lots of documents.
when I search the collection, I need to get a list of matches independent of documents. So if I search for the word "pie". I would get back a list of documents, properly sorted by relevance. However, some of these documents contain the word "pie" on more then one place. I would like to get back a list of all matches, unrelated to the document where the match was found. Also, this list of all hits would need the be sorted by relevance (weight), again totally independent of the document (not grouped by the document).
Following code searches and returns matches grouped by the document...
let $searchfor := "pie"
let $query := cts:and-query((
cts:element-word-query(xs:QName("title"), ($searchfor), (), 16),
cts:element-word-query(xs:QName("para"), ($searchfor), (), 10)
))
let $resultset := cts:search(fn:collection("docs"), $query)[0 to 100]
for $n in $resultset
return cts:score($n)
What I need is $n to be the "match-node", not a "document-node"...
Thanks!
Document relevance is determined by TFIDF. Matches contribute to a document's score but don't have scores relative to each other. cts:search already returns results ordered by document relevance, so you could do this to get match nodes ordered by their ancestor document score:
let $searchfor := "pie"
let $query := cts:and-query((
cts:element-word-query(xs:QName("title"), ($searchfor), (), 16),
cts:element-word-query(xs:QName("para"), ($searchfor), (), 10)
))
return
cts:search(//(title|para),$query)[0 to 100]/cts:highlight(.,$query,element match {$cts:node})//match/*
You need to split the document (fragment it) into smaller documents. Every textnode could be a document, with an stored original xpath so that the context is not lost.
I recommend that you look at the Search API (http://community.marklogic.com/pubs/5.0/books/search-dev-guide.pdf and http://community.marklogic.com/pubs/5.0/apidocs/SearchAPI.html). This API will give what you want, providing match nodes as well as the URIs for the actual documents. You should also find it easier to use for the general cases, although there will be edge cases where you will need to revert back to cts:search.
search:search is the specific function you will want to use. It will give you back responses similar to this:
<search:response total="1" start="1" page-length="10" xmlns=""
xmlns:search="http://marklogic.com/appservices/search">
<search:result index="1" uri="/foo.xml"
path="fn:doc("/foo.xml")" score="328"
confidence="0.807121" fitness="0.901397">
<search:snippet>
<search:match path="fn:doc("/foo.xml")/foo">
<search:highlight>hello</search:highlight></search:match>
</search:snippet>
</search:result>
<search:qtext>hello sample-property-constraint:boo</search:qtext>
<search:report id="SEARCH-FLWOR">(cts:search(fn:collection(),
cts:and-query((cts:word-query("hello", ("lang=en"), 1),
cts:properties-query(cts:word-query("boo", ("lang=en"), 1))),
()), ("score-logtfidf"), 1))[1 to 10]
</search:report>
<search:metrics>
<search:query-resolution-time>PT0.647S</search:query-resolution-time>
<search:facet-resolution-time>PT0S</search:facet-resolution-time>
<search:snippet-resolution-time>PT0.002S</search:snippet-resolution-time>
<search:total-time>PT0.651S</search:total-time>
</search:metrics>
</search:response>
Here you can see that every result has one or possibly more match elements defined.
How would you determine the relevance of a word independent of the document? Relevance is a measure of document relevance, not word relevance. I don't know how one would measure word relevance.
You could potentially return all words ordered by document relevance, then words for each document in "document order" which means the order in which they appear in the document. That would be relatively easy to do with search:search where you iterate over all results and extract each matching word. What would you present with each match? Its surrounding snippet?
Keep in mind that what you're asking for would potentially take a long time to execute.

Access list element using get()

I'm trying to use get() to access a list element in R, but am getting an error.
example.list <- list()
example.list$attribute <- c("test")
get("example.list") # Works just fine
get("example.list$attribute") # breaks
## Error in get("example.list$attribute") :
## object 'example.list$attribute' not found
Any tips? I am looping over a vector of strings which identify the list names, and this would be really useful.
Here's the incantation that you are probably looking for:
get("attribute", example.list)
# [1] "test"
Or perhaps, for your situation, this:
get("attribute", eval(as.symbol("example.list")))
# [1] "test"
# Applied to your situation, as I understand it...
example.list2 <- example.list
listNames <- c("example.list", "example.list2")
sapply(listNames, function(X) get("attribute", eval(as.symbol(X))))
# example.list example.list2
# "test" "test"
Why not simply:
example.list <- list(attribute="test")
listName <- "example.list"
get(listName)$attribute
# or, if both the list name and the element name are given as arguments:
elementName <- "attribute"
get(listName)[[elementName]]
If your strings contain more than just object names, e.g. operators like here, you can evaluate them as expressions as follows:
> string <- "example.list$attribute"
> eval(parse(text = string))
[1] "test"
If your strings are all of the type "object$attribute", you could also parse them into object/attribute, so you can still get the object, then extract the attribute with [[:
> parsed <- unlist(strsplit(string, "\\$"))
> get(parsed[1])[[parsed[2]]]
[1] "test"
flodel's answer worked for my application, so I'm gonna post what I built on it, even though this is pretty uninspired. You can access each list element with a for loop, like so:
#============== List with five elements of non-uniform length ================#
example.list=
list(letters[1:5], letters[6:10], letters[11:15], letters[16:20], letters[21:26])
#===============================================================================#
#====== for loop that names and concatenates each consecutive element ========#
derp=c(); for(i in 1:length(example.list))
{derp=append(derp,eval(parse(text=example.list[i])))}
derp #Not a particularly useful application here, but it proves the point.
I'm using code like this for a function that calls certain sets of columns from a data frame by the column names. The user enters a list with elements that each represent different sets of column names (each set is a group of items belonging to one measure), and the big data frame containing all those columns. The for loop applies each consecutive list element as the set of column names for an internal function* applied only to the currently named set of columns of the big data frame. It then populates one column per loop of a matrix with the output for the subset of the big data frame that corresponds to the names in the element of the list corresponding to that loop's number. After the for loop, the function ends by outputting that matrix it produced.
Not sure if you're looking to do something similar with your list elements, but I'm happy I picked up this trick. Thanks to everyone for the ideas!
"Second example" / tangential info regarding application in graded response model factor scoring:
Here's the function I described above, just in case anyone wants to calculate graded response model factor scores* in large batches...Each column of the output matrix corresponds to an element of the list (i.e., a latent trait with ordinal indicator items specified by column name in the list element), and the rows correspond to the rows of the data frame used as input. Each row should presumably contain mutually dependent observations, as from a given individual, to whom the factor scores in the same row of the ouput matrix belong. Also, I feel I should add that if all the items in a given list element use the exact same Likert scale rating options, the graded response model may be less appropriate for factor scoring than a rating scale model (cf. http://www.rasch.org/rmt/rmt143k.htm).
'grmscores'=function(ColumnNameList,DataFrame) {require(ltm) #(Rizopoulos,2006)
x = matrix ( NA , nrow = nrow ( DataFrame ), ncol = length ( ColumnNameList ))
for(i in 1:length(ColumnNameList)) #flodel's magic featured below!#
{x[,i]=factor.scores(grm(DataFrame[, eval(parse(text= ColumnNameList[i]))]),
resp.patterns=DataFrame[,eval(parse(text= ColumnNameList[i]))])$score.dat$z1}; x}
Reference
*Rizopoulos, D. (2006). ltm: An R package for latent variable modelling and item response theory analyses, Journal of Statistical Software, 17(5), 1-25. URL: http://www.jstatsoft.org/v17/i05/

Resources