How to search for different spellings of a persons full name - search

I want to search in a Solr database on full names. The documents in the database are from different sources, so the spelling of the name in the documents is not consistent.
The spelling can be firstname lastname or lastname firstname. Also there can be one or more firstnames and one or more lastnames.
So if a name is:
firstname: ALBERTO JORGE
lastname: ALONSO CALEFACCION
The spellings can be:
ALBERTO JORGE ALONSO CALEFACCION
ALBERTO J. ALONSO CALEFACCION
ALBERTO J ALONSO CALEFACCION
ALBERTO ALONSO CALEFACCION
and
ALONSO CALEFACCION ALBERTO JORGE
ALONSO CALEFACCION ALBERTO J.
ALONSO CALEFACCION ALBERTO J
ALONSO CALEFACCION ALBERTO
I can search on the last names only with "ALONSO CALEFACCION"~0 with correct responses.
But how to search on all different spellings in one match?
The search will be created by a program based on user input.
The search is more complicated because Spanish names can contain extra words like "y" and "de" without these words are required (in our case).
So the name in the database could be something like: ALBERTO JORGE ALONSO Y CALEFACCION
Thanks for your help.
I use Solr 3.6

If you saved the first name in firstname and the last name in lastname fields you can prepare your query in some programming language. For example, if the user typed 2 words, you can query firstname:(word1) AND lastname:(word2) OR firstname:(word2) AND lastname:(word1).
You can even make a special type for these fields to find initial and contracted forms:
<fieldType name="AuthorsPrefix" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="200" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
You can read more here.
Another approach is to generate all possible combinations during indexing and search for authors in this combo field:
ALBERTO JORGE ALONSO CALEFACCION
ALBERTO J ALONSO CALEFACCION
ALBERTO ALONSO CALEFACCION
ALONSO CALEFACCION ALBERTO JORGE
ALONSO CALEFACCION ALBERTO J
ALONSO CALEFACCION ALBERTO
You can generate the synonyms automaticall making your own SearchComponent.

Related

Implementing remove function for element tree element

I am trying to remove an element from an XML tree. My attempt is based on the example that I found in the Python documentation:
for country in root.findall('country'):
# using root.findall() to avoid removal during traversal
rank = int(country.find('rank').text)
if rank > 50:
root.remove(country)
tree.write('output.xml')
But I'm trying to use the remove() function for a string attribute, not an integer one.
for country in root.findall('country'):
# using root.findall() to avoid removal during traversal
description = country.find('rank').text
root.remove(description)
tree.write('SampleData.xml')
But I get the following error:
TypeError: remove() argument must be xml.etree.ElementTree.Element, not str.
I ultimately added another element under country called description which holds a short description of the country and its features:
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
<description>Liechtenstein has a lot of flowers.</description>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N" />
<description>Singapore has a lot of street markets</description>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W" />
<neighbor name="Colombia" direction="E" />
<description>Panama is predominantly Spanish speaking.</description>
</country>
</data>
I'm trying to use the remove() function to delete that description attribute for all instances.
Indeed, in your code you are passing a string as argument, not a node:
description = country.find('rank').text
root.remove(description)
This is not what happens in the correct examples. The one with the integer does this:
rank = int(country.find('rank').text)
if rank > 50:
root.remove(country)
Note that country is removed (a node), not rank (an int).
It is not clear what you want to do with the description, but make sure to remove a node, not a string. For instance,
description = country.find('rank').text
if description == "delete this": # just an example condition
root.remove(country)
Or, if you just want to remove the "rank" node and keep the country node:
ranknode = country.find('rank')
if ranknode.text == "delete this":
country.remove(ranknode)
As you mention you have actually a description element (you call it an attribute, but that is confusing), you can target that element instead of rank:
descriptionnode = country.find('description')
if descriptionnode.text == "delete this":
country.remove(descriptionnode)

duplicate title using jm-chinese-gb7714-2005-numeric.csl in Juris-M

When I format the reference using jm-chinese-gb7714-2005-numeric.csl in Juris-M, the title occurs twice, does someone know the reason?
Many thanks.
Example:
[1] MINEKUS M, ALMINGER M, ALVITO P, et al. A standardised static in vitro digestion method suitable for food – an international consensus
A standardised static in vitro digestion method suitable for food – an international consensus[J]. Food & Function, 2014, 5(6) : 1113–1124
The address of the reference https://pubs.rsc.org/en/content/articlelanding/2014/fo/c3fo60702j#!divAbstract
Gist of the csl file:
https://gist.github.com/redleafnew/6f6fa23c3627c67d968eee38e4d2d40a
This bug has been fixed. Replace
<text macro="title" suffix="[J]."/>
<text value=""/>
with
<text value="[J]."/>
The duplicated reference could be removed.

Solr WhitespaceTokenizerFactory will make URL parameter no work

I created a new field type as seen below:
<fieldType name="text_whitespace" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" rule="unicode" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" rule="unicode" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I need WhitespaceTokenizerFactory to make special characters to index and search, and it's working now,
But I have other question,
When I used WhitespaceTokenizerFactory, it will make URL parameter no work,
e.g. http://localhost:8983/solr/Test1/select?defType=dismax&hl.fl=content&hl=on&indent=on&q=%22C#"&qf=content^100&rows=1&wt=json
when I used that parameter in Solr Web UI,
It will work and get the result,
But When I used the URL and same parameter I get no result
and this is my date:
[
{
"id" : "test1",
"title" : "test1# title C*?#",
"content" : "test1# title C*?#",
"dynamic_s": 5
},
{
"id" : "test2",
"title" : "test2 title C#",
"content" : "test2 title C#",
"dynamic_s": 10
},
{
"id" : "test3",
"title" : "test3 title",
"content" : "test3 title",
"dynamic_s": 0
}
]
If I use WhitespaceTokenizerFactory how do I make the parameter work in URL?
This is not related to Solr, but is how HTTP works.
As explained in your original post, this is because # has special meaning in HTTP URLs. A # indicates a local anchor, and is never transmitted to the server - it's used to keep a local reference to a single point in the page (these days the value behind # refers to the id of the element the page should scroll to when being displayed, but earlier it referenced an empty a tag with a name).
To use characters with special meaning in URLs (& would also mean that there's a new parameter coming instead of being interpreted as a value to an argument), you have to escape them. In Javascript you can use encodeURIComponent to do this:
encodeURIComponent("foo#&bar")
-> "foo%23%26bar"
So to send the value foo#&bar as the argument, and not introduce a new parameter or a local anchor hash, the value would be sent as foo%23%26bar instead. Your HTTP server will decode this for you automagically.
?q=field%3Afoo%23%26bar
.. will be interpreted as field:foo#&bar serverside. Since ':' can usually be used safely in URLs, you don't have to escape it - but it doesn't hurt to do it properly. Look up URL escaping in your language of choice if you're going to do this in an application.

Removing punctuation except for apostrophes AND intra-word dashes in R

I know how to separately remove punctuation and keep apostrophes:
gsub( "[^[:alnum:]']", " ", db$text )
or how to keep intra-word dashes with the tm package:
removePunctuation(db$text, preserve_intra_word_dashes = TRUE)
but I cannot find a way to do both at the same time. For example if my original sentence is:
"Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"
I would like it to be:
"Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
Of course, there will be extra white spaces, but I can remove them later.
I will be grateful for your help.
Use character classes
gsub("[^[:alnum:]['-]", " ", db$text)
## "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
I like David Arenberg's answer. If you need another way, you could try:
library(qdap)
text <- "Interested in energy/the environment/etc.? Congrats to our new e-board! Ben, Nathan, Jenny, and Adam, y'all are sure to lead the club in a great direction next year! #obama #swag"
gsub("/", " ",strip(text, char.keep=c("-","/"), apostrophe.remove=F,lower.case=F))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
or
library(gsubfn)
clean(gsubfn("[[:punct:]]", function(x) ifelse(x=="'","'",ifelse(x=="-","-"," ")),text))
#[1] "Interested in energy the environment etc Congrats to our new e-board Ben Nathan Jenny and Adam y'all are sure to lead the club in a great direction next year obama swag"
clean is from qdap. Used to remove escaped characters and space

Mongoose.js schema description issue (array vs object)

I need to store some user dictionaries (sets of words, and I need to store it as user property) and each dictionary has actually only one property: language.
So I can describe that property like that:
dictionaries:[{language: 'string', words: [.. word entry schema desc..]
}]
and store dictionaries like that:
dictionaries: [
{language: en, words: [.. words of English dictionary..]},
{language: es, words: [.. words of Spanish dictionary..]}
]
But actually I could store the dictionaries in "less nested" way: not array abut object:
dictionaries: {
en: [.. words of English dictionary..],
es: [.. words of Spanish dictionary..]
}
But I don't see a way to describe such object with mongoose schema. So the question is what is the best (more reasonable in terms of storage and querying) option to go with considering I use mongoose.

Resources