I am using Apache Solr for indexing using DataImportHandler.
The document structure is as follows:
id(long), title(text), abstract(text), pubDate(date)
I combined title and abstract filed fro text searching.My problem is when I query
"title: utility" then it gives result as follows:
id, title
6, Financial Deal Insights Energy & Utilities December 2008
11,Residential utility retail strategies in an economic downturn
16,Financial Deal Insights: Energy & Utilities Review of 2008
41,Solar at the heart of utility corporate strategy
I want to search only "utility" but it gives result also for utilities...
I also tried title:"utility" and title:utility~1 but it doesnt worked.
I read about 'stemming' but I dont have any idea how to use it...
please help me..
thanks..
This is cause of the PorterStemFilterFactory in your Text analysis.
<filter class="solr.PorterStemFilterFactory"/>
Stemmer would reduce the words to root and hence utility would match utilities as well.
Check if you need Stemmer for Searching, else you can remove it from your filter chain.
Else check for a less aggressive stemmer to fit your needs.
Related
Does someone have a recommendation of tagging tool for NER types in raw text?
The input for the tool should be a library of text files(.txt simple format) , there should be a convenient UI for selecting words and set the tag/annotation fit to selection, the output should be structural representations of the tags(e.gs tart index , last index, tag in a JSON format)
Founderof LightTag here
We provide a super convenient interface to do span annotations such as named entity recognition, classifications and relationships.
You can work as one labeler or bring in a team and LightTag will disribute work between everyone automatically (no more selecting files and remembering what you labeled already) .
You can upload your own suggestions and let labelers use those, or use LightTags built in model.
Of course you can annotate at the character level and highlight subwords or multi word phrases.
You can try https://github.com/lasigeBioTM/MER (bash)
see the demo at http://labs.fc.ul.pt/mer/
Online tools:
I guess Dataturks' POS tool should work fine for your use case, you can just upload your data and specify the labels. The UI seems convenient enough.
Here is the link:
https://dataturks.com
It's an online tool, so you can work with multiple people to get the tagging done.
The exact output format you are looking for is not supported, but the format can easily be converted to what you are looking for, the output is like: word___LABEL word2___LABEL , so a simple 2-line script can convert it to start and end index.
Offline:
Another tool you can check out is prodigy, it's a downloadable software and does similar things. Just that you might be willing to pay for it upfront.
https://prodi.gy
I was comparing both ElasticSearch and Apache Solr for a search solution. Data that will go into the system is not moderated and I don't want anyone to search for something and some sexually explicit content to flash on the very top of the search result. But I don't want to remove them for search results either. I want to demote them, so that they come later in the search results. Can I do this in Solr or ElasticSearch ? Some pointers towards how to achieve this will be helpful.
In Solr you can't give "negative boosts" per se but you can boost everything that doesn't have the term. This can be done with the boost query:
...&bq=(*:* -erotic)^999
or in solrconfig.xml:
<str name="bq">(*:* -erotic)^999</str>
Where "erotic" is the term to which you wish to give a "negative boost". To add another term, add another bq=....
My question is not about parsing.
I have been looking through the wikipedia API. I need to search for companies and get a one sentence summary. It's working good, the only problem I have is when I need to disambiguate. It's hard for my code to know whether "dropbox (service)" or "dropbox (band)" is the dropbox company my user is looking for.
I tried to put the word "company" in the query, expecting it to work like a google search, but unfortunately it didn't.
so my question is: is there an easy way to disambiguate the results I get by telling wikipedia it is a "company" that I want?
If you're looking for companies only then consider using their full names instead of short forms. In case of Dropbox, the name of the company is Dropbox, Inc. If you search for Dropbox, Inc in Wikipedia you will be redirected to the page Dropbox(Service) which i believe is the page youre looking for.
If you dont have the resources to have the name of the company in the perfect format, then consider using Category:Companies to refine your results further.
When you get to the page, you can mine for the extract of the company by using the Mediawiki API as follows
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Dropbox%20(service)
Note: The extract is called section0 in MediaWiki
I recommend trying Wikidata. Wikidata are a multilingual factual database of everything, and they have a query interface at query.wikidata.org. The language the interface uses is called SPARQL. For instance, if you're interested in a list of well-known cats, https://w.wiki/W4W is your query. More details can be found at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service.
import wikipedia
print(wikipedia.summary("COMPANY_NAME"))
Try to filter out the companies by categories - there is a list provided in the end of the page:
xx = wikipedia.page("Dropbox")
xx.title
print(xx.categories)
I am trying to make multi-language stemming working with the Solr. I have setup language detection with LangDetectLanguageIdentifierUpdateProcessorFactory as per official Solr guides. The language is recognized and now I have a whole bunch of dynamic fields like:
description_en
description_de
description_fr
...
which are properly stemmed.
The question now is how do I search across so many fields? Making a long query every time that will search across dozens possible language fields doesn't seem like a smart option. I have tried using copyField like:
<copyField source="description_*" dest="text"/>
but stemming is being lost in the text field when I do that.
The text field is defined as solr.TextField with solr.WhitespaceTokenizerFactory. Maybe I am not setting up the text field properly or how is this supposed to be done?
You have multiple options:
search over all the fields you mentioned. There always will be some overhead: the more fields you use, the slower search will be (gradually)
try to recognise query language and search over only necessary fields: for example recognised and some default one. Here you can find library for this
develop custom solution with multiple languages in one field, which is possible and could work in production according to Trey Graigner
The question is a bit old, but maybe that answer will help other people.
We are using a site definition and it has 3 feature dependencies that we are struggling to identify:
<ActivationDependency FeatureId="7EDD3C9C-8AC6-4ab5-A209-30B5DC422464" />
<ActivationDependency FeatureId="63FDC6AC-DBB4-4247-B46E-A091AEFC866F" />
<ActivationDependency FeatureId="22A9EF51-737B-4ff2-9346-694633FE4416" />
Can anyone identify what these features or give me an idea as to how to identify them?
I think they are out of the box moss features but they are not installed on the farm currently.
Thanks for any suggestions
22A9EF51-737B-4ff2-9346-694633FE4416 - Publishing Web Feature
The other two GUIDs are not googleable and don't return any results on MSDN. Are they Microsoft features, or could they be 3rd party?
An alternative to Copernic Dekstop search is a tool called Agent Ransack from Mythicsoft. It allows for really good text search in files (in the FEATURES folder of the 12 hive in your case) and it is free. Download it here.
I use Copernic Desktop Search and have indexed a copy of the 12 hive. I frequently use it to search for out of the box and custom features by GUID, just as you are.
As Andrew said, 7EDD3C9C-8AC6-4ab5-A209-30B5DC422464 and
63FDC6AC-DBB4-4247-B46E-A091AEFC866F are not standard features as they were not in the 12 hive. But if you download Copernic (or use a similar search tool) and point it at your solution, you should be able to find the feature definitions pretty quickly.