MarkLogic - Enabling stemming will also search for American / British spelling

MarkLogic - Enabling stemming will also search for American / British spelling - search

MarkLogic 9.0.8.2
We have business requirements to support American/British words in search queries like
fiber or fibre
color or colour
SO if we enable stemming at database level will solve this problem or we need to configure more to make it work?
Stemming
https://docs.marklogic.com/guide/search-dev/stemming

Yes, enabling stemming on the database would be the easiest way to achieve what you are looking to do.
Below is some code that you can use to quickly experiment and verify that it will work for you:
xquery version "1.0-ml";
(: enable stemmed searches :)
import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy";
let $config := admin:get-configuration()
return
(: experiment with various settings: off, basic, advanced, decompounding :)
admin:database-set-stemmed-searches($config, xdmp:database("Documents"), "basic")
! admin:save-configuration(.)
;
(: insert two test documents with different spelling for color :)
("color","colour") ! xdmp:document-insert("/"||.||".xml", <doc>{.}</doc>)
;
(: search and see what is returned :)
cts:search(doc(), cts:word-query("colour"))

Related

Searching a lot of keywords on twitter via tweepy

I am trying to make a python code with tweepy that will track all the tweets from a specific country from a date which will have some of the chosen specific keywords. And I have chosen a lot of keywords like 24-25.
My keywords are vigilance anticipation interesting ecstacy joy serenity admiration trust acceptance terror fear apprehensive amazement surprize distraction grief sadness pensiveness loathing disgust boredom rage anger annoyance.
for more understanding, my code till now is:
places = api.geo_search(query="Canada",granularity="country")
place_id = places[0].id
public_tweets = tweepy.Cursor(api.search,
q="place:"+place_id+" since:2020-03-01",
lang="en",
).items(num_tweets)
Please help me with this question as soon as possible.
Thank You

Get default stop word list in elastic search

I am trying to find out what the predefined stop word list for elastic search are, but i have found no documented read API for this.
So, i want to find the word lists for this predefined variables (_arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_, _catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_)
I found the english stop word list in the documentation, but I want to check if it is the one my server really uses and also check the stop word lists for other languages.

The stop words used by the English Analyzer are the same as the ones defined in the Standard Analyzer, namely the ones you found in the documentation.
The stop word files for all other languages can be found in the Lucene repository in the analysis/common/src/resources/org/apache/lucene/analysis folder.

How to get inflections for a word using Wordnet

I want to get inflectional forms for a word using Wordnet.
E.g. If the word is make, then its inflections are
made, makes, making
I tried all the options of the wn command but I did not get the inflections for a word.
Any idea how to get these?

I am not sure wordnet was intended to inflect words. Just found this little writeup about how WordNet(R) makes use of the Morphy algorithm to make a morphological determination of the head term associated with an inflected form https://github.com/jdee/dubsar/wiki/Inflections. I needed some inflection for a project of mine (Python) a little ago and I used https://github.com/pwdyson/inflect.py and https://bitbucket.org/cnu/montylingua3/overview/ (required some hacking, also take a look at the original http://web.media.mit.edu/~hugo/montylingua/)

This python package LemmInflect provides functions to get all inflections of a word.
Just copy their examples here:
> from lemminflect import getInflection, getAllInflections, getAllInflectionsOOV
> getInflection('watch', tag='VBD')
('watched',)
> getAllInflections('watch')
{'NN': ('watch',), 'NNS': ('watches', 'watch'), 'VB': ('watch',), 'VBD': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',), 'VBP': ('watch',)}
> getAllInflections('watch', upos='VERB')
{'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}
> getAllInflectionsOOV('xxwatch', upos='NOUN')
{'NN': ('xxwatch',), 'NNS': ('xxwatches',)}
Check out https://lemminflect.readthedocs.io/en/latest/inflections/ for more details.

TCL (thermal control language) [printer protocol] references

I'm working on supporting of the TCL (thermal control protocol, stupid name, its a printer protocol of futurelogic) but i cannot find resources about this protocol, how it is, how it works, nothing, on theirs site i only found this mention http://www.futurelogic-inc.com/trademarks.aspx
any one had worked with it? does any one knows where can i find the data sheet?

The protocol is documented on their website http://www.futurelogic-inc.com/support/downloads/
If you are targetting the PSA66ST model it supports a number of protocols TCL, which is quite nice for delivering templated tickets and, line printing using the Epson ESC/P protocol.
This is all explained in the protocol document.
Oops, these links are incorrect and only correspond to marketing brochures. You will need to contact Futurelogic for the protocol documents. Probably also need to sign an NDA. Anyway, the information may guide you some more.

From what I can gather, it seems the FutureLogic thermal printers do not support general printing, but only printing using predefined templates stored in the printer's firmware. The basic command structure is a caret ^ followed by a one or two character command code, with arguments delimited using a pipe |, and the command ended with another caret ^. I've been able to reverse-engineer a few commands:
^S^ - Printer status
^Se^ - Extended printer status
^C|x|^ - Clear. Known arguments:
a - all
j - jam
^P|x|y0|...|yn|^ - Print fields y0 through yn using template x.
Data areas are defined in the firmware using a similar command format, command ^D|x|y0|...|yn|^, and templates are defined from data areas using command ^T|z|x0|...|xn|^.

Is there an Open-Source Web Search Library that does not use a Search Index File?

I'm looking for an open-source web search library that does not use a search index file.
Do you know any?
Thanks,
Kenneth

You mean:
search.cgi
#/bin/sh
arg=`echo $QUERY | sed -e 's/^s=//' -e 's/&.*$//'`
cd /var/www/httpd
find . -type f | xargs egrep -l "$arg" | awk 'BEGIN {
print "Content-type: text/html";
print "";
print "<HTML><HEAD><TITLE>Search Result</TITLE></HEAD>";
print "<BODY><P>Here are your search results, sorry it took so long.</P>";
print "<UL>";
}
{ print "<LI>" $1 "</LI>"; }
END {
print "</UL></BODY>";
}'
Untested...

The original poster clarified in a comment to this reply that what he is looking for is essentially "greplike search but through HTTP", and mentioned that he is looking for something that uses little disk as he's working with an embedded system.
I am not aware of any related projects, but you might want to look at html parsers and xquery implementations in your language of choice. You should be able to take care of "real-life" messiness of html with the former, and write a search that's almost as detailed as you might desire with the latter.
I assume that you will be working with a set of urls that will either be provided, or already stored locally, since the idea of actually crawling the whole web, discovering links, etc, in an embedded device is thoroughly unrealistic.
Although with a good html/xquery implementation, you do have the tools to extract all the links..
My original answer, which was really a request for clarification:
Not sure what you mean. How do you picture a search working without an index? Crawling the web for every query? Piping through to google? Or are you referring to a specific kind of search index file that you are trying to avoid?

I guess there is none (at least that is popular enough for users here to be aware of).
We've went ahead to code our own Search system.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

MarkLogic - Enabling stemming will also search for American / British spelling - search

Related

Searching a lot of keywords on twitter via tweepy

Get default stop word list in elastic search

How to get inflections for a word using Wordnet

TCL (thermal control language) [printer protocol] references

Is there an Open-Source Web Search Library that does not use a Search Index File?

Categories

Resources