Get default stop word list in elastic search - search

I am trying to find out what the predefined stop word list for elastic search are, but i have found no documented read API for this.
So, i want to find the word lists for this predefined variables (_arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_, _catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_)
I found the english stop word list in the documentation, but I want to check if it is the one my server really uses and also check the stop word lists for other languages.

The stop words used by the English Analyzer are the same as the ones defined in the Standard Analyzer, namely the ones you found in the documentation.
The stop word files for all other languages can be found in the Lucene repository in the analysis/common/src/resources/org/apache/lucene/analysis folder.

Related

Dynamic test tag pattern execution in karate [duplicate]

I'm wondering if you can use wildcard characters with tags to get all tagged scenarios/features that match a certain pattern.
For example, I've used 17 unique tags on many scenarios throughout many of my feature files. The pattern is "#jira=CIS-" followed by 4 numbers, like #jira=CIS-1234 and #jira=CIS-5678.
I'm hoping I can use a wildcard character or something that will find all of the matches for me.
I want to be able to exclude them from being run, when I run all of my features/scenarios.
I've tried the follow:
--tags ~#jira
--tags ~#jira*
--tags ~#jira=*
--tags ~#jira=
Unfortunately none have given my the results I wanted. I was only able to exclude them when I used the exact tag, ex. ~#jira=CIS-1234. It's not a good solution to have to add each single one (of the 17 different tags) to the command line. These tags can change frequently, with new ones being added and old ones being removed, plus it would make for one real long command.
Yes. First read this - there is this un-documented expression-language (based on JS) for advanced tag selction based on the #key=val1,val2 form: https://stackoverflow.com/a/67219165/143475
So you should be able to do this:
valuesFor('#jira').isPresent
And even (here s will be a string, on which you can even do JS regex if you know how):
valuesFor('#jira').isEach(s => s.startsWith('CIS-'))
Would be great to get your confirmation and then this thread itself can help others and we can add it to the docs at some point.

Adding multiple keywords with Exiftool, but only if they're not already present

I'm running the following command to add multiple keywords to an image:
exiftool -keywords+="Flowering" -keywords+="In Flower" -keywords+="Primula vulgaris" -overwrite_original "/pictures/Some Folder/P4130073.JPG"
However, I've noticed that if I do this for an image which already contains a particular keyword, then it'll get added a second time.
How can I ensure that keywords are added only if they're already missing, and that if they exist, it'll do a no-op (and ideally leave the file untouched). I've read a few questions on the forum and the docs, but NoDups docs isn't clear to me (I'm an exiftool n00b) and all the answers I've found only process a single keyword addition.
For an added bonus, if the 'exists' check could be case-insensitive, so much the better (e.g., so that if I'm doing keywords+="Flowering" and the image already has the keyword "flowering", nothing will be done.
I also need this to work on Linux, MacOS and Windows (I know the quotes can complicate things!).
See Exiftool FAQ #17
To prevent duplication when adding new items, specific items can be
deleted then added back again in the same command. For example, the
following command adds the keywords "one" and "two", ensuring that
they are not duplicated if they already existed in the keywords of an
image:
exiftool -keywords-=one -keywords+=one -keywords-=two -keywords+=two DIR
The NoDups helper function is used to remove duplicates when they already exist in the file. It isn't used to prevent duplicates from being added in the first place.

Combining phrases from list of words Python3

doing my best to grab information out of a lot of pdf files. Have them in a dictionary format where the key is a given date and the values are a list of occupations.
looks like this when proper:
'12/29/2014': [['COUNSELING',
'NURSING',
'NURSING',
'NURSING',
'NURSING',
'NURSING']]
However, occasionally there are occupations with several words which cannot be reliably understood in single word-form, such as this:
'11/03/2014': [['DENTISTRY',
'OSTEOPATHIC',
'MEDICINE',
'SURGERY',
'SOCIAL',
'SPEECH-LANGUAGE',
'PATHOLOGY']]
Notice that "osteopathic medicine & surgery" and "speech-language pathology" are the full text for two of these entries. This gets hairier when we also have examples of just "osteopathic medicine" or even "medicine."
So my question is this - How should I go about testing combinations of these words to see if they match more complex occupational titles? I can use the same order of the words, as I have maintained that from the source.
Thanks!

How to get inflections for a word using Wordnet

I want to get inflectional forms for a word using Wordnet.
E.g. If the word is make, then its inflections are
made, makes, making
I tried all the options of the wn command but I did not get the inflections for a word.
Any idea how to get these?
I am not sure wordnet was intended to inflect words. Just found this little writeup about how WordNet(R) makes use of the Morphy algorithm to make a morphological determination of the head term associated with an inflected form https://github.com/jdee/dubsar/wiki/Inflections. I needed some inflection for a project of mine (Python) a little ago and I used https://github.com/pwdyson/inflect.py and https://bitbucket.org/cnu/montylingua3/overview/ (required some hacking, also take a look at the original http://web.media.mit.edu/~hugo/montylingua/)
This python package LemmInflect provides functions to get all inflections of a word.
Just copy their examples here:
> from lemminflect import getInflection, getAllInflections, getAllInflectionsOOV
> getInflection('watch', tag='VBD')
('watched',)
> getAllInflections('watch')
{'NN': ('watch',), 'NNS': ('watches', 'watch'), 'VB': ('watch',), 'VBD': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',), 'VBP': ('watch',)}
> getAllInflections('watch', upos='VERB')
{'VB': ('watch',), 'VBP': ('watch',), 'VBD': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',)}
> getAllInflectionsOOV('xxwatch', upos='NOUN')
{'NN': ('xxwatch',), 'NNS': ('xxwatches',)}
Check out https://lemminflect.readthedocs.io/en/latest/inflections/ for more details.

Creating a tag cloud from text:nsp output

I am using Text::NSP which creates n-gram from text files. Is it possible to create tag clouds from an output file of Text-NSP? I have used and liked IBM Word Cloud Generator which only gives a tag cloud output from the frequency of each word within a file. However, I am working with 2-grams and 3-grams. In short, I need a tag cloud generator which will accept an input file with words and their occurrence number. I am running on Debian.
Thanks all.
I started to use R snippets package which is what I was searching for.
The output of text::nsp should be changed with some bash scripts in order to obtain a dataframe acceptable by the R.

Resources