Defining a list of strings using snowball

Defining a list of strings using snowball - nlp

How can i define a list string using snowball ?
I have tried to do it like this :
define patterns (
'{m}{f}{i}{l}' or '{f}{a}{i}{l}' or .......
)
How to get the list length ? how to deal with every pattern ?

An example:
groupings ( v v_WXY valid_LI )
stringescapes {}
define v 'aeiouy'
define v_WXY v + 'wxY'
define valid_LI 'cdeghkmnrt'
Combine strings into groupings. Example drawn from: http://snowball.tartarus.org/algorithms/english/stemmer.html

Related

Remove map entries based on collection of values - how to do it in a Groovy way?

Is there a Groovy way of dropping elements from a that match values in b?
def a = [1:"aa", 2:"bb", 3:"cc", 4:"dd"]
def b = [ "bb", "dd"]
expected output : [1:"aa", 3:"cc"]
I am currently using 2 nested for loops to solve this. I am wondering if Groovy has a better way of doing it?

For Groovy < 2.5.0
You can use a single Map.findAll() method to do that:
a.findAll { k,v -> !(v in b) }
However, keep in mind that this method does not modify existing a map, but it creates a new one instead. So if you want to modify map stored in a variable you will have to reassign it.
a = a.findAll { k,v -> !(v in b) }
For Groovy >= 2.5.0
Groovy version 2.5.x introduced a new default method for Map - removeAll which takes a predicate and removes elements from input map based on this predicate.
a.removeAll { k,v -> v in b}

sort list of maps in grovvy

Hi I have a list of maps in groovy like
def v=[[val1:'FP'],[val1:'LP'],[val1:'MP'],[val1:'MP'],[val1:'LP'],[val1:'FP']]
I wanted to sort based on the following order FP,MP,LP
I tried doing
v.sort{x,y->
x.val1 <=> y.val1
}
which prints [[val1:FP], [val1:FP], [val1:LP], [val1:LP], [val1:MP], [val1:MP]] which is sorted alphabetically, but I need it to be sorted in the following format
FP,MP,LP

An alternative: Whenever I am dealing with a fixed, ordered list of strings I immediately think of using enums instead:
enum PValue { FP, MP, LP }
Now we have an ordered set of constants that readily converts to and from string values. So sorting looks as simple as this:
v.sort { x, y -> PValue[x.val1] <=> PValue[y.val1] }
EDIT: Or even simpler:
v.sort { PValue[it.val1] }

As has been said int the comments, you need to define a preferred order, and then sort based on that... so with your list of maps:
def v=[[val1:'FP'],[val1:'LP'],[val1:'MP'],[val1:'MP'],[val1:'LP'],[val1:'FP']]
And a preferred order of results:
def preferredOrder = ['FP', 'MP', 'LP']
You can then sort based on the values index into this preferred order:
v.sort(false) { preferredOrder.indexOf(it.val1) }
Or, if you want unknown elements (ie: [val1:'ZP']) to go at the end of the sorted list, then you an do:
v.sort(false) { preferredOrder.indexOf(it.val1) + 1 ?: it.val1 }
So if they are not found (index -1) then they are compared on their String name
This question is similar to this one btw, which has more options in the answer

How to use the term position parameter in Xapian query constructors

Xapian docs talk about a query constructor that takes a term position parameter, to be used in phrase searches:
Quote:
This constructor actually takes a couple of extra parameters, which
may be used to specify positional and frequency information for terms
in the query:
Xapian::Query(const string & tname_,
Xapian::termcount wqf_ = 1,
Xapian::termpos term_pos_ = 0)
The term_pos represents the position of the term in the query. Again,
this isn't useful for a single term query by itself, but is used for
phrase searching, passage retrieval, and other operations which
require knowledge of the order of terms in the query (such as
returning the set of matching terms in a given document in the same
order as they occur in the query). If such operations are not
required, the default value of 0 may be used.
And in the reference, we have:
Xapian::Query::Query ( const std::string & tname_,
Xapian::termcount wqf_ = 1,
Xapian::termpos pos_ = 0
)
A query consisting of a single term.
And:
typedef unsigned termpos
A term position within a document or query.
So, say I want to build a query for the phrase: "foo bar baz", how do I go about it?!
Does term_pos_ provide relative position values, ie define the order of terms within the document:
(I'm using here the python bindings API, as I'm more familiar with it)
q = xapian.Query(xapian.Query.OP_AND, [xapian.Query("foo", wqf, 1),xapian.Query("bar", wqf,2),xapian.Query("baz", wqf,3)] )
And just for the sake of testing, suppose we did:
q = xapian.Query(xapian.Query.OP_AND, [xapian.Query("foo", wqf, 3),xapian.Query("bar", wqf, 4),xapian.Query("baz", wqf, 5)] )
So this would give the same results as the previous example?!
And suppose we have:
q = xapian.Query(xapian.Query.OP_AND, [xapian.Query("foo", wqf, 2),xapian.Query("bar", wqf, 4),xapian.Query("baz", wqf, 5)] )
So now this would match where documents have "foo" "bar" separated with one term, followed by "baz" ??
Is it as such, or is it that this parameter is referring to absolute positions of the indexed terms?!
Edit:
And how is OP_PHRASE related to this? I find some online samples using OP_PHRASE as such:
q = xapian.Query(xapian.Query.OP_PHRASE, term_list)
This makes obvious sense, but then what is the role of the said term_pos_ constructor in phrase searches - is it a more surgical way of doing things!?

int pos = 1;
std::list<Xapian::Query> subs;
subs.push_back(Xapian::Query("foo", 1, pos++));
subs.push_back(Xapian::Query("bar", 1, pos++));
querylist.push_back(Xapian::Query(Xapian::Query::OP_PHRASE, subs.begin(), subs.end()));

Asterisks in front of array names in Groovy?

I'm a bit new to Groovy, so I'm sure this is one of those extremely obvious things...but it's difficult to search for via Google.
In other languages, asterisks tend to represent pointers. However, in this snippet of Groovy code:
byte[] combineArrays(foo, bar, int start) {
[*foo[0..<start], *bar, *foo[start..<foo.size()]]
}
I can only imagine that that's not the case. I mean, pointers? Groovy?
I'm assuming that this code intends to pass the members of foo and bar as opposed to a multidimensional array. So what exactly do the asterisks mean?
Thanks a lot for your help.

When used like this, the * operator spreads a List or Array into a list of arguments. That didn't help at all, did it? How about an example instead? Say we have this function:
def add(Number a, Number b) {
return a + b
}
And this List
def args = [1, 2]
We shouldn't do this:
add(args)
because the function expects two numeric arguments. But we can do this:
add(*args)
because the * operator converts the List of 2 elements into 2 arguments. You can use this operator with Lists and Arrays.

Visual C++ hash_multimap not finding any results

I need some help understanding how stdext::hash_multimap's lower_bound, upper_bound and equal_range work (at least the VS2005 version of it).
I have the following code (summarized for the question)
#include <hash_map>
using stdext::hash_multimap;
using std::greater;
using stdext::hash_compare;
using std::pair;
using std::cout;
typedef hash_multimap < double, CComBSTR, hash_compare< double, greater<double> > > HMM;
HMM hm1;
HMM :: const_iterator it1, it2;
pair<HMM::const_iterator, HMM::const_iterator> pairHMM;
typedef pair <double, CComBSTR> PairDblStr;
// inserting only two values for sample
hm1.insert ( PairDblStr ( 0.224015748, L"#1-64" ) );
hm1.insert ( PairDblStr ( 0.215354331, L"#1-72" ) );
// Using a double value in between the inserted key values to find one of the elements in the map
it1 = hm1.lower_bound( 0.2175 );
if( it1 == hm1.end() )
{
cout << "lower_bound failed\n";
}
it1 = hm1.upper_bound( 0.2175 );
if( it1 == hm1.end() )
{
cout << "upper_bound failed\n";
}
pairHMM = hm1.equal_range( 0.2175 );
if( ( pairHMM.first == hm1.end() ) && ( pairHMM.second == hm1.end() ) )
{
cout << "equal_range failed\n";
}
As mentioned in the comment I am passing in a value (0.2175) that is in between the two inserted key values (0.224015748, 0.215354331). But the output of the code is:
lower_bound failed
upper_bound failed
equal_range failed
Did I misunderstand how the lower_bound, upper_bound and equal_range can be used in maps? Can we not find a "closest match" key using these methods? If these methods are not suitable, would you have any suggestion on what I could use for my requirement?
Thanks in advance for any help.
Thanks to #billy-oneal #dauphic for their comments and edits. I have updated the code above to make it compilable and runnable (once you include the correct headers of course).

Can we not find a "closest match" key using these methods?
No. hash_multimap is implemented using a hashtable. Two keys that are very close to each other (0.2153 and 0.2175, for example) will likely map to totally different bins in the hashtable.
A hashtable does not maintain its elements in a sorted order, so you cannot find the closest match to a given key without a linear search.
The lower_bound, upper_bound, and equal_range functions in hash_multimap have a somewhat odd implementation in the Visual C++ standard library extensions.
Consider the documentation for lower_bound:
The member function determines the first element X in the controlled sequence that hashes to the same bucket as key and has equivalent ordering to key. If no such element exists, it returns hash_map::end; otherwise it returns an iterator that designates X. You use it to locate the beginning of a sequence of elements currently in the controlled sequence that match a specified key.
And the documentation for upper_bound:
The member function determines the last element X in the controlled sequence that hashes to the same bucket as key and has equivalent ordering to key. If no such element exists, or if X is the last element in the controlled sequence, it returns hash_map::end; otherwise it returns an iterator that designates the first element beyond X. You use it to locate the end of a sequence of elements currently in the controlled sequence that match a specified key.
Essentially, these functions allow you to identify the range of elements that have a given key. Their behavior is not the same as the behavior of std::lower_bound or std::map::lower_bound (theirs is the behavior that you were expecting).
For what it's worth, the C++0x unordered associative containers do not provide lower_bound, upper_bound, or equal_range functions.
Would you have any suggestion on what I could use for my requirement?
Yes: if you need the behavior of lower_bound and upper_bound, use an ordered associative container like std::multimap.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Defining a list of strings using snowball - nlp

How can i define a list string using snowball ? I have tried to do it like this : define patterns ( '{m}{f}{i}{l}' or '{f}{a}{i}{l}' or ....... ) How to get the list length ? how to deal with every pattern ?

An example: groupings ( v v_WXY valid_LI ) stringescapes {} define v 'aeiouy' define v_WXY v + 'wxY' define valid_LI 'cdeghkmnrt' Combine strings into groupings. Example drawn from: http://snowball.tartarus.org/algorithms/english/stemmer.html

Related

Remove map entries based on collection of values - how to do it in a Groovy way?

sort list of maps in grovvy

How to use the term position parameter in Xapian query constructors

Asterisks in front of array names in Groovy?

Visual C++ hash_multimap not finding any results

Categories

Resources