Configure sphinx to rank exact matching higher with morphology enabled - node.js

I'm having sphinx index to search users by names.
I'm using soundex morphology to show more relevant results for case searcher doesn't exactly know how the name spells. Consider following table:
+----+--------------------+
| id | name |
+----+--------------------+
| 1 | Maciej Makuszewski |
| 2 | Dane Massey |
| 3 | Lionel Messi |
| 4 | Mr. No Matches |
+----+--------------------+
With soundex enabled sphinx suggests 1, 2, 3 rows as a relevant result for query messi. Anyway I'd like to show the exact matching first. I mean that if user types messi he wants to see Lionel Messi the first with great probability.
My problem is I don't know how to do that. I tried to set different rankers but it gives nothing.
I also tried to add
index_exact_words = 1
to index but it gives nothing.
I'm using sphinx API with node.js sphinxapi module if it matters.
What is the common way of solving such issue?

You want, index_exact_words, but should also add expand_keywords
This will cause sphinx to search for the fuzzy (via morphology) AND the exact word (via index_exact_words) automatically. So an exact match, matches both, and ranks higher.
Can do the same manually by searching for say
messi | =messi
(which is similar to what expand_keywords does internally)

Related

Solr Query is not able to pick up expected documents in result set

I have a Solr document whose fields and values are shown below.
and the parsed query which i am trying to Hit to fetch this document is "red tape white casual shoes"-
parsedquery: "+(DisjunctionMaxQuery((keywords_text_en:casual | (brandName_text_en_mv:casual)^3.0 | (name_text_en:casual)^2.0 | (categoryName_text_en_mv:casual)^4.0))
DisjunctionMaxQuery((Synonym(keywords_text_en:boot keywords_text_en:shoe) | (Synonym(brandName_text_en_mv:boot brandName_text_en_mv:shoe))^3.0 | (Synonym(name_text_en:boot name_text_en:shoe))^2.0 | (Synonym(categoryName_text_en_mv:boot categoryName_text_en_mv:shoe))^4.0))
DisjunctionMaxQuery((keywords_text_en:red | (brandName_text_en_mv:red)^3.0 | (name_text_en:red)^2.0 | (categoryName_text_en_mv:red)^4.0)) DisjunctionMaxQuery((keywords_text_en:tape | (brandName_text_en_mv:tape)^3.0 | (name_text_en:tape)^2.0 | (categoryName_text_en_mv:tape)^4.0)) DisjunctionMaxQuery((keywords_text_en:white |
(brandName_text_en_mv:white)^3.0 | (name_text_en:white)^2.0 | (categoryName_text_en_mv:white)^4.0)))~5 DisjunctionMaxQuery(((keywords_text_en:"casual (boot shoe) red tape white"~5)^2.0 | (brandName_text_en_mv:"casual (boot shoe) red tape white"~5)^6.0 | (categoryName_text_en_mv:"casual (boot shoe) red tape white"~5)^8.0 | (name_text_en:"casual (boot shoe) red tape white"~5)^4.0))",
As per my understanding, since the word - "casual" is present in the 'categoryName_text_en_mv' field, and all the other words in other query fields, this query should be able to find this and return in the response.
but the number of documents found is 0. Can someone help me understand what am I missing here?
Thanks in advance!
Edit 1
The interesting thing is when the query is "red tape white shoes", then the expected document is coming in the results. Only when I add 'casual' to the query, it fails. Important observation is that all the other words except causal are present in the single field. I suspect solr is failing to match documents across multiple field
I would suggest you to use the analysis screen . select the field in the drop-down and put the value you are searching in both query and index side to see how it's pipleline is defined .
https://solr.apache.org/guide/6_6/analysis-screen.html
I just reindexed that one particular document and it solved the issue for me. looks like the indexing did not happen properly at solr side.

Gherkin - real or representative scenarios?

Lets say that my app works with some books with real titles "The Old Man and the Sea", "War and Peace", etc., when creating scenarios, should I use real title like:
Given I have a book "War and Peace" persisted
When ...
or should I do something like:
Given I have a book "Book1" persisted
When ...
Option 2 is more generic, but artificial example. And If I use first option, person who is reading the test has to have domain knowledge, and he will also have some presumptions about the scenario as soon as he reads the title of the book.
Also, is there some simpler way for me to create data table without repeating data (in this case page where I have always to repeat 1,1,2,2,2,2...)? example:
When we receive book with following content:
| Page | Line | Text |
| 1 | 1 | a |
| 1 | 2 | b |
| 2 | 1 | a |
| 2 | 2 | b |
is this standard way to do it:
When we receive a book
And page 1 has content
| Line | Text |
| 1 | a |
| 2 | b |
And page 2 has content
| Line | Text |
| 1 | a |
| 2 | b |
First of all start with the name of the scenario, this name should be meaningful and should be like a summary about what is about the test.
Once you have the name then the other steps should describe a business flow that of course should contain domain language, because for example if i don't know nothing about healthcare, banking etc then why would I understand a test about a specific domain subject?, the scenarios are for a specific group of people (the ones that are working in the specific domain).
One of the BDD role is to help in understanding better the specifications and the application on all levels (technical to non-technical, but on the same business domain), to improve communication.
Now for your specific issue.
Given I have a book "War and Peace" persisted does not offer to much info since the title of the book says nothing about the test data; is a new book that just was added/created?, is a type of book technical/poetry or just some book?
What was useful for me is use a name for the the data that says something about the data used in the test.
If you don't have different types of books you can use any name, else a more complete name would be more useful.
As for the table, that represents a data set and you need to tell what to check and where; depending by case you could group some checks, if you can read all data at once or not, or if you need to specify the texts/pages.
One option would be to hide the data set and say something like:
Given I have a book "War and Peace" persisted
Then the book contains the expected content for "War and Peace"
in the first step "War and Peace" - gets/creates a specific book that is identified by this title
in the second step "War and Peace" - identifies a set of data for the expected result using the same name since is the expected for that specific data set, this set of data can be list/array/map ... depending of what programming language you are using.
Don't think to much to the details, just define the scenario in human readable language using outside-in approach, then see if you can refine it and after start the implementation.
Always use a description for the feature and a meaningful title for each scenario

complex console.log output in cli

I'm creating informational output with console.log() in Node.js however, I would like to create a split screen that somehow outputs different data.
Example:
---------------------------------
| value a.1 | value b.1 |
| value a.2 | value b.2 |
| value a.3 | value b.3 |
| value a.4 | value b.4 |
| value a.5 | value b.5 |
| | value b.6 |
| | value b.7 |
---------------------------------
It could be that value b.x is updating very fast, and value a.1 very slowly.
What could I use? Maybe something else then console.log()?
UPDATE:
I needed a UI library for the console.
It sounds like you want a UI library for the console. You're in luck. This sort of thing has been around for a while.
You essentially have two choices:
https://github.com/chjj/blessed - A simple graphics library for terminals that lets you do stuff like what you're describing above.
https://github.com/mscdex/node-ncurses - node bindings for ncurses (this is a standard terminal graphics library).
I think blessed has a nicer API, but the choice is yours!
For the sake of others looking around for more options:
Along with blessed and node-ncurses, you have Colors, Chalk and Terminal-Kit as well.
However, if one do not want to use console.log(), Terminal-Kit could be of use.

Identifying and comparing syntactic structure of questio-sentence

I am getting question from user and trying to understand syntactically.
My goal is to identify the exact question sentence from user entered question. Like
Obama is president of USA, who is his wife?
So I am able to apply anaphora resolution and get his pointing to Obama and can convert above sentence to
Obama is president of USA, who is Obama wife?
but how can I syntactically identify exact question sentence i.e. Who is obama wife? from above entire question
I am trying with pylinkgrammar which give 54 linkage for above sentence, like
linkparser>
Linkage 54, cost vector = (UNUSED=0 DIS= 8.05 LEN=24)
+------------------------------Xp------------------------------+
+---------------------->WV---------------------->+ |
+-------------------Xx-------------------+-->WV->+---SIs---+ |
+----Wd---+--Ss--+--Oum--+---Mp--+-Js+ +Wq+--Q-+ +Ds**c+ |
| | | | | | | | | | | |
LEFT-WALL Obama[!] is.v president.t of USA.l , who is.v his wife.n ?
What I want to do it defining pattern for different question type like W5H1, conjunction based question etc.
But I dont find how to write rule for these pattern, any suggestion and reference would be much appreciable?
You can try to extract different possible sub-questions (hypotheses) from your original text and test for textual entailment between your text and hypotheses. Check out http://hltfbk.github.io/Excitement-Open-Platform/#Recognizing_Textual_Entailment

Is it possible to use 2 different examples table in Cucumber/Cuke4Duke

Is it possible to somehow construct a Scenario which uses two different Example tables in different steps? Something like this:
Given I log in
When I view a page
Then I should see <goodText>
Examples:
|goodText|
|abc|
And I should not see <badText>
Examples:
|badText|
|xyz|
The scenario above doesn't work, also in reality there would be more rows to each table.
It looks like you're confusing tables with scenario examples. You can mix them, but from your example I'm not sure what you're trying to achieve. Why not just write:
Given I log in
When I view a page
Then I should see "abc"
But I should not see "xyz"
or if you wanted to check for multiple strings:
Given I log in
When I view a page
Then I should see the following text:
| abc |
| def |
But I should not see the following text:
| xyz |
| uvw |
You say that in reality there would be many more rows to the table; but of course a table can also have many columns.
Would this not work for you?
Given I log in
When I view a page
Then I should see <goodText>
But I should not see <badText>
Examples:
|goodText| badText |
|abc | xyz |

Resources