Search results order varies each time in Elasticsearch - search

We have 200k records. When running search query for the first time with size: 500 I am getting results in doc-1, doc-2, doc-3. But when I run the same search query for the second time I am getting the order changed to doc-2, doc-1, etc ... why the search result order varies each time when we run the same query ?
Query : {"explain":true,"size":500,"query":{"query_string":{"query":" ( (NAME:\"BANK AMERICA\")^50 OR (Names.Name:(BANK AMERICA))^30 OR (NAME_PAIR:\"BANK AMERICA\")^30 OR (NORMAL_NAME:(BANK AMERICA) AND CITY:\"\" ) ^40 OR (NORMAL_NAME:(BANK AMERICA))^30 OR (Styles.value:\"BS\")^5 OR (NORMAL_NAME:\"BANK AMERICA\")^5 OR (address.streetName:\"\" AND CITY:\"\")^30 OR (ZIP:\"\")^6 OR (address.streetName:\"\")^6 OR (address.streetNumber:\"\" AND address.streetName:\"\")^15 OR (telephones.telephone:\"\")^50 OR (mailAddresses.postbox:\"\")^6 ) "}},"sort":[{"_score":{"order":"desc"}},{"statusIndicator":{"order":"asc"}}],"aggs":{"NAME":{"filter":{"term":{"NAME":"ATLS"}}}}}
when running the above the the results are :
"hits": {
"total": 106421,
"max_score": null,
"hits": [
{
"_shard": 0,
"_node": "1",
"_index": "allocation_e1",
"_type": "my_type",
"_id": "217600050_826_E1",
"_score": 2.9569159,
"_routing": "E1",
"_source": {
"sample_number": 217600050,
"countryCode": 101,
"state": "E1",
"name": "BANK of AMERICA Plc",
when ruining the same query oneagain the results are :
Query : {"explain":true,"size":500,"query":{"query_string":{"query":" ( (NAME:\"BANK AMERICA\")^50 OR (Names.Name:(BANK AMERICA))^30 OR (NAME_PAIR:\"BANK AMERICA\")^30 OR (NORMAL_NAME:(BANK AMERICA) AND CITY:\"\" ) ^40 OR (NORMAL_NAME:(BANK AMERICA))^30 OR (Styles.value:\"BS\")^5 OR (NORMAL_NAME:\"BANK AMERICA\")^5 OR (address.streetName:\"\" AND CITY:\"\")^30 OR (ZIP:\"\")^6 OR (address.streetName:\"\")^6 OR (address.streetNumber:\"\" AND address.streetName:\"\")^15 OR (telephones.telephone:\"\")^50 OR (mailAddresses.postbox:\"\")^6 ) "}},"sort":[{"_score":{"order":"desc"}},{"statusIndicator":{"order":"asc"}}],"aggs":{"NAME":{"filter":{"term":{"NAME":"ATLS"}}}}}
hits": {
"total": 106421,
"max_score": null,
"hits": [
{
"_shard": 0,
"_node": "1",
"_index": "allocation_e1",
"_type": "my_type",
"_id": "239958846_826_E1",
"_score": 2.9571724,
"_routing": "E1",
"_source": {
"sample_number": 239958846,
"countryCode": 101,
"state": "E1",
"name": "BANK of AMERICA Plc",
when running the same query the document order gets differs why do the document order changes when running the same query ?
please help on this thanks in advance

Run your queries in say descending order, based on UID, and you will get the same results.
Compare the following examples.
Unsorted:
Sorted ascending:

Related

Loading json data into Cassandra using dsbulk

I feel like the documentation on loading json files into cassandra is really lacking in dsbulk docs.
Here is part of the json file that im trying to load:
[
{
"tags": [
"r"
],
"owner": {
"reputation": 23,
"user_id": 12235281,
"user_type": "registered",
"profile_image": "https://www.gravatar.com/avatar/60e28f52215bff12adb9758fc2cf86dd?s=128&d=identicon&r=PG&f=1",
"display_name": "Me28",
"link": "https://stackoverflow.com/users/12235281/me28"
},
"is_answered": false,
"view_count": 3,
"answer_count": 0,
"score": 0,
"last_activity_date": 1589053659,
"creation_date": 1589053659,
"question_id": 61702762,
"link": "https://stackoverflow.com/questions/61702762/merge-dataframes-in-r-with-different-size-and-condition",
"title": "Merge dataframes in R with different size and condition"
},
{
"tags": [
"python",
"location",
"pyautogui"
],
"owner": {
"reputation": 1,
"user_id": 13507535,
"user_type": "registered",
"profile_image": "https://lh3.googleusercontent.com/a-/AOh14GgtdM9KrbH3X5Z33RCtz6xm_TJUSQS_S31deNYUcA=k-s128",
"display_name": "lowhatex",
"link": "https://stackoverflow.com/users/13507535/lowhatex"
},
"is_answered": false,
"view_count": 2,
"answer_count": 0,
"score": 0,
"last_activity_date": 1589053657,
"creation_date": 1589053657,
"question_id": 61702761,
"link": "https://stackoverflow.com/questions/61702761/want-to-get-a-grip-of-this-pyautogui-command",
"title": "Want to get a grip of this pyautogui command"
}
]
The way I have been trying to load this is following:
dsbulk load -url ./data_so1.json -k stackoverflow_t -t staging_t -h '182.14.0.1' -header false -u username -p password
This is the closest i get and it pushes the values into Cassandra row by row like this:
data
-------------------------------------------------------------------------------------------------------------------------------
"title": "'Microsoft.ACE.OLEDB.12.0' provider is not registered on the local machine giving exception on client"
"profile_image": "https://www.gravatar.com/avatar/05085ede54486bdaebefcf8363e081e2?s=128&d=identicon&r=PG&f=1",
"view_count": 422,
"question_id": 61702768,
"user_id": 12235281,
This just takes the rows as they are (including the commas). I've tried the -m key for mapping but didnt really get anywhere with it.
What would be the right way to get these values to their own respective columns?

ArangoDB offset doesn't work in join

I got next tables: users_categories, users.
users_categories objects contains "users" fields which has keys only, so I make join:
FOR c IN users_categories
FILTER c._key == '75a65608-7e9b-4e74-be19-76882209e388'
FOR u IN c.users
FOR u2 IN users FILTER u == u2._key
LIMIT 0, 100
RETURN u2
Result:
[
{
"_key": "5b1b68db-9848-4a0a-81b3-775007f16845",
"_id": "users/5b1b68db-9848-4a0a-81b3-775007f16845",
"_rev": "_VXo9gaC---",
"activated": true,
"blocked": false,
"citizenship": "RU",
"city": "Kalinigrad",
"deleted": false,
"email": "trigger.trigg#yandex.ru",
"lastActivityTime": 1501539830209,
"login": "triggerJK",
"name": "Max",
"passportId": "8736e8e4-9390-44e7-9e21-b17e18b1ebd9",
"phone": "89092132022",
"profileName": "Default profile",
"sex": 1,
"surname": "Max"
},
{
"_key": "0965a0d9-fc91-449f-90f8-9086944b1a86",
"_id": "users/0965a0d9-fc91-449f-90f8-9086944b1a86",
"_rev": "_VWjRYHe---",
"activated": true,
"blocked": false,
"citizenship": "AF",
"deleted": false,
"email": "megamozg4#mail.ru",
"lastActivityTime": 1501247531,
"login": "Megamozg4",
"passportId": "20ab7aad-d356-4437-86b2-6dfa9c4467e0",
"phone": "12312334555",
"profileName": "Default profile",
"sex": 1
}
]
If I set LIMIT 1 or LIMIT 0, 1 it returns only first record, as I want to. However, if I set LIMIT 1, N (N can be any) it returns empty array, so offset doesn't work?
What am I doing wrong?
ArangoDB used: 3.1.10
UPD:
somehow, LIMIT 1, N skips not the only first record, but first 2.
If I have more than 2 records to show, offset works strange. I created issue on github
Two bugs were reported regarding offsets:
https://github.com/arangodb/arangodb/issues/2928
https://github.com/arangodb/arangodb/issues/2879
And the fixes for LIMIT are included in the versions v3.1.27 and v3.2.1, so please update and test again.

Return distinct and sorted query in AQL

So I have two collections, one with cities with an array of postal codes as a property and one with postal codes and their latitude & longitude.
I want to return the cities closest to a coordinate. This is easy enough with a geo index but the issue I'm having is the same city being returned multiple times and some times it can be the 1st and 3rd closest because the postal code that I'm searching in bordering another city.
cities example data:
[
{
"_key": "30936019",
"_id": "cities/30936019",
"_rev": "30936019",
"countryCode": "US",
"label": "Colorado Springs, CO",
"name": "Colorado Springs",
"postalCodes": [
"80904",
"80927"
],
"region": "CO"
},
{
"_key": "30983621",
"_id": "cities/30983621",
"_rev": "30983621",
"countryCode": "US",
"label": "Manitou Springs, CO",
"name": "Manitou Springs",
"postalCodes": [
"80829"
],
"region": "CO"
}
]
postalCodes example data:
[
{
"_key": "32132856",
"_id": "postalCodes/32132856",
"_rev": "32132856",
"countryCode": "US",
"location": [
38.9286,
-104.6583
],
"postalCode": "80927"
},
{
"_key": "32147422",
"_id": "postalCodes/32147422",
"_rev": "32147422",
"countryCode": "US",
"location": [
38.8533,
-104.8595
],
"postalCode": "80904"
},
{
"_key": "32172144",
"_id": "postalCodes/32172144",
"_rev": "32172144",
"countryCode": "US",
"location": [
38.855,
-104.9058
],
"postalCode": "80829"
}
]
The following query works but as an ArangoDB newbie I'm wondering if there's a more efficient way to do this:
FOR p IN WITHIN(postalCodes, 38.8609, -104.8734, 30000, 'distance')
FOR c IN cities
FILTER p.postalCode IN c.postalCodes AND c.countryCode == p.countryCode
COLLECT close = c._id AGGREGATE distance = MIN(p.distance)
FOR c2 IN cities
FILTER c2._id == close
SORT distance
RETURN c2
The first FOR in the query will use the geo index and probably return few documents (just the postal codes around the specified location).
The second FOR will look up the city for each found postal code. This may be an issue, depending on whether there is an index present on cities.postalCodes and cities.countryCode. If not, then the second FOR has to do a full scan of the cities collection each time it is involved. This will be inefficient. It may therefore be create an index on the two attributes like this:
db.cities.ensureIndex({ type: "hash", fields: ["countryCode", "postalCodes[*]"] });
The third FOR can be removed entirely when not COLLECTing by c._id but by c:
FOR p IN WITHIN(postalCodes, 38.8609, -104.8734, 30000, 'distance')
FOR c IN cities
FILTER p.postalCode IN c.postalCodes AND c.countryCode == p.countryCode
COLLECT city = c AGGREGATE distance = MIN(p.distance)
SORT distance
RETURN city
This will shorten the query string, but it may not help efficiency much I think, as the third FOR will use the primary index to look up the city documents, which is O(1).
In general, when in doubt about a query using indexes, you can use db._explain(queryString) to show which indexes will be used by a query.

Returning the "search term" along with result - Elasticsearch

In the elasticsearch module I have built, is it possible to return the "input search term" in the search results ?
For example :
GET /signals/_search
{
"query": {
"match": {
"focused_content": "stock"
}
}
}
This returns
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.057534903,
"hits": [
{
"_index": "signals",
"_type": "signal",
"_id": "13",
"_score": 0.057534903,
"_source": {
"username": "abc#abc.com",
"tags": [
"News"
],
"content_url": "http://www.wallstreetscope.com/morning-stock-highlights-western-digital-corporation-wdc-fibria-celulose-sa-fbr-ametek-inc-ame-cott-corporation-cot-graftech-international-ltd-gti/25375462/",
"source": null,
"focused_content": "Morning Stock Highlights: Western Digital Corporation (WDC), Fibria Celulose SA (FBR), Ametek Inc. (AME), Cott Corporation (COT), GrafTech International Ltd. (GTI) - WallStreet Scope",
"time_stamp": "2015-08-12"
}
}
]
}
Is it possible to have the input search term "stock" along with each of the results (like an additional JSON Key along with "content_url","source","focused_content","time_stamp") to identify which search term had brought that result ?
Thanks in Advance !
All I can think of, would be using highlighting feature. So it would bring back additional key _highlightand it would highlight things, that matched.
It won't bring exact matching terms, tho. You'd have to deal with them in your application. You could use pre/post tags functionality to wrap them up somehow specially, so your app could recognize that it was a match.
You can use highlights on all fields, like #Evaldas suggested. This will return the result along with the value in the field which matched, surrounded by customisable tags (default is <em>).
GET /signals/_search
{
"highlight": {
"fields": {
"username": {},
"tags": {},
"source": {},
"focused_content": {},
"time_stamp": {}
}
},
"query": {
"match": {
"focused_content": "stock"
}
}
}

How to set counter field in elasticsearch

I need to increment the field 'post_count' with +1 in elasticsearch
For ex: In my case When I click a button the post_count need to increment
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_work",
"_type": "user",
"_id": "d989dd8629f8b6cc59faf8a1aa2328c8",
"_score": 1,
"_source": {
"first_name": "test",
"last_name": "amt",
"post_count":0
}
}
]
Is there any single query to increment post_count in each update
Try something like this:
POST /test_work/user/d989dd8629f8b6cc59faf8a1aa2328c8/_update
{
"script" : "ctx._source.post_count+=1"
}
You can also control the increment of the counter using a parameter in your script.
If you are using ES 1.3+ you might get an error saying "dynamic scripting disabled", in order to avoid it you need to specify the script language, in this case groovy.
POST /test_work/user/d989dd8629f8b6cc59faf8a1aa2328c8/_update
{
"script" : "ctx._source.post_count+=increment",
"params" : {
"increment" : 4
},"lang":"groovy"
}

Resources