I have several million docs, that I need to move into a new index, but there is a condition on which docs should flow into the index. Say I have a field named, offsets, that needs to be queried against. The values I need to query for are: [1,7,99,32, ....., 10000432] (very large list) in the offset field..
Does anyone have thoughts on how I can move the specific docs, with those values in the list into a new elasticsearch index.? My first though was reindexing with a query, but there is no pattern for the offsets list..
Would it be a python loop appending each doc to a new index? Looking for any guidance.
Thanks
Are the documents really large, or can you add them into an jsonl file for bulk ingestion?
In what form is the selector list, the one shown as "[1,7,99,32, ....., 10000432]"?
I'd do it in Pandas, but here is an idea in ES parlance.
Whatever you do, do use the _bulk API, or the job will never finish.
You can run a query based upon as file as per
GET my_index/_search?_file="myquery_file"
You can put all the ids into a file, myquery_file, as below:
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
},
"format": "jsonl"
}
and output as jsonl to ingest.
You can do the above for the reindex API.
{
"source": {
"index": "source",
**"query": {
"match": {
"company": "cat"
}
}**
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
Unfortunately,
I was facing a time crunch, and had to throw in a personalized loop to query a very specific subset of indices..
df = pd.read_csv('C://code//part_1_final.csv')
offsets = df['OFFSET'].tolist()
# Offsets are the "unique" values I need to identify the docs by.. There is no pattern in these values, thus I must go one by one..
missedDocs = []
for i in offsets:
print(i)
try:
client.reindex({
"source": {
"index": "<source_index>,
"query": {
"bool": {
"must": [
{ "match" : {"<index_filed_1>": "1" }},
{ "match" : {"<index_with_that_needs_values_to_match": i }}
]
}
}
},
"dest": {
"index": "<dest_index>"
}
})
except KeyError:
print('error')
#missedDocs.append(query)
print('DOC ERROR')
Related
I'm trying to efficiently query data via Mango (as that seems to be the only option given my requirements Searching for sub-objects with a date range containing the queried date value), but I can't even get a very simple index/query pair to work: although I specify my index manually for the query, I'm told that my index "was not used because it does not contain a valid index for this query. No matching index found, create an index to optimize query time."
(I'm doing all of this via Fauxton on CouchDB v. 3.0.0)
Let's say my documents look like this:
{
"tenant": "TNNT_a",
"$doctype": "JobOpening",
// a bunch of other fields
}
All documents with a $doctype of "JobOpening" are guaranteed to have a tenant property. The searches I wish to perform will only ever be for documents with $doctype of "JobOpening" and a tenant selector will always be provided when querying.
Here's the test index I've configured:
{
"index": {
"fields": [
"tenant",
"$doctype"
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
},
"ddoc": "job-openings-doctype-index",
"type": "json"
}
And here's the query
{
"selector": {
"tenant": "TNNT_a",
"\\$doctype": "JobOpening"
},
"use_index": "job-openings-doctype-index"
}
Why isn't the index being used for the query?
I've tried not using a partial index, and I think the $doctype escaping is done properly in the requisite places, but nothing seems to keep CouchDB from performing a full scan.
The index isn't being used because the $doctype field is not being recognized by the query planner as expected.
Changing the fields declaration from $doctype to \\$doctype in the design document solves the issue.
{
"index": {
"fields": [
"tenant",
"\\$doctype"
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
},
"ddoc": "job-openings-doctype-index",
"type": "json"
}
After that small refactor, the query
{
"selector": {
"tenant": "TNNT_a",
"\\$doctype": "JobOpening"
},
"use_index": "job-openings-doctype-index"
}
Returns the expected result, and produces an "explain" which confirms the job-openings-doctype-index was queried:
{
"dbname": "stack",
"index": {
"ddoc": "_design/job-openings-doctype-index",
"name": "7f5c5cea5acd90f11fffca3e3355b6a03677ad53",
"type": "json",
"def": {
"fields": [
{
"tenant": "asc"
},
{
"\\$doctype": "asc"
}
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
}
},
// etc etc etc
Whether this change is intuitive or not is unclear, however it is consistent - and perhaps reveals leading field names with a "special" character may not be desirable.
Regarding the indexing of the filtered field, as per the documentation regarding partial_filter_selector
Technically, we don’t need to include the filter on the "status" [e.g.
$doctype here] field in the query selector ‐ the partial index
ensures this is always true - but including it makes the intent of the
selector clearer and will make it easier to take advantage of future
improvements to query planning (e.g. automatic selection of partial
indexes).
Despite that, I would not choose to index a field whose value is constant.
I am new to elasticsearch i want to index a JSON file and perform search queries from elasticsearch
How can I index this json and perform queries to get value if i pass parameter as "field3.innerfield" : "someval"
I have tried indexing this file with helpers.bulk and search but it returns all the fields instead of a selected field.
Below is the JSON sample
[
{
"id": "someid",
"metadata": {
"docType": "value",
"otherfield": " ",
morefields
.
.
},
"field1":"value1",
"field2":"value2,
"field3": [
{
"innerfield": "someval",
"innerfield1":[
"kind of a paragraph"
]
}
],
"field4": [
{
"innerfield": "someval",
"innerfield1": "kind of a paragraph"
}
],
},
{ again the format repeats with different id but same fields
},
{
}
]
Your question lacks clarity however what I understood is that you want to fetch values from its key for a nested json. You can do that in the following way as shown below.
Parse it multiple times and make the required changes as per your need.
import json
data = data.apply(lambda x: json.loads(json.loads(x).get("metadata","{}")).get("doctype") if x else None)
I am trying to create a CouchDB Mango Query with an index with the hope that the query runs faster. At the moment I have the following Mango Query which returns what I am looking for but it's slow. Therefore, I assume, I need to create an index to make it faster. I need help figuring out how to create that index.
selector: {
categoryIds: {
$in: categoryIds,
},
},
sort: [{ publicationDate: 'desc' }],
You can assume that my documents are let say news articles from different categories. Therefore in each document I have a field that contains one or more categories that the news article belongs to. For that I have an array of categoryIds for each document. My query needs to be optimized for queries like "Give me all news that have categoryId1 in their array of categoryIds sorted by publicationDate". What I don't know how to do is 1. How to define an index 2. What that index should be 3. How to use that index in "use_index" field of the Mango Query. Any help is appreciated.
Update after "Alexis Côté" answer:
If I define the index like this:
{
"_id": "_design/0f11ca4ef1ea06de05b31e6bd8265916c1bbe821",
"_rev": "6-adce50034e870aa02dc7e1e075c78361",
"language": "query",
"views": {
"categoryIds-json-index": {
"map": {
"fields": {
"categoryIds": "asc"
},
"partial_filter_selector": {}
},
"reduce": "_count",
"options": {
"def": {
"fields": [
"categoryIds"
]
}
}
}
}
}
And run the Mango Query like this:
{
"selector": {
"categoryIds": {
"$in": [
"e0bd5f97ac35bdf6893351337d269230"
]
}
},
"use_index": "categoryIds-json-index"
}
It still does return the results but they are not sorted in the order I want by publicationDate. So I am not clear what you are suggesting the solution is.
You can create an index as documented here
In your case, you will need an index on the "categoryIds" field.
You can specify the index using "use_index": "_design/<name>"
Note:The query planner should automatically pick this index if it's compatible.
I have following type of documents:
{
"_id": "0710b1dd6cc2cdc9c2ffa099c8000f7b",
"_rev": "1-93687d40f54ff6ca72e66ca7fc99caff",
"date": "2018-06-04T07:46:08.848Z",
"topic": "some topic",
}
The collection is not very large. Only 20k documents.
However, the following query is very slow. Takes ca 5 secs!
{
selector: {
topic: 'some topic'
},
sort: ['date'],
}
I tried various indexes, e.g.
index: {
fields: ['topic', 'date']
}
but nothing really worked well.
What I am missing here?
When sorting in a Mango query, you need to ensure that the sort order you are asking for matches the index that you are using.
If you are indexing the data set in topic,date order then you can use the following query on "topic" to get the data out in data order using the index:
{
"selector": {
"topic": "some topic"
},
"sort": [
"topic",
"date"
]
}
Because the sort matches the form of the data in the index, the index is used to answer the query which should speed up your query time considerably.
Let's take this document for example:
{
"id":1
"planet":"earth-616"
"data":[
["wolverine","mutant"],
["Storm","mutant"],
["Mark Zuckerberg","human"]]
}
I created a search index to index the name and type, for example if searched for name:wolverine or type:mutant I'd get the document that has it. But as per my requirement I don't want the whole document, I only want ["wolverine","mutant"] I've created a view that outputs as:
{
"id":1,
"key":"earth-616",
"value":["earth-616","wolverine","mutant"]
}
Then I found out I can query only with keys. (Is it possible to create search indexes on views?, Couldn't find anything in the documentation)
Or should I create views along with the one above like this:
{
"id":1,
"key":"wolverine",
"value":["earth-616","wolverine","mutant"]
}
And
{
"id":,
"key":"mutant"
"value":["earth-616","wolverine","mutant"]
}
This way I can query with keys that I want but I can't seem to partial match keys(Am I missing something?)
If you need the output to be exactly as described then I believe you have to use views, and to support wildcard searches I believe you will have to index every substring of a key.
One alternative is to use Cloudant Query, although admittedly you cannot get the exact output you are looking for. If you issue a query like so:
{
"selector": {
"_id": {
"$gt": 0
},
"data": {
"$elemMatch": {
"$elemMatch": {
"$regex": "(?i)zuck"
}
}
}
},
"fields": [
"data"
]
}
The result will be the entire data array:
{
"data": [
["wolverine", "mutant"],
["Storm", "mutant"],
["Mark Zuckerberg", "human"]
]
}