ArangoDB aggregation counts of objects in array - aggregation

I'm trying to generate facets (aggregation counts) for the following documents in a graph (based on collections rather than a named graph):
{
"relation": "decreases",
"edge_type": "primary",
"subject_lbl": "act(p(HGNC:AKT1), ma(DEFAULT:kin))",
"object_lbl": "act(p(HGNC:CDKN1B), ma(DEFAULT:act))",
"annotations": [
{
"type": "Disease",
"label": "cancer",
"id": "cancer"
},
{
"type": "Anatomy",
"label": "liver",
"id": "liver"
}
]
}
The following works great to get facets (aggregation counts) for the edge_type:
FOR doc in edges
COLLECT
edge_type = doc.edge_type WITH COUNT INTO edge_type_cnt
RETURN {edge_type, edge_type_cnt}
I tried the following to get counts for the annotations[*].type value:
FOR doc in edges
COLLECT
edge_type = doc.edge_type WITH COUNT INTO edge_type_cnt,
annotations = doc.annotations[*].type WITH COUNT INTO anno_cnt
RETURN {edge_type, edge_type_cnt, annotations, anno_cnt}
Which results in an error - any ideas what I'm doing wrong? Thanks!

Thanks to this thread: https://groups.google.com/forum/#!topic/arangodb/vNFNVrYo9Yo linked to from this Question: ArangoDB Faceted Search Performance pointed me in the right direction.
FOR doc in edges
FOR anno in doc.annotations
COLLECT anno_type = anno.type WITH COUNT INTO anno_cnt
RETURN {anno_type, anno_cnt}
Results in:
Anatomy 4275
Cell 2183
CellLine 2093
CellStructure 2081
Disease 2126
Organism 2075
TextLocation 2121
Looping over the edges and then the annotations array is the key that I was missing.

Related

Moving specific Index Data into a new Index within Elasticsearch

I have several million docs, that I need to move into a new index, but there is a condition on which docs should flow into the index. Say I have a field named, offsets, that needs to be queried against. The values I need to query for are: [1,7,99,32, ....., 10000432] (very large list) in the offset field..
Does anyone have thoughts on how I can move the specific docs, with those values in the list into a new elasticsearch index.? My first though was reindexing with a query, but there is no pattern for the offsets list..
Would it be a python loop appending each doc to a new index? Looking for any guidance.
Thanks
Are the documents really large, or can you add them into an jsonl file for bulk ingestion?
In what form is the selector list, the one shown as "[1,7,99,32, ....., 10000432]"?
I'd do it in Pandas, but here is an idea in ES parlance.
Whatever you do, do use the _bulk API, or the job will never finish.
You can run a query based upon as file as per
GET my_index/_search?_file="myquery_file"
You can put all the ids into a file, myquery_file, as below:
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
},
"format": "jsonl"
}
and output as jsonl to ingest.
You can do the above for the reindex API.
{
"source": {
"index": "source",
**"query": {
"match": {
"company": "cat"
}
}**
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
Unfortunately,
I was facing a time crunch, and had to throw in a personalized loop to query a very specific subset of indices..
df = pd.read_csv('C://code//part_1_final.csv')
offsets = df['OFFSET'].tolist()
# Offsets are the "unique" values I need to identify the docs by.. There is no pattern in these values, thus I must go one by one..
missedDocs = []
for i in offsets:
print(i)
try:
client.reindex({
"source": {
"index": "<source_index>,
"query": {
"bool": {
"must": [
{ "match" : {"<index_filed_1>": "1" }},
{ "match" : {"<index_with_that_needs_values_to_match": i }}
]
}
}
},
"dest": {
"index": "<dest_index>"
}
})
except KeyError:
print('error')
#missedDocs.append(query)
print('DOC ERROR')

How to Generate Counts of Elements Returned from Map Function?

I have a map function
function (doc) {
for(var n =0; n<doc.Observations.length; n++){
emit(doc.Scenario, doc.Observations[n].Label);
}
}
the above returns the following:
{"key":"Splunk","value":"Organized"},
{"key":"Splunk","value":"Organized"},
{"key":"Splunk","value":"Organized"},
{"key":"Splunk","value":"Generate"},
{"key":"Splunk","value":"Ingest"}
I"m looking to design a reduce function that will then return the counts of the above values, something akin to:
Organized: 3
Generate: 1
Ingest: 1
My map function has to filter on my Scenario field, hence why I have it as an emitted key in the map function.
I've tried using a number of the built in reduce functions, but I end up getting count of rows, or nothing at all as the functions available don't apply.
I just need to access the counts of each of the elements that appear in the values field. Also, the values present here are representative, there could 100s of different types of values found in the values field for what that's worth.
I really appreciate the help!
Here's sample input:
{
"_id": "dummyId",
"test": "test",
"Team": "Alpha",
"CreatedOnUtc": "2019-06-20T21:39:09.5940830Z",
"CreatedOnLocal": "2019-06-20T17:39:09.5940830-04:00",
"Participants": [
{
"Name": "A",
"Role": "Person"
}
],
"Observations": [
{
"Label": "Report",
},
{
"Label": "Ingest",
},
{
"Label": "Generate",
},
{
"Label": "Ingest",
}
]
}
You can set the map by "value" as your key and associate an increment to that key to make sure a count is maintained. And then you can print your map which should look as you are requesting for.

limit in _source in elasticsearch

This is my source from ES:
"_source": {
"queryHash": "query412236215",
"id": "query412236215",
"content": {
"columns": [
{
"name": "Catalog",
"type": "varchar(10)",
"typeSignature": {
"rawType": "varchar",
"typeArguments": [],
"literalArguments": [],
"arguments": [
{
"kind": "LONG_LITERAL",
"value": 10
}
]
}
}
],
"data": [
[
"apm"
],
[
"postgresql"
],
[
"rest"
],
[
"system"
],
[
"tpch"
]
],
"query_string": "show catalogs",
"execution_time": 1979
},
"createdOn": "1514269074289"
}
How can i get the n records inside _source.data?
Lets say _source.data have 100 records , I want only 10 at a time ,also is it possible to assign offset for next 10 records?
Thanks
Take a look at scripting. As far as I know there isn't any built-in solution because Elasticsearch is primarily built for searching and filtering with a document store only as a secondary concern.
First, the order in _source is stable, so it's not totally impossible:
When you get a document back from Elasticsearch, any arrays will be in
the same order as when you indexed the document. The _source field
that you get back contains exactly the same JSON document that you
indexed.
However, arrays are indexed—made searchable—as multivalue fields,
which are unordered. At search time, you can’t refer to "the first
element" or "the last element." Rather, think of an array as a bag of
values.
However, source filtering doesn't cover this, so you're out of luck with arrays.
Also inner hits won't help you. They do have options for sort, size, and from, but those will only return the matched subdocuments and I assume you want to page freely through all of them.
So your final hope is scripting, where you can build whatever you want. But this is probably not what you want:
Do you really need paging here? Results are transferred in a compressed fashion, so the overhead of paging is probably much larger than transferring the data in one go.
If you do need paging, because your array is huge, you probably want to restructure your documents.

Count duplicate values via Elasticsearch terms aggregation

I am trying to run an Elasticsearch terms aggregation on multiple fields of the documents in my index. Each document contains multiple fields with hashtags, which can be extracted using a custom hashtag analyzer. The goal is to find the most common hashtags in the system.
As stated in the Elasticsearch documentation, it is not possible to run a terms aggregation on multiple fields of a document. I am thus trying to use a copy_to field. The problem now is, that if the document contains the same hashtag in multiple fields, it should count the term multiple times. This is not the case with the default terms aggregation:
Given Mapping:
{
"properties": {
"field_one": {
"type": "string",
"copy_to": "hashtags"
},
"field_two": {
"type": "string",
"copy_to": "hashtags"
}
}
Given Document:
{
"field_one": "Hello #World",
"field_two": "One #World",
}
The aggregation will return a single bucket {"key": "#World", "doc_count": 1}. What I need is a single bucket {"key": "#World", "doc_count": 2}.

Faceted Search and Aggregations Do Not Work Properly

I have an index for a Store with a type for Items, i.e. /store/items.
among other properties, Items have a Title (analyzed text), a
Description (analyzed text), and Tags (not_analyzed text).
I want to be able to show the facets over Tags with counts, so if a
facet of the Tag "Yellow" has a count of 12, for example, then when the
user adds that Tag to the filter she will see only 12 items.
I am using a Filtered Query with Aggs, as shown below, on Elasticsearch
1.1.1 on a single node:
GET _search {
"query": {
"filtered": {
"query": {
"multi_match": {
"query": "Large Widgets",
"fields": [
"title^3",
"description"
]
}
},
"filter": {
"terms": {
"tags": [
"Colorful"
],
"execution": "and"
}
}
}
},
"aggs": {
"available_tags": {
"terms": {
"field": "tags"
}
},
"size": 20
}
}
I have two problems:
No matter what value I pass for the aggs/size I get 10 aggregations.
I want to get more than 10.
The hits count that comes back when adding the new tag to the filter
doesn't match the doc_count that came with the aggregations, for
example, the aggregations might show a doc_count of 12 for the tag
"Yellow", but if I add "Yellow" to the filter terms so that it reads
"tags": [ "Colorful", "Yellow" ]I get 17 hits instead of the expected 12.
This usually does not happen at the first level, but only in subsequent drill down.
am I doing something wrong? is there a bug somewhere?
this is a cross post from the Elasticsearch mailing list which didn't get enough attention
Shard_size cannot be smaller than the size, so you may have a larger size than shard_size. In which case Elasticsearch will override it to be equal to shard_size.
Does the filtered results for "Colorful" "and" "Yellow" equal 17 documents instead of the 12 "Yellow" documents?

Resources