We have an index with a join datatype and the indexing speed is very slow.
At best we are indexing 100/sec, but mostly around 50/sec, the times is varying depending of the document size. We are using multiple threads with .NET Nest when indexing but both batching and single inserts are pretty slow. We have tested various batch sizes but still not getting any speed to talk about. Even with only small documents containing "metadata" it is slow, but speed will drop radically when the document size is increasing. Document size in this solution can vary from small up to 6 MB
What can we expect using the join datatype and indexing? How much penalty must we expect to get using it? We did of course try to avoid this when designing it, but we did not find any way around it. Any tips or tricks?
We are using a 3-node cluster in Azure, all with 32 GB of RAM and premium SSD disks. The Java Heap size is set to 16GB. Swapping is Disabled. Memory usage on the VM’s is stable about 60% of total, but the CPU is very low < 10 %. We are running Elasticsearch v. 6.2.3.
A short version of the mapping:
"mappings": {
"log": {
"_routing": {
"required": true
},
"properties": {
"body": {
"type": "text"
},
"description": {
"type": "text"
},
"headStepJoinField": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"head": "step"
}
},
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"statusId": {
"type": "keyword"
},
"stepId": {
"type": "keyword"
}
}
}
}
Related
Has anyone here used the PUT /MoveEntry call successfully before? I can make the call to create the record, but I was expecting the API to populate the lot number and it is not. It does by UI, but not by API. Is there a trick that I'm missing?
Update 1:
PUT /MoveEntry
{
"Hold": {
"value": true
},
"Details": [
{
"OrderType": {
"value": "RO"
},
"ProductionNbr": {
"value": "RO0000001"
},
"Quantity": {
"value": 1
},
"Location": {
"value": "PRODRECPT"
},
"Warehouse": {
"value": "ABBOTSFORD"
}
}
]
}
It always records the document successfully, but never has the lot number.
Could it be a lot class configuration issue?
Update 2:
Acu support agrees this looks like a defect and has passed the case on to Acu development.
I would take a look at the Numbering Sequence called AMBatch as that is the one that seems to be used by default for me on the Move Entry screen.
After working with Acu Support, it came back the API requires the "OperationNbr" field to be populated with the Bill of Materials operation number. Then the lot number is generated as expected.
I have a Copy job should copy 100 GB of excel files between two Azure DataLake.
"properties": {
"activities": [
{
"name": "Copy Data1",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": true,
"maxConcurrentConnections": 256
},
"sink": {
"type": "AzureDataLakeStoreSink",
"maxConcurrentConnections": 256
},
"enableStaging": false,
"parallelCopies": 32,
"dataIntegrationUnits": 256
},
"inputs": [
{
"referenceName": "SourceLake",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "DestLake",
"type": "DatasetReference"
}
]
}
],
my throughput is about 4 MB/s. As I read here it should be 56 MB/s. What should I do to reach this throughput?
You can use the Copy actives Performance tuning to help you tune the performance of your Azure Data Factory service with the copy activity.
Summary:
Take these steps to tune the performance of your Azure Data Factory service with the copy activity.
Establish a baseline. During the development phase, test your pipeline by using the copy activity against a representative data sample. Collect execution details and performance characteristics following copy activity monitoring.
Diagnose and optimize performance. If the performance you observe doesn't meet your expectations, identify performance bottlenecks. Then, optimize performance to remove or reduce the effect of bottlenecks.
In some cases, when you run a copy activity in Azure Data Factory, you see a "Performance tuning tips" message on top of the copy activity monitoring page, as shown in the following example. The message tells you the bottleneck that was identified for the given copy run. It also guides you on what to change to boost copy throughput.
Your file is about 100 GB size. But test files for file-based stores are multiple files with 10 GB in size. The performance may be different.
Hope this helps.
With a standard configuration of ELK stack (deviant/docker-elk)
and the template http://localhost:9200/logstash-alimentaris/_mapping/?pretty=true set to:
{
"string_fields": {
"mapping": {
"fielddata": {},
"index": "analyzed",
"omit_norms": true,
"type": "string",
"fields": {
"raw": {
"ignore_above": 256,
"index": "not_analyzed",
"type": "string",
"doc_values": true
}
}
},
"match": "*",
"match_mapping_type": "string"
}
}
In Kibana all raw fields are empty. What are some possibilities to investigate what hinders Elasticsearch to fill the raw fields?
One possibility is to create a custom template: Change default mapping of string to "not analyzed" in Elasticsearch
But, it is documented that the raw index works out of the box, and it would be in many cases better to stick with the default configuration. What are possible solutions, hints?
The logic in Kibana is to not show the raw variables in the discover pane. They show up only if one selects "show missing fields", and then they are represented as empty.
In the visualizations one can and should use the .raw fields in most of the aggregations.
Lucene mentions that -
If The document you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors
though we can configure it by IndexWriter.setMaxFieldLength(int).
I created an index in elasticsearch - http://localhost:9200/twitter and posted a document with 40,000 terms in it.
mapping -
{
"twitter": {
"mappings": {
"tweet": {
"properties": {
"filter": {
"properties": {
"term": {
"properties": {
"message": {
"type": "string"
}
}
}
}
},
"message": {
"type": "string",
"analyzer": "standard"
}
}
}
}
} }
i indexed a document with message field has 40,000 terms - message: "text1 text2 .... text40000" .
Since standard analyzer analyzes on space it has indexed 40,000 terms.
My point is Does elasticsearch sets a limit of number of indexed terms on lucene ? If yes what is that limit ?
If no, how my all 40,000 terms got indexed , it shouldn't have indexed terms more than 10000.
The source you're citing doesn't seem up-to-date, as IndexWriter.setMaxFieldLength(int) was deprecated in Lucene 3.4 and now isn't available anymore in Lucene 4+, which ES is based on. It's been replaced by LimitTokenCountAnalyzer. However, I don't think such a limit exists anymore, or at least it is not set explicitly within the Elasticsearch codebase.
The only limit you might encounter while indexing documents would be related to either the HTTP payload size or Lucene's internal buffer size such as explained in this post
I have simples aggregation like
"aggs": {
"firm_aggregation": {
"terms": {
"field": "experience.company_name.slug",
"size": 10
}
}
}
and this gives me result like
"aggregations": {
"firm_aggregation": {
"buckets": [
... (some others)
{
"key": "freelancer",
"doc_count": 33
},
but when I increase aggregation size to 2000 i get
"aggregations": {
"firm_aggregation": {
"buckets": [
... (some others)
{
"key": "freelancer",
"doc_count": 35
},
why is this happening ?? I thouht that size will increase number of aggregations which elastic return.
This is owing to the estimation done on shard level.
For results of size 5 , only top 5 terms are taken from each shard and this is added to get the result. This need not be very accurate.
There is a very good explanation about this here.
Along with size , you can pass shard_size parameter which can control this behavior without affecting the data that is returned