"Content" too large when indexing blob content for Azure Search - azure

I set up blob indexing and full-text searching for Azure as described in this article: Indexing Documents in Azure Blob Storage with Azure Search.
Some of my documents are failing in the indexer, throwing the returning the following error:
Field 'content' contains a term that is too large to process. The max length for UTF-8 encoded terms is 32766 bytes. The most likely cause of this error is that filtering, sorting, and/or faceting are enabled on this field, which causes the entire field value to be indexed as a single term. Please avoid the use of these options for large fields.
The particular pdf that is producing this error is 3.68 MB, and contains a variety of content (text, tables, images, etc).
The index and indexer are set up exactly as described in that article, with the addition of some file type restrictions.
Index:
{
"name": "my-index",
"fields": [{
"name": "id",
"type": "Edm.String",
"key": true,
"searchable": false
}, {
"name": "content",
"type": "Edm.String",
"searchable": true
}]
}
Indexer:
{
"name": "my-indexer",
"dataSourceName": "my-data-source",
"targetIndexName": "my-index",
"schedule": {
"interval": "PT2H"
},
"parameters": {
"maxFailedItems": 10,
"configuration": {
"indexedFileNameExtensions": ".pdf,.doc,.docx,.xls,.xlsx,.ppt,.pptx,.html,.xml,.eml,.msg,.txt,.text"
}
}
}
I tried searching through their docs and some other related articles, but I couldn't really find any information. I'm guessing this is because this feature is still in preview.

there's a limit on the size of a single term in the search index - it also happens to be 32KB. If the content field in your search index is marked as filterable, facetable or sortable then you'll hit this limit (regardless of whether the field is marked as searchable or not). Typically for large searchable content you want to enable searchable and sometimes retrievable but not the rest. That way you won't hit limits on content length from the index side.
Please see this answer for more context as well.

Related

Forge-Get Item Path along with custom attributes in BIM360 document

Two Requirements are needed:
Get item path of the document in a BIM360 document management.
Get all custom attributes for that item.
For Req. 1, an api exists to fetch and for getting custom attributes, another api exists and data can be retrived.
Is there a way to get both the requirements in a single api call instead of using two.
In case of large number of records, api to retrieve item path is taking more than an hour for fetching 19000+ records and token gets expired though refesh token is used, while custom attribute api processes data in batches of 50, which completes it in 5 minutes only.
Please suggest.
Batch-Get Custom Attributes is for the additional attributes of Document Management specific. While path in project is a general information with Data Management.
The Data Management API provides some endpoints in a format of command, which can ask the backend to process the data for bunch of items.
https://forge.autodesk.com/en/docs/data/v2/reference/http/ListItems/
This command will retrieve metadata for up to 50 specified items one time. It also supports the flag includePathInProject, but the usage is tricky and API document does not indicate it. In the response, it will tell the pathInProject of these items. It may save more time than iteration.
{
"jsonapi": {
"version": "1.0"
},
"data": {
"type": "commands",
"attributes": {
"extension": {
"type": "commands:autodesk.core:ListItems",
"version": "1.0.0" ,
"data":{
"includePathInProject":true
}
}
},
"relationships": {
"resources": {
"data": [
{
"type": "items",
"id": "urn:adsk.wipprod:dm.lineage:vkLfPabPTealtEYoXU6m7w"
},
{
"type": "items",
"id": "urn:adsk.wipprod:dm.lineage:bcg7gqZ6RfG4BoipBe3VEQ"
}
]
}
}
}
}
Get item path of the document in a BIM360 document management.
Is this question about getting the hiarchy of the item? e.g. rootfolder>>subfolder>>item ? With the endpoint, by specifying the query param includePathInProject=true, it will return the relative path of the item (pathInProject) in the folder structure.
https://forge.autodesk.com/en/docs/data/v2/reference/http/projects-project_id-items-item_id-GET/
"data": {
"type": "items",
"id": "urn:adsk.wipprod:dm.lineage:xxx",
"attributes": {
"displayName": "my-issue-att.png",
"createTime": "2021-03-12T04:51:01.0000000Z",
"createUserId": "xxx",
"createUserName": "Xiaodong Liang",
"lastModifiedTime": "2021-03-12T04:51:02.0000000Z",
"lastModifiedUserId": "200902260532621",
"lastModifiedUserName": "Xiaodong Liang",
"hidden": false,
"reserved": false,
"extension": {
"type": "items:autodesk.bim360:File",
"version": "1.0",
"schema": {
"href": "https://developer.api.autodesk.com/schema/v1/versions/items:autodesk.bim360:File-1.0"
},
"data": {
"sourceFileName": "my-issue-att.png"
}
},
"pathInProject": "/Project Files"
}
or if you may iterate by the data of parent
"parent": {
"data": {
"type": "folders",
"id": "urn:adsk.wipprod:fs.folder:co.sdfedf8wef"
},
"links": {
"related": {
"href": "https://developer.api.autodesk.com/data/v1/projects/b.project.id.xyz/items/urn:adsk.wipprod:dm.lineage:hC6k4hndRWaeIVhIjvHu8w/parent"
}
}
},
Get all custom attributes for that item. For Req. 1, an api exists to fetch and for getting custom attributes, another api exists and data can be retrived. Is there a way to get both the requirements in a single api call instead of using two. In case of large number of records, api to retrieve item path is taking more than an hour for fetching 19000+ records and token gets expired though refesh token is used, while custom attribute api processes data in batches of 50, which completes it in 5 minutes only. Please suggest.*
Let me try to understand the question better. Firstly, two things: Custom Attributes Definitions, and Custom Attributes Values(with the documents). Could you clarify what are they with 19000+ records?
If Custom Attributes Definitions, the API to fetch them is
https://forge.autodesk.com/en/docs/bim360/v1/reference/http/document-management-custom-attribute-definitions-GET/
It supports to set limit of each call. i.e. the max limit of one call is 200, which means you can fetch 19000+ records by 95 times, while each time calling should be quick (with my experience < 10 seconds). Totally around 15 minutes, instead of more than 1 hour..
Or at your side, each call with 200 records will take much time?
If Custom Attributes Values, the API to fetch them is
https://forge.autodesk.com/en/docs/bim360/v1/reference/http/document-management-versionsbatch-get-POST/
as you know, 50 records each time. And it seems it is pretty quick at your side with 5 minutes only if fetch the values of 19000+ records?

How to page-wise index a blob document in Azure Cognitive Search?

I am new to Azure Search. I am indexing few pdf documents using this method
But, I want to get search result page-wise. It is currently providing result from the whole document, but instead of that I want the result to be shown from each page and I also need that particular file name and page number that has the highest score.
As you have noticed, the document cracking by default shoves all text into one field (content). If you have an OCR skill involved (assuming you have images within the PDF that contain text), it does the same thing by default in merged_content. I do not believe there is a way to force these two tasks to break your data out into pages.
I say "believe" because it difficult to find documentation on the shape of the document object that is input into your skillsets. For example, look at the input to this merge skillset. It uses /document/content and other document related data and pushes it all into a field called merged_content. If you could find documentation on all the fields in document, it MIGHT have your pages broken down.
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"name": "#BookMergeSkill",
"description": "Some description",
"context": "/document",
"insertPreTag": " ",
"insertPostTag": " ",
"inputs": [
{
"name": "text",
"source": "/document/content"
},
{
"name": "itemsToInsert",
"source": "/document/normalized_images/*/text"
},
{
"name": "offsets",
"source": "/document/normalized_images/*/contentOffset"
}
],
"outputs": [
{
"name": "mergedText",
"targetName": "merged_content"
}
]
},
The only way I know to approach this is to use a custom skill, which would reside in an Azure Function and be called as part of the document skillset pipeline. Inside that Azure Function, you would have to use a PDF reader, like iText7, and crack open the documents yourself and return data that you would place in the index document as an array of text or custom objects.
We were going to go down a custom cracking process with a client (not to do this but for other reasons), but the project was canned due to the cost of holding large amounts of data within an index.

limit in _source in elasticsearch

This is my source from ES:
"_source": {
"queryHash": "query412236215",
"id": "query412236215",
"content": {
"columns": [
{
"name": "Catalog",
"type": "varchar(10)",
"typeSignature": {
"rawType": "varchar",
"typeArguments": [],
"literalArguments": [],
"arguments": [
{
"kind": "LONG_LITERAL",
"value": 10
}
]
}
}
],
"data": [
[
"apm"
],
[
"postgresql"
],
[
"rest"
],
[
"system"
],
[
"tpch"
]
],
"query_string": "show catalogs",
"execution_time": 1979
},
"createdOn": "1514269074289"
}
How can i get the n records inside _source.data?
Lets say _source.data have 100 records , I want only 10 at a time ,also is it possible to assign offset for next 10 records?
Thanks
Take a look at scripting. As far as I know there isn't any built-in solution because Elasticsearch is primarily built for searching and filtering with a document store only as a secondary concern.
First, the order in _source is stable, so it's not totally impossible:
When you get a document back from Elasticsearch, any arrays will be in
the same order as when you indexed the document. The _source field
that you get back contains exactly the same JSON document that you
indexed.
However, arrays are indexed—made searchable—as multivalue fields,
which are unordered. At search time, you can’t refer to "the first
element" or "the last element." Rather, think of an array as a bag of
values.
However, source filtering doesn't cover this, so you're out of luck with arrays.
Also inner hits won't help you. They do have options for sort, size, and from, but those will only return the matched subdocuments and I assume you want to page freely through all of them.
So your final hope is scripting, where you can build whatever you want. But this is probably not what you want:
Do you really need paging here? Results are transferred in a compressed fashion, so the overhead of paging is probably much larger than transferring the data in one go.
If you do need paging, because your array is huge, you probably want to restructure your documents.

Max terms indexed in a document by Elasticsearch?

Lucene mentions that -
If The document you are indexing are very large. Lucene by default only indexes the first 10,000 terms of a document to avoid OutOfMemory errors
though we can configure it by IndexWriter.setMaxFieldLength(int).
I created an index in elasticsearch - http://localhost:9200/twitter and posted a document with 40,000 terms in it.
mapping -
{
"twitter": {
"mappings": {
"tweet": {
"properties": {
"filter": {
"properties": {
"term": {
"properties": {
"message": {
"type": "string"
}
}
}
}
},
"message": {
"type": "string",
"analyzer": "standard"
}
}
}
}
} }
i indexed a document with message field has 40,000 terms - message: "text1 text2 .... text40000" .
Since standard analyzer analyzes on space it has indexed 40,000 terms.
My point is Does elasticsearch sets a limit of number of indexed terms on lucene ? If yes what is that limit ?
If no, how my all 40,000 terms got indexed , it shouldn't have indexed terms more than 10000.
The source you're citing doesn't seem up-to-date, as IndexWriter.setMaxFieldLength(int) was deprecated in Lucene 3.4 and now isn't available anymore in Lucene 4+, which ES is based on. It's been replaced by LimitTokenCountAnalyzer. However, I don't think such a limit exists anymore, or at least it is not set explicitly within the Elasticsearch codebase.
The only limit you might encounter while indexing documents would be related to either the HTTP payload size or Lucene's internal buffer size such as explained in this post

No results when in the mapping, the field _all has specified an index_analyzer

With Elasticsearch I have created an index using a custom mapping and custom set of analszers, however I'm not able to do query search on the _all field.
I'm using these analyzers:
{
"analysis": {
"analyzer": {
"case_insensitive": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
],
"char_filter": "punctuation"
}
},
"char_filter": {
"punctuation": {
"type": "mapping",
"mappings": [
".=>\\u0020",
"-=>\\u0020",
"_=>\\u0020"
]
}
}
}
}
and this mapping:
{
"article": {
"_all": {
"enabled": true,
"store": "yes",
"index_analyzer": "case_insensitive",
"search_analyzer": "case_insensitive"
},
"properties": {
"title": {
"type": "string",
"index": "analyzed"
},
"subtitle": {
"type": "string",
"analyzer": "case_insensitive"
},
"comment": {
"type": "string",
"index": "not_analyzed"
},
"review": {
"type":"string",
"index": "not_analyzed",
"include_in_all":false
}
}
}
}
Then I add a document like this:
{
"title": "This is the story of a wonderful man.",
"subtitle":"A man goes on vacation in the worst place possible.",
"comment": "I like the movie very much, however I did not undertand it.",
"review":"Very well"
}
and I expect the following 3 out of 4 fields shall be included in _all, in particular title, subtitle and comment.
The analyzer is working as following (tested using the analyzer test in elasticsearch):
"I like the movie very much, however I did not undertand it." -> "i like the movie very much, however i did not undertand it "
"This is the story of a wonderful man." -> "this is the story of a wonderful man "
I expect that at least searching on _all using the query: "This is the story of a wonderful man." I should be able to find the document.
What am I doing wrong?
How is elasticsearch populating the _all field?
If the field 'title' shall be added to the _all field, which data is used and how? is it using the output of the analyzer selected for the 'title' field as input for the analyzer of the _all or is using the raw data?
How is the flow of data in the _all field? For example
input -> analyzer -> title -> index_analyser -> _all
or
input -> analyzer -> title
-> index_analyser -> _all
Thank you in advance...
Your mapping looks ok to me. The only thing I would try is to set one of the fields explicitly to include_in_all=true and then rerun your query.
According to the docs, it may be that as you are overriding the default value of include_in_all for one of the fields, it may have changed it for all the other fields of the objects. See here _all
Relevant text from the documentation is below:
Inclusion in the _all field can be controlled on a field-by-field basis by using the include_in_all setting, which defaults to true. Setting include_in_all on an object (or on the root object) changes the default for all fields within that object.
UPDATE:
I think I know why its not working. Here is what I did. First, I removed the custom analysers from the _all_ field (so using the standard analyser). With this I was able to query and get the results as expected. Results were returned for terms that were in any of the document attributes but review. At least this confirms that the general behaviour of _all is correct. Next to test the analysers, I did a query on the subtitle field with the exact text(as it is using keyword analyser). This also worked. Then I realised that _all is an aggregated field and then analysed.
So the query should include all the text from all the fields to work. But again, how do we know in which order they were aggregated :)
This link _all custom analyser has some information. Relevant bits extracted below (from Shay).
You don't want to set the analyzer for _all to be keyword, _all is an aggregation of all the other fields int the doc, so you basically treat the whole aggregation of text as a single token.

Resources