How to page-wise index a blob document in Azure Cognitive Search?

How to page-wise index a blob document in Azure Cognitive Search? - nlp

I am new to Azure Search. I am indexing few pdf documents using this method
But, I want to get search result page-wise. It is currently providing result from the whole document, but instead of that I want the result to be shown from each page and I also need that particular file name and page number that has the highest score.

As you have noticed, the document cracking by default shoves all text into one field (content). If you have an OCR skill involved (assuming you have images within the PDF that contain text), it does the same thing by default in merged_content. I do not believe there is a way to force these two tasks to break your data out into pages.
I say "believe" because it difficult to find documentation on the shape of the document object that is input into your skillsets. For example, look at the input to this merge skillset. It uses /document/content and other document related data and pushes it all into a field called merged_content. If you could find documentation on all the fields in document, it MIGHT have your pages broken down.
{
"#odata.type": "#Microsoft.Skills.Text.MergeSkill",
"name": "#BookMergeSkill",
"description": "Some description",
"context": "/document",
"insertPreTag": " ",
"insertPostTag": " ",
"inputs": [
{
"name": "text",
"source": "/document/content"
},
{
"name": "itemsToInsert",
"source": "/document/normalized_images/*/text"
},
{
"name": "offsets",
"source": "/document/normalized_images/*/contentOffset"
}
],
"outputs": [
{
"name": "mergedText",
"targetName": "merged_content"
}
]
},
The only way I know to approach this is to use a custom skill, which would reside in an Azure Function and be called as part of the document skillset pipeline. Inside that Azure Function, you would have to use a PDF reader, like iText7, and crack open the documents yourself and return data that you would place in the index document as an array of text or custom objects.
We were going to go down a custom cracking process with a client (not to do this but for other reasons), but the project was canned due to the cost of holding large amounts of data within an index.

Related

Azure Cognitive Search - Retrieve Search Score in Search Result

I am looking for a way to retrieve the search score in the search result (an index field value), similar to the other metadata fields like metadata_storage_name or metadata_storage_path. In the Indexer Definition, I tried retrieving the search score in the following way. Please correct me if I am missing anything or retrieving it the wrong way.
"fieldMappings": [
{
"sourceFieldName": "#search.score",
"targetFieldName": "search_score",
"mappingFunction": null
}
]

Search score is an attribute added to each search result in the search request response. Try issue a simple search request using your favourite REST client or Azure Poral. Below is an example of a response object. #search.score is what you're looking for.
"value": [
{
"#search.score": 7.3617697,
"HotelId": "21",
"HotelName": "Nova Hotel & Spa",
"Description": "1 Mile from the airport. Free WiFi, Outdoor Pool, Complimentary Airport Shuttle, 6 miles from the beach & 10 miles from downtown.",
"Category": "Resort and Spa",
"Tags": [
"pool",
"continental breakfast",
"free parking"
]
},
{
"#search.score": 2.5560288,
"HotelId": "25",
"HotelName": "Scottish Inn",
"Description": "Newly Redesigned Rooms & airport shuttle. Minutes from the airport, enjoy lakeside amenities, a resort-style pool & stylish new guestrooms with Internet TVs.",
"Category": "Luxury",
"Tags": [
"24-hour front desk service",
"continental breakfast",
"free wifi"
]
}]
Example is from here: https://learn.microsoft.com/en-us/azure/search/search-query-simple-examples#example-1-full-text-search

'#search.score' is not a field in an index, but a computation of each search result relevance scoring. If there is a match for the criteria of your search and a result returned, you can retrieve that value from the HTTP response with '#search.score'.
Field mappings on the other hand are used to map a field that is found in your data source and does not match the name you would like to use in the index, so you can map to the name you need.
For more information on the HTTP response of Search Documents REST API and search scoring, please visit:
https://learn.microsoft.com/rest/api/searchservice/search-documents and
https://learn.microsoft.com/azure/search/index-similarity-and-scoring

Forge-Get Item Path along with custom attributes in BIM360 document

Two Requirements are needed:
Get item path of the document in a BIM360 document management.
Get all custom attributes for that item.
For Req. 1, an api exists to fetch and for getting custom attributes, another api exists and data can be retrived.
Is there a way to get both the requirements in a single api call instead of using two.
In case of large number of records, api to retrieve item path is taking more than an hour for fetching 19000+ records and token gets expired though refesh token is used, while custom attribute api processes data in batches of 50, which completes it in 5 minutes only.
Please suggest.

Batch-Get Custom Attributes is for the additional attributes of Document Management specific. While path in project is a general information with Data Management.
The Data Management API provides some endpoints in a format of command, which can ask the backend to process the data for bunch of items.
https://forge.autodesk.com/en/docs/data/v2/reference/http/ListItems/
This command will retrieve metadata for up to 50 specified items one time. It also supports the flag includePathInProject, but the usage is tricky and API document does not indicate it. In the response, it will tell the pathInProject of these items. It may save more time than iteration.
{
"jsonapi": {
"version": "1.0"
},
"data": {
"type": "commands",
"attributes": {
"extension": {
"type": "commands:autodesk.core:ListItems",
"version": "1.0.0" ,
"data":{
"includePathInProject":true
}
}
},
"relationships": {
"resources": {
"data": [
{
"type": "items",
"id": "urn:adsk.wipprod:dm.lineage:vkLfPabPTealtEYoXU6m7w"
},
{
"type": "items",
"id": "urn:adsk.wipprod:dm.lineage:bcg7gqZ6RfG4BoipBe3VEQ"
}
]
}
}
}
}

Get item path of the document in a BIM360 document management.
Is this question about getting the hiarchy of the item? e.g. rootfolder>>subfolder>>item ? With the endpoint, by specifying the query param includePathInProject=true, it will return the relative path of the item (pathInProject) in the folder structure.
https://forge.autodesk.com/en/docs/data/v2/reference/http/projects-project_id-items-item_id-GET/
"data": {
"type": "items",
"id": "urn:adsk.wipprod:dm.lineage:xxx",
"attributes": {
"displayName": "my-issue-att.png",
"createTime": "2021-03-12T04:51:01.0000000Z",
"createUserId": "xxx",
"createUserName": "Xiaodong Liang",
"lastModifiedTime": "2021-03-12T04:51:02.0000000Z",
"lastModifiedUserId": "200902260532621",
"lastModifiedUserName": "Xiaodong Liang",
"hidden": false,
"reserved": false,
"extension": {
"type": "items:autodesk.bim360:File",
"version": "1.0",
"schema": {
"href": "https://developer.api.autodesk.com/schema/v1/versions/items:autodesk.bim360:File-1.0"
},
"data": {
"sourceFileName": "my-issue-att.png"
}
},
"pathInProject": "/Project Files"
}
or if you may iterate by the data of parent
"parent": {
"data": {
"type": "folders",
"id": "urn:adsk.wipprod:fs.folder:co.sdfedf8wef"
},
"links": {
"related": {
"href": "https://developer.api.autodesk.com/data/v1/projects/b.project.id.xyz/items/urn:adsk.wipprod:dm.lineage:hC6k4hndRWaeIVhIjvHu8w/parent"
}
}
},
Get all custom attributes for that item. For Req. 1, an api exists to fetch and for getting custom attributes, another api exists and data can be retrived. Is there a way to get both the requirements in a single api call instead of using two. In case of large number of records, api to retrieve item path is taking more than an hour for fetching 19000+ records and token gets expired though refesh token is used, while custom attribute api processes data in batches of 50, which completes it in 5 minutes only. Please suggest.*
Let me try to understand the question better. Firstly, two things: Custom Attributes Definitions, and Custom Attributes Values(with the documents). Could you clarify what are they with 19000+ records?
If Custom Attributes Definitions, the API to fetch them is
https://forge.autodesk.com/en/docs/bim360/v1/reference/http/document-management-custom-attribute-definitions-GET/
It supports to set limit of each call. i.e. the max limit of one call is 200, which means you can fetch 19000+ records by 95 times, while each time calling should be quick (with my experience < 10 seconds). Totally around 15 minutes, instead of more than 1 hour..
Or at your side, each call with 200 records will take much time?
If Custom Attributes Values, the API to fetch them is
https://forge.autodesk.com/en/docs/bim360/v1/reference/http/document-management-versionsbatch-get-POST/
as you know, 50 records each time. And it seems it is pretty quick at your side with 5 minutes only if fetch the values of 19000+ records?

"Content" too large when indexing blob content for Azure Search

I set up blob indexing and full-text searching for Azure as described in this article: Indexing Documents in Azure Blob Storage with Azure Search.
Some of my documents are failing in the indexer, throwing the returning the following error:
Field 'content' contains a term that is too large to process. The max length for UTF-8 encoded terms is 32766 bytes. The most likely cause of this error is that filtering, sorting, and/or faceting are enabled on this field, which causes the entire field value to be indexed as a single term. Please avoid the use of these options for large fields.
The particular pdf that is producing this error is 3.68 MB, and contains a variety of content (text, tables, images, etc).
The index and indexer are set up exactly as described in that article, with the addition of some file type restrictions.
Index:
{
"name": "my-index",
"fields": [{
"name": "id",
"type": "Edm.String",
"key": true,
"searchable": false
}, {
"name": "content",
"type": "Edm.String",
"searchable": true
}]
}
Indexer:
{
"name": "my-indexer",
"dataSourceName": "my-data-source",
"targetIndexName": "my-index",
"schedule": {
"interval": "PT2H"
},
"parameters": {
"maxFailedItems": 10,
"configuration": {
"indexedFileNameExtensions": ".pdf,.doc,.docx,.xls,.xlsx,.ppt,.pptx,.html,.xml,.eml,.msg,.txt,.text"
}
}
}
I tried searching through their docs and some other related articles, but I couldn't really find any information. I'm guessing this is because this feature is still in preview.

there's a limit on the size of a single term in the search index - it also happens to be 32KB. If the content field in your search index is marked as filterable, facetable or sortable then you'll hit this limit (regardless of whether the field is marked as searchable or not). Typically for large searchable content you want to enable searchable and sometimes retrievable but not the rest. That way you won't hit limits on content length from the index side.
Please see this answer for more context as well.

display computed text in JSON format

In an xpage I have several calls to collect data in json format from several notesviews via java class files.
To check or visualize the data I have a "debug mode" option to display this data in computed fields.
The data is json but I would like to have it formatted in the computed text so it is easier to read.
Does anyone know how I can format the display to it is easier to read in stead of one line of text?
e.g. from
{"locationName":"","gender":"Male","companyName":"","name":"Patrick Kwinten","docUNID":"845AB7AF45FF1260C1257E88003DACFA","notesName":"CN=Patrick Kwinten\/O=quintessens","branchName":"Quintessens Global Services","phone": ["+49 1525 161 223"],"info": ["IT Specialsit"],"sourceUNID":"","pictureURL":"http:\/\/dev1\/apps\/banking\/ServiceData.nsf\/0\/845AB7AF45FF1260C1257E88003DACFA\/$FILE\/PortalPicture.jpg","mail": ["patrickkwinten#ghotmail.com"],"reportsTo":"CN=Eva Fahlgren\/O=quintessens","job":"Managaer","departmentName":"Collaboration Services"}
to
{
"locationName": "",
"gender": "Male",
"companyName": "",
"name": "Patrick Kwinten",
"docUNID": "845AB7AF45FF1260C1257E88003DACFA",
"notesName": "CN=Patrick Kwinten\/O=quintessens",
"branchName": "Quintessens Global Services",
"phone": [
"+49 1525 161 223"
],
"info": [
"IT Specialsit"
],
"sourceUNID": "",
"pictureURL": "http:\/\/dev1\/apps\/banking\/ServiceData.nsf\/0\/845AB7AF45FF1260C1257E88003DACFA\/$FILE\/PortalPicture.jpg",
"mail": [
"patrickkwinten#ghotmail.com"
],
"reportsTo": "CN=Eva Fahlgren\/O=quintessens",
"job": "Managaer",
"departmentName": "Collaboration Services"
}

I do it in a different way. I use Google Postman to fire the request (with headers or whatever you need) and then I get the result back in Postman and can view it as "pretty" - this way I don't have to build anything like this into the application - and I also prefer to see the "raw" data and not risk changing anything on manipulating it prior to displaying it the way you suggest :-)
Really can't live without this utility once I discovered it.
/John

View with geospatial and non geospatial keys with CouchDB

I'm using CouchDB and GeoCouch and I'm trying to understand if it were possible to build a geospatial index and "query" the database both by using a location and a value from another field.
Data
{
"_id": "1",
"profession": "medic",
"location": [15.12, 30.22]
}
{
"_id": "2",
"profession": "secretary",
"location": [15.12, 30.22]
}
{
"_id": "3",
"profession": "clown",
"location": [27.12, 2.2]
}
Questions
Is there any way to perform the following queries on these documents:
Find all documents with job = "medic" near location [15.12, 30.22] (more important)
List all the different professions near this location [15.12, 30.22] (a plus)
In case that's not possible, what options do I have? I'm already considering switching to MongoDB, but I'd rather solve in a different way.
Notes
Data changes quickly, new documents might be added and many might be removed
References
Faceted search with geo-index using CouchDB

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string