Is there a way to exclude NULL values from Azure Cognitive Search Indexes - azure

So for example we have field 1 up to 10. I want to index all the field in Azure Search, so you can filter, search on those filters.
My Question is, is there a way to just exclude the fields that are NULL from a specific ID, so not store them in Azure search? See example underneath.
The data itself is initially stored in Azure Cosmos Database.
In Azure Cosmos DB it would like this:
Id 1
field 1: a
field 2: b
field 5: c
field 6: d
field 8: e
Id 2
field 3: a
field 2: b
field 5: c
field 9: d
field 10: e
However in Azure Search Index, it looks like this:
Id 1
field 1:a
field 2:b
field 3:NULL
field 4:NULL
field 5:c
field 6:d
field 7:NULL
field 8:e
field 9:NULL
field 10:NULL
Id 2
field 1:NULL
field 2:b
field 3:a
field 4:NULL
field 5:c
field 6:NULL
field 7:NULL
field 8:NULL
field 9:d
field 10:e

The shortest answer to your question is "no", but it's a little deeper than that.
When you add documents to an Azure Cognitive Search index, the values of each field are stored in a data structure called an inverted index. This stores a dictionary of terms found in the field, and each entry contains a list of document IDs containing that term. It is somewhat similar to a column-oriented database in that regard. The null value that you see in document JSON is never actually stored in the inverted index. This can make it expensive to test whether a field is null, since the query needs to look for all document IDs not contained in the inverted index, but it is perfectly efficient in terms of storage (because it doesn't consume any).
This article has a few simplified examples of how inverted indexes work, although it's about a different topic than your question.
Your broader concern about having many fields defined in your index is a valid one. There is a tradeoff between schema flexibility and resource utilization as you increase the number of fields in your index. However, this is due to the bookkeeping overhead required for each field, not the "number of nulls in the field" (which doesn't really mean anything since nulls aren't stored).
From your question, it sounds like you're trying to model different "entity types" in the same index, resulting in a sparse index where some subset of the documents have one subset of fields defined, while another subset of documents have different fields defined. This is a scenario that we want to better support in the service. One promising future direction could be supporting multi-index query, so each subset of your schema could have its own index with its own distinct (but perhaps overlapping) set of fields. This is not on our immediate roadmap, but it's something we want to investigate further. Please vote on this User Voice item to help us prioritize.

As far as not saving the null values, AFAIK it is not possible. An index in Cognitive Search has a pre-defined schema (much like a relational database table) and based on an attribute's data type an attribute's value will be initialized with a default value (null for most of the data types).

If your concern is storage, it's not a problem since it's an inverted index.
If you have an issue with the complexity of the JSON data returned, you could implement your own intermediate service that just hides all NULL values from the JSON. So, your application queries your own query service which in turn queries the actual Azure service. Just passing along all parameters as-is. The only difference is that your service removes both the key/value from the JSON to make the responses easier to manage.
The response from search would then appear to be identical to your Cosmos record.

Related

Mongodb, should a number fields be indexed?

I'm trying to get a proper understanding of using mongodb to optimise queries. In this case it's for fields that would hold an integer. So say i have a collection
with two fields value and cid where value will store data of type string and cid will store data of type number.
I intend to write queries that will search for records by matching the fields value and cid. Also the expectation is that the saved records for this collection would get very large and hence queries could benefit from mongodb indexes. It makes sense to me to index the value field which holds string. But I wonder if the cid field requires indexing, or its okay as is, given that it will be holding integers.
I'm asking because I was going through a code base with this exact scenario described and i can't figure out why the number field was not indexed. Hoping my question makes any sense.
Regardless of datatypes, generally speaking all queries should use an index. If you use a sort predicate you can assist the database by having a compound index on both the equality portion of the query (the filter predicate) as well as the sorting portion (the sort predicate). MongoDB recommends following the index strategy referred to as the E.S.R. rule - see Performance Best Practices for E.S.R. rule.

solr query to sort result in descending order on basis of price

I am very beiginer in Solr and I am trying to do query on my data. I am trying to find data with name=plant and sort it by maximum price
my schema for both name and price is text type.
for eg let say data is
name:abc, price:25;
name:plant, price:35;
name:plant,price:45; //1000 other data
My Approach
/query?q=(name:"Plant")&stopwords=true
but above is giving me result of plants but I am not sure how to sort result using price feild
Any help will be appreciated
You can use the sort param for achieving the sorting.
Your query would be like q=(name:"Plant")&sort=price desc
The sort parameter arranges search results in either ascending (asc)
or descending (desc) order. The parameter can be used with either
numerical or alphabetical content. The directions can be entered in
either all lowercase or all uppercase letters (i.e., both asc or ASC).
Solr can sort query responses according to document scores or the
value of any field with a single value that is either indexed or uses
DocValues (that is, any field whose attributes in the Schema include
multiValued="false" and either docValues="true" or indexed="true" – if
the field does not have DocValues enabled, the indexed terms are used
to build them on the fly at runtime), provided that:
the field is non-tokenized (that is, the field has no analyzer and its
contents have been parsed into tokens, which would make the sorting
inconsistent), or
the field uses an analyzer (such as the KeywordTokenizer) that
produces only a single term.

What are the limits for SharePoint field values?

Are there any limits to the amount of data that can be stored in an individual SharePoint field? If there are, what are they?
Is there a limit in terms of the number of bytes or string length, say, that can be stored as a value of an individual field?
SharePoint stores the list items in a SQL Server table called AllUserData, so the maximum values are determined by the data types of the columns.
You can find the complete structure here. However, I cannot find any resource discussing the mapping between the SharePoint field types and SQL Server columns types; probably because accessing the SharePoint tables directly is strongly discouraged. That's not a big problem though - query the table, look at the results and you will be able to match the fields and the columns (e.g. nvarchar1 correspond to the 1st 'Single line of text' field).

How to retrieve search results from two fields in lucene index, giving one query?

Suppose I search for a query in Field A, and I want to retrive the corresponding fields B and C from my index, how should I go about it? I am using Lucene 3.6.0.
The results of your query will be returned as a set of documents, not fields. Once you've got a document, you can load whichever field contents you're interested in.
One thing that's probably worth watching out for is to ensure that your fields have been "stored".
Good luck,

Azure Table Storage: Order by

I am building a web site that has a wish list. I want to store the wish list(s) in azure table storage, but also want the user to be able to sort their wish list, when viewing it, a number of different ways - date added, date added reversed, item name etc. I also want to implement paging which I believe I can implement by making use of the continuation token.
As I understand it, "order by" isn't implemented and the order that results are returned from table storage is based on the partition key and row key. Therefore if I want to implement the paging and sorting that I describe, is the best way to implement this by storing the wish list multiple times with different partition key / row key?
In this simple case, it is likely that the wish list won't be that large and I could in fact restrict the maximum number of items that can appear in the list, then get rid of paging and sort in memory. However, I have more complex cases that I also need to implement paging and sorting for.
On today’ s hardware having 1000’s of rows to hold, in a list, in memory and sort is easily supportable. What the real issue is, how possible is it for you to access the rows in table storage using the Keys and not having to do a table scan. Duplicating rows across multiple tables could get quite cumbersome to maintain.
An alternate solution, would be to temporarily stage your rows into SQL Azure and apply an order by there. This may be effective if your result set is too large to work in memory. For best results the temporary table would need to have the necessary indexes.
Azure Storage keeps entities in lexicographical order, indexed by Partition Key as primary index and Row Key as secondary index. In general for your scenario it sounds like UserId would be a good fit for a partition key, so you have the Row Key to optimize for per each query.
If you want the user to see the wish lists latest on top, then you can use the log tail pattern where your row key will be the inverted Date Time Ticks of the DateTime when the wish list was entered by the user.
https://learn.microsoft.com/azure/storage/tables/table-storage-design-patterns#log-tail-pattern
If you want user to see their wish lists ordered by the item name you could have your item name as your row key, and so the entities will naturally sorted by azure.
When you are writing the data you may want to denormalize the data and do multiple writes with these different row key schemas. Since you will have the same partition key as user id, you can at that stage do a batch insert operation and not worry about consistency since azure table batch operations are atomic.
To differentiate the different rowkey schemas, you may want to prepend each with a const string value. Like your inverted ticks row key value for instance woul dbe something like "InvertedTicks_[InvertedDateTimeTicksOfTheWishList]" and your item names row key value would be "ItemName_[ItemNameOfTheWishList]"
Why not do all of this in .net using a List.
For this type of application I would have thought SQL Azure would have been more appropriate.
Something like this worked just fine for me:
List<TableEntityType> rawData =
(from c in ctx.CreateQuery<TableEntityType>("insysdata")
where ((c.PartitionKey == "PartitionKey") && (c.Field == fieldvalue))
select c).AsTableServiceQuery().ToList();
List<TableEntityType> sortedData = rawData.OrderBy(c => c.DateTime).ToList();

Resources