What exactly is a Lotus Notes document's "size" field? - lotus-notes

Background is that I am trying to eliminate duplicated documents that have arisen (from a rogue replication?) in a huge database - tens of thousands of docs. I am composing a view field that uniquely identifies a document so that I can sort on that field and remove the duplicates via LotusScript code. Trouble is that there is a rich text "Body" field that contains most of the action/variability and you can't (AFAICS) use #Abstract in a view column...
Looking around for an alternative I can see that there is a system variable "size" which includes any attachments size, so I am tentatively using this. I note that duplicated docs (to the eye) can differ in reported size by up to about 30 bytes.
Why the small differences? I can code the LS to use a 30-byte leeway, but I'd like to know the reason for it. And is really 30 bytes or something else?

Probably document's item $Revisions has one more entry. That means that the document was saved one more time.
If the cause of this "copy thousands of documents" accident was replication then you might be lucky that documents contain an item $Conflict. This item contains the DocumentUniqueId of the original document and you could pair them this way.

Related

Xpages how to sort preselected large amount data from view

I have a domino database with the following view:
Project_no Realization_date Author
1/2005 2015-01-02 Alex/Acme
3/2015 2015-02-20 John/Acme
33/2015 2016-06-20 Henry/Acme
44/2015 2015-02-13 John/Acme
...
Now I want to get all projects from this view that starts i.e with "3" (partial match), sort them by Realization_date descending and display first 1000 of them on Xpage.
View is large - some selection can give me 500.000 documents.
The FT search view option is not acceptable because it returns 5.000 docs only.
Creation of ArrayList or ListMap resulted with java out of memory exception (java Domino objects are recycled). Exceeding the memory may help of course but we have 30k users so it may be insufficient.
Do you have any ideas how can I achive this?
I think the key is goiong to be what the users want to do with the output, as Frantisek says.
If it's for an export, I'd export the data without sorting, then sort in the spreadsheet.
If it's for display, I would hope there's some paging involved, otherwise it will take a very long time to push the HTML from the server to the browser, so I'd recommend doing an FT Search on Project_no and Realization_date between certain ranges and "chunking" your requests. You'll need a manual pager to load the next set of results, but if you're expecting that many, you won't get a pager that calculates the total number of pages anyway.
Also, if it's an XAgent or displaying everything in one go, set viewState="nostate" on the relevant XPage. Otherwise, every request will get serialized to disk. So the results of your search get serialized to disk / memory, which is probably what's causing the Java memory issues you're seeing.
Remember FT_MAX_SEARCH_RESULTS notes.ini variable can be amended on the server to increase the (default) maximum from 5000.
500,000 is a very high set of results and is probably not going to make it very user-friendly for any subsequent actions on them. I'd probably also recommend restricting the search, e.g. forcing a separate entry of the "2015" portion or preventing entry of just one number, so it has to be e.g. "30" instead of just "3". That may also mean amending your view so the Project_no format displays as #Right("0000" + #Left(Project_no,"/"), 4), so users don't get 3, 30, 31, 32....300, 301, 302...., but can search for "003" and find just 30, 31, 32..., 39. It really depends on what the users are wanting to do and may require a bit of thinking outside the box, to quickly give them access to the targeted set of documents they want to action.
I would optimize data structure for your view. For example make a ArrayList<view entry>, that will represent the minimum information from your view. It mimics the index. The "view entry" is NOT Notes object (Document, ViewEntry), but a simplified POJO that will hold just enough information to sort it (via comparator) and show or lookup real data - for example Subject column to be shown and UNID to make a link to open that document.
This structure should fit into few hundred bytes per document. Troublesome part is to populate that structure - even with ViewNavigator it may take minutes to build such list.
Proper recycling should be ok but...
You could also "revert" to classic Domino URLS for ex ?yourviewname?ReadViewEntries&startkey=3&outputformat=JSON and render that JSON via Javascript UI component of some kind
If the filtering is based on partial match for the first sorted column, there's a pure Domino based solution. It requires that the Domino server is 8.5.3 or newer (view.resortView was introduced in 8.5.3), and that the realization_date column has click to sort.
Create a filtered collection with getAllEntriesByKey( key, false ) <-- partial match
Call view.resortView( "name_of_realization_date_column" )
Create a collection of all entries, now sorted by realization_date
Intersect the sorted collection with the filtered collection. This gives you the entries you want sorted by realization_date. E.g. sortedCollection.intersect( filteredCollection )
Pseudocode:
..
View view = currentDb.getView( "projectsView" );
view.setAutoUpdate( false );
ViewEntryCollection filteredCollection = view.getAllEntriesByKey( userFilter, False );
// Use index where view is sorted by realization_date
view.resortView( "realization_date" );
// All entries sorted by realization_date
ViewEntryCollection resortedCollection = view.getAllEntries();
resortedCollection.intersect( filteredCollection );
// resortedCollection now contains only the entries in filteredCollection, sorted by realization_date
..
I'm not certain if this would be faster than creating a custom data structure, but I would think it's worth to test :)

How to perform duplicate detection when your matchcode length is > 450?

I have an entity that has three fields that need to form a unique constraint in my CRM 2011 Organizations, but when I enter them in for a Duplicate Detection Rule, the resulting matchcode length is too long.
At first I was going to just add an odata query in javascript on the form to ensure that no record existed for the unique constraint, but that doesn't catch data import issues.
Is there some way to get around the 450 character limit, or am I most likely going to need to create a plugin?
Using a new field, that contains the values of the 3 fields you want to create the duplicate detection rule, may be an option. You maintain the state of this field with a workflow (on create/update) and apply the duplicate detection rule on it (if the new field does not exceed the limit of the matchcode).
The approach with the plugin may be another choice if the above is not a convenient solution for your scenario.
You can choose to only use part of of one or more of your field eg the first 150 characters - do you really need to be checking the whole of these long fields for absolute uniqueness?
In how many cases would the first 150 characters of each of three fields be identical but not the last bit (which would be the false-positives this causes)?

Solr Merging Results of 2 Cores Into Only Those Results That Have A Matching Field

I am trying to figure out if how I can accomplish the following and none of the answers I have found so far seem to fit:
I have a fairly static and large set of resources I need to have indexed and searchable. Solr seems to be a perfect fit for that. In addition I need to have the ability for my users to add resources from the main data set to a 'Favourites' folder (which can include a few more tags added by them). The Favourites needs to be searchable in the same manner as the main data set, across all the same fields plus the additional ones.
My first thought was to have two separate schemas
- the first for the main data set and its metadata
- the second for the Favourites folder with all of the metadata from the main set copied over and then adding the additional fields.
Then I thought that would probably waste quite a bit of space (the number of users is much larger than the number of main resources).
So then I thought I could have the main data set with its metadata (Core0), same as above with the resourceId as the unique identifier. Then there would be second one (Core1) for the Favourites folder with the unique id of the resourceId, userId, grade, folder all concantenated. The resourceId would be a separate field also. In addition, I would create another schema/core (Core3) with all the fields from the other two and have a request handler defined on it that searches across the other 2 cores and returns the results through this core.
This third core would have searches run against it where the results would expect to only be returned for a single user. For example, a user searches their Favourites folder for all the items with Foo. The result is only those items the user has added to their Favourites with Foo somewhere in their main data set metadata. I guess the result handler from Core3 would break the search up into a search for all documents with Foo in Core0, a search across Core1 for userId and folder and then match up the resourceIds from both of them and eliminate those not in both. Or run a search on Core1 with the userId and folder and then having gotten that result set back, extract all the resourceIds and append an AND onto the search query to Core0 like: AND (resourceId = 1232232312 OR resourceId = 838388383 OR resourceId = 8637626491).
Could this be made to work? Or is there some simpler mechanism is Solr to resolve the merging of 2 searches across 2 cores and only return the results that match on (not necessarily a unique) field in both?
Thanks.
Problem looks like a data base join of 2 tables with resource id as the foreign key.
Ignore the post if what i understood is wrong.
First i will probably do it with a single core, with a field userid (indexed, but not stored), reindex a document every time a new user favorites it by appending his user id (delimited by something that analyzer ignores).
So searching gets easier (userId:"kaka's id" will fetch all my favorites)
I think it takes some work to do this and also if number of users who can like a document increases, userid field gets really long.
So in that case,i will move on to my next idea which is similar to yours,have a second core with (userid,resource id).Write a wrapper which first searches this core for all the favorites, then searches another core for all the resources in a where condition, but again..if a user favorites more resources, the query might exceed GET method's size limit..
If both doesn't seem to work, its time to think something more scalable, which leaves us the same space wasting option.
Am i missing something??

Storing index values in FAST-ESP without modifications

first of all I'm totally new to FAST but I already have a couple of issues I need to solve, so I'm sorry if my questions are very basic =)
Well, the problem is that I have a field in the FAST index which in the source document is something like "ABC 12345" (please note the intentional whitespaces) but when stored in the index is in the form "ABC 123456" (please note that now there is a single space).
If I retrieve all the document values then this specific value is OK (with all the whitespaces), my only problem is with the way the value is stored in the index since I need to retrieve and display it to my user just like it appears in the original document, and I don't want to go to the full document just for this value, I want the value that I already have in the index. I think I need to update one of the FAST XML configuration files but I don't have enough documentation at hand in order to decide where to perform the change, index_profile.xml? in the XMLMapper file?
I've found the answer by myself. I'm using a XMLMapper for my collection, all I had to do was to add the ignore-whitespace attribute to the Mapping element and then set this attribute value to "false". This solved the problem and the raw data now when retrieved from the index contains the expected inner whitespaces.
Thanks.

How to find related items by tags in Lucene.NET

My indexed documents have a field containing a pipe-delimited set of ids:
a845497737704e8ab439dd410e7f1328|
0a2d7192f75148cca89b6df58fcf2e54|
204fce58c936434598f7bd7eccf11771
(ignore line breaks)
This field represents a list of tags. The list may contain 0 to n tag Ids.
When users of my site view a particular document, I want to display a list of related documents.
This list of related document must be determined by tags:
Only documents with at least one matching tag should appear in the "related documents" list.
Document with the most matching tags should appear at the top of the "related documents" list.
I was thinking of using a WildcardQuery for this but queries starting with '*' are not allowed.
Any suggestions?
Setting aside for a minute the possible uses of Lucene for this task (which I am not overly familiar with) - consider checking out the LinkDatabase.
Sitecore will, behind the scenes, track all your references to and from items. And since your multiple tags are indeed (I assume) selected from a meta hierarchy of tags represented as Sitecore Items somewhere - the LinkDatabase would be able to tell you all items referencing it.
In some sort of pseudo code mockup, this would then become
for each ID in tags
get all documents referencing this tag
for each document found
if master-list contains document; increase usage-count
else; add document to master list
sort master-list by usage-count descending
Forgive me that I am not more precise, but am unavailable to mock up a fully working example right at this stage.
You can find an article about the LinkDatabase here http://larsnielsen.blogspirit.com/tag/XSLT. Be aware that if you're tagging documents using a TreeListEx field, there is a known flaw in earlier versions of Sitecore. Documented here: http://www.cassidy.dk/blog/sitecore/2008/12/treelistex-not-registering-links-in.html
Your pipe-delimited set of ids should really have been separated into individual fields when the documents were indexed. This way, you could simply do a query for the desired tag, sorting by relevance descending.
You can have the same field multiple times in a document. In this case, you would add multiple "tag" fields at index time by splitting on |. Then, when you search, you just have to search on the "tag" field.
Try this query on the tag field.
+(tag1 OR tag2 OR ... tagN)
where tag1, .. tagN are the tags of a document.
This query will return documents with at least one tag match. The scoring automatically will take care to bring up the documents with highest number of matches as the final score is sum of individual scores.
Also, you need to realizes that if you want to find documents similar to tags of Doc1, you will find Doc1 coming at the top of the search results. So, handle this case accordingly.

Resources