Solr - Is there a way to speed up my import

Solr - Is there a way to speed up my import - search

I have a relational database model
This is the basics of my data-config.xml
<entity name="MyMainEntity" pk="pID" query="select ... from [dbo].[TableA] inner join TableB on ...">
<entity name="Entity1" pk="Id1" query="SELECT [Text] Tag from [Table2] where ResourceId = '${MyMainEntity.pId}'"></entity>
<entity name="Entity1" pk="Id2" query="SELECT [Text] Tag from [Table2] where ResourceId2 = '${MyMainEntity.pId}'"></entity>
<entity name="LibraryItem" pk="ResourceId"
query="select SKU
FROM [TableB]
INNER JOIN ...
ON ...
INNER JOIN ...
ON ...
WHERE ... AND ...'">
</entity>
</entity>
Now, this takes a lot of time.
10000 rows in the first query and then each other inner entities are fetched later (around 10 rows each).
If I use a db profiler I see a the three inner entities query running over and over (3 select sentences than again 3 select sentences over and over)
This is really not efficient.
And the import can run over 40 hrs ()
Now,
What are my options to run it faster .
Obviously there is an option to flat the tables to one big table - but that will create a lot of other side effects. I would really like to avoid that extra effort and run solr on my production relational tables.
So far it works great out of the box and I am searching here if there is a configuration tweak.
If I will flat the rows that - does the schema.xml need to be change too? or the same fields that are multivalued will keep being multivalued.
Thanks.

without changing the schema of the DB, the first thing to try is caching. If the inner entities cache well, gains will be substantial.
Maybe the wiki is not uptodate so you should check the jira issues, namely solr-2382 and maybe have a look at solr-2948 too.
A second path could be trying multithreading DIH, but it's more tricky. At one point this was optional, but later was removed cause it was buggy, and I think now there was some jira issue trying to reimplement it, try look it up, but I recommend caching first.

Related

What library do you use for postgres+jsonb in Node?

I would like to do more complex queries on jsonb/documents that contain arrays of objects. Is there any library anyone would recommend for Node? I am using pg but I want to do more advanced queries like select the document where a document has an array with an object with a certain key/value. If there aren't any libraries that do this, does anyone know how I could do it with json functions/etc in psql? or point me to a book/resource where I could learn this advanced querying?

If you need to do really complicated things you're going to be writing SQL no matter what. But for basic queries that involve working with JSONB fields Massive (full disclosure, it's my project) has you covered, and executing handwritten prepared statements is as easy as anything else since scripts are loaded into the API.
Searching an embedded array falls into the 'really complicated' category, unfortunately, but if you know your element positions you could do this quite simply with Massive:
await db.mytable.find({
'somejson.arrayfield[0].key': 'value'
});
This would return all records from mytable where the somejson column has an arrayfield array, the first element in which contains a "key": "value" pair.
For searching, check out the Postgres docs. The specific question you have requires a lateral join on the jsonb_array_elements function like so:
SELECT somejson
FROM mytable
JOIN LATERAL jsonb_array_elements(mytable.somejson->'arrayfield') AS elements
ON TRUE
WHERE elements->>'key' = $1;
With Massive, you'd put this query in a script in your application's /db directory and run it as db.myScriptName('value'). You can use folders to group similar scripts too.

Cassandra Full-Text Search

Full-Text search in Cassandra;
I am fairly new to Cassandra, and wish to understand it more properly. I am attempting to perform a Full-Text search in Cassandra, but after some research I have found that there may not be a "simple" approach for this.. and I say maybe because the first page of Google hasn't said much of anything.
So I am trying to understand now instead, what is the best approach here.. This sort of lead me to take make up my own assumptions based on what I've learned so far about Cassandra, that is based on these two principals; a) design your tables based on your queries, rather than the data, and b) more-data is a good thing, as long as it is being used properly.
With that being said, I've come up with a couple of solutions I'd like to share, and also ask that if anyone has a better idea, please fill me on it before I commit to anything unreasonable/naive.
First Solution: Create a Column Family(CF), with two primary keys and an Index like so:
CREATE TABLE "FullTextSearch" (
"PartialText" text,
"TargetIdentifier" uuid,
"CompleteText" text,
"Type" int,
PRIMARY KEY ("PartialText","TargetIdentifier")
);
CREATE INDEX IX_FullTextSearch_Type "keyspace"."FullTextSearch" ("Type");
With the above table, I would need to insert rows for the text "Hello World" as follows:
BATCH APPLY;
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("H",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("He",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hel",000000000-0000-0000-0000-000000000,"Hello World",1);
.....
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello Wor",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello Worl",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Hello World",000000000-0000-0000-0000-000000000,"Hello World",1);
.....
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Wor",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("Worl",000000000-0000-0000-0000-000000000,"Hello World",1);
INSERT INTO "FullTextSearch" ("PartialText","TargetIdentifier","CompleteText","Type") VALUES ("World",000000000-0000-0000-0000-000000000,"Hello World",1);
END BATCH;
Basically, the above will satisfy the following wildcards/partialtext "%o W%", "Hello%", "Worl%"; However it will not satisfy partial words such as "%ell%" for "Hello", which I can feel alright about for now..... (OCD sorta kicks in here)
This approach sort of sucks for me because I would now have to delete/re-insert any time a save/name change occurs on the "TargetIdentifier";
The Second Solution, would be very similar only this time making use of wide-columns; where the table might look like:
CREATE TABLE "FullTextSearch" (
"TargetIdentifier" uuid,
"Type" int,
"CompleteText" text,
PRIMARY KEY("TargetIdentifier")
);
and now during a search something like:
SELECT * FROM "FullTextSearch" WHERE "He" = 1;
so that if the column exists, the respective rows are returned;
Third Solution:
similar to the one above, only this time instead of using wide-columns we use a set column such as map for the partial texts, and perform a query like:
SELECT * FROM "FullTextSearch" WHERE "PartialTexts"['He'] = 1;
Anyways, I am all out of ideas, it is late, and I can only hope for a great response! Please, let me know what I should be doing here... am I even on the right path?

AFAIK Datastax Enterprise Search is the (commercial) successor of Solandra.
Cassandra 2.0 supports so called "custom secondary indexes".
Custom secondary indexes are Java code. Your own implementation has to implement the abstract class org.apache.cassandra.db.index.SecondaryIndex
(See http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/create_index_r.html)
I'm not sure whether implementations exist for Elasticsearch or Solr.
I would not recommend to code all the weird full text search logic like stemming, multiple/exotic language support or even geo spatial stuff.
But SecondaryIndexwould be a good point to start integrating your favorite search engine.

If your dataset is relative small you can simply use a inmemory instance of lucene, update the index at a set interval and you are ready to go.

Use elassandra which comes elasticsearch as a plugin in cassandra.
An example can be found from here

Check out SOLANDRA (former Lucandra)
But I think Solandra is not being actively developed any more, the author moved to Datastax and continued his work there.
So You can also take a look at Datastax Enterprise Search
There are some limitation also, look at DistributedSearch

The very basic thing about cassandra is if you want to use where clause for filtration of records that column is either primary key or you have to assign index to it, so what i can see is you have given primary key to "TargetIdentifier" field and index to "Type" and using "CompleteText" in where clause so this may not work..
Assign secondary index to "CompleteTex" and check whether you are getting desired output or not.

A couple other options you have:
Stratio Lucene Plugin. This uses Lucene for implementing a native secondary index.
You also have SSTable Attached Secondary Index (SASI) available to use for free text searching.
Be forewarned that both of these strategies use locally distributed indexes such that queries will not be very performant since searches will end up being broadcast across the entire cluster. For SASI, you can avoid this if you can use a partition key as part of your query.

Use Solr for fullText search
Cassandra is not good for fullText.
1 Db Cassandra for Archive
2 Solr for full text search

solr - modeling multiple values on 1:n connection

I try to model my db using this example from solr wiki.
I have a table called item and a table called features with id,featureName,description
here is the updated xml (added featureName)
<dataConfig>
<dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
<document>
<entity name="item" query="select * from item">
<entity name="feature" query="select description, featureName as features from feature where item_id='${item.ID}'"/>
</entity>
</document>
Now I get two lists in the xml element
<doc>
<arr name="featureName">
<str>number of miles in every direction the universal cataclysm was gathering</str>
<str>All around the Restaurant people and things relaxed and chatted. The</str>
<str>- Do we have... - he put up a hand to hold back the cheers, - Do we</str>
</arr>
<arr name="description">
<str>to a stupefying climax. Glancing at his watch, Max returned to the stage</str>
<str>air was filled with talk of this and that, and with the mingled scents of</str>
<str>have a party here from the Zansellquasure Flamarion Bridge Club from</str>
</arr>
</doc>
But I would like to see the list together (using xml attributes) so that I dont have to join the values.
Is it possible?

I wanted to suggest the ScriptTransformer, it gives you the flexibility to alter the data as needed, but it will not work in your case since it's working at the row level.
You can always define an aggregation function for string concatenation in SQL(example), but you will potentially have performance issues.
If you would use a http/xml data source the solution would have been to use the flatten atribute.
Nevertheless the search functionality will work as expected even if you ended up with multi-valued fields. The down side would be on the client where you will concatenate them before the presentation layer, which is not really a problem if you use some sort of pagination.

How to preform a relative complement query in CRM?

Background (ie what the heck is a relative complement?)
Relative Complement
What I'm trying to do
Let's say I've got a custom Vehicle entity that has a VehicleType option set that is either "Car", or "Truck". There is a 1 to many relationship between Contact and Vehicle (ie. ContactId is on the vehicle entity). How do I write an XRM query (Linq To CRM, QueryExpression, fetch Xml, whatever) that returns the contacts with only cars?

Option 1:
I’d prefer a modification of the proposal that AdamV makes above. I can’t think of a way that you’d get this particular query answered using Linq to CRM, Query Expressions, FetchXML alone. Daryl doesn’t offer what the client is, but I would suppose if Linq and Query Expressions were acceptable offerings, .NET is on the table. Creating aggregate fields containing the count of the related entity on the parent entity (contact in this case) offers more than the Boolean option. If the query requirements ever changed to a threshold (more than X cars, less than Y trucks, between X and Y total vehicles) the Boolean options fails to deliver. The client in this question isn’t known, but I can’t think of many (any?) cases where pulling all the records to the client on a set of 500K+ rows is more efficient than a the SQL query that CRM would make on your behalf against several integer fields with range clauses.
Upside:
Maintains client purity in Query approach
Simple client query
Probably as performant as possible
Downside:
Setups for Aggregate fields
Workflow or plugin to manage the increment and decrement of the aggregate fields
SQL Script for initial load of the aggregates.
Risk that aggregate fields get out of sync (workflow or plugin fails)
Option 2:
If purity within the client isn’t essential, and .NET is on the table – skip the aggregate fields and the setup and just run SQL against the Views. If you don’t want to work with the ADO.NET, a thin ORM like Dapper, Massive, or PetaPOCO can still give you an object model. As Andreas offers in his comment on the OP’s first answer, it seems like something fairly trivial to do in SQL.
Sketching something from top of mind:
SELECT c.*
FROM Contact
WHERE C.Contactid in (
Select contactid
FROM Vehicle v
group by v.contactid , v.type
having v.type = ‘Car’ and count(contactid) > 1
)
AND NOT IN (
Select contactid
FROM Vehicle v
group by v.contactid , v.type
having v.type <> ‘Car’ and count(contactid) > 1
)
Upside:
Much less work
CRM Entities get left alone
Downside:
Depending on the client and/or the application mixing DataAccess methods is a bit kludgy.
Likely less performant than Option 1
Option 3:
Mix and Match: Take the aggregate fields from Option 1. But update them using a scheduled SQL job (or something similar) with a query similar to the initial load job you’d need to write in Option 1
Upside:
Takes most of the work and risk out of Option 1
Keeps all of the performance of Option 1
Downside:
Some will see this as an unsupported feature.

In order to order to perform a true Relative Complement Query you need to be able to perform a subquery.
Your query would basically say give me all the contacts with cars, and then, within those results, remove any contacts that have a vehicle that isn't a car. This is what the SQL in #JasonKoopmans answer does. Unfortunetly, CRM does not support SubQueries.
Therefore, the only way to achieve this is to either perform the sub query on the client side, as I resorted to doing, or storing the results of what would be the subquery in a manner that can be accessed through the main query (ie storing counts on the contact entity).
You could theoretically do this "on the fly" by making a SubQueryResult entity that stores a ContactId, and SubQueryId. You'd first pull back the contacts that have at least 1 car, and create a SubQueryResult record for each record, with it's contactId, and a single SubQueryId that is generated client side to tie them all together.
Then you'd do another query that says give me all the contacts that are in this SubQueryResult with this SubQueryId, that do not have any vehicles that aren't cars.
I could only assume that this wouldn't be any more efficient than performing the two separate queries and performing the filter client side. Although with the new ExecuteMultipleRequests in the new CRM release, it may be close.

I have resorted to pulling back all of my records in CRM, and performing the check on the client side since CRM 2011 doesn't support this via Query Expressions.
You could write two Fetch XML statements, one to return all contacts and the count of their vehicles, and another to return all contacts and the count of their cars, then compare the list on the client side. But once again, you're having to return every contact and filter it client side.

It's not tested but how about this query expression? I'm linking in the Vehicle entity as an inner join, requiring that it's a Car. I'm assuming that the field VehicleType is a String because I'm a bit lazy and don't want to test it (I'm typing this hardcore style, no compilation - pure brain work).
Optionally, you might want to add a Criteria section as well to control which of the Contact instances that actually get retrieved. Do tell how it went!
Sorry for the verbosity. I know you like it short. My brains work better when circumlocutory.
new QueryExpression
{
EntityName = "contact",
ColumnSet = new ColumnSet("fullname"),
LinkEntities =
{
new LinkEntity
{
JoinOperator = JoinOperator.Inner,
LinkFromEntityName = "contact",
LinkFromAttributeName = "contactid",
LinkToEntityName = "vehicle",
LinkToAttributeName = "contactid",
Columns = new ColumnSet("vehicletype"),
EntityAlias = "Vroom",
//LinkCriteria = { Conditions =
//{
// new ConditionExpression(
// "vehicletype", ConditionOperator.Equal, "car")
//} }
LinkCriteria = { Conditions =
{
new ConditionExpression(
"vehicletype", ConditionOperator.NotEqual, "truck")
} }
}
}
};
EDIT:
I've talk to my MVP Gustaf Westerlund and he's suggested the following work-around. Let me stress that it's not an answer to your original question. It's just a way to solve it. And it's cumbersome. :)
So, the hint is to add a flag in the Contact or Person entity. Then, every time you create a new instance of Vehicle, you need to fire a message and using a plugin, update the information on the first about the creation of the latter.
This has several drawbacks.
It requires us to do stuff.
It's not the straight-forward do-this-and-that type of approach.
Maintenance is higher for every new type of Vehicle one adds.
Buggibility is elevated since there are many cases to regard (what happens to the flagification when a Vehicle instance is reasigned, deleted etc.).
So, my answer to your question is changed to: "can't be done". This remains effective until (gladly) proven wrong by presented alternative solution. Duck!
Personally, I'd fetch (almost) everything and unleash the hounds of LINQ onto it. But I'd do that without smiling nor proud. :)

CouchDB views - Multiple join... Can it be done?

I have three document types MainCategory, Category, SubCategory... each have a parentid which relates to the id of their parent document.
So I want to set up a view so that I can get a list of SubCategories which sit under the MainCategory (preferably just using a map function)... I haven't found a way to arrange the view so this is possible.
I currently have set up a view which gets the following output -
{"total_rows":16,"offset":0,"rows":[
{"id":"11098","key":["22056",0,"11098"],"value":"MainCat...."},
{"id":"11098","key":["22056",1,"11098"],"value":"Cat...."},
{"id":"33610","key":["22056",2,"null"],"value":"SubCat...."},
{"id":"33989","key":["22056",2,"null"],"value":"SubCat...."},
{"id":"11810","key":["22245",0,"11810"],"value":"MainCat...."},
{"id":"11810","key":["22245",1,"11810"],"value":"Cat...."},
{"id":"33106","key":["22245",2,"null"],"value":"SubCat...."},
{"id":"33321","key":["22245",2,"null"],"value":"SubCat...."},
{"id":"11098","key":["22479",0,"11098"],"value":"MainCat...."},
{"id":"11098","key":["22479",1,"11098"],"value":"Cat...."},
{"id":"11810","key":["22945",0,"11810"],"value":"MainCat...."},
{"id":"11810","key":["22945",1,"11810"],"value":"Cat...."},
{"id":"33123","key":["22945",2,"null"],"value":"SubCat...."},
{"id":"33453","key":["22945",2,"null"],"value":"SubCat...."},
{"id":"33667","key":["22945",2,"null"],"value":"SubCat...."},
{"id":"33987","key":["22945",2,"null"],"value":"SubCat...."}
]}
Which QueryString parameters would I use to get say the rows which have a key that starts with ["22945".... When all I have (at query time) is the id "11810" (at query time I don't have knowledge of the id "22945").
If any of that makes sense.
Thanks

The way you store your categories seems to be suboptimal for the query you try to perform on it.
MongoDB.org has a page on various strategies to implement tree-structures (they should apply to Couch and other doc dbs as well) - you should consider Array of Ancestors, where you always store the full path to your node. This makes updating/moving categories more difficult, but querying is easy and fast.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string