Search strategies in ORMs - search

I am looking for information on handling search in different ORMs.
Currently I am redeveloping some old application in PHP and one of requirements is: make everything or almost everything searchable, so user just types "punkrock live" and the app finds videos clips, music tracks, reviews, upcoming events or even user comments labeled that way.
In environment where everything is searchable ORM need to support this feature in two ways:
providing some indexing API on "O" side of ORM
providing means for bulk database retrieval on "R" side
Ideal solution would return ready made objects based on searched string.
Do you know any good end-to-end solutions that does the job, not necessarily in PHP?
If you dealt with similar problem it would be nice to listen what your experience is. Something more than Use Lucene or semantic web is the way oneliners, tho ;-)*

I have recently integrated the Compass search engine into a Java EE 5 application. It is based on Lucene Java and supports different ORM frameworks as well as other types of models like XML or no real model at all ;)
In the case of an object model managed by an ORM framework you can annotate your classes with special annotations (e.g. #Searchable), register your classes and let Compass index them on application startup and listen to changes to the model automatically.
When it comes to searching, you have the power of Lucene at hand. Compass then gives you instances of your model objects as search result.
It's not PHP, but you said it didn't have to be PHP necessarily ;) Don't know if this helps, though...

In a Propel 1.3 schema.xml file, you can specify that you'd like all your models to extend a "BaseModel" class that YOU create.
In that BaseModel, you're going to re-define the save() method to be something like this:
public function save(PropelPDO $con = null)
{
if($this->getIsSearchable())
{
// update your search index here. Lucene, Sphinx, or otherwise
}
return parent::save($conn);
}
That takes care of keeping everything indexed. As for searching, I'd suggest creating a Search class with a few methods.
class Search
{
protected $_searchableTypes = array('music','video','blog');
public method findAll($search_term)
{
$results = array();
foreach($this->_searchableTypes as $type)
{
$results[] = $this->findType($type, $search_term);
}
return $results;
}
}

Related

DDD/CQRS: Combining read models for UI requirements

Let's use the classic example of blog context. In our domain we have the following scenarios: Users can write Posts. Posts must be cataloged at least in one Category. Posts can be described using Tags. Users can comment on Posts.
The four entities (Post, Category, Tag, Comment) are implemented as different aggregates because of I have not detected any rule for that an entity data should interfere in another. So, for each aggregate I will have one repository that represent it. Too, each aggregate reference others by his id.
Following CQRS, from this scenario I have deducted typical use cases that result on commands such as WriteNewPostCommand, PublishPostCommand, DeletePostCommand etc... along with their respective queries to get data from repositories. FindPostByIdQuery, FindTagByTagNameQuery, FindPostsByAuthorIdQuery etc...
Depending on which site of the app we are (backend or fronted) we will have queries more or less complex. So, if we are on the front page maybe we need build some widgets to get last comments, latest post of a category, etc... Queries that involve a simple Query object (few search criterias) and a QueryHandler very simple (a single repository as dependency on the handler class)
But in other places this queries can be more complex. In an admin panel we require to show in a table a relation that satisfy a complex search criteria. Might be interesting search posts by: author name (no id), categories names, tags name, publish date... Criterias that belongs to different aggregates and different repositories.
In addition, in our table of post we dont want to show the post along with author ID, or categories ID. We need to show all information (name user, avatar, category name, category icon etc).
My questions are:
At infrastructure layer, when we design repositories, the search methods (findAll, findById, findByCriterias...), should have return the corresponding entity referencing to all associations id's? I mean, If a have a method findPostById(uuid) or findPostByCustomFilter(filter), should return a post instance with a reference to all categories id it has, all tags id, and author id that it has? Or should my repo have some kind of method that populates a given post instance with the associations I want?
If I want to search posts created from 12/12/2014, written by John, and categorised on "News" and "Videos" categories and tags "sci-fi" and "adventure", and get the full details of each aggregate, how should create my Query and QueryHandler?
a) Create a Query with all my parameters (authorName, categoriesNames, TagsNames, if a want retrive User, Category, Tag association full detailed) and then his QueryHandler ensamble the different read models in a only one. Or...
b) Create different Queries (FindCategoryByName, FindTagByName, FindUserByName) and then my web controller calls them for later
call to FindPostQuery but now passing him the authorid, categoryid, tagid returned from the other queries?
The b) solution appear more clean but it seems me more expensive.
On the query side, there are no entities. You are free to populate your read models in any way suits your requirements best. Whatever data you need to display on (a part of) the screen, you put it in the read model. It's not the command side repositories that return these read models but specialized query side data access objects.
You mentioned "complex search criteria" -- I recommend you model it with a corresponding SearchCriteria object. This object would be technnology agnostic, but it would be passed to your Query side data access object that would know how to combine the criteria to build a lower level query for the specific data store it's targeted at.
With simple applications like this, it's easier to not get distracted by aggregates. Do event sourcing, subscribe to the events by one set of tables that is easy to query the way you want.
Another words, it sounds like you're main goal is to be able to query easily for the scenarios you describe. Start with that end goal. Now write your event handler to adjust your tables accordingly.
Start with events and the UI. Then everything else will fit easily. Google "Event Modeling" as it will help you formulate ideas sound what and how you want to build these style of applications.
I can see three problems in your approach and they need to be solved separately:
In CQRS the Queries are completely separate from the Commands. So, don't try to solve your queries with your Commands pipelines repositories. The point of CQRS is precisely to allow you to solve the commands and queries in very different ways, as they have very different requirements.
You mention DDD in the question title, but you don't mention your Bounded Contexts in the question itself. If you follow DDD, you'll most likely have more than one BC. For example, in your question, it could be that CategoryName and AuthorName belong to two different BCs, which are also different from the BC where the blog posts are. If that is the case and each BC properly owns its own data, the data that you want to search by and show in the UI will be stored potentially in different databases, therefore implementing a query in the DB with a join might not even be possible.
Searching and Reading data are two different concerns and can/should be solved differently. When you search, you get some search criteria (including sorting and paging) and the result is basically a list of IDs (authorIds, postIds, commentIds). When you Read data, you get one or more Ids and the result is one or more DTOs with all the required data properties. It is normal that you need to read data from multiple BCs to populate a single page, that's called UI composition.
So if we agree on these 3 points and especially focussing on point 3, I would suggest the following:
Figure out all the searches that you want to do and see if you can decompose them to simple searches by BC. For example, search blog posts by author name is a problem, because the author information could be in a different BC than the blog posts. So, why not implement a SearchAuthorByName in the Authors BC and then a SearchPostsByAuthorId in the Posts BC. You can do this from the Client itself or from the API. Doing it in the client gives the client a lot of flexibility because there are many ways a client can get an authorId (from a MyFavourites list, from a paginated list or from a search by name) and then get the posts by authorId is a separate operation. You can do the same by tags, categories and other things. The Post will have Ids, but not the extra details about those IDs.
Potentially, you might want more complicated searches. As long as the search criteria (including sorting fields) contain fields from a single BC, you can easily create a read model and execute the search there. Note that this is only for the search criteria. If the search result needs data from multiple BCs you can solve it with UI composition. But if the search criteria contain fields from multiple BCs, then you'll need some sort of Search engine capable of indexing data coming from multiple sources. This is especially evident if you want to do full-text search, search by categories, tags, etc. with large quantities of data. You will need to use some specialized service like Elastic Search and it won't belong to any of your existing BCs, it'll be like a supporting service.
From CQRS you will have a separeted Stack for Queries and Commands. Your query stack should represent a diferente module, namespace, dll or package at your project.
a) You will create one QueryModel and this query model will return whatever you need. If you are familiar with Entity Framework or NHibernate, you will create a Façade to hold this queries togheter, DbContext or Session.
b) You can create this separeted queries, but saying again, if you are familiar with any ORM your should return the set that represents the model, return every set as IQueryable and use LET (Linq Expression Trees) to make your Query stack more dynamic.
Using Entity Framework and C# for exemple:
public class QueryModelDatabase : DbContext, IQueryModelDatabase
{
public QueryModelDatabase() : base("dbname")
{
_products = base.Set<Product>();
_orders = base.Set<Order>();
}
private readonly DbSet<Order> _orders = null;
private readonly DbSet<Product> _products = null;
public IQueryable<Order> Orders
{
get { return this._orders.Include("Items").Include("Items.Product"); }
}
public IQueryable<Product> Products
{
get { return _products; }
}
}
Then you should do queries the way you need and return anything:
using (var db = new QueryModelDatabase())
{
var queryable = from o in db.Orders.Include(p => p.Items).Include("Details.Product")
where o.OrderId == orderId
select new OrderFoundViewModel
{
Id = o.OrderId,
State = o.State.ToString(),
Total = o.Total,
OrderDate = o.Date,
Details = o.Items
};
try
{
var o = queryable.First();
return o;
}
catch (InvalidOperationException)
{
return new OrderFoundViewModel();
}
}

Retrieving a value object without Aggreteroot

I'm developing an application with Domain Drive Design approach. in a special case I have to retrieve the list of value objects of an aggregate and present them. to do that I've created a read only repository like this:
public interface IBlogTagReadOnlyRepository : IReadOnlyRepository<BlogTag, string>
{
IEnumerable<BlogTag> GetAllBlogTagsQuery(string tagName);
}
BlogTag is a value object in Blog aggregate, now it works fine but when I think about this way of handling and the future of the project, my concerns grow! it's not a good idea to create a separate read only repository for every value object included in those cases, is it?
anybody knows a better solution?
You should not keep value objects in their own repository since only aggregate roots belong there. Instead you should review your domain model carefully.
If you need to keep track of value objects spanning multiple aggregates, then maybe they belong to another aggregate (e.g. a tag cloud) that could even serve as sort of a factory for the tags.
This doesn't mean you don't need a BlogTag value object in your Blog aggregate. A value object in one aggregate could be an entity in another or even an aggregate root by itself.
Maybe you should take a look at this question. It addresses a similar problem.
I think you just need a query service as this method serves the user interface, it's just for presentation (reporting), do something like..
public IEnumerable<BlogTagViewModel> GetDistinctListOfBlogTagsForPublishedPosts()
{
var tags = new List<BlogTagViewModel>();
// Go to database and run query
// transform to collection of BlogTagViewModel
return tags;
}
This code would be at the application layer level not the domain layer.
And notice the language I use in the method name, it makes it a bit more explicit and tells people using the query exactly what the method does (if this is your intent - I am guessing a little, but hopefully you get what I mean).
Cheers
Scott

Preventing StackOverflowException while serializing Entity Framework object graph into Json

I want to serialize an Entity Framework Self-Tracking Entities full object graph (parent + children in one to many relationships) into Json.
For serializing I use ServiceStack.JsonSerializer.
This is how my database looks like (for simplicity, I dropped all irrelevant fields):
I fetch a full profile graph in this way:
public Profile GetUserProfile(Guid userID)
{
using (var db = new AcmeEntities())
{
return db.Profiles.Include("ProfileImages").Single(p => p.UserId == userId);
}
}
The problem is that attempting to serialize it:
Profile profile = GetUserProfile(userId);
ServiceStack.JsonSerializer.SerializeToString(profile);
produces a StackOverflowException.
I believe that this is because EF provides an infinite model that screws the serializer up. That is, I can techincally call: profile.ProfileImages[0].Profile.ProfileImages[0].Profile ... and so on.
How can I "flatten" my EF object graph or otherwise prevent ServiceStack.JsonSerializer from running into stack overflow situation?
Note: I don't want to project my object into an anonymous type (like these suggestions) because that would introduce a very long and hard-to-maintain fragment of code).
You have conflicting concerns, the EF model is optimized for storing your data model in an RDBMS, and not for serialization - which is what role having separate DTOs would play. Otherwise your clients will be binded to your Database where every change on your data model has the potential to break your existing service clients.
With that said, the right thing to do would be to maintain separate DTOs that you map to which defines the desired shape (aka wireformat) that you want the models to look like from the outside world.
ServiceStack.Common includes built-in mapping functions (i.e. TranslateTo/PopulateFrom) that simplifies mapping entities to DTOs and vice-versa. Here's an example showing this:
https://groups.google.com/d/msg/servicestack/BF-egdVm3M8/0DXLIeDoVJEJ
The alternative is to decorate the fields you want to serialize on your Data Model with [DataContract] / [DataMember] fields. Any properties not attributed with [DataMember] wont be serialized - so you would use this to hide the cyclical references which are causing the StackOverflowException.
For the sake of my fellow StackOverflowers that get into this question, I'll explain what I eventually did:
In the case I described, you have to use the standard .NET serializer (rather than ServiceStack's): System.Web.Script.Serialization.JavaScriptSerializer. The reason is that you can decorate navigation properties you don't want the serializer to handle in a [ScriptIgnore] attribute.
By the way, you can still use ServiceStack.JsonSerializer for deserializing - it's faster than .NET's and you don't have the StackOverflowException issues I asked this question about.
The other problem is how to get the Self-Tracking Entities to decorate relevant navigation properties with [ScriptIgnore].
Explanation: Without [ScriptIgnore], serializing (using .NET Javascript serializer) will also raise an exception, about circular
references (similar to the issue that raises StackOverflowException in
ServiceStack). We need to eliminate the circularity, and this is done
using [ScriptIgnore].
So I edited the .TT file that came with ADO.NET Self-Tracking Entity Generator Template and set it to contain [ScriptIgnore] in relevant places (if someone will want the code diff, write me a comment). Some say that it's a bad practice to edit these "external", not-meant-to-be-edited files, but heck - it solves the problem, and it's the only way that doesn't force me to re-architect my whole application (use POCOs instead of STEs, use DTOs for everything etc.)
#mythz: I don't absolutely agree with your argue about using DTOs - see me comments to your answer. I really appreciate your enormous efforts building ServiceStack (all of the modules!) and making it free to use and open-source. I just encourage you to either respect [ScriptIgnore] attribute in your text serializers or come up with an attribute of yours. Else, even if one actually can use DTOs, they can't add navigation properties from a child object back to a parent one because they'll get a StackOverflowException.
I do mark your answer as "accepted" because after all, it helped me finding my way in this issue.
Be sure to Detach entity from ObjectContext before Serializing it.
I also used Newton JsonSerializer.
JsonConvert.SerializeObject(EntityObject, Formatting.Indented, new JsonSerializerSettings { PreserveReferencesHandling = PreserveReferencesHandling.Objects });

CouchDB simple find

I have some couchDB database.
and i want, as in mongodb, find one item.
something like db.find({user : "John"})
That the easiest way to do it?
If you have your queries predefined, you can use views to query your database.
There is also the ability to use temporary views for ad-hoc searches, but they are never recommended for production use because the index is not saved.
If you need something more along the lines of full-text search, check out couchdb-lucene.
What is your programming language of choice?
CouchDB's API is HTTP based.
Basically you could setup a view which uses the username as key and query that via HTTP request or with the help of a "driver" for your specific language.
Views are defined as map/reduce functions. An easy introduction can be found at the official wiki for example.
Also taking a look at the CouchDB guide is a good place to start with.
I prefer using elasticsearch.
It has a couchdb _river for integration. It will listen to _changes of couchdb, then fetch and index documents.
That way you get awesome power of elasticsearch (powered by lucene), with it's RESTful interfaces and clustering ability.
You get good separation of "searching" vs. your core documents.
Which means you can index and search across different document stores.
Admittedly you don't get a nice small all in one package, but for flexibility for my use cases it wins hands down.
I have a new project to do this: http://github.com/iriscouch/query_couchdb
(Hopefully I can add an intro and documentation today.)
The idea is to copy the Google App Engine Python API.
new Query("User")
.filter("name =", "John")
.order('-age')
.get(function(er, view) {
if(er)
throw(er);
console.log("Got " + view.rows.length + " rows!");
for(var a = 0; a < view.rows.length; a++) {
var row = view.rows[a];
console.log("Row " + a + " = " + JSON.stringify(row));
}
});
Unfortunately it is missing unit tests and examples, but I am already using this in production.
There is an initiative to implement a mongo-like find with the query syntax offered by mongo db. Cloudant announced the initiative and they started contributing through mango, a MongoDB inspired query language interface for Apache CouchDB.
The Cloudant project should allow this type of queries find({user : "John"}), find({user:{$in : ["Doe", "Smith"]}}) or find({"age": {"$gt": 21}}) for age > 21
A similar alternative pouchdb-find is also being developed for pouch db.

RetryPolicy in Subsonic

I am using Subsonic with a SqlAzure database and it's working great. I would like to improve my application by implementing the suggested best practice mentioned in this blog article.
According to the article, I could do something like:
var sqlAzureRetryPolicy = ... code omitted ...;
return sqlAzureRetryPolicy.ExecuteAction<IEnumerable<Product>>(() =>
{
// Invoke a LINQ query.
return result;
});
However this would mean that I would have to copy and paste this code snippet all over my solution and I think it would be tedious and error prone, other members of my team could forget, etc. I don't think it's the best solution and I'm wondering if there's a better way.
Anybody has suggestions on how to do it?
Have you looked into implementing an extension method? You can "add" a method to existing classes (including IEnumerable and IQueryable in your case) to make it look like the method is part of the class. Extension methods can be useful to centralize this type of code. Here is a simple article showing how to extend the string class: http://www.developer.com/net/csharp/article.php/3592216/Using-the-New-Extension-Methods-Feature-in-C-30.htm
I personally use extension methods (I created a TryOpen and TryExecuteReader) to do just what you are asking for, but against SqlCommand and SqlConnection classes. So my code sample won't help you for LINQ.

Resources