Doctrine2: Limiting with Left Joins / Pagination - Best Practice - pagination

i have a big query (in my query builder) and a lot of left joins. So i get Articles with their comments and tags and so on.
Let's say i have the following dql:
$dql = 'SELECT blogpost, comment, tags
FROM BlogPost blogpost
LEFT JOIN blogpost.comments comments
LEFT JOIN blogpost.tags tags';
Now let's say my database has more than 100 blogposts but i only want the first 10, but with all the comments of those 10 and all their tags, if they exist.
If i use setMaxResults it limits the Rows. So i might get the first two Posts, but the last one of those is missing some of it's comments or tags. So the followin doesn't work.
$result = $em->createQuery($dql)->setMaxResults(15)->getResult();
Using the barely documented Pagination Solution that ships with doctrine2.2 doesn't really work for me either since it is so slow, i could as well load all the data.
I tried the Solutions in the Stackoverflow Article, but even that Article is still missing a Best Practise and the presented Solution is deadly slow.
Isn't there a best practise on how to do this?
Is nobody using Doctrine2.2 in Production mode?

Getting the proper results with a query like this is problematic. There is a tutorial on the Doctrine website explaining this problem.
Pagination
The tutorial is more about pagination rather than getting the top 5 results, but the overall idea is that you need to do a "SELECT DISTINCT a.id FROM articles a ... LIMIT 5" instead of a normal SELECT. It's a little more complicated than this, but the last 2 points in that tutorial should put you on the right track.
Update:
The problem here is not Doctrine, or any other ORM. The problem lies squarely on the database being able to return the results you're asking for. This is just how joins work.
If you do an EXPLAIN on the query, it will give you a more in depth answer of what is happening. It would be a good idea to add the results of that to your initial question.
Building on what is discussed in the Pagination article, it would appear that you need at least 2 queries to get your desired results. Adding DISTINCT to a query has the potential to dramatically slow down your query, but its only really needed if you have joins in it. You could write another query that just retrieves the first 10 posts ordered by created date, without the joins. Once you have the IDs of those 10 posts, do another query with your joins, and a WHERE blogpost.id IN (...) ORDER BY blogpost.created. This method should be much more efficient.
SELECT
bp
FROM
Blogpost bp
ORDER BY
bp.created DESC
LIMIT 10
Since all you care about in the first query are the IDs, you could set Doctrine to use Scalar Hydration.
SELECT
bg
FROM
Blogpost bp
LEFT JOIN
bp.comments c
LEFT JOIN
bp.tags t
WHERE
bp.id IN (...)
ORDER BY
bp.created DESC
You could also probably do it in one query using a correlated subquery. The myth that subqueries are always bad is NOT true. Sometimes they are faster than joins. You will need to experiment to find out what the best solution is for you.

Edit in light of the clarified question:
You can do what you want in native MySQL using a subquery in the FROM clause as such:
SELECT * FROM
(SELECT * FROM articles ORDER BY date LIMIT 5) AS limited_articles,
comments,
tags
WHERE
limited_articles.article_id=comments.article_id
limited_articles.article_id=tags.article_id
As far as I know, DQL does not support subqueries like this, so you can use the NativeQuery class.

Related

Solr sort and limit the results of a sub-query

The bounty expires in 3 days. Answers to this question are eligible for a +100 reputation bounty.
Jing is looking for an answer from a reputable source.
I am using Solr as my search engine and what I want to do is to sort and limit the result of a subquery. For example, let's say I have a Amazon product review datasets and I want to get all the products with title containing "iphone" OR products in the smart-phone category.
I'd write solr query something like: "name:iphone OR category:smartphone". However, the problem with this is that there are too many products that are in the category of "smartphone". So I want to limit to only popular products where the popularity is defined by something like a reviewCount. So what I'd like is, for the second subquery, sort the results of that sub-query based on reviewCount and then only take topK. That is, I want to something like:
name:iphone OR (category:smartphone AND sort:reviewCount desc AND rows=100)
So that I can get the products that are "iphone" OR top-100 popular smart phones.
Does Solr support something like this ?
I'm sorry to tell you that this is not possible. Lucene-based search engines spread indexes over multiple shards. Every shard then calculates matches and scores independently. At the very end, the results become aggregated and the number of result rows is cropped. That's why subqueries do not exist here. You can only boost on the score (which should be preferred over sorting) or make two parallel requests and combine the results on the client side (which should be fairly easy with your example).

How to better implement a more complex sorting strategy

I have an application with posts. Those posts are shown in the home view in descending order with the creation date.
I want to implement a more complex sorting strategy based on for example, posts which users have more posts, posts which have more likes, or views. Not complex, simple things. Everything picking random ones. Let's say I have the 100 posts more liked, I pick 10 of them.
To achieve this I don't want to do it in the same query, since I don't want to affect it's performance. I am using mongodb, and I need to use lookup which wouldn't be advisable to use in the most critical query of the app.
What would be the best approach to implement this?.
I thought doing all those calculations using for example AWS Lambda, or maybe triggers in mongo atlas, each 30 seconds and store the resultant information in database, which could be consumed by query.
That way each 30 seconds lets say the first 30 posts will be updated depending on the criteria.
I don't really know if this is a good approach or not. I need something not complex, but be able to "mix" all the post and show first the ones the comply with the criteria.
Thanks!

MariaDB: group_concat with subquery failing

With the exact same database structure and data on MySQL 5.6.34 (my new dev server) and MariaDB 10.2.8 (my new production server, where I thought I was finally deploying the code today - sigh!), MySQL is working and MariaDB is not. This is code that has been working fine for years on MySQL 5.0.95. I have simplified my query to the minimum example that shows the problem - it seems that GROUP_CONCAT() and subqueries do not mix. Here is the query:
SELECT person.PersonID,
GROUP_CONCAT(CategoryID ORDER BY CategoryID SEPARATOR ',') AS categories
FROM person LEFT JOIN percat ON person.PersonID=percat.PersonID
WHERE person.PersonID IN (SELECT PersonID FROM action WHERE ActionTypeID=3)
GROUP BY person.PersonID
And here is a composite image of screenshots that show the structure of all three tables involved:
On MySQL, it works fine, as it has worked for years. Here is the result and EXPLAIN:
And this is the crazy result I get on MariaDB:
I don't know the inner workings of the DB engine well enough to follow EXPLAIN, but I assume the clue is in there somewhere. I found this bug report that sounds related, but I don't really understand what they're saying about it, and more importantly, what I should do about it.
This is a bug, apparently it is not quite the same as the one that you have found (because the test case from the mentioned bug report works all right on 10.2.8, and yours, indeed, does not). Please feel free to report a new one at MariaDB JIRA.
Meanwhile, I think you should be able to work around it by setting
optimizer_switch=orderby_uses_equalities=off
in your cnf file. It's a newly enabled optimization, obviously not flawless.
UPDATE: The bug is now reported as https://jira.mariadb.org/browse/MDEV-13694
Workaround This won't answer why there is a difference, but you should add DISTINCT to the GROUP_CONCAT.
The "why" Probably the answer is very deeply rooted in the Optimizers. There have been a lot of changes since 5.0. 5.6 had a lot of new code; at the same time, MariaDB was forking off into 10.0. In this forking, the Optimizers were diverging significantly. 10.2 has moved farther forward, but not necessarily in optimizing this type of query.
Improved query There are several things that could be done to the query. Some will probably make it faster:
SELECT p.PersonID,
( SELECT GROUP_CONCAT(pc.CategoryID
ORDER BY CategoryID SEPARATOR ',')
FROM percat
WHERE PersonID = p.PersonID
) AS categories
FROM person
JOIN action AS a ON p.PersonID = a.PersonID
WHERE ActionTypeID = 3
GROUP BY p.PersonID
Transforming the LEFT JOIN will probably decrease the load on GROUP BY. Possibly the GROUP BY can be removed.
Because of PRIMARY KEY(PersonID, CategoryID), there should be no need for DISTINCT.
Needed index This "covering index" would speed up things more: INDEX(ActionTypeID, PersonID).

Query on all columns cassandra

I have close to six tables, each of them have from 20 to 60 columns in Cassandra. I am designing the schema for this database.
The requirement from the query is that all the columns must be queriable individually.
I know if the data has High-Cardinality using secondary indexes is not encouraged.
Materialized views will solve my purpose to an extent where I will be able to query on other columns as well.
My question is :
In this scenario, if each table has 30 to 50+ materialized views, is this an okay pattern to follow or is it going on a totally wrong track. Is it taking this functionality to its extreme. Maybe writes will start to become expensive on the system (I know they are written eventually and not with the immediate write to the actual table).
You definitely do not want 30 to 50 materialized views.
It sounds like the use case you're trying to satisfy is search, more so than a specific query.
If the queries that are going to be done on each column can be pre defined, then you can also go the denormalization route, trading flexibility of search for better performance and less operational overhead.
If you're interested in the search route, here's what I suggest you take a look at:
SASI Indexes (depending on Cassandra version you're using)
Elastic Search
Solr
DataStax Enterprise Search (disclaimer I work for DataStax)
Elassandra
Stratio
Those are just the ones I know off the top of my head. There may be others (Sorry if I missed you). I provided links to each so you can make your own informed decision as to which makes more sense for your use case.

Multiple queries in Solr

My problem is I have n fields (say around 10) in Solr that are searchable, they all are indexed and stored. I would like to run a query first on my whole index of say 5000 docs which will hit around an average of 500 docs. Next I would like to query using a different set of keywords on these 500 docs and NOT on the whole index.
So the first time I send a query a score will be generated, the second time I run a query the new score generated should be based on the 500 documents of the previous query, or in other words Solr should consider only these 500 docs as the whole index.
To summarise this, Index of 5000 will be filtered to 500 and then 50 (5000>500>50). Its basically filtering but I would like to do this in Solr.
I have reasonable basic knowledge and still learning.
Update: If represented mathematically it would look like this:
results1=f(query1)
results2=f(query2, results1)
final_results=f(query3, results2)
I would like this to be accomplish using a program and end-user will only see 50 results. So faceting is not an option.
Two likely implementations occur to me. The simplest approach would be to just add the first query to the second query;
+(first query) +(new query)
This is a good approach if the first query, which you want to filter on, changes often. If the first query is something like a category of documents, or something similar where you can benefit from reuse of the same filter, then a filter query is the better approach, using the fq parameter, something like:
q=field:query2&fq=categoryField:query1
filter queries cache a set of document ids to filter against, so for commonly used searches, like categories, common date ranges, etc., a significant performance benefit can be gained from it (for uncommon searches, or user-entered search strings, it may just incur needless overhead to cache the results, and pollute the cache with a useless result set)
Filter queries (fq) are specifically designed to do quick restriction of the result set by not doing any score calculation.
So, if you put your first query into fq parameter and your second score-generating query in the normal 'q' parameter, it should do what you ask for.
See also a question discussing this issue from the opposite direction.
I believe you want to use a nested query like this:
text:"roses are red" AND _query_:"type:poems"
You can read more about nested queries here:
http://searchhub.org/2009/03/31/nested-queries-in-solr/
Should take a look at "faceted search" from Solr: http://wiki.apache.org/solr/SolrFacetingOverview This will help you in this kind of "iterative" search.

Resources