MariaDB: group_concat with subquery failing - subquery

With the exact same database structure and data on MySQL 5.6.34 (my new dev server) and MariaDB 10.2.8 (my new production server, where I thought I was finally deploying the code today - sigh!), MySQL is working and MariaDB is not. This is code that has been working fine for years on MySQL 5.0.95. I have simplified my query to the minimum example that shows the problem - it seems that GROUP_CONCAT() and subqueries do not mix. Here is the query:
SELECT person.PersonID,
GROUP_CONCAT(CategoryID ORDER BY CategoryID SEPARATOR ',') AS categories
FROM person LEFT JOIN percat ON person.PersonID=percat.PersonID
WHERE person.PersonID IN (SELECT PersonID FROM action WHERE ActionTypeID=3)
GROUP BY person.PersonID
And here is a composite image of screenshots that show the structure of all three tables involved:
On MySQL, it works fine, as it has worked for years. Here is the result and EXPLAIN:
And this is the crazy result I get on MariaDB:
I don't know the inner workings of the DB engine well enough to follow EXPLAIN, but I assume the clue is in there somewhere. I found this bug report that sounds related, but I don't really understand what they're saying about it, and more importantly, what I should do about it.

This is a bug, apparently it is not quite the same as the one that you have found (because the test case from the mentioned bug report works all right on 10.2.8, and yours, indeed, does not). Please feel free to report a new one at MariaDB JIRA.
Meanwhile, I think you should be able to work around it by setting
optimizer_switch=orderby_uses_equalities=off
in your cnf file. It's a newly enabled optimization, obviously not flawless.
UPDATE: The bug is now reported as https://jira.mariadb.org/browse/MDEV-13694

Workaround This won't answer why there is a difference, but you should add DISTINCT to the GROUP_CONCAT.
The "why" Probably the answer is very deeply rooted in the Optimizers. There have been a lot of changes since 5.0. 5.6 had a lot of new code; at the same time, MariaDB was forking off into 10.0. In this forking, the Optimizers were diverging significantly. 10.2 has moved farther forward, but not necessarily in optimizing this type of query.
Improved query There are several things that could be done to the query. Some will probably make it faster:
SELECT p.PersonID,
( SELECT GROUP_CONCAT(pc.CategoryID
ORDER BY CategoryID SEPARATOR ',')
FROM percat
WHERE PersonID = p.PersonID
) AS categories
FROM person
JOIN action AS a ON p.PersonID = a.PersonID
WHERE ActionTypeID = 3
GROUP BY p.PersonID
Transforming the LEFT JOIN will probably decrease the load on GROUP BY. Possibly the GROUP BY can be removed.
Because of PRIMARY KEY(PersonID, CategoryID), there should be no need for DISTINCT.
Needed index This "covering index" would speed up things more: INDEX(ActionTypeID, PersonID).

Related

How to use database views in EF Core 3.0?

I know the question was asked before, but at the time it was, we had EF Core 2.x. The short answer was "no you can't" and obviously, not very helpful.
The other answers involved ugly hacks like changing migration files after they were created by the tool.
I make an application Code First. I have my models created with lot's of foreign keys and database joins in mind.
But here comes the unpleasant surprise (I'm a little new to EF): those joins written in LINQ are pretty slow, as a matter of fact they do not produce database join, but fetch whole tables instead.
Of course it's totally unacceptable, I import an old database with millions of records, with the joins I get results in milliseconds, without I get couple of seconds lags - on my very fast internet connection (in real world scenario it would be much worse).
I need views, and AFAIK EF won't create them for me, is it STILL true for EF 3.0?
Then, what would be the best and the most clean way to create views in SQL and to make entities for them? I mean - considering the situation the database models would change over time, and the database structure would have to be updated.
Well, I would prefer doing my joins not in SQL views, just have queries returned "JOIN" statement results. Especially some not obvious joins. Lets say table B has a column being a foreign key referencing table A. I want to get results from table A joining B for details. With normal SQL JOIN performance.
I checked the database: there is no significant performance difference between "select * from A" and "select * from A join B...". In LINQ - the difference is huge.
I figured out that in Code First database views are redundant.
The "views" can be created as models (ordinary classes) having a field or a property set to joined entity. I use private fields for that purpose. Then I use LINQ Join() to create my view entity. The query may refer ONLY to the fields set to joined entities, nothing else. Such query, if written properly translates clearly to SQL JOIN and works with full speed. In my application it's equivalent of a database view.
Why private fields and not properties, you may ask. Maybe because joined entities are "implementation details", but another reason is my presentation code uses reflection to operate on entity public properties, it's good to have those entities hidden from it. Otherwise I would probably need to use attributes to hide those "columns".
BTW, such views can be ordered with OrderBy(), filtered with Where() at virtually no cost. The constraint is to maintain the collection's IQueryable interface, never refer joined entities indirectly. So even if X refers to A.B, never refer X in a LINQ query, always A.B where A is direct entity reference assigned in the Join() query.
To build dynamic queries at runtime one must use expressions.
This set of properties of EF Core 3.0 allows to build a database application without using SQL, but with the full SQL speed maintained. However, the database / entity structure must be relatively simple to achieve that.

Azure SQL Database Autotuning - Develop without worrying about indexes?

Azure's SQL database feature for auto-tuning creates and drops indexes based on database usage. I've imported an old database into Azure which did not have comprehensive indexes defined and it seems to of done a great job on reducing CPU & DTU usage over a relatively short period of time.
It feels wrong - but does this mean I can develop going forwards without defining indexes? Does anyone do this? SSMS index editor is a pain and slow to modify with. Not having to worry/maintain indexes would speed up development time.
The auto-tuning is taking advantage of three things, Query Store, Missing Indexes and Machine Learning.
First, the last known good plan is a weaponization of the Query Store. This is in both Azure and SQL Server 2017. It will spot queries that have degraded performance after a plan change (and quite a few executions, not just one) and will revert back to that plan. If performance degrades, it turns that off. This is great. However, if you write crap code or have bad data structures or out of date statistics, it doesn't help very much.
The automatic indexes in Azure are using two things, missing index suggestions from the optimizer and machine learning on Azure. With those, if the missing index comes up a lot over a period of 12-18 hours (read this blog post on automating it), you'll get an index suggestion. It measures for another 12-18 hours and if that index helped, it stays, if not, it goes. This is also great. However, it suffers from two problems. First, same as before, if you have crap code, etc., this will only really help at the margins. Second, the missing index suggestions from the optimizer are not always the best index. For example, when I wrote the blog post, it identified a missing index appropriately, but it missed the fact that an INCLUDE column would have been even better than the index it suggested.
A human brain and eyeball is still going to be solving the more difficult problems. These automations take away a lot of the easier, low-hanging problems. Overall, that's a great thing. However, don't confuse it with a panacea for all things performance related. It's absolutely not.

Query on all columns cassandra

I have close to six tables, each of them have from 20 to 60 columns in Cassandra. I am designing the schema for this database.
The requirement from the query is that all the columns must be queriable individually.
I know if the data has High-Cardinality using secondary indexes is not encouraged.
Materialized views will solve my purpose to an extent where I will be able to query on other columns as well.
My question is :
In this scenario, if each table has 30 to 50+ materialized views, is this an okay pattern to follow or is it going on a totally wrong track. Is it taking this functionality to its extreme. Maybe writes will start to become expensive on the system (I know they are written eventually and not with the immediate write to the actual table).
You definitely do not want 30 to 50 materialized views.
It sounds like the use case you're trying to satisfy is search, more so than a specific query.
If the queries that are going to be done on each column can be pre defined, then you can also go the denormalization route, trading flexibility of search for better performance and less operational overhead.
If you're interested in the search route, here's what I suggest you take a look at:
SASI Indexes (depending on Cassandra version you're using)
Elastic Search
Solr
DataStax Enterprise Search (disclaimer I work for DataStax)
Elassandra
Stratio
Those are just the ones I know off the top of my head. There may be others (Sorry if I missed you). I provided links to each so you can make your own informed decision as to which makes more sense for your use case.

Doctrine2: Limiting with Left Joins / Pagination - Best Practice

i have a big query (in my query builder) and a lot of left joins. So i get Articles with their comments and tags and so on.
Let's say i have the following dql:
$dql = 'SELECT blogpost, comment, tags
FROM BlogPost blogpost
LEFT JOIN blogpost.comments comments
LEFT JOIN blogpost.tags tags';
Now let's say my database has more than 100 blogposts but i only want the first 10, but with all the comments of those 10 and all their tags, if they exist.
If i use setMaxResults it limits the Rows. So i might get the first two Posts, but the last one of those is missing some of it's comments or tags. So the followin doesn't work.
$result = $em->createQuery($dql)->setMaxResults(15)->getResult();
Using the barely documented Pagination Solution that ships with doctrine2.2 doesn't really work for me either since it is so slow, i could as well load all the data.
I tried the Solutions in the Stackoverflow Article, but even that Article is still missing a Best Practise and the presented Solution is deadly slow.
Isn't there a best practise on how to do this?
Is nobody using Doctrine2.2 in Production mode?
Getting the proper results with a query like this is problematic. There is a tutorial on the Doctrine website explaining this problem.
Pagination
The tutorial is more about pagination rather than getting the top 5 results, but the overall idea is that you need to do a "SELECT DISTINCT a.id FROM articles a ... LIMIT 5" instead of a normal SELECT. It's a little more complicated than this, but the last 2 points in that tutorial should put you on the right track.
Update:
The problem here is not Doctrine, or any other ORM. The problem lies squarely on the database being able to return the results you're asking for. This is just how joins work.
If you do an EXPLAIN on the query, it will give you a more in depth answer of what is happening. It would be a good idea to add the results of that to your initial question.
Building on what is discussed in the Pagination article, it would appear that you need at least 2 queries to get your desired results. Adding DISTINCT to a query has the potential to dramatically slow down your query, but its only really needed if you have joins in it. You could write another query that just retrieves the first 10 posts ordered by created date, without the joins. Once you have the IDs of those 10 posts, do another query with your joins, and a WHERE blogpost.id IN (...) ORDER BY blogpost.created. This method should be much more efficient.
SELECT
bp
FROM
Blogpost bp
ORDER BY
bp.created DESC
LIMIT 10
Since all you care about in the first query are the IDs, you could set Doctrine to use Scalar Hydration.
SELECT
bg
FROM
Blogpost bp
LEFT JOIN
bp.comments c
LEFT JOIN
bp.tags t
WHERE
bp.id IN (...)
ORDER BY
bp.created DESC
You could also probably do it in one query using a correlated subquery. The myth that subqueries are always bad is NOT true. Sometimes they are faster than joins. You will need to experiment to find out what the best solution is for you.
Edit in light of the clarified question:
You can do what you want in native MySQL using a subquery in the FROM clause as such:
SELECT * FROM
(SELECT * FROM articles ORDER BY date LIMIT 5) AS limited_articles,
comments,
tags
WHERE
limited_articles.article_id=comments.article_id
limited_articles.article_id=tags.article_id
As far as I know, DQL does not support subqueries like this, so you can use the NativeQuery class.

Can I query any attribute in a Windows Azure Tablestorage row?

Sorry if this sounds like a rather dumb question but I would like to do a "select" on data from a Windows Azure table. I tried the following and it worked:
from question in _statusTable.GetAll()
where status.RowKey.StartsWith(name)
I then tried
from question in _statusTable.GetAll()
where status.Description.StartsWith(name)
This one gave me nothing. Can anyone explain to me if or how I can query on rows that are not part of the RowKey or PartitionKey.
You can query on any property, but the types of query supported are limited - e.g. StartsWith isn't supported. Also if you aren't querying on PartitionKey and RowKey, then there are some very important performance issues to understand - and you always need to be aware of ContinuationToken's - almost any query result can contain these.
You can see the sorts of queries supported by looking at the REST API: http://msdn.microsoft.com/en-us/library/dd894031.aspx - it's pretty limited (but quick as a result):
Equal
GreaterThan
GreaterThanOrEqual
LessThan
LessThanOrEqual
NotEqual
If you need to do more, then:
you can mimic things like StartsWith("Fred") by doing a GreaterThanOrEqualTo("Fred") and LessThan("Free")
or client side filtering will work - but that means pulling back all the rows from the storage - which could be a lot of data and which could be computationally and transactionally expensive!
What does GetAll() do? StartsWith isn't supported by WA tables, so I'm assuming GetAll pulls all the data local, and so your query is done over objects in memory. If so, this has nothing to do with Windows Azure, so I'd take a look at whether your data looks like you expect it to.

Resources