Hoping that someone here will be able to provide some mysql advice...
I am working on a categorical searchtag system. I have tables like the following:
EXERCISES
exerciseID
exerciseTitle
SEARCHTAGS
searchtagID
parentID ( -> searchtagID)
searchtag
EXERCISESEARCHTAGS
exerciseID (Foreign key -> EXERCISES)
searchtagID (Foreign key -> SEARCHTAGS)
Searchtags can be arranged in an arbitrarily deep tree. So for example I might have a tree of searchtags that looks like this...
Body Parts
Head
Neck
Arm
Shoulder
Elbow
Leg
Hip
Knee
Muscles
Pecs
Biceps
Triceps
Now...
I want to select all of the searchtags in ONE branch of the tree that reference at least ONE record in the subset of records referenced by a SINGLE searchtag in a DIFFERENT branch of the tree.
For example, let's say the searchtag "Arm" points to a subset of exercises. If any of the exercises in that subset are also referenced by searchtags from the "Muscles" branch of SEARCHTAGS, I would like to select for them. So my query could potentially return "Biceps," "Triceps".
Two questions:
1) What would the SELECT query for something like this look like? (If such a thing is even possible without creating a lot of slow down. I'm not sure where to start...)
2) Is there anything I should do to tweak my datastructure to ensure this query will continue to run fast - even as the tables get big?
Thanks in advance for your help, it's much appreciated.
An idea: consider using a cache table that saves all ancestor relationships in your searchtags:
CREATE TABLE SEARCHTAGRELATIONS (
parentID INT,
descendantID INT
);
Also include the tag itself as parent and descendant (so, for searchtag with id 1, the relations table includes a row with (1,1).
That way, you get rid of the parent/descendant relationships and can join a flat table. Assuming "Muscles" has the ID 5,
SELECT descendantID FROM SEARCHTAGRELATIONS WHERE parentID=5
returns all searchtags contained in muscles.
Alternatively, use modified preorder tree traversal, also known as the nested set model. It requires two fields (left and right) instead of one (parent id), and makes certain operations harder, but makes selecting whole branches much easier.
Related
Quite often, the new feature of multisetAgg is used along with LEFT JOINs.
Let's say, I have a user as dimension table and fact table paid_subscriptions. I want to query a specific user with all of his paid subscriptions and for each subscription do some processing (like sending an email or whatever).
I would write some JOOQ like this:
ctx
.select(row(
USER.ID,
USER.USERNAME,
multisetAgg(PAIDSUBSCRIPTIONS.SUBNAME).as("subscr").convertFrom(r -> r.intoSet(Record1::value1))
).mapping(MyUserWithSubscriptionPOJO::new)
)
.from(USER)
.leftJoin(PAIDSUBSCRIPTIONS).onKey()
.where(someCondition)
.groupBy(USER)
.fetch(Record1::value1));
The problem here is: the multisetAgg produces a Set which can contain null as element.
I either heve to filter out the null subscriptions I don't care about after JOOQ select, or I have to rewrite my query with something like this:
multisetAgg(PAIDSUBSCRIPTIONS.SUBNAME).as("subscr").convertFrom(r -> {
final Set<String> res = r.intoSet(Record1::value1);
res.remove(null); // remove possible nulls
return res;
})
Both don't look too nice in code.
I wonder if there is a better approach to write this with less code or even an automatic filtering of null values or some other kind of syntactic sugar avilable in JOOQ? After all, I think it is quite a common usecase especially considering that often enough, I end up with some java8 style stream processing of my left joined collection and first step is to filter out null which is something I forget often :)
You're asking for a few things here:
SET instead of MULTISET (will be addressed with #12033)
Adding NULL filtering (is already possible with FILTER)
The implied idea that such NULL values could be removed automatically (might be addressed with #13776)
SET instead of MULTISET
The SQL standard has some notions of a SET as opposed to MULTISET or ARRAY. For example:
#13795
It isn't as powerful as MULTISET, and it doesn't have to be, because usually, just by adding DISTINCT you can turn any MULTISET into a SET. Nevertheless, Informix (possibly the most powerful ORDBMS) does have SET data types and constructors:
LIST (ARRAY)
MULTISET
SET
So, we might add support for this in the future, perhaps. I'm not sure yet of its utility, as opposed to using DISTINCT with MULTISET (already possible) or MULTISET_AGG (possible soon):
#12033
Adding NULL filtering
You already have the FILTER clause to do this directly in SQL. It's a SQL standard and supported by jOOQ natively, or via CASE emulations. A native SQL example, as supported by e.g. PostgreSQL:
select
t.a,
json_agg(u.c),
json_agg(u.c) filter (where u.b is not null)
from (values (1), (2)) t (a)
left join (values (2, 'a'),(2, 'b'),(3, 'c'),(3, 'd')) u (b, c) on t.a = u.b
group by t.a
Producing:
|a |json_agg |json_agg |
|---|----------|----------|
|1 |[null] | |
|2 |["a", "b"]|["a", "b"]|
So, just write:
multisetAgg(PAIDSUBSCRIPTIONS.SUBNAME).filter(PAIDSUBSCRIPTIONS.ID.isNotNull())
The implied idea that such NULL values could be removed automatically
Note, I understand that you'd probably like this to be done automatically. There's a thorough discussion on that subject here: #13776. As always, it's a desirable thing that is far from easy to implement consistently.
I'm positive that this will be done eventually, but it's a very big change.
I have some data that is processed and model based on case classes, and the classes can also have other case classes in them, so the final table has complex data, struct, array. Using the case class I save the data in hive using dataframe.saveAsTextFile(path).
This data sometimes changes or needs to have a different model, so for each iteration I use a suffix in the table name (some_data_v01, some_data_v03, etc.).
I also have queries that are run on a schedule on these tables, using Impala, so in order to not modify the query each time I save a a new table, I wanted to use a view that is always updated whenever I change the model.
The problem with that is I can't use Impala to create the view, because of the complex nature of the data in the tables (nested complex types). Apart from being a lot of work to expand the complex types, I want these types to be preserved (lots of level of nesting, duplication of data when joining arrays).
One solution was to create the view using Hive, like this
create view some_data as select * from some_data_v01;
But if I do this, when I want to use the table from Impala,
select * from some_data;
or even something simple, like
select some_value_not_nested, struct_type.some_int, struct_type.some_other_int from some_data;
the error is the following:
AnalysisException: Expr 'some_data_v01.struct_type' in select list returns a complex type
'STRUCT< some_int:INT, some_other_int:INT, nested_struct:STRUCT< nested_int:INT, nested_other_int:INT>, last_int:INT>'. Only scalar types are allowed in the select list.
Is there any way to access this view, or create it in some other way for it to work?
i need some help for my problem related to complex datatype in C#. I have following type of data and i want to save it in variable, but it will be performance efficient as i have to use it for search and there will be lot of data in it. Data sample is as follow:
ParentNode1
ChildNode1
ChildNode2
ChildNode3
ParentNode2
ParentNode3
ParentNode4
ChildNode1
ChildNode2
Node1
Node2
Node3
Nth level Node1
ChildNode3
ParentNode5
Above data is just a sample to show hierarchy of data. I'm not sure nested List, Dictionary, ienumerable or link list which will be best related to performance. Thanks
If you know that a search will take place at a single level, then you might want a list of lists: one list for each level. If your hierarchy has N levels, then you have N lists. Each one contains nodes that are:
ListNode
Data // string
ParentIndex // index of parent in the previous list
So to search level 4, you go to the list for that level and do your contains or regex test on each node in that level. If it matches, then the ParentIndex value will get you the parent, and its ParentIndex will get you the grandparent, etc.
This way, you don't have to worry about navigating the hierarchy except when you find a match, and you don't have to write nested or recursive algorithms to traverse the tree.
You could maintain your hierarchy, as well, with each top-level node containing a list of child nodes, and build this secondary list only for searching.
I need to make a view that emits a value for each pair of documents (A cartesian product of _all_docs with itself)
For example, assume DB has documents with IDs a, b, c -> then the view should emit 9 keys aa, ab, ac, ba, ... , cc (assuming no grouping)
E.g. if the documents are "cities" with coordinates, the view returns pairs of cities and distance between them (real example is more complicated), so I could then use _list function to compute "top10 closest cities" and so on.
This looks like a very simple task, however Google and SO search gives no results. Am I missing some magic keyword here?
I can't think of a way to do this in CouchDB - fundamentally, this doesn't lend itself to map/reduce indexes - in the map function you only have access to one document at a time and in the reduce stage you need to reduce the result (computing the cartesian product would expand it).
If you use another system to precompute the distances between the cities then CouchDB is likely a good fit for storing and querying the result of that cartesian product (to e.g. find the top 10 closest cities). However, you might also want to look at a graph database (Neo4j or Giraph) as well.
How do you define a directed acyclic graph (DAG) (of strings) (with one root) best in Haskell?
I especially need to apply the following two functions on this data structure as fast as possible:
Find all (direct and indirect) ancestors of one element (including the parents of the parents etc.).
Find all (direct) children of one element.
I thought of [(String,[String])] where each pair is one element of the graph consisting of its name (String) and a list of strings ([String]) containing the names of (direct) parents of this element. The problem with this implementation is that it's hard to do the second task.
You could also use [(String,[String])] again while the list of strings ([String]) contain the names of the (direct) children. But here again, it's hard to do the first task.
What can I do? What alternatives are there? Which is the most efficient way?
EDIT: One more remark: I'd also like it to be defined easily. I have to define the instance of this data type myself "by hand", so i'd like to avoid unnecessary repetitions.
Have you looked at the tree implemention in Martin Erwig's Functional Graph Library? Each node is represented as a context containing both its children and its parents. See the graph type class for how to access this. It might not be as easy as you requested, but it is already there, well-tested and easy-to-use. I have used it for more than a decade in a large project.