Easier way to use transactions in JOOQ - jooq

We use JOOQ code generation which works like a charm. For 99% of our use cases we simply reuse DAOs generated by JOOQ. For one usecase we need to use transactions.
I am looking at other questions such as the one here JOOQ & transactions and using JOOQ transactions to insert rows into two tables as a transaction.
The tables are quite big with 15+ columns and I am thinking if there is a better way to do this other than
DSL.using(configuration)
.transaction(ctx -> {
DSL.using(ctx)
.update(TABLE)
.set(TABLE.COL, newValue)
.where(...)
.execute();
});
Specifying the 15 columns seems like a lot of work which has to be redone when new columns etc are added. We already have POJOs is there any way to simply convert POJO into a record and simply using the DSL syntax.

You can convert a POJO to a record like this:
ctx.dsl()
.update(TABLE)
.set(ctx.dsl().newRecord(TABLE, pojo))
.where(...)
.execute();
This will set all the columns in the table, which may be undesired (e.g. you may not want to set the primary key to itself). To prevent certain fields from being included, use Record.changed(Field, boolean), to reset the internal changed flag per column.
Alternatively, if your WHERE clause is just a predicate on the primary key, you can also just use UpdatableRecord.update():
ctx.dsl().newRecord(TABLE, pojo).update();

Related

How can you query all the data with an empty set in cassandra db?

As the title says, I'm trying to query all the data I got with no value stored in it. I've been searching for a while, and the only operation allowed that I've found is CONTAINS, which doesn't fit my need.
consider the following table:
CREATE TABLE environment(
id uuid,
name varchar,
message text,
public Boolean,
participants set<varchar>,
PRIMARY KEY (id)
)
How can I get all entries in the table with an empty set? E.g. participants = {} or null?
Unfortunately, you really can't. Cassandra makes queries like this difficult by design, because there's no way it can be done without doing a full table scan (scanning each and every node). This is why a big part of Cassandra data modeling is understanding all the ways that table will be queried, and building it to support those queries.
The other issue that you'll have to deal with, is that (generally speaking) Cassandra does not allow filtering by nulls. Again, it's a design choice...it's much easier to query for data that exists, rather than data that does not exist. Although, when writing with lightweight transactions, there are ways around this one (using the IF clause).
If you knew all of the ids ahead of time, you could write something to iterate through them, SELECT and check for null on the app-side. Although that approach will be slow (but it won't stress the cluster). Probably the better approach, is to use a distributed OLAP layer like Apache Spark. It still wouldn't be fast, but this is probably the best way to handle a situation like this.

cassandra where clause, one field from udt

How can I query based on only one field in UDT (Cassandra) ?
I have a UDT which contains marital status and I need data for all the married people
but when I query based on one field it gives empty output
How can I do this ?
short answer - NO, at least not in the stock Cassandra. It's possible to do using the DSE Search, but it adds its own constraints. Maybe when SAI will be implemented some day, it will support indexing of UDT fields. I don't remember if SASI supports this, but it's really not recommended to use. But even if indexing was supported, it's still bad case (maybe except SAI) for Cassandra because it will create big partitions to represent each status.
The general rule is that you need to always query by partition key, with possibility to use secondary indexes when you need to search for another field, but only inside partition. If you need to query by multiple fields that not primary keys, I would suggest to take another database, or use Elasticsearch/Solr (although they aren't very good in geo-distributed environment).

How do you UPDATE a Cassandra column without directly knowing the primary key?

Given a scenario where you have a User table, with id as PRIMARY KEY.
You have a column called email, and a column called name.
You want to UPDATE User.name based on User.email
I realized that the UPDATE command requires you to pass in a PRIMARY KEY. Does this mean I can't use a pure CQL migration, and would need to first query for the User.id primary key before I can UPDATE?
In this case, I DO know the PRIMARY KEY because the UUIDs are the same for dev and prod, but it feels dirty.
Yes, you're correct - you need to know primary key of the record to perform an update on the data, or deletion of specific record. There are several options here, depending of your data model:
Perform full scan of the table using effective token range scan (Look to this answer for more details);
If this is required very often, you can create a materialized view, with User.email as partition key, and fetch all message IDs that you can update (but you'll need to do this from your application, there is no nested query support in CQL). But also be aware that materialized views are "experimental" feature in Cassandra, and may not work all the time (it's more stable in DataStax Enterprise). Also, if you have some users with hundreds of thousands of emails, this may create big partitions.
Do like 2nd item with your code, by using an additional table
I think Alex's answer covers your question -- "how can I find a value in a PK column working backwards from a non-PK column's value?".
However, I think it's worth noting that asking this question indicates you should reconsider your data model. A rule of thumb in C* data model design is that you begin by considering the queries you need, and you've missed the UPDATE query use case. You can probably make things work without changing your model for now, but if you find you need to make other queries you're unprepared for, you'll run into operational issues with lots of indexes and/or MVs.
More generally, search around for articles and other resources about Cassandra data modeling. It sounds like you're basically using C* for a relational use case so you'll want to look into that.

Approaches to generate auto-incrementing numeric ids in CouchDB

Since CouchDB does not have support for SQL alike AUTO_INCREMENT what would be your approach to generate sequential unique numeric ids for your documents?
I am using numeric ids for:
User-friendly IDs (e.g. TASK-123, RQ-001, etc.)
Integration with libraries/systems that require numeric primary key
I am aware of the problems with replication, etc. That's why I am interested in how people try to overcome this issue.
As Dominic Barnes says, auto-increment integers are not scalable, not distributed-friendly or cloud-friendly. It seems every app nowadays needs a mobile version with offline support, and that is not directly compatible with auto-increment integers. We all know this, but it's true: auto-increment integers are necessary for legacy code and arguably other stuff.
In both scenarios, you are responsible for producing the auto-incrementing integer. A view is running emit(the_numeric_id, null). (You could also have a "type" namespace, e.g. by emit([doc.type, the_numeric_id], null). Query for the final row (e.g. with a startkey=MAXINT&descending=true&limit=1, increment the value returned, and that is your next id. The attempt to save is in a loop which can retry if there was a collision.
You can also play tricks if you don't need 100% density of the list of IDs. For example, you can add timestamps to the emit() rows, and estimate the document creation velocity, and increment by that velocity times your computation and transmit time. You could also simply increment by a random integer between 1 and N, so most of the time the first insert works, at a cost of non-homogeneous ID numbers.
About where to store the integer, I think there is the id strategy and the try and check strategy.
The id strategy is simpler and quicker in the short term. Document IDs are an integer (perhaps prefixed with a type to add a namespace). Since Couch guarantees uniqueness on the _id field, you just worry about the auto-incrementing. Do this in a loop: 409 Conflict triggers a retry, 201 Accepted means you're done.
I think the major pain with this trick is, that if and when you get conflicts, you have two completely unrelated documents, and one of them must be copied into a fresh document. If there were relationships with other documents, they must all be corrected. (The CouchDB 0.11 emit(key, {_id: some_foreign_doc_id}) trick comes to mind.)
The try and check strategy uses the default UUID as the doc._id, so every insert will succeed. Ideally, all or most of your inter-document relations are based on the immutable UUID _id, not the integer. That is just used for users and UI. The auto-incrementing integer is simply a field in the document, {"int_id":20}. The view of course does emit(doc.int_id, null). (You can look up a document by integer id with a ?key=23?include_docs=true parameter of the view.
Of course, after a replication, you might have id conflicts (not official CouchDB conflicts, but just documents using the same numeric id). The view which emits by ID would also have a reduce phase: simply _count should be enough. Next you must patrol the DB, querying this view with ?group=true and looking for any row (corresponding to an integer id) which has a count > 1. On the plus side, correcting the numeric id of a document is a minor change because it does not require new document creation.
Those are my ideas. Now that I wrote them down, I feel like you must do relation-shepherding regardless of where the id is stored; so perhaps using _id is better after all. The only other downside I see is that you are permanently married to a fundamentally broken naming model—for some definition of "permanently."
Is there any particular reason you want to use numeric IDs over the UUIDs that CouchDB can generate for you? UUIDs are perfect for the distributed paradigm that CouchDB uses, stick with what is built in.
If you find yourself with any more than 1 CouchDB node in your architecture, you're going to get conflicting document IDs if you rely on something like "auto increment" when it comes time for replication. Even if you're only using 1 node now, that's probably not always going to be the case, especially since CouchDB works so well in a distributed and "offline" architecture.
I have had pretty good luck just using an iso formatted date as my key:
http://wiki.apache.org/couchdb/IsoFormattedDateAsDocId
It's pretty simple to do, human-readable and it basically builds in a few querying options by just existing. :-)
Keeping in mind the issues around replication and conflicts, you can use an update function to generate incrementing IDs that are guaranteed unique in a single master setup.
function(doc, req) {
if (!doc) {
doc = {
_id: req.id,
type: 'idGenerator',
count: 0
};
}
doc.count++;
return [doc, toJSON(doc.count)];
}
Include this function in a design document like so:
{
"_id": "_design/application",
"language": "javascript",
"updates": {
"generateId": "function (doc, req) {\n\t\t\tif (!doc) {\n\t\t\t\tdoc = {\n\t\t\t\t\t_id: req.id,\n\t\t\t\t\ttype: 'idGenerator',\n\t\t\t\t\tcount: 0\n\t\t\t\t};\n\t\t\t}\n\n\t\t\tdoc.count++;\n\t\t\t\n\t\t\treturn [doc, toJSON(doc.count)];\n\t\t}"
}
}
Then call it like so:
curl -XPOST http://localhost:5984/mydb/_design/application/_update/generateId/entityId
Replace entityId with whatever you like to create several independent ID sequences.
Not a perfect solution but something that worked for me. Create an independent service that generates auto-incremented ids. Yes, you probably say "this breaks the offline model of couchdb" but what if you get a pool of N ids that you can then use whenever you need to get a new auto-incremented id. Then every time you're online you get some more ids and if you are running out of ids you tell your users - please go online. If the pool is big enough (say the monthly traffic) this shouldn't happen. Again, not perfect but maybe can be helpful to some people.
Instead of explicitly constructing an increasing integer key, you could use the implicit index couchDB accepts for paging.
The skip parameter accepts an integer that will effectively provide the auto-incrementing index you are used to.
http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
The drawback is that it is not a viable solution for "User-friendly IDs". The index is not tied to the doc, and is subject to change if you are rewriting history.
If your only constraint is "integration with libraries/systems that require numeric primary key", this will bridge the gap without loosing the benefits of couchDB's key structure.

What is the best way to store and search through object transactions?

We have a decent sized object-oriented application. Whenever an object in the app is changed, the object changes are saved back to the DB. However, this has become less than ideal.
Currently, transactions are stored as a transaction and a set of transactionLI's.
The transaction table has fields for who, what, when, why, foreignKey, and foreignTable. The first four are self-explanatory. ForeignKey and foreignTable are used to determine which object changed.
TransactionLI has timestamp, key, val, oldVal, and a transactionID. This is basically a key/value/oldValue storage system.
The problem is that these two tables are used for every object in the application, so they're pretty big tables now. Using them for anything is slow. Indexes only help so much.
So we're thinking about other ways to do something like this. Things we've considered so far:
- Sharding these tables by something like the timestamp.
- Denormalizing the two tables and merge them into one.
- A combination of the two above.
- Doing something along the lines of serializing each object after a change and storing it in subversion.
- Probably something else, but I can't think of it right now.
The whole problem is that we'd like to have some mechanism for properly storing and searching through transactional data. Yeah you can force feed that into a relational database, but really, it's transactional data and should be stored accordingly.
What is everyone else doing?
We have taken the following approach:-
All objects are serialised (using the standard XMLSeriliser) but we have decorated our classes with serialisation attributes so that the resultant XML is much smaller (storing elements as attributes and dropping vowels on field names for example). This could be taken a stage further by compressing the XML if necessary.
The object repository is accessed via a SQL view. The view fronts a number of tables that are identical in structure but the table name appended with a GUID. A new table is generated when the previous table has reached critical mass (a pre-determined number of rows)
We run a nightly archiving routine that generates the new tables and modifies the views accordingly so that calling applications do not see any differences.
Finally, as part of the overnight routine we archive any old object instances that are no longer required to disk (and then tape).
I've never found a great end all solution for this type of problem. Some things you can try is if your DB supports partioning (or even if it doesn't you can implement the same concept your self), but partion this log table by object type and then you can further partion by date/time or by your object ID (if your ID is a numeric this works nicely not sure how a guid would partion).
This will help maintain the size of the table and keep all related transactions to a single instance of an object to itself.
One idea you could explore is instead of storing each field in a name value pair table, you could store the data as a blob (either text or binary). For example serialize the object to Xml and store it in a field.
The downside of this is that as your object changes you have to consider how this affects all historical data if your using Xml then there are easy ways to update the historical xml structures, if your using binary there are ways but you have to be more concious of the effort.
I've had awsome success storing a rather complex object model that has tons of interelations as a blob (the xml serializer in .net didn't handle the relationships btw the objects). I could very easily see myself storing the binary data. A huge downside of storing it as binary data is that to access it you have to take it out of the database with Xml if your using a modern database like MSSQL you can access the data.
One last approach is to split the two patterns, you could define a Difference Schema (and I assume more then one property changes at a time) so for example imagine storing this xml:
<objectDiff>
<field name="firstName" newValue="Josh" oldValue="joshua"/>
<field name="lastName" newValue="Box" oldValue="boxer"/>
</objectDiff>
This will help alleviate the number of rows, and if your using MSSQL you can define an XML Schema and get some of the rich querying ability around the object. You can still partition the table.
Josh
Depending on the characteristics of your specific application an alternative approach is to keep revisions of the entities themselves in their respective tables, together with the who, what, why and when per revision. The who, what and when can still be foreign keys.
Although I would be very careful to use this approach, since this is only viable for applications with a relatively small amount of changes per entity/entity type.
If querying the data is important I would use true Partitioning in SQL Server 2005 and above if you have enterprise edition of SQL Server. We have millions of rows partitioned by year down to day for the current month - you can be as granular as your application demands with a maximum number of 1000 partitions.
Alternatively , if you are using SQL 2008 you could look into filtered indexes.
These are solutions that will enable you to retain the simplified structure you have whilst providing the performance you need to query that data.
Splitting/Archiving older changes obviously should be considered.

Resources