How single partition batch in Cassandra function for multiple column update? - cassandra

We have multiple update queries in a single partition of a single columnfamily. Like below
update t1 set username = 'abc', url = 'www.something.com', age = ? where userid = 100;
update t1 set username = 'abc', url = 'www.something.com', weight = ? where userid = 100;
update t1 set username = 'abc', url = 'www.something.com', height = ? where userid = 100;
username, url will be always same and are mandatory fields. But depending on the information given there will be extra columns.
As this is a single partition operation and we need atomicity + isolation. We will execute this in a batch.
As per Doc
A BATCH statement combines multiple data modification language (DML) statements (INSERT, UPDATE, DELETE) into a single logical operation, and sets a client-supplied timestamp for all columns written by the statements in the batch.
Now as we are updating columns(username, url) with same value in multiple statement, will C* combines it as a single statement before executing it like
update t1 set username = 'abc', url = 'www.something.com', age = ?, weight = ?, height = ? where userid = 100;
or same value will be upsert?
Another question is that, as they all have the same timestamp how C* resolves that conflict. Will C* compare every column (username, url) value.
As they all have the same timestamp C* resolves the conflict by choosing the largest value for the cells. Atomic Batch in Cassandra
Or should we add queries in batch like below. In this case we have to check username, url has already been added in statement.
update t1 set username = 'abc', url = 'www.something.com', age = ? where userid = 100;
update t1 set weight = ? where userid = 100;
update t1 set height = ? where userid = 100;
In short what will be the best way to do it.

For your first question(will C* combines it as a single statement?) answer is yes.
A single partition batch is applied as a single row mutation.
check this link for details:
https://issues.apache.org/jira/browse/CASSANDRA-6737
For your second question(Will C* compare every column (username, url) value?) answer is also yes.
As given in the answer of your provided link "Conflict is resolved by choosing the largest value for the cells"
So, you can write queries in batch in either way(given in your question).
As it will ultimately converted to a single write internally.

You are using Single partition batch so everything goes into a single partition.So all of your update will be merge and applied with a single RowMutation.
And so your update will be applied with no batch log, atomic, isolated

Related

How to construct a `select ... in` SQL query in nim?

I'm using nim and db_sqlite to fetch some rows with certain _ids from a database table. For example:
for row in db.fastRows("SELECT * FROM t WHERE _id IN (?, ?)", #["1", "2"]):
echo row
This works as expected, however, the sequence at the end is constructed dynamically at runtime, which means I need a variable amount of ? in the query. I end up creating a sequence with question marks, joining them, interpolating the string and turning it into a database query:
var qs : seq[string]
for id in ids:
qs.add("?")
let query_string = """SELECT * FROM t WHERE _id IN ($1)""" % join(qs, ",")
let query = SqlQuery(query_string)
for row in db.fastRows(query, ids):
echo row
Is there a better way to construct a select ... in query in nim? Ideally one with just one ? in the SqlQuery.
(For what it's worth, the current behavior is similar to other languages I've used)
you could do the replacement manually, here's one way using strformat and map
import strformat,db_sqlite,sequtils,strutils
#assuming ids is a seq[string] here
let query = sql(&"SELECT * FROM t WHERE _id IN ({ids.map(dbQuote).join(\",\")})")
for row in db.fastRows(query):
echo row

How to count the number of rows from query with SQLAlchemy when no model is specified?

I'm looking for a way to count with SQLAlchemy the number of rows that is returned from a given query (that potentially includes filters) but everything I find on the net makes explicit use of a model (example here). My problem is that I don't have a model, I only have a Table object (because I'm dealing with temporary tables that vary in format from time to time). For the moment I can do the following:
tbl = Table(mytablename,metadata,autoload=True, autoload_with=myengine, schema=myschemaname)
query = select([tbl])
filters = build_filters(...) #my function that build filters
query = query.where(and_(*filters))
conn = myengine.connect()
ResultProxy = conn.execute(query)
totalCount = len(ResultProxy.fetchall())
but it's very inefficient. Is there a way to do the count efficiently and without referring to any model?
Try the SQLAlchemy Core 'count' function documented here. I believe you can attach your filters on to that like you're doing now. So, (not guaranteeing my syntax here, but here's something to start you with)...
query = select([func.count()]).select_from(my_table).where(and_(*filters))
conn = myengine.connect()
ResultProxy = conn.execute(query)
totalCount = ResultProxy.fetchone()[0]
According to the documentation, I believe this will actually generate a SELECT COUNT from the database, not actually bring all the rows back from the DB and then count them.

Getting AutoQuery pagination to work with left join

In my AutoQuery request I have a left join specified so I can query on properties in the joined table.
public class ProductSearchRequest : QueryDb<Book>
, ILeftJoin<Book, BookAuthor>, ILeftJoin<BookAuthor, Author>
{}
If I use standard way of autoquery like so:
var q = AutoQuery.CreateQuery(request, base.Request);
var results = AutoQuery.Execute(request, q);
And 100 are being requested, then often less than 100 will be retuned as the Take() is based on results with a left join.
To remedy this I am doing this instead:
var q = AutoQuery.CreateQuery(request, base.Request);
q.OrderByExpression = null //throws error if orderby exists
var total = Db.Scalar<int>(q.Select(x => Sql.CountDistinct(x.Id))); //returns 0
var q1 = AutoQuery.CreateQuery(request, base.Request).GroupBy(x => x);
var results = Db.Select<Book>(q1);
return new QueryResponse<Book>
{
Offset = q1.Offset.GetValueOrDefault(0),
Total = total
Results = results
};
The group by appears to return correct number of results so paging works but the Total returns 0.
I also tried:
var total2 = (int)Db.Count(q1);
But even though q1 has a GroupBy() it returns the number of results including the left join and not the actual query
How can I get the true total of the query?
(Getting some official docs on how to do paging and totals with autoquery & left join would be very helpful as right now it's a bit confusing)
Your primary issue stems from trying to return a different total then the actual query AutoQuery executes. If you have multiple left joins, the total is the total results of the query it executes not the number of rows in your source table.
So you're not looking for the "True total", rather you're looking to execute a different query to get a different total than the query that's executed, but still deriving from the original query as its basis. First consider using normal INNER JOINS (IJoin<>) instead of LEFT JOINS so it only returns results for related rows in joined tables which the total will reflect accordingly.
Your total query that returns 0 is likely returning no results, so I'd look at looking at the query in an SQL Profiler so you can see the query that's executed. You can also enable logging of OrmLite queries with Debug logging enabled and in your AppHost:
OrmLiteUtils.PrintSql();
Also note that GroupBy() of the entire table is unusual, you would normally group by a single or multiple explicit selected columns, e.g:
.GroupBy(x => x.Id);
.GroupBy(x => new { x.Id, x.Name });

Cassandra Modelling for Date Range

Cassandra Newbie here. Cassandra v 3.9.
I'm modelling the Travellers Flight Checkin Data.
My Main Query Criteria is Search for travellers with a date range (max of 7 day window).
Here is what I've come up with with my limited exposure to Cassandra.
create table IF NOT EXISTS travellers_checkin (checkinDay text, checkinTimestamp bigint, travellerName text, travellerPassportNo text, flightNumber text, from text, to text, bookingClass text, PRIMARY KEY (checkinDay, checkinTimestamp)) WITH CLUSTERING ORDER BY (checkinTimestamp DESC)
Per day, I'm expecting upto a million records - resulting in the partition to have a million records.
Now my users want search in which the date window is mandatory (max a week window). In this case should I use a IN clause that spans across multiple partitions? Is this the correct way or should I think of re-modelling the data? Alternatively, I'm also wondering if issuing 7 queries (per day) and merging the responses would be efficient.
Your Data Model Seems Good.But If you could add more field to the partition key it will scale well. And you should use Separate Query with executeAsync
If you are using in clause, this means that you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing
Source : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Instead of using IN clause, use separate query of each day and execute it with executeAsync.
Java Example :
PreparedStatement statement = session.prepare("SELECT * FROM travellers_checkin where checkinDay = ? and checkinTimestamp >= ? and checkinTimestamp <= ?");
List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
futures.add(resultSetFuture);
}
for (ResultSetFuture future : futures){
ResultSet rows = future.getUninterruptibly();
//You get the result set of each query, merge them here
}

Error in Linq: The text data type cannot be selected as DISTINCT because it is not comparable

I've a problem with LINQ. Basically a third party database that I need to connect to is using the now depreciated text field (I can't change this) and I need to execute a distinct clause in my linq on results that contain this field.
I don't want to do a ToList() before executing the Distinct() as that will result in thousands of records coming back from the database that I don't require and will annoy the client as they get charged for bandwidth usage. I only need the first 15 distinct records.
Anyway query is below:
var query = (from s in db.tSearches
join sc in db.tSearchIndexes on s.GUID equals sc.CPSGUID
join a in db.tAttributes on sc.AttributeGUID equals a.GUID
where s.Notes != null && a.Attribute == "Featured"
select new FeaturedVacancy
{
Id = s.GUID,
DateOpened = s.DateOpened,
Notes = s.Notes
});
return query.Distinct().OrderByDescending(x => x.DateOpened);
I know I can do a subquery to do the same thing as above (tSearches contains unique records) but I'd rather a more straightfoward solution if available as I need to change a number of similar queries throughout the code to get this working.
No answers on how to do this so I went with my first suggestion and retrieved the unique records first from tSearch then constructed a subquery with the non unique records and filtered the search results by this subquery. Answer below:
var query = (from s in db.tSearches
where s.DateClosed == null && s.ConfidentialNotes != null
orderby s.DateOpened descending
select new FeaturedVacancy
{
Id = s.GUID,
Notes = s.ConfidentialNotes
});
/* Now filter by our 'Featured' attribute */
var subQuery = from sc in db.tSearchIndexes
join a in db.tAttributes on sc.AttributeGUID equals a.GUID
where a.Attribute == "Featured"
select sc.CPSGUID;
query = query.Where(x => subQuery.Contains(x.Id));
return query;

Resources