I am trying to execute 3 conditional inserts to different tables inside a batch by using the Cassandra cpp-driver:
BEGIN BATCH
insert into table1 values (...) IF NOT EXISTS
insert into table2 values (...) IF NOT EXISTS
insert into table3 values (...) IF NOT EXISTS
APPLY BATCH
But I am getting the following error:
Batch with conditions cannot span multiple tables
If the above is not possible in Cassandra, what is the alternative to perform multiple conditional inserts as a transaction and ensure that all succeed or all fail?
I'm afraid there are no alternatives. Conditional statements in a BATCH environment are limited to a single table only, and I don't think there's room for changes in future.
This is due to how Cassandra works internally: a batch containing a conditional update (it is called lightweight transaction) can only be used in one partition because they are based on the Paxos implementation, because the Paxos itself works at partition level only. Moreover, in a batch with multiple conditional statements in the same BATCH, all the conditions must be verified to the batch succeed. Even if one (and only) conditional update fails, the entire batch will fail.
You can read more about BATCH statements in the documentation.
You'd basically get a performance hit for the conditional update, and a performance hit for a batched operation, and C* stops you to get so far.
It seems to me you designed it RDBMS-like. A No-SQL alternative solution, I don't know if it can be applied to your use case though, you could denormalize your data in a 4th table that combines all the other 3 tables, and then supply a single update to this 4th table.
Related
We know in SQL, an index can be created on a column if it is frequently used for filtering. Is there anything similar I can do in spark? Let's say I have a big table T containing a column C I want to filter on. I want to filter 10s of thousands of id sets on the column C. Can I sort/orderBy column C, cache the result, and then filter all the id sets with the sorted table? Will it help like indexing in SQL?
You should absolutely build the table/dataset/dataframe with a sorted id if you will query on it often. It will help predicate pushdown. and in general give a boost in performance.
When executing queries in the most generic and basic manner, filtering
happens very late in the process. Moving filtering to an earlier phase
of query execution provides significant performance gains by
eliminating non-matches earlier, and therefore saving the cost of
processing them at a later stage. This group of optimizations is
collectively known as predicate pushdown.
Even if you aren't sorting data you may want to look at storing the data in file with 'distribute by' or 'cluster by'. It is very similar to repartitionBy. And again only boosts performance if you intend to query the data as you have distributed the data.
If you intend to requery often yes, you should cache data, but in general there aren't indexes. (There are file types that help boost performance if you have specific query type needs. (Row based/columnar based))
You should look at the Spark Specific Performance tuning options. Adaptive query is a next generation that helps boost performance, (without indexes)
If you are working with Hive: (Note they have their own version of partitions)
Depending on how you will query the data you may also want to look at partitioning or :
[hive] Partitioning is mainly helpful when we need to filter our data based
on specific column values. When we partition tables, subdirectories
are created under the table’s data directory for each unique value of
a partition column. Therefore, when we filter the data based on a
specific column, Hive does not need to scan the whole table; it rather
goes to the appropriate partition which improves the performance of
the query. Similarly, if the table is partitioned on multiple columns,
nested subdirectories are created based on the order of partition
columns provided in our table definition.
Hive Partitioning is not a magic bullet and will slow down querying if the pattern of accessing data is different than the partitioning. It make a lot of sense to partition by month if you write a lot of queries looking at monthly totals. If on the other hand the same table was used to look at sales of product 'x' from the beginning of time, it would actually run slower than if the table wasn't partitioned. (It's a tool in your tool shed.)
Another hive specific tip:
The other thing you want to think about, and is keeping your table stats. The Cost Based Optimizer uses those statistics to query your data. You should make sure to keep them up to date. (Re-run after ~30% of your data has changed.)
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] -- (Note: Fully support qualified table name
since Hive 1.2.0, see HIVE-10007.)
COMPUTE STATISTICS
[FOR COLUMNS] -- (Note: Hive 0.10.0 and later.)
[CACHE METADATA] -- (Note: Hive 2.1.0 and later.)
[NOSCAN];
Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?
Here is the original dataframe:
Here dates are in the given order, as you can see, via show.
Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.
Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.
So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:
df.orderBy('date').show()
In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:
df.orderBy(col("date").asc, col("other_column").desc)
In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.
The situation occurs because the show is an action that is called twice.
As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.
As soon as I used .cache, the same result was always gotten.
This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.
Like #Ihor Konovalenko and #mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.
As per this datastax doc Atomicity in Cassandra is:
In Cassandra, a write is atomic at the partition-level, meaning inserting or updating columns in a row is treated as one write operation.
Whereas according to this datastax doc Atomicity in Cassandra is:
In Cassandra, a write operation is atomic at the partition level, meaning the insertions or updates of two or more rows in the same partition are treated as one write operation.
My confusion is that whether atomicity is seen at a single row basis or it can have multiple rows included of a table at partition level?
I assume it is a combination of both depending on the type of query we are executing in Cassandra.
For Example :
If I have an insert query, it will always be inserting one row in a partition. So Cassandra ensures that this row is inserted successfully at partition level.
But if I have update query whose where clause has a condition which is qualifying multiple rows then the update operation is atomic at partition level means either all qualified rows as per the condition will be updated or none will be.
Is my understanding correct?
"row" and "partition" get conflated since previously row meant partition and now row means a part of a partition.
They are atomic to the partition. Keep in mind thats in reference to a single replica, so a or multiple rows in a batch containing 5 columns are all updated in a single operation on the one replica (no cross node isolation). If your setting (key, value) VALUES ('abc', 'def') you will never see just the key and not the value set. However you might make a read and only 1 replica has it set while other does not. Meaning depending on your replication factor and consistency level requested you will either see the whole thing or nothing. This can apply to multiple rows within a partition as well but you cannot update 2 rows with a single update statement without a batch (logged or unlogged).
Little problem here with cassandra. Basically my data has a status (INITIALIZED, PERFORMED, ENDED...), and I have different scheduled tasks that will query this data based on the status with an "IN" clause. So one scheduler will work with the data that is INITIALIZED, one with the PERFORMED, some with both, etc...
Once the data is retrieved, it is processed and the status changes accordingly (INITIALIZED -> PERFORMED -> ENDED).
The problem : in order to be able to use the IN clause, the status has to figure among the primary keys of my table. But when I update the status... it creates a new record in my table, since the UPSERT doesn't find any data with the primary keys given...
How do I solve that ?
Instead of including the status column in your primary key columns you can create a secondary index on the column. However, the IN clause is not (yet) supported for secondary index columns. But as you have a very limited number of values to look up you could use equality conditions in your WHERE clause and then merge the results client-side?
Beware that using secondary indexes comes at a cost. Check out "when not to use an index". In your case these points may apply:
On a frequently updated or deleted column. See Problems using an
index on a frequently updated or deleted column below.
To look for a
row in a large partition unless narrowly queried. See Problems using
an index to look for a row in a large partition unless narrowly
queried below.
In Cassandra, I want to add a row, and if it already exists, only update it if the existing date is earlier than the new date. This is how it's done:
INSERT INTO tbl (...) VALUES (...) IF NOT EXISTS;
If the first query is not applied, I'm running this second one:
UPDATE tbl SET ...
WHERE ...
IF date <= ?;
Is it possible to merge the two queries into one? Maybe using the UPDATE as an upsert, while keeping the IF condition. We are having performance issues with these statements (timeouts) so this is the reason why I want to change it.
Regular updates (without IFs) also perform inserts if the row doesn't exist, but the lightweight transaction doesn't. Maybe it's possible to "trick" it into inserting as well.
Thanks!
LWT is basically doing a check before executing data mutation. Conditional execution is enabled only for INSERT and UPDATE with these conditions:
1. IF NOT EXISTS for INSERT
2. IF column = 'value' for UPDATE
You cannot mix and match these conditions with different operations. If there was an option to say UPDATE ... IF column <= 'value' it would have to hit all the nodes and propose a transaction to all of them and this would be a huge performance impact. LWT impacts performance even with equals conditions by hitting only replica nodes.