Using multiple DDLs within a transaction in YugabyteDB

Using multiple DDLs within a transaction in YugabyteDB - yugabytedb

Does anyone know the reason, yugabyte-specific or otherwise, that I cannot alternate between truncating and inserting within the same transaction?
These steps:
Truncate a table.
Insert a row into that table.
Truncate again.
Insert another row into the table.
Result in this error on the final step:
ERROR: Operation failed. Try again.: Unknown transaction, could be recently aborted: e415ae05-0d46-42f5-b18d-f27b344b5642 (SQLSTATE 40001)
[Disclaimer]: This question was first asked on the YugabyteDB Community Slack channel.

In YugabyteDB, currently 'truncate' is not transactional. The recommendation would be to avoid using:
a) truncate inside of a multi-step transaction
OR
b) running truncate concurrently with our read/write operations on the same table.
To my knowledge other distributed SQL databases also either:
a) do not support truncate (like Google Cloud Spanner). See Does Cloud Spanner support a TRUNCATE TABLE command?
OR,
b) they support truncate, but not in transactional manner.
We do plan to restrict this restriction in future. In the near term perhaps
delete from T;
can be used as a workaround; this is a bit heavier weight than using truncate, but will be transactional.

Related

spark sql databricks - transaction log error after optimize

I have two tables, written like this:
f_em.write.format('delta').mode("overwrite").saveAsTable('rens.f_em')
f_dial.write.format('delta').mode("overwrite").saveAsTable('rens.f_dial')
These tables work fine. I can query them. However, they are large (ca. 11 billion rows), so to enhance performance, I want to optimize them.
%sql
optimize rens.f_em
zorder by (RKNR)
and
%sql
optimize rens.f_dial
zorder by (rknr)
I have no clue how optimize exactly works and what zorder by exactly does. I used the optimize function before on another table, and just used the attribute I use the most for linking/joining in the zorder by statement. This enhanced performed significantly so I tried the same approach here.
After running the optimize statement, I cannot query from the tables any longer:
For one of the tables I receive this error after a simple select statement
You are trying to read from `dbfs:/user/hive/warehouse/rens.db/f_em` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to read from the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://learn.microsoft.com/azure/databricks/delta/index
;
and other error:
Error in SQL statement: FileNotFoundException: dbfs:/user/hive/warehouse/rens.db/f_dial/_delta_log/00000000000000000000.json: Unable to reconstruct state at version 2 as the transaction log has been truncated due to manual deletion or the log retention policy (delta.logRetentionDuration=30 days) and checkpoint retention policy (delta.checkpointRetentionDuration=2 days)

Just guessing: Check the filename for the new query. It looks for the data in the path
dbfs:/user/hive/warehouse/rens.db/f_em
But most likely you saved the table to:
dbfs:/user/hive/warehouse/rens.f_em .
This might be due to your dot-notation in you saveAsTable(rens.f_em).
In the SQL query, the dot is interpreted by the SQL API as a database, not as a delta table called rens.f_em.
EDIT: Given your reply, I would like to propose such a workaround, which I personally always use and favor due to robustness.
table_dir = "/path/to/table"
f_em.write.format('delta').mode("overwrite").save(f"{table_dir} + /rens.f_em")
spark.sql("CREATE DATABASE rens")
spark.sql(f"CREATE rens.f_em USING DELTA delta.`{table_dir}/f_em`")
spark.sql(f"OPTIMIZE delta.`{table_dir}/f_em` zorder by (RKNR)")

Copy Data pipeline on Azure Data Factory from SQL Server to Blob Storage

I'm trying to move some data from Azure SQL Server Database to Azure Blob Storage with the "Copy Data" pipeline in Azure Data Factory. In particular, I'm using the "Use query" option with the ?AdfDynamicRangePartitionCondition hook, as suggested by Microsoft's pattern here, in the Source tab of the pipeline, and the copy operation is parallelized by the presence of a partition key used in the query itself.
The source on SQL Server Database consists of two views with ~300k and ~3M rows, respectively.
Additionally, the views have the same query structure, e.g. (pseudo-code)
with
v as (
select hashbyte(field1) [Key1], hashbyte(field2) [Key2]
from Table
)
select *
from v
and so do the tables that are queried by the views. On top of this, the views query the same number of partitions with a roughly equally distributed number of rows.
The unexpected behavior - most likely due to the lack of experience from my side - of the copy operation is that it lasts much longer for the view that query fewer rows. In fact, the copy operation with ~300k rows shows a throughput of ~800 KB/s, whereas the one with ~3M rows shows a throughput of ~15MB/s (!). Lastly, the writing operation to the blob storage is pretty fast for both cases, as opposite to the reading-from-source operation.
I don't expect anyone to provide an actual solution - as the information provided is limited -, but I'd rather like some hints on what could be affecting the copy performance so badly for the case where the view queries much (roughly an order of magnitude) fewer rows, taking into account that the tables under the views have a comparable number of fields, and also the same data types: both the tables that the views query contain int, datetime, and varchar data types.
Thanks in advance for any heads up.

To whoever might stumble upon the same issue, I managed to find out, rather empirically, that the bottleneck was being caused by the presence of several key-hash computations in the view on SQL DB. In fact, once I removed these - calculated later on Azure Synapse Analytics (data warehouse) - I observed a massive performance boost of the copy operation.

When there's a copy activity performance issue in ADF and the root cause is not obvious (e.g. if source is fast, but sink is throttled, and we know why) -- here's how I would go about it :
Start with the Integration Runtime (IR) (doc.). This might be a jobs' concurrency issue, a network throughput issue, or just an undersized VM (in case of self-hosted). Like, >80% of all issues in my prod ETL are caused by IR-s, in one way or another.
Replicate copy activity behavior both on source & sink. Query the views from your local machine (ideally, from a VM in the same environment as your IR), write the flat files to blob, etc. I'm assuming you've done that already, but having another observation rarely hurts.
Test various configurations of copy activity. Changing isolationLevel, partitionOption, parallelCopies and enableStaging would be my first steps here. This won't fix the root cause of your issue, obviously, but can point a direction for you to dig in further.
Try searching the documentation (this doc., provided by #Leon is a good start). This should have been a step #1, however, I find ADF documentation somewhat lacking.
N.B. this is based on my personal experience with Data Factory.
Providing a specific solution in this case is, indeed, quite hard.

Truncate tables on databricks

I'm working with two environments in Azure: Databricks and SQL Database. I'm working with a function that generate a dataframe that it's going to be used to overwrite the table that is stored in the SQL Database. I have many problems because the df.write.jdbc(mode = 'overwrite') only drops the table and, I'm guessing, my user didn't have the right permissions to created again (I've already seen for DML and DDL permission that I need to do that). In resume, my functions only drops the table but without recreating again.
We discuss about what could be the problem and we conclude that maybe the best thing that I can do is truncate the table and re-add the new data there. I'm trying to find how to truncate the table, I tried these two approaches but I can't find more information related to that:
df.write.jdbc()
&
spark.read.jdbc()
Can you help me with these? The overwrite doesn't work (maybe I don't have the adequate permissions) and I can't figure out how to truncate that table using a jdbc.

It's in the Spark documentation - you need to add the truncate when writing:
df.write.mode("overwrite").option("truncate", "true")....save()
Also, if you have a lot of data, then maybe it's better to use Microsoft's Spark connector for SQL Server - it has some performance optimizations that should allow to write faster.

You can create stored procedure for truncating or dropping in SQL Server and call that stored procedure in databricks using ODBC connection.

In Azure Synapse, is it true that creating a temporary table is discouraged?

I had a good discussion with one of my colleagues and he mentioned creating a temporary table degrades the performance in Azure Synapse because Synapse creates the temporary table first in the master node then distribute them to child node. Is it true? He recommended me to create create permanent table instead of temporary table.

That’s not correct. Temp tables don’t necessarily funnel through the control node. Let’s say you are selecting from a table distributed on ProductKey and loading it into a #temp table distributed on ProductKey. The data will never leave each compute node since it’s a distribution compatible insert.
On the other hand, if you run a query that uses a ROW_NUMBER function, for example, that would have to be calculated on the control node and then the data would be sent back to the compute nodes to be stored in the distributed temp table. But that only happens in the presence of some types of functions and some types of queries. It is not the norm. If you are worried about a particular query then add the word EXPLAIN to the front of it and paste the explain plan XML into your question so we can help you interpret it.
If you load a #temp table with a SELECT INTO statement you can’t specify the table geometry so it will be a round robin distributed columnstore. Usually this isn’t ideal since it takes extra time and memory to compress a columnstore and because round robin distribution isn’t ideal unless there is no good distribution key. Usually the next query which uses the round robin distributed temp table will just reshuffle it so it’s best to properly hash distribute a temp table initially. To do this do a CTAS statement as described here.

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?

Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string