In Azure Synapse, is it true that creating a temporary table is discouraged? - azure

I had a good discussion with one of my colleagues and he mentioned creating a temporary table degrades the performance in Azure Synapse because Synapse creates the temporary table first in the master node then distribute them to child node. Is it true? He recommended me to create create permanent table instead of temporary table.

That’s not correct. Temp tables don’t necessarily funnel through the control node. Let’s say you are selecting from a table distributed on ProductKey and loading it into a #temp table distributed on ProductKey. The data will never leave each compute node since it’s a distribution compatible insert.
On the other hand, if you run a query that uses a ROW_NUMBER function, for example, that would have to be calculated on the control node and then the data would be sent back to the compute nodes to be stored in the distributed temp table. But that only happens in the presence of some types of functions and some types of queries. It is not the norm. If you are worried about a particular query then add the word EXPLAIN to the front of it and paste the explain plan XML into your question so we can help you interpret it.
If you load a #temp table with a SELECT INTO statement you can’t specify the table geometry so it will be a round robin distributed columnstore. Usually this isn’t ideal since it takes extra time and memory to compress a columnstore and because round robin distribution isn’t ideal unless there is no good distribution key. Usually the next query which uses the round robin distributed temp table will just reshuffle it so it’s best to properly hash distribute a temp table initially. To do this do a CTAS statement as described here.

Related

Is it hacky to do RF=ALL + CL=TWO for a small frequently used Cassandra table?

I plan to enhance the search for our retail service, which is managed by DataStax. We have data of about 500KB in raw from our wheels and tires and could be compressed and encrypted to about 20KB. This table is frequently used and changes about every day. We send the data to the frontend, which will be processed with Next.js later. Now we want to store this data in a single row table in a separate keyspace with a consistency level of TWO and RF equal to all nodes, replicating the table to all of the nodes.
Now the question: Is this solution hacky or abnormal? Is any solution rather this that fits best in this situation?
The quick answer to your question is yes, it is a hacky solution to do RF=ALL.
The table is very small so there is no benefit to replicating it to all nodes in the cluster. In practice, the tables are so small that the data will be cached anyway.
Since you are running with DataStax Enterprise (DSE), you might as well take advantage of the DSE In-Memory feature which allows you to keep data in RAM to save from disk seeks. Since your table can easily fit in RAM, it is a perfect use case for DSE In-Memory.
To configure the table to run In-Memory, set the table's compaction strategy to MemoryOnlyStrategy:
CREATE TABLE inmemorytable (
...
PRIMARY KEY ( ... )
) WITH compaction= {'class': 'MemoryOnlyStrategy'}
AND caching = {'keys':'NONE', 'rows_per_partition':'NONE'};
To alter the configuration of an existing table:
ALTER TABLE inmemorytable
WITH compaction= {'class': 'MemoryOnlyStrategy'}
AND caching = {'keys':'NONE', 'rows_per_partition':'NONE'};
Note that tables configured with DSE In-Memory are still persisted to disk so you won't lose any data in the event of a power outage or service disruption. In-Memory tables operate the same as regular tables so the same backup and restore processes still apply with the only difference being that a copy of the data is kept in memory for faster read performance.
For details, see DataStax Enterprise In-Memory. Cheers!

Copy Data pipeline on Azure Data Factory from SQL Server to Blob Storage

I'm trying to move some data from Azure SQL Server Database to Azure Blob Storage with the "Copy Data" pipeline in Azure Data Factory. In particular, I'm using the "Use query" option with the ?AdfDynamicRangePartitionCondition hook, as suggested by Microsoft's pattern here, in the Source tab of the pipeline, and the copy operation is parallelized by the presence of a partition key used in the query itself.
The source on SQL Server Database consists of two views with ~300k and ~3M rows, respectively.
Additionally, the views have the same query structure, e.g. (pseudo-code)
with
v as (
select hashbyte(field1) [Key1], hashbyte(field2) [Key2]
from Table
)
select *
from v
and so do the tables that are queried by the views. On top of this, the views query the same number of partitions with a roughly equally distributed number of rows.
The unexpected behavior - most likely due to the lack of experience from my side - of the copy operation is that it lasts much longer for the view that query fewer rows. In fact, the copy operation with ~300k rows shows a throughput of ~800 KB/s, whereas the one with ~3M rows shows a throughput of ~15MB/s (!). Lastly, the writing operation to the blob storage is pretty fast for both cases, as opposite to the reading-from-source operation.
I don't expect anyone to provide an actual solution - as the information provided is limited -, but I'd rather like some hints on what could be affecting the copy performance so badly for the case where the view queries much (roughly an order of magnitude) fewer rows, taking into account that the tables under the views have a comparable number of fields, and also the same data types: both the tables that the views query contain int, datetime, and varchar data types.
Thanks in advance for any heads up.
To whoever might stumble upon the same issue, I managed to find out, rather empirically, that the bottleneck was being caused by the presence of several key-hash computations in the view on SQL DB. In fact, once I removed these - calculated later on Azure Synapse Analytics (data warehouse) - I observed a massive performance boost of the copy operation.
When there's a copy activity performance issue in ADF and the root cause is not obvious (e.g. if source is fast, but sink is throttled, and we know why) -- here's how I would go about it :
Start with the Integration Runtime (IR) (doc.). This might be a jobs' concurrency issue, a network throughput issue, or just an undersized VM (in case of self-hosted). Like, >80% of all issues in my prod ETL are caused by IR-s, in one way or another.
Replicate copy activity behavior both on source & sink. Query the views from your local machine (ideally, from a VM in the same environment as your IR), write the flat files to blob, etc. I'm assuming you've done that already, but having another observation rarely hurts.
Test various configurations of copy activity. Changing isolationLevel, partitionOption, parallelCopies and enableStaging would be my first steps here. This won't fix the root cause of your issue, obviously, but can point a direction for you to dig in further.
Try searching the documentation (this doc., provided by #Leon is a good start). This should have been a step #1, however, I find ADF documentation somewhat lacking.
N.B. this is based on my personal experience with Data Factory.
Providing a specific solution in this case is, indeed, quite hard.

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Cassandra - Distributing data and Multiple tables (Data modeling)

I am trying to learn cassandra. One thing I am not clear is how to ask Cassandra to distribute various tables. ie say I have time series data coming into table t1,t2,t3
T1 is heavily loaded ( ratio is 2000: 2:4 for num of rows).
I want the data of T1 for a given day to be not on the same machine as T2 or T3; so my queries are equally distributed ie not put too much load on one machine.
Also as the data gets older, its queried less, how can I take into account this factor.
regards
Cassandra is automatically distributed, you do not have a direct control on how the data gets distributed. In most cases, by default it makes use of an md5 on the row key and depending on that selects which nodes (computers) will be used to save the data.
What you are talking about would be more of a planning for a standard SQL database. However, if you generate extremely large amount of statistical data that is only to be used by some backend processes and users, you could have a separate cluster of 2 or 3 nodes. That way your other tables would not be affected by those statistics.
However, the true power of Cassandra is to be used with one large cluster. If it slows down, add nodes to it and do the necessary repair to spread the data properly. That's it... pretty much.
As for the way a table is used, you can use all the parameters defined on a table to tweak its setup. If you mainly do writes to a table, then you can tweak the parameters to get faster writes and slower reads. The other way around is also available: one write, many reads. And also many writes and many reads. To tweak those settings, in most cases you will need to have your software running and gather various stats and make changes as time passes.
Update:
There is actually a solution, thinking about it, just... I never use that mode so I did not think about it.
When you use a cluster which supports sorted rows, you can use a specific row name and the data will then go to a specific node. Again, you do not directly have control over what goes where, but if you really really want to do it that way, that's probably the solution you are looking for.
In this case, the row name would start with a number such as 0x0001 for T1 data, and 0x0100 and 0x0200 for T2 and T3. Since you do not know where the data really goes and how Cassandra decides to use it, it is rather complicated to obtain the right results here. And if you change your cluster (i.e. add nodes) then all your assumptions of where the data goes may very well go to the toilet! (and that's not speaking of upgrading to a new version of Cassandra...)

Resources