Help me understand how Global temporary table works
I have process which is going to be threaded and requires data visible only to that thread session. So we opted for Global Temporary Table.
Is it better to leave global temporary table not being dropped after all threads are completed or is it wise to drop the table. Call to this process can happen once or twice in a day.
Around 4 tables are required
Oracle Temp tables are NOT like SQL Server #temp tables. I can't see any reason to continuously drop/create the tables. The data is gone on a per session basis anyways once the transaction or session is completed (depends on table creation options). If you have multiple threads using the same db session, they will see each other's data. If you have one session per thread, then the data is limited in scope as you mentioned. See example here.
If you drop global temporary table and recreate it then it is not impacting to any database activities and server disk io activities because global temporary tables are created in temp tablespace where no archive is generating and not checkpoint is updating header of tempfile. Purpose of temporary table is only accurately maintained in this case.
Related
I had a good discussion with one of my colleagues and he mentioned creating a temporary table degrades the performance in Azure Synapse because Synapse creates the temporary table first in the master node then distribute them to child node. Is it true? He recommended me to create create permanent table instead of temporary table.
That’s not correct. Temp tables don’t necessarily funnel through the control node. Let’s say you are selecting from a table distributed on ProductKey and loading it into a #temp table distributed on ProductKey. The data will never leave each compute node since it’s a distribution compatible insert.
On the other hand, if you run a query that uses a ROW_NUMBER function, for example, that would have to be calculated on the control node and then the data would be sent back to the compute nodes to be stored in the distributed temp table. But that only happens in the presence of some types of functions and some types of queries. It is not the norm. If you are worried about a particular query then add the word EXPLAIN to the front of it and paste the explain plan XML into your question so we can help you interpret it.
If you load a #temp table with a SELECT INTO statement you can’t specify the table geometry so it will be a round robin distributed columnstore. Usually this isn’t ideal since it takes extra time and memory to compress a columnstore and because round robin distribution isn’t ideal unless there is no good distribution key. Usually the next query which uses the round robin distributed temp table will just reshuffle it so it’s best to properly hash distribute a temp table initially. To do this do a CTAS statement as described here.
I am using Cassandra as a NoSQL DB in my solution and have a Data model wherein I have 2 tables , one is parent table and other one is child table
Here is the scenario
client A is trying to update a parent table record as well child table records
At the same time, client B also does select request (which makes a hit to both parent and child table)
client B receives latest record from Parent table but gets older record from Child table
I can use a batch log operation so that I can achieve atomicity for updating both the tables but not sure how to isolate or lock the read request from Client B so as to avoid having dirty read problem.
Have also tried evaluating light weight transactions but doesnt seem to work in this case
Just thinking if I can use some middleware application to implement locking functionality since there seems to be nothing available in Cassandra out of the box.
Please help to make me understand how to achieve read/write sync in this regard
As you mentioned - Cassandra provides only atomicity when you choose to batch. It does provide isolation though when you make a single partition batch, which is not your case unfortunately.
To respond to your question - if you really need transaction I would think about the problem and possible solutions once again. Either you should eliminate the need of locking or you should change the technology stack.
We're exploring the possibility to use temporary views in spark and relate this to some actual file storage - or to other temporary views. We want to achieve something like:
Some user uploads data to some S3/hdfs file storage.
A (temporary) view is defined so that spark sql queries can be run against the data.
Some other temporary view (referring to some other data) is created.
A third temporary view is created that joins data from (2) and (3).
By selecting from the view from (4) the user gets a table that reflects the joined data from (2) and (3). This result can be further processed and stored in a new temporary view and so on.
So we end up with a tree of temporary views - querying their parent temporary views until they end up loading data from the filesystem. Basically we want to store transformation steps (selecting, joining, filtering, modifying etc) on the data - without storing new versions. The spark SQL-support and temporary views seems like a good fit.
We did some successful testing. The idea is to store the specification of these temporary views in our application and recreate them during startup (as temporary or global views).
Not sure if this is viable solution? One problem is that we need to know how the temporary views are related (which one queries which). We create them like:
sparkSession.sql("select * from other_temp_view").createTempView(name)
So, when this is run we have to make sure that other_temp_view is already created in the session. Not sure how this can be achieved. One idea is to store a timestamp and recreate them in the same order. This could be ok since out views most likely will have to be "immutable". We're not allowed to change a query that other queries relies on.
Any thoughts would be most appreciated.
I would definately go with the SessionCatalog object :
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SessionCatalog.html
You can access it with spark.sessionState.catalog
I am not sure if I understand the idea of reference tables correctly. In my mind it is a table that contains the same data in every shard. Am I wrong? I am asking because I have no idea how should I insert data to the reference table to make the data multiply in every shard. Or maybe it is impossible? Can anyone clarify this issue?
Yes, the idea of a Reference Table is that the same data is contained on every shard. If you have small numbers of shards and data changes are rare, you can open multiple connections in your application and apply the changes to multiple DBs simultaneously. Or you can construct a management script that iterates periodically across all shards to update reference data or performs a bulk-insert of a fresh image of rows.
The new feature previewing in Azure SQL Database called Elastic DB Jobs, which allows you to define a SQL script for operations that you want to take place on all shards, and then runs the script asynchronously with eventual completion guarantees. You can potentially use this to update reference tables. Details on the feature are here.
The attributes for the <jdbc:inbound-channel-adapter> component in Spring Integration include data-source, sql and update. These allow for separate SELECT and UPDATE statements to be run against tables in the specified database. Both sql statements will be part of the same transaction.
The limitation here is that both the SELECT and UPDATE will be performed against the same data source. Is there a workaround for the case when the the UPDATE will be on a table in a different data source (not just separate databases on the same server)?
Our specific requirement is to select rows in a table which have a timestamp prior to a specific time. That time is stored in a table in a separate data source. (It could also be stored in a file). If both sql statements used the same database, the <jdbc:inbound-channel-adapter> would work well for us out of the box. In that case, the SELECT could use the time stored, say, in table A as part of the WHERE clause in the query run against table B. The time in table A would then be updated to the current time, and all this would be part of one transaction.
One idea I had was, within the sql and update attributes of the adapter, to use SpEL to call methods in a bean. The method defined for sql would look up a time stored in a file, and then return the full SELECT statement. The method defined for update would update the time in the same file and return an empty string. However, I don't think such an approach is failsafe, because the reading and writing of the file would not be part of the same transaction that the data source is using.
If, however, the update was guaranteed to only fire upon commit of the data source transaction, that would work for us. If the event of a failure, the database transaction would commit, but the file would not be updated. We would then get duplicate rows, but should be able to handle that. The issue would be if the file was updated and the database transaction failed. That would mean lost messages, which we could not handle.
If anyone has any insights as to how to approach this scenario it is greatly appreciated.
Use two different channel adapters with a pub-sub channel, or an outbound gateway followed by an outbound channel adapter.
If necessary, start the transaction(s) upstream of both; if you want true atomicity you would need to use an XA transaction manager and XA datasources. Or, you can get close by synchronizing the two transactions so they get committed very close together.
See Dave Syer's article "Distributed transactions in Spring, with and without XA" and specifically the section on Best Efforts 1PC.