I would like to execute over hundred of user-defined-type statements. These statements are encapsulated in a .cql file.
While executing .cql file everytime for new cases, I find that many of the statements within it gets skipped.
Therefore, I would like to know if there is any performance issues of executing 100s of statements composed in .cql file
Note: I am executing .cql files on behalf of a Python script via os.system method
The performance of executing 100's of DDL statements via code (or cql file/cqlsh) is proportional to the number of nodes in the cluster. In a distributed system like Cassandra all nodes have to agree for the schema change and more the number of nodes, more the time it takes for schema agreement.
There is essentially a timeout value maxSchemaAgreementWaitSeconds which determines how long coordinator node will wait before replying back to client. Typically case for schema deployment is one or two tables and the default value for this parm works just fine.
Since in the special case of multiple DDL executed at once via code/cqlsh; its better to increase the value for maxSchemaAgreementWaitSeconds say to 20sec. Its going to a take a little longer for the schema deployment, but it will make sure the deployment succeeds.
Java reference
Python reference
Related
Cassandra doesn't imply particular order in which the statements are executed.
Executing statements like the code below doesn't execute in the order.
INSERT INTO channel
JSON ''{"cuid":"NQAA0WAL6drA"
,"owner":"123"
,"status":"open"
,"post_count":0
,"mem_count":1
,"link":"FWsA609l2Og1AADRYODkzNjE2MTIyOTE="
, "create_at":"1543328307953"}}'';
BEGIN BATCH
UPDATE channel
SET title = ? , description = ? WHERE cuid = ? ;
INSERT INTO channel_subscriber
JSON ''{"cuid":"NQAA0WAL6drA"
,"user_id":"123"
,"status":"subscribed"
,"priority":"owner"
,"mute":false
,"setting":{"create_at":"1543328307956"}}'';
APPLY BATCH ;
According to system_traces.sessions each of them are received by different nodes.
Sometimes the started_at time in both query are equal (in milliseconds), sometimes the started_at time of second query is less than the first one.
So, this ruins the order of statements and data.
We use erlang, marina driver, consistency_level is QUORUM and time of all cassandra nodes and application server are sync.
How can I force Cassandra to execute queries in order?
Because of the distributed nature, queries in Cassandra could be received by different nodes, and depending on the load on particular node, it could be that some queries that sent later, are executed earlier. In your case you can put first insert into batch itself. Or, as it's implemented in some drivers (for example, Java driver), use whitelist policy to send queries to only one node - but it will be bottleneck in this case. (and I really not sure that your driver has such functionality).
Is there any way to execute VoltDB stored procedures at regular interval or schedule store procedure to run at a specific time?
I am exploring VotlDB to shift out product from RDBMS to VotlDB. Out produce written in java.
Most of the query can be migrated into the VoltDB stored procedures. But In our product, we have cron job in oracle which executes at regular interval. Now I do not find such features in VoltDB.
I know VoltDB stored procedures can be called from the application at regular interval but our product deploys in an Active-Active mode, in that case, all application will call store procedure at regular interval and that is not a good solution or otherwise, we have to develop some mechanism to run procedure from one instance only.
so It would be good if I get cron job feature from VoltDB.
I work at VoltDB. There isn't currently a feature like this in VoltDB, for example like DBMS_JOB in Oracle.
You could certainly use a cron job on one of the servers in your cluster, or on some other server within your network that could invoke sqlcmd to run a script or echo individual SQL statements or execute procedure commands through sqlcmd to the database. Making cron jobs highly available is a general problem. You might find these other discussions helpful:
How to convert Linux cron jobs to "the Amazon way"?
https://www.reddit.com/r/linuxadmin/comments/3j3bz4/run_cronjob_only_on_one_node_in_cluster/
You could also look into something like rcron.
One thing to be careful of when converting from an RDBMS to VoltDB is that VoltDB is optimized for processing many small transactions in parallel across many partitions. While the architecture of serialized execution per partition excels for many operational and streaming workloads, it is not designed to perform bulk operations on many rows at a time, especially transactions that need to perform writes on many rows that may be in different partitions within one transaction.
If you have a periodic job that does something like "process all the new rows that meet some criteria" you may find this transaction is slow and every time it runs it could delay other parts of the workload, especially if many rows have accumulated. It would be more the "VoltDB Way" to replace a simple INSERT statement that you may be using to ingest data (to be processed later by a scheduled job) with a procedure that inserts and immediately processes the row of data. You might even need a procedure that checks for other records and processes small sets of rows as a group, for example stitching together segments of data that go together but may have arrived out of order. By operating on fewer records at a time within one partition at a time, this type of procedure would be more scalable and would keep the data closer to your desired finished state in real time, rather than always having some data waiting to be processed.
Is it possible to run a table to table mapping scenario in parallel (multi threading)
we have a huge table and we already created table mapping and scenario on the mapping.
we also executing it from loadplan.
but is there way I can run the scenario in multiple threads to make the data transfer faster.
I am using groovy to script all these task.
It will be better if I get someway to script it in groovy.
A load plan with Parallel steps or a packages with scenarios in asynchronous mode will do for the parallelism part.
An issue you might run in, depending on which KMs are used, is that the same name will be used by temporary tables in all mappings. To avoid that, select the "Use Unique Temporary Object Names" checkbox appears in the Physical tab of your mapping. It will generate a different name for these objects for each execution.
It is possible on the ODI side, you may need some modifications on the mapping to not load any duplicate data. We have a similar flow where we use modula function on a numeric key to split source data into partitions. Then this data gets loaded into target.
To run this interface in multi-thread way, we have a package with a loop that executes the scenario asynchronously of this mapping with a MODULO_VALUE variable.
For loading data we are using oracle sqlloader utility, it is able to work in a parallel way to load data into one target table. I am not sure about if data pump utility also has this ability. But I know if you try to load data by SQL as a multithread approach you would get a ORA-00054: resource busy and acquire with NOWAIT specified error.
As you see there is no Groovy code included in this flow, all handled by ODI mappings, packages and KMs. I hope this helps.
I have around 100 threads running parallel and dumping data in a single table using sqlldr ctl file. the query generates values for ID using expression ID SEQUENCE(MAX,1).
The process fails to load files properly due to parallel execution and may be two or more threads get same ID. it works fine when I run it sequentially with one single thread.
Please suggest a workaround.
Each CSV file contains data associated with a test cases and cases are supposed to be run in parallel. I can not concatenate all files in one go.
You could load the data and then run a separate update in which you could update ID with a traditional oracle sequence?
I noticed if I have a java method in which I have a preparedStatement uisng the JDBC driver that comes with Cassandra it is always slow. But if I put the same query twice in the method the second time it is 20x faster. Why is that? I would think the second, third, four time I call the java method it would be faster then the first. I am using Cassandra 1.2.5. I have also cached 100MB of rows in the row-cache and set the table to caching = "all". In Cassandra-cli I verified the settings. And in Cassandra-Cli I verified the second, third fourth time I get the rows from the same table I do the JDBC calls against I get faster response time.
Any Ideas?
Thanks,
-Tony
From the all knowing CQL3 documentation (always a great starting point btw):
Prepared statement is an optimization that allows to parse a query only once but execute it multiple times with different concrete values.
The statement gets cached. This is the difference maker you are experiencing. Also prepared statements get pre-compiled, typically meaning an execution plan is prepared before the query is run against the db. Knowing what you are doing makes the process faster.
At the first run your prepared statement is cached in-case you run the same query again, which you do, and since its cached the querying will be executed much faster.