Is it possible to run more than one Cassandra query from a single Cassandra file?
So that if I share that file, the others can run it to replicate the database in all systems
The easiest way is to pass the file containing CQL statements to either cqlsh (using the -f option) or using DevCenter
If you are using Java, the Achilles framework has a class called ScriptExecutor that you can use to run CQL statements from a file and even plug in parameters to dynamically change the statements during execution.
ScriptExecutor documentation
Related
I am trying to run a Hive query with pyspark. I am using Hortonworks so I need to use the Hive WarehouseConnector.
Running one or even multiple queries is easy and works. My problem is that I want to issue set commands before. For instance to set the dag name in tez ui: set hive.query.name=something relevant or to set up some memory configuration set hive.tez.container.size = 8192. For these statements to take effect, they need to run on the same session than the main query and that's my issue.
I tried 2 ways:
The first one was to generate a new hive session for each query, with a properly setup url eg.:
url='jdbc:hive2://hiveserver:10000/default?hive.query.name=relevant'
builder = HiveWarehouseSession.session(self.spark)
builder.hs2url(url)
hive = builder.build()
hive.execute("select * from whatever")
It works well for the first query, but the same url is reused for the next one (even if I try to manually delete builder and hive), so does not work.
The second way is to set spark.sql.hive.thriftServer.singleSession=true globally in the spark thrift server. This does seem to work, but I find it a shame to limit the global spark thrift server for the benefit of one application only.
Is there a way to achieve what I am looking for? Maybe there could be a way to pin a query to one executor, so hopefully one session?
this has been a big peeve of mine...still is actually.
The solution that resolved this issue for me was putting all the queries in a query file, where each query would be separated by a semicolon. Then I run the query using beeline from within a python script.
Unfortunately, it does not work with queries that return results...only suitable for set, overwrite, insert kind of queries.
In case you might have discovered a more efficient way to do this, please do share.
Description
In MySQL world, one can create an SQL file and execute it from the command line.
mysql -h hostname -u user database < path/to/sqlfile.sql
This is especially useful for test data.
I've checked the Arangodb documentation and the best I can see for putting in test data is this from the cookbook
https://docs.arangodb.com/3.2/Cookbook/AQL/CreatingTestData.html
Please is it possible to write an aql file and execute it in the command line like with MySQL?
In contrast to mysql SQL which contains DML and DDL language elements, AQL by definition only contains DML statements. Therefore "executing AQL" is most probably not what suits your needs.
The arangosh can be used to read files from the filesystem, which you then could use to send AQL queries or creating collections and indices.
If you want a reproduceable state, you could create a set of data that you like maybe by creating a CSV to read using arangoimp, create indices and so on, then dump your database using arangodump and for setting up your SUT use arangorestore.
There are tools by community members, Migrant Verde and ArangoMiGO that enable you to do schema evolution etc which may be of interest for that purpose too.
I'm using cqlengine from Django. Is there a way to make DESCRIBE KEYSPACES from it. It works form cqlsh.
Couldn't find anything in Docs
Best bet is getting it from the cluster metadata. DESCRIBE is not part of cql but cqlsh is using the python driver just like cqlengine so you can use same mechanism:
https://github.com/apache/cassandra/blob/16490a48b02b6f206a78717e9b816983f0b76bb1/bin/cqlsh.py#L619
map(str, yourconnection.metadata.keyspaces.keys())
from the metadata you can collect most data you want like this. You can also query the system schema tables but that changes between versions a bit so I would recommend letting driver do it for you.
I am looking at exploring optimization of cassandra for a limited set of commands. For that I wanted to know which among SELECT, INSERT, UPDATE, DELETE & BATCH is the CQL command with highest frequency of use in realtime systems. Any pointers and thoughts on this would be great help.
There is no such thing as common cql commands, it all depends for which use case cassandra is deployed.
So Instead of optimizing commands you could go for Use Case based optimization:
Eg: UseCase: Write oriented Workload:
Optimize Insert and Update commands.
I'm looking for a tool to load CSV into Cassandra. I was hoping to use RazorSQL for this but I've been told that it will be several months out.
What is a good tool?
Thanks
1) If you have all the data to be loaded in place you can try sstableloader(only for cassandra 0.8.x onwards) utility to bulk load the data.For more details see:cassandra bulk loader
2) Cassandra has introduced BulkOutputFormat bulk loading data into cassandra with hadoop job in latest version that is cassandra-1.1.x onwards.
For more details see:Bulkloading to Cassandra with Hadoop
I'm dubious that tool support would help a great deal with this, since a Cassandra schema needs to reflect the queries that you want to run, rather than just being a generic model of your domain.
The built-in bulk loading mechanism for cassandra is via BinaryMemtables: http://wiki.apache.org/cassandra/BinaryMemtable
However, whether you use this or the more usual Thrift interface, you still probably need to manually design a mapping from your CSV into Cassandra ColumnFamilies, taking into account the queries you need to run. A generic mapping from CSV-> Cassandra may not be appropriate since secondary indexes and denormalisation are commonly needed.
For Cassandra 1.1.3 and higher, there is the CQL COPY command that is available for importing (or exporting) data to (or from) a table. According to the documentation, if you are importing less than 2 million rows, roughly, then this is a good option. Is is much easier to use than the sstableloader and less error prone. The sstableloader requires you to create strictly formatted .db files whereas the CQL COPY command accepts a delimited text file. Documenation here:
http://www.datastax.com/docs/1.1/references/cql/COPY
For larger data sets, you should use the sstableloader.http://www.datastax.com/docs/1.1/references/bulkloader. A working example is described here http://www.datastax.com/dev/blog/bulk-loading.