I am going through self paced training provided by Data Stax and I have downloaded CCM tool to get started, but in my list of keyspaces I am not able to see "MusicDB" which is used for training purposes. Any one has faced the same issue, if yes, please help
CCM is Cassandra Cluster Manager and not the MusicDB sample database. I am unable to find the import script for the MusicDB sample collection but if you use the set up on the virtual machine that can be downloaded here (after logging in), you could export the data if you really wanted to use it in your own instance.
Related
Pretty new to Databricks.
I've got a requirement to access data in the Lakehouse using a JDBC driver. This works fine.
I now want to stub the Lakehouse using a docker image for some tests I want to write. Is it possible to get a Databricks / spark docker image with a database in it? I would also want to bootstrap the database on startup to create a bunch of tables.
No - Databricks is not a database but a hosted service (PaaS). You can theoretically you can use OSS Spark with Thriftserver started on it, but the connections strings and other functionality would be very different, so it makes no sense to spend time on it (imho). Real solution would depend on the type of tests that you want to do.
Regarding bootstrapping database & create a bunch of tables - just issue these commands, like, create database if not exists or create table if not exists when you application starts up (see documentation for an exact syntax)
We have one requirement where we need to replicate Cassandra Cluster with existing nodes and existing data in it. Approx 2.5 TB of data is on Azure and 3.5 TB on AWS. We need to pull the remaining data from AWS to Azure. Your Kind Help is appreciated.
There are many options here.
You can connect the two using GPFS - stand up a DC in Azure replicate across remove the old DC.
You could unload the data via the Cassandra loader. https://github.com/brianmhess/cassandra-loader
You could take a snapshot and then stream the data to the new cluster via sstableloader.
It's hard to give a complete answer - it would depend on so many factors. The above should get you started at least.
I am working with a cluster I created with ccm. We are using 3 tables in 2 keyspaces, so 6 tables in total. I was having a problem that it let me create one table in one keyspace and 2 in the other but even when I removed my
IF NOT EXISTS
check then it would give me an error saying it already exists. It seems that the create is ignoring the fact that these are supposed to be in 2 separate keyspaces;
These are the same cql script files that we run against our dev cloud Cassandra cluster, so I know its not an issue with the scripts. That, and the create statements are pretty simple and straightforward.
So does CCM only support one keyspace? If so, that seems like a pretty big limitation and makes it much much less useful, if we can even use it at all for our local dev and testing purposes.
Thanks!
The answer to your question is: No, CCM doesn't support only one keyspace.
CCM doesn't have any restrictions at all built into it. Under the covers it is just a set of python scripts for configuring and launching a cassandra cluster on a single machine.
Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup
I am currently using Achilles Embedded to spin up a local, temporary Cassandra instance and test my functionality there. While this is working to some extend, there seems to be a memory leak as the more tests I run, the more I see messages like PS Scavenge GC in xx ms, and my system slows to a crawl, even freezing the mouse pointer.
So, is there a better way to automatically spin up a small Cassandra instance to run my tests against?
The tool I use for quickly creating a local Cassandra cluster is the ccm (Cassandra Cluster Manager) utility. You can easily create a multi-node cluster on your local machine for any release. See more information here.
I believe some of the Cassandra developers use ccm for their development work, so ccm is kept up to date with the newest releases.
I agree, you can use use CCM. if you have a test cluster. Try using cassandra stress tool (Either standalone or using yam profile). If I am getting your question correct, it will solve your problem.