I have to create a report from the data present in Cassandra DB. I am more in to RDBMS and pretty much new in to Cassandra DB. In RDBMS that task is quite easy, you can create complex queries with joins and create the report. But how we can acheieve the same thing in Cassandra DB.
You can use the Bulk Loader (DSBulk) tool to export data from a Cassandra table.
However, it's not BI tool and there are no joins in Cassandra so you can't do complex queries. Once you've exported a table to a CSV file, you can manipulate it whichever way you want.
Here are some references with examples to help you get started quickly:
Blog - Introduction to DSBulk
Blog - DSBulk Unloading examples
Docs - More data export examples
DSBulk is open-source so it's free to use. Cheers!
Related
I am very new to Cassandra and any help here would be appreciated. I have a cluster of 6 nodes that spans 2 datacenters (3 nodes to each cluster). My client has decided that they do not want to renew their Cassandra license with Datastax anymore and want their data exported into a format that can be easily imported into another Database in the future. I was thinking of exporting the data as a CSV file, but since the data is distributed between all the nodes, I am not sure what is the best way to export all the data.
One option - You should be able to use the CQL COPY command - which copies the data into a CSV format. The nice thing about copy is that you can run it from a single node (i.e. it is not a "node" level tool). Command would be (once in cqlsh):
CQL> COPY . to '/path/to/file'
If there is a LOT of data, or a lot of tables, this tool may not be a great fit. But for small number of tables that don't have HUGE rowcounts (< several million), this works well. Hope that helps.
-Jim
Since 2018 you can use DSBulk with DSE to export or import data to/from CSV (by default), or JSON. Since the end of 2019 it's possible to use it with open source Cassandra as well.
It could be as simple as:
dsbulk unload -k keyspace -t table -u user -p password -url filename
DSBulk is heavily optimized for fast data export, without putting too much load onto the coordinator node that happens when you just run select * from table.
You can control what columns to export, and even provide your own query, etc. DataStax blog has a series of blog posts about different aspects of using DSBulk:
Introduction and Loading
More Loading
Common Settings
Unloading
Counting
Examples for Loading From Other Locations
You can use CQL COPY command for exporting the data from Cassandra cluster. However it is performant for small set of data if you are having big size of data this command is not useful cause it will give some error or timeout issue. Also, you may use sstabledump and export your node-wise date into JSON format. Hope, this will useful for you.
I have implemented small script for this purpose. It isn't the best way, since it slow and, in my experience, produces connection errors on system tables. But it could be useful for inspecting Cassandra on small datasets: https://github.com/kirillt/cassandra-utils
I wanted to know what data sources can be called 'smart' in spark. As per book "Mastering Apache Spark 2.x", any data source can be called smart if spark can process data at data source side. Example JDBC sources.
I want to know if MongoDB, Cassandra and parquet could be considered as smart data sources as well?
I believe smart data sources can be those as well. At least according to slides 41 to 42 you can see mention of smart data sources and logos including those sources (note that mongodb logo isn't there but I believe it supports the same thing https://www.mongodb.com/products/spark-connector, see section "Leverage the Power of MongoDB") from the Databricks presentation here: https://www.slideshare.net/databricks/bdtc2
I was also able to find some information supporting that MongoDB is a smart data source, since it's used as an example in the "Mastering Apache Spark 2.x" book:
"Predicate push-down on smart data sources Smart data sources are those that support data processing directly in their own engine-where the data resides--by preventing unnecessary data to be sent to Apache Spark.
On example is a relational SQL database with a smart data source. Consider a table with three columns: column1, column2, and column3, where the third column contains a timestamp. In addition, consider an ApacheSparkSQL query using this JDBC data source but only accessing a subset of columns and rows based using projection and selection. The following SQL query is an example of such a task:
select column2,column3 from tab where column3>1418812500
Running on a smart data source, data locality is made use of by letting the SQL database do the filtering of rows based on timestamp and removal of column1. Let's have a look at a practical example on how this is implemented in the Apache Spark MongoDB connector"
In a project that I am working on, we are planning migrate from Cassandra DB to other technology,
The problem is how to get all of the data out of cassandra? (We are talking about 4M-8M records)
I should try exporting data to CSV file and then import it to another db.
To export data to CSV you can start with cpoy to command ..if that does not work then a simple Java program described in this can help you for bigger set of data.
But more importantly you should understand the data model of other technology before importing data into it..you may need to change your data model..
You can also look at other tools like https://github.com/brianmhess/cassandra-loader. I have imported/exported data in terms of hundreds of million using this application.
I have to change the schema of one of my tables in Cassandra. It's cannot be done by simply using ALTER TABLE command, because there are some changes in primary key.
So the question is: How to do such a migration in the best way?
Using COPY command in cql is not an option in here because dump file can be really huge.
Can I solve this problem by not creating some custom application?
Like Guillaume has suggested in the comment - you can't do this directly in cassandra. Schema altering operations are very limited here. You have to perform such migration manually using one of suggested there tools OR if you have very large tables you can leverage Spark.
Spark can efficiently read data from your nodes, transform them locally and save them back to db. Remember that such migration requires reading whole db content, so might take a while. It might be the most performant solution, however needs some bigger preparation - Spark cluster setup.
I'm looking for a tool to load CSV into Cassandra. I was hoping to use RazorSQL for this but I've been told that it will be several months out.
What is a good tool?
Thanks
1) If you have all the data to be loaded in place you can try sstableloader(only for cassandra 0.8.x onwards) utility to bulk load the data.For more details see:cassandra bulk loader
2) Cassandra has introduced BulkOutputFormat bulk loading data into cassandra with hadoop job in latest version that is cassandra-1.1.x onwards.
For more details see:Bulkloading to Cassandra with Hadoop
I'm dubious that tool support would help a great deal with this, since a Cassandra schema needs to reflect the queries that you want to run, rather than just being a generic model of your domain.
The built-in bulk loading mechanism for cassandra is via BinaryMemtables: http://wiki.apache.org/cassandra/BinaryMemtable
However, whether you use this or the more usual Thrift interface, you still probably need to manually design a mapping from your CSV into Cassandra ColumnFamilies, taking into account the queries you need to run. A generic mapping from CSV-> Cassandra may not be appropriate since secondary indexes and denormalisation are commonly needed.
For Cassandra 1.1.3 and higher, there is the CQL COPY command that is available for importing (or exporting) data to (or from) a table. According to the documentation, if you are importing less than 2 million rows, roughly, then this is a good option. Is is much easier to use than the sstableloader and less error prone. The sstableloader requires you to create strictly formatted .db files whereas the CQL COPY command accepts a delimited text file. Documenation here:
http://www.datastax.com/docs/1.1/references/cql/COPY
For larger data sets, you should use the sstableloader.http://www.datastax.com/docs/1.1/references/bulkloader. A working example is described here http://www.datastax.com/dev/blog/bulk-loading.