Export data from Cassandra query to a file - cassandra

I have x GB (x varies from 25-40 GB) of daily data which resides in the cassandra and i want to export it in a file. So, I came cross this SO link. Using which you can export the data of query having format to be like this :
select column1, column2 from table where condition = xy
So, I have schedule the same method in the cron job. But due to huge amount of data process get killed while writing to the text file. So, what are other options to export the huge data given the query format.

Have looked into using Spark to retrieve and process your data? If you are using Datastax you have this as part of your isntallation (DSE Analytics). With Spark you should be able to read the data from your C* instance and write it to a text file without the limitations of a direct CQL statement.

Hava a look into the following python script where you can use the scralling using to get the huge data from the cassandra without timeout.
query = "SELECT * FROM table_name",statement = SimpleStatement(query, fetch_size=100),results=session.execute(statement),for user_row in session.execute(statement):,for rw in user_row:,This is work for me very efficiently. I didn't mention the cassandra connection i think we can easley get the code for cassandra connection in python.

Related

What is the best way to get row count from a table in Cassandra?

Is there any best way that we can get the total number of rows from the Cassandra table?
Regards,
Mani
DatastaxBulk is probably the easiest to install and run.
Apache Spark Cassandra connector could be handy. Once the dataframe is loaded with sc.cassandraTable() you can count
Avoid counting in your code, it does not scale as it performs a full scan of the cluster, the response time will be in seconds.
Avoid counting with CQL select count(*) as you will likely hit the timeout quickly.
you can simply use Count(*) to get row numbers from the table.
For example,
Syntax:
SELECT Count(*)
FROM tablename;
and the expected output looks like this,
count
-------
4
(1 rows)
Background
Cassandra has a built-in CQL function COUNT() which counts the number of rows returned by a query. If you execute an unbounded query (no filter or WHERE clause), it will retrieve all the partitions in the table which you can count, for example:
SELECT COUNT(*) FROM table_name;
Pitfalls
However, this is NOT recommended since it requires a full table scan that would query every single node which is very expensive and will affect the performance of the cluster.
It might work for very small clusters (for example, 1 to 3 nodes) with very small datasets (for example, a few thousand partitions) but in practice it would likely timeout and not return results. I've explained in detail why you shouldn't do this in Why COUNT() is bad in Cassandra.
Recommended solution
There are different techniques for counting records in the database but the easiest way is to use the DataStax Bulk Loader (DSBulk). It is open-source so it's free to use. It was originally designed for bulk-loading data to and exporting data from a Cassandra cluster as a scalable solution for the cqlsh COPY command.
DSBulk has a count command that provides the same functionality as the CQL COUNT() function but has optimisations that break up the table scan into small range queries so doesn't suffer from the same problems as brute-force counting.
DSBulk is quite simple to use and only takes a few minutes to setup. First, you need to download the binaries from DataStax Downloads then unpack the tarball. For details, see the DSBulk Installation Instructions.
Once you've got it installed, you can count the partitions in a table with one command:
$ cd path/to/dsbulk_installation
$ bin/dsbulk count -h <node_ip> -k ks_name -t table_name
Here are some references with examples to help you get started quickly:
Docs - Counting data in tables
Blog - Counting records with DSBulk
Blog - DSBulk Intro + Loading data
You can also use cqlsh as an alternate for small tables.
Refer this documentation
https://www.datastax.com/blog/running-count-expensive-cassandra

Incremental and parallelism read from RDBMS in Spark using JDBC

I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.

What is the best way to export all of my data from a Cassandra Cluster?

I am very new to Cassandra and any help here would be appreciated. I have a cluster of 6 nodes that spans 2 datacenters (3 nodes to each cluster). My client has decided that they do not want to renew their Cassandra license with Datastax anymore and want their data exported into a format that can be easily imported into another Database in the future. I was thinking of exporting the data as a CSV file, but since the data is distributed between all the nodes, I am not sure what is the best way to export all the data.
One option - You should be able to use the CQL COPY command - which copies the data into a CSV format. The nice thing about copy is that you can run it from a single node (i.e. it is not a "node" level tool). Command would be (once in cqlsh):
CQL> COPY . to '/path/to/file'
If there is a LOT of data, or a lot of tables, this tool may not be a great fit. But for small number of tables that don't have HUGE rowcounts (< several million), this works well. Hope that helps.
-Jim
Since 2018 you can use DSBulk with DSE to export or import data to/from CSV (by default), or JSON. Since the end of 2019 it's possible to use it with open source Cassandra as well.
It could be as simple as:
dsbulk unload -k keyspace -t table -u user -p password -url filename
DSBulk is heavily optimized for fast data export, without putting too much load onto the coordinator node that happens when you just run select * from table.
You can control what columns to export, and even provide your own query, etc. DataStax blog has a series of blog posts about different aspects of using DSBulk:
Introduction and Loading
More Loading
Common Settings
Unloading
Counting
Examples for Loading From Other Locations
You can use CQL COPY command for exporting the data from Cassandra cluster. However it is performant for small set of data if you are having big size of data this command is not useful cause it will give some error or timeout issue. Also, you may use sstabledump and export your node-wise date into JSON format. Hope, this will useful for you.
I have implemented small script for this purpose. It isn't the best way, since it slow and, in my experience, produces connection errors on system tables. But it could be useful for inspecting Cassandra on small datasets: https://github.com/kirillt/cassandra-utils

Select All Records From Cassandra

I am trying to select all records from one Cassandra table (~10M records) which should be distributed over 4 nodes using CQL shell, but every time I do that it partitions the output to 1K records Max. So my question is, it is possible to select all records at once as I am trying to see how much time it takes Cassandra to retrieve all records.
When you write "SELECT * from CF" CQL client will never select everything at once. It's just a stupid action for large data. Instead it will load only first page and give you an iterator. Cassandra from 2.0 version supports automatic query paging. So you should call your select all query and ITERATE over pages to load full column family. See an example for python client. There is no way to load all in one action in CQL now and it shouldn't be.
While it was already pointed out that it's a bad idea to try and load all data in cqlsh, what you're trying to do is still somewhat possible. You just need to set a limit and probably increase the timeout for cqlsh.
user#host:~# cqlsh --request-timeout=600
This will start the shell with a request timeout of 10 minutes.
select * from some_table limit 10000000;
Please do not use this in a production environment, as it might have terrible implications for performance and cluster availability!

Spark Cassandra connector - where clause

I am trying to do some analytics on time series data stored in cassandra by using spark and the new connector published by Datastax.
In my schema the Partition key is the meter ID and I want to run spark operations only on specifics series, therefore I need to filter by meter ID.
I would like then to run a query like: Select * from timeseries where series_id = X
I have tried to achieve this by doing:
JavaRDD<CassandraRow> rdd = sc.cassandraTable("test", "timeseries").select(columns).where("series_id = ?",ids).toJavaRDD();
When executing this code the resulting query is:
SELECT "series_id", "timestamp", "value" FROM "timeseries" WHERE token("series_id") > 1059678427073559546 AND token("series_id") <= 1337476147328479245 AND series_id = ? ALLOW FILTERING
A clause is automatically added on my partition key (token("series_id") > X AND token("series_id") <=Y) and then mine is appended after that. This obviously does not work and I get an error saying: "series_id cannot be restricted by more than one relation if it includes an Equal".
Is there a way to get rid of the clause added automatically? Am I missing something?
Thanks in advance
The driver automatically determines the partition key using table metadata it fetches from the cluster itself. It then uses this to append the token ranges to your CQL so that it can read a chunk of data from the specific node it's trying to query. In other words, Cassandra thinks series_id is your partition key and not meter_id. If you run a describe command on your table, I bet you'll be surprised.

Resources