Cassandra Database Problem - cassandra

I am using Cassandra database for large scale application. I am new to using Cassandra database. I have a database schema for a particular keyspace for which I have created columns using Cassandra Command Line Interface (CLI). Now when I copied dataset in the folder /var/lib/cassandra/data/, I was not able to access the values using the key of a particular column. I am getting message zero rows present. But the files are present. All these files are under extension, XXXX-Data.db, XXXX-Filter.db, XXXX-Index.db. Can anyone tell me how to access the columns for existing datasets.

(a) Cassandra doesn't expect you to move its data files around out from underneath it. You'll need to restart if you do any manual surgery like that.
(b) if you didn't also copy the schema definition it will ignore data files for unknown column families.

For what you are trying to achieve it may probably be better to export and import your SSTables.
You should have a look at bin/sstable2json and bin/json2sstable.
Documentation is there (near the end of the page): Cassandra Operations

Related

How to solve the maximum view depth error in Spark?

I have a very long task that creates a bunch of views using Spark SQL and I get the following error at some step: pyspark.sql.utils.AnalysisException: The depth of view 'foobar' exceeds the maximum view resolution depth (100).
I have been searching in Google and SO and couldn't find anyone with a similar error.
I have tried caching the view foobar, but that doesn't help. I'm thinking of creating temporary tables as a workaround, as I would like not to change the current Spark Configuration if possible, but I'm not sure if I'm missing something.
UPDATE:
I tried creating tables in parquet format to reference tables and not views, but I still get the same error. I applied that to all the input tables to the SQL query that causes the error.
If it makes a difference, I'm using ANSI SQL, not the python API.
Solution
Using parque tables worked for me after all. I spotted that I was still missing one table to persist so that's why it wouldn't work.
So I changed my SQL statements from this:
CREATE OR REPLACE TEMPORARY VIEW `VIEW_NAME` AS
SELECT ...
To:
CREATE TABLE `TABLE_NAME` USING PARQUET AS
SELECT ...
To move all the critical views to parquet tables under spark_warehouse/ - or whatever you have configured.
Note:
This will write the table on the master node's disk. Make sure you have enough disk or consider dumping in an external data store like s3 or what have you. Read this as an alternative - and now preferred - solution using checkpoints.

What is the best way to export all of my data from a Cassandra Cluster?

I am very new to Cassandra and any help here would be appreciated. I have a cluster of 6 nodes that spans 2 datacenters (3 nodes to each cluster). My client has decided that they do not want to renew their Cassandra license with Datastax anymore and want their data exported into a format that can be easily imported into another Database in the future. I was thinking of exporting the data as a CSV file, but since the data is distributed between all the nodes, I am not sure what is the best way to export all the data.
One option - You should be able to use the CQL COPY command - which copies the data into a CSV format. The nice thing about copy is that you can run it from a single node (i.e. it is not a "node" level tool). Command would be (once in cqlsh):
CQL> COPY . to '/path/to/file'
If there is a LOT of data, or a lot of tables, this tool may not be a great fit. But for small number of tables that don't have HUGE rowcounts (< several million), this works well. Hope that helps.
-Jim
Since 2018 you can use DSBulk with DSE to export or import data to/from CSV (by default), or JSON. Since the end of 2019 it's possible to use it with open source Cassandra as well.
It could be as simple as:
dsbulk unload -k keyspace -t table -u user -p password -url filename
DSBulk is heavily optimized for fast data export, without putting too much load onto the coordinator node that happens when you just run select * from table.
You can control what columns to export, and even provide your own query, etc. DataStax blog has a series of blog posts about different aspects of using DSBulk:
Introduction and Loading
More Loading
Common Settings
Unloading
Counting
Examples for Loading From Other Locations
You can use CQL COPY command for exporting the data from Cassandra cluster. However it is performant for small set of data if you are having big size of data this command is not useful cause it will give some error or timeout issue. Also, you may use sstabledump and export your node-wise date into JSON format. Hope, this will useful for you.
I have implemented small script for this purpose. It isn't the best way, since it slow and, in my experience, produces connection errors on system tables. But it could be useful for inspecting Cassandra on small datasets: https://github.com/kirillt/cassandra-utils

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Cassandra internal structure

I have a table threshold when i see internal I saw data like this
thresholds-013f8630812a11e885581d42d0727985
thresholds-0ab8f8e0713511e8849d85e38fe317db
thresholds-18c19550b2a411e7845cd18bd49bef86
trojan_horse-18b0f380b2a411e7845cd18bd49bef86
Now I want to know why Cassandra create so many data files for the table thresholds?
and also kindly tell how to know which one is having my data to backup?

How to migrate data between two tables in Cassandra properly

I have to change the schema of one of my tables in Cassandra. It's cannot be done by simply using ALTER TABLE command, because there are some changes in primary key.
So the question is: How to do such a migration in the best way?
Using COPY command in cql is not an option in here because dump file can be really huge.
Can I solve this problem by not creating some custom application?
Like Guillaume has suggested in the comment - you can't do this directly in cassandra. Schema altering operations are very limited here. You have to perform such migration manually using one of suggested there tools OR if you have very large tables you can leverage Spark.
Spark can efficiently read data from your nodes, transform them locally and save them back to db. Remember that such migration requires reading whole db content, so might take a while. It might be the most performant solution, however needs some bigger preparation - Spark cluster setup.

Resources