Transfer data from one column family db to another column family

Transfer data from one column family db to another column family - cassandra

I'm new to Cassandra and Column family database world.I have a scenario where I need to move data from one Column family database such as Scylla Database to another Column family database Datastax Cassandra.Amount of data to be transferred will be in millions. And I wan this data transfer to happen on regular interval time lets say 2 mins.I was exploring sstableloader option. No luck yet. is ter any other better approach for my scenario ? Any suggetions will be highly appreciated.

(Disclaimer: I'm a ScyllaDB employee)
There are 3 ways to accomplish this:
Dual writes from the client side with client side time stamps, to both DBs
Use the sstableloader tool to migrate the data from one DB to the other.
Use nodetool refresh command to load sstables
You can read more about Migration from Cassandra to Scylla in the following doc, which also describes how to perform dual writes from the client side (option-1), with code examples + how to use the sstableloader tool (option-2)
http://docs.scylladb.com/procedures/cassandra_to_scylla_migration_process/
For the nodetool refresh usage you may look here: http://docs.scylladb.com/nodetool-commands/refresh/

A common approach is to instrument the client to write to both databases in parallel, instead of synchronizing the two databases. This keeps the two databases in sync on every single write.

Related

How to use join in cassandra database tables like mysql

I am new to node.js.
And use Express framework and use cassandra as a database.
My question is it is possible to use join with multiple table relationship like mysql.
For e.g.,
In mysql
select * from `table1`
left join `table2` on table1.id=table2.id

Short: No.
There are no joins in cassandra (and other nosql databases) - which is different from relational databases.
In cassandra the standard way is to denormalize data and maybe store multiple copies if necessary. As a rule of thumb think query first - and store your data in that way you need to query it later.
Cassandra will perform very very well if your data is evenly spread accross your cluster and the every day queries hit only one or only a few primary key(s) (to be exact - partition key).
Have a look at: http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
And there are trainings from DataStax: https://academy.datastax.com/resources/ds220-data-modeling (and others too).

No. Joins are not possible in cassandra (or any other NoSQL databases).
You should design your table in cassandra based on your query requirements.
And while using NoSQL system it's recommended to de-normalize your data.
Basic Rules of Cassandra Data Modeling

Basic googling returns this blog post from the company behind Cassandra: https://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise
In short, it says that joins are possible via Spark or ODBC connections.
It comes down to the performance characteristics of such joins, esp. compared to making a "join" by hand, i.e. a lookup query on one table for every (relevant) row on the other. Any ideas?

Running a website/web application that analyzes big data

Hello i have this website where the server has in its database 2-3 GB of data and i want the user to run a query to get the data and analyze it (for example the user can put age>15) and then press the button that says cluster to do clustering in that data , then the user sees that with libraries like d3.js.
how to do it ? Can i link Hadoop or something like that with php /nodejs ?
Any suggestion

I think that your size of data is not relevant to use as BigData Stack.
Maybe configure your RDMS to perform well with your requests could solve your problem.
In size of GB it will not give you a nice response in Hadoop... In your case, if you need small latency I suggest Cassndra or maybe Redis for the request.
Don't use Hadoop for GB.

You should use RDBMS, which will provide better results, if configured right. RDBMSs are easy to be intergrated into web applications.
Hadoop is a distirbuted file system, and should be used for way more than GBs of data, otherwise it will just slow you down.

We need more information.
Depends on the data store, type of data we can go with different options
Option 1:
Relational data base can store Terabytes of data in clustered platform with replication set either though log shipping / streaming can handle the GB of storage . Then comes the analysis . It depends on how the data is stored. MS SQL server can easily handle Tera bytes of data and apply analytics engine on top. This is the option if we are storing the data in denormalised way and ACID is a key factor. Transaction aware.
option 2
If the data is received and stored in document model (JSON) and consistency and replication is factor rather than availability . MongoDB is the best in market which we can set in primary , secondary setup . The javascript interpreter in mongo shell will help data handling very efficiently.
Option 3
If consistency and ACID is not a constraint and availability and data is stored as key value . The best bet is Cassandra. Built a better has and Terabytes of data will be an ease as it replicates across nodes with in DC or cross DC. Better hash key definition is a major factor for sharding here

Big data solution for frequent queries

I need a big data storage solution for batch inserts of denormalized data which happen infrequently and queries on the inserted data which happen frequently.
I've gone through Cassandra and feel that its not that good for batch inserts, but an OK solution for querying. Also, it would be good if there was a mechanism to segregate data separately based on a data attribute.

As you mentioned Cassandra I will talk about it:
Can you insert in an unbatched way or is this impossed by the system? If you can insert unbatched, Cassandra will probably be able to handle it easily.
Batched inserts should also be handable by Cassandra nodes, but this won't distribute the load properly among all the nodes (NOTE: I'm talking about load balancing, not about data balance, which will be only depending on your partition key setup). If you are not very familiar with Cassandra you could tell us your data structure and your query types and we could suggest you how to use Cassandra's data model to fit it.
For the filtering part of the question, Cassandra has clustering keys and secondary indexes, that are basically like adding another column configuration to the clustering key so that you have both for querying.

Tables already created to insert into a cassandra keyspace to test

I want to test my cluster a little, how data replicates, etc.
I have a cassandra cluster formed by 5 machines ( centos 7 & cassie 3.4 on them).
Are there anywhere tables already created for testing that I can import in my db in some keyspace?
If yes, please be kind enough and explain me how to import them into a keyspace and where from to take them.

You can use Cassandra-stress. This is great to create data for your style of table and also has some default tables.
http://docs.datastax.com/en/cassandra_win/3.0/cassandra/tools/toolsCStress.html
I highly recommend it.

Actually , it is a lot of data in internet that can be used for testing
e.g.
https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
http://bigdata-madesimple.com/70-websites-to-get-large-data-repositories-for-free/
Cassandra provide with tool cqlsh for executing CQL command as COPY for importing CSV data to database.
P.S.But pay attention on the fact that cqlsh has some restriction related to timeout. That is why it would be better to use some cassandra connector to make this process more effective.

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?

I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string