Is there a schema versioning tool for cassandra [closed]

Is there a schema versioning tool for cassandra [closed] - cassandra

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
In the sql world, it's quite common to have a tool that goes through a folder of schema scripts to set up some schema. A widely used approach is to have a table holding the current db version number, and ddl scripts so that we can start from any version of the db and update to any subsequent version in a controller manner. Visual Studio has db projects, redgate have similar tools.
I was wondering if there's something for cassandra as well. I know it won't be too difficult to implement something basic for cassandra, but was wondering if somebody's already done it.

Pillar manages migrations for your Cassandra data stores.
Pillar grew from a desire to automatically manage Cassandra schema as code. Managing schema as code enables automated build and deployment, a foundational practice for an organization striving to achieve Continuous Delivery.
Pillar is to Cassandra what Rails ActiveRecord migrations or Play Evolutions are to relational databases with one key difference: Pillar is completely independent from any application development framework.
https://github.com/comeara/pillar

Your initial question doesn't specify a language, though you later indicate you'd like C#. I don't have a C# answer, but I've extracted the Java versioning component that I'm using for my project. I also created a small sample project that shows how to integrate it. It's bare-bones. There are different approaches to this problem, so I picked one that was simple to build and does what I need. Here are the two GitHub projects:
https://github.com/DonBranson/cql_schema_versioning
https://github.com/DonBranson/cql_schema_versioning_example
This component doesn't store a version # in the schema, but stores the list of scripts it's run. It depends on the sort order of the script names to determine run order. Very basic.

Cassandra is by its nature is 'schemaless' it is a a structured key-value store, so it is very different from a traditional rdbms in that regard.
Cassandra has now evolved to be 'schema-optional' in that it allows to you describe general datatypes that live in a particular column family.
Try looking at Liquibase and/or Flyaway to see if the extensions provide the versioning capability you require.
http://bungeedata.blogspot.com/2013/12/liquibase-and-cassandra.html
http://www.datastax.com/dev/blog/schema-in-cassandra-1-1
http://planetcassandra.org/blog/schema-vs-schema-less/

I was looking for a schema migration tool that could be used for the following scenarios:
Automated upgrade to schema when an application is deployed.
Allow test Cassandra databases to be populated for integration tests.
After some searching, I've found the following two that look like potential candidates:
https://github.com/Contrast-Security-OSS/cassandra-migration
https://github.com/DonBranson/cql_schema_versioning

I'm not aware of anything that exists today.
To the extent that you're using CQL you could probably come up with something but you'll likely run into problems with the limited abilities of CQL to modify tables and then with transformation phase.
When I've used these types of tools with SQL, I always ended up with a bunch of SQL to update the data set after the application of updated DDL.
With CQL, I've had to write code to be applied after the schema change.
If all you're doing is adding or dropping tables, columns and indexes, it should be do-able.

Related

Mongodb vs Postgres in Nodejs [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm building a NodeJS application and am utterly torn between NoSQL MongoDB vs RMDS PostregresSql. My project is to create a open source example project for logging visitors and displaying visitor statistics in real time on a webpage using NodeJS. I was planning on using MongoDB at first, because lot of NodeJS examples and tutorials, albeit mostly older ones, used it and paas hosters with a free tier are abounding. However, I was seeing a lot of bashing on MongoDB recently and found that people who tried to use MongoDB ended up switching to Postgres:
http://blog.engineering.kiip.me/post/20988881092/a-year-with-mongodb
http://dieswaytoofast.blogspot.com/2012/09/mysql-vs-postgres-vs-mongodb.html
http://www.plotprojects.com/why-we-use-postgresql-and-slick/
I also a fan of Heroku and have heard a lot about Postgres because of that and find that SQL queries can be nice sometimes.
I'm not a database expert, so I can't tell for the life of me which way to go. I would really appreciate it if you could give some advice on which one to consider and why.
I have a few criteria:
Since I want this to be a example, it would be nice to have a way to host a decently sized amount of data. I know that MongoDB definitely offers this, but Postgres paas like Heroku seem to have pretty small databases (since I am logging every visitor to the website)
A database that is simplistic and easy to explain to others.
Performance doesn't really matter, but speed can't hurt
Thanks for all of the help!
Note: Please no flame wars, everyone has their own opinion :)

Choosing between an SQL database and a NoSQL database is certainly being debated heavily right now and there are plenty of good articles on the subject. I'll list a couple at the end. I have no problem recommending SQL over NOSQL for your particular needs.
NoSQL has a niche group of use cases where data is stored in large tightly coupled packages called documents rather than in a relational model. In a nutshell, data that is tightly coupled to a single entity (like all the text documents used by a single user) is better stored in a NoSQL document model. Data that behaves like excel spreadsheets and fits nicely in rows and is subject to aggregate calculations is better stored in a SQL database of which postgresql is only one of several good choices.
A third option that you might consider is redis (http://redis.io/) which is a simple key value data store that is extremely fast when querying like SQL but not as rigidly typed.
The example you cite seems to be a straightforward row/column type problem. You will find the SQL syntax is much less arcane than the query syntax for MongoDB. Node has many things to recommend it and the toolset has matured significantly in the past year. I would recommend using the monogoose npm package as it reduces the amount of boilerplate code that is required with native mongodb and I have not noticed any performance degradation.
http://slashdot.org/topic/bi/sql-vs-nosql-which-is-better/
http://www.infoivy.com/2013/07/nosql-database-comparison-chart-only.html

Entity Framework migrations on legacy database

We have several legacy SQL Server databases that we occasionally make schema changes to. We currently have a utility written in C++ that allows users to update their DB's with these schema changes. The utility currently generates dynamic sql to create all DB objects. I am looking into redoing this and thought EF migrations might be a good way to go. I have read up a bit on the subject and I have a general idea of how it works. But I'm having a bit of a hard time figuring out how I would set it up to replace our current procedure (or if it is even possible). Currently, a client could be on any one of a number of previous versions. I'm assuming I would have to go back to the oldest possible version and create my model/initial migration from that, then generate incremental migrations for each version change in order to support updates from all versions. Is that a correct assumption? Also, currently our clients could be using sql server 2000, 2005, or 2008. Would this have any effect on how I would set things up (or if I even could)? Further, the goal is to create a utility with a (C# - probably WPF) UI that the user can use to manipulate the migrations (up or down, preferably). I've seen a lot of examples of how to manipulate migrations from command-line within package manager but not a lot of stuff on how to create a utility with a friendly UI for upgrading/downgrading DB's in production. Also, I have not seen anything that shows how to create stored procedures in a migration (our DBs rely on some stored procedures). I'm assuming that, if nothing else, I can use the Sql() method to generate a SQL query to create a SP. Is that correct? Is there a better way?
I know my questions are a bit non-specific and I apologize for that. But I'm still in the beginning processes of learning this and I'd like to get an idea of whether or not this is a good way to go. Any guidance would be greatly appreciated.
Thanks,
Dennis

Firstly, on SQL Server support, Entity Framework doesn't really support SQL Server 2000. See this question:
EntityFramework SQL Server 2000?
On the question of supporting all the multiple versions, you have the right idea about needing to generate an initial migration for the oldest version first then incrementally altering the model and generating migrations to support the later versions. This will be a pain as the migrations are opinionated about how they represent the model in the database and you will be doing a lot of messing about to end up with a model and a set of migrations that fully represent that. Specific concerns are indexes, column lengths, data types, stored procedures, triggers, functions, partitioning.
The Sql() function gets you around most issues, though also helpful in the migrations are functions like CreateIndex and AlterColumn.
For automating this, the migrations are definitely available as powershell cmdlets which are themselves just .Net objects so can be called programmatically.
As this question is a year old, I assume you will have made a decision on whether to do this. My opinion is that it is hard to see that it's worth the effort. If you were re-platforming the code base that uses this database to Entity Framework then it would make sense. Otherwise there are bound to be better tools out there for database version management. My first port of call would be Redgate.

About Java Cassandra Client, which one is better? How about CQL? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am trying to develop application using Hive as the Database, and then I also find noSQL solutions as an alternative to it.
Now decided to develop using Cassandra, my next problem is about what client should I use? which one is better, Hector -- a pure java solutions, or Kundera with JPA like development?
I prefer Hector, but I am curious about Kundera. Is there anyone using Kundera? Which is better?
I'm curious about CQL (Cassandra Query Language). Can it integrate with Hector?

Hector is slowly moving towards CQL integration. The first steps have been made, but because of the experience of an unstable API, the developers seem to have postponed a new release. The CQL API is rather new, as it should be nearly equivalent to a SQL syntax. I made some basic steps with CRUD operations to verify that data could be written and read via CQL.
Nevertheless, the CQL JAR is not usable out of the box like a standard JDBC driver as of now, and misses some important feature aspects. Having a look at the more or less difficult to understand thrift API and the not really much simpler hector API, I am convinced that CQL will be established as the state-of-the-art access API for Cassandra in version 0.8.1 and 1.0, where thrift will remain the native, raw access for some time.
The competition between both APIs has nothing to do with the decision of Hector. Hector itself provides additional services like failure and connection handling in the cluster. These are features being addressed by neither thrift nor CQL.
I don't really believe in all other O/R mappers, or even those claiming to provide a full-fledged JPA. I cannot imagine how this should work.

Answering your question about clients - Hector essentially provides access to the Cassandra native API (columns, column families, rows etc) whereas Kundera aims to hide these details and provide object-database mapping.
Kundera therefore probably makes it easier to quickly persist a range of Java objects into Cassandra - but may not provide an efficient mapping, perhaps losing some of the performance that noSQL approaches provide.
Hector expects you to adapt to the Cassandra data model - this will be harder work, but is likely to deliver more performance.

There is now a new client, Astyanax, released by Netflix in January 2012.
"Astyanax is a Java Cassandra client. It borrows many concepts from
Hector but diverges in the connection pool implementation as well as
the client API. One of the main design considerations was to provide a
clean abstraction between the connection pool and Cassandra API so
that each may be customized and improved separately. Astyanax provides
a fluent style API which guides the caller to narrow the query from
key to column as well as providing queries for more complex use cases
that we have encountered. The operational benefits of Astyanax over
Hector include lower latency, reduced latency variance, and better
error handling."
The source code for Astyanax is hosted at Github: https://github.com/Netflix/astyanax

For details about using CQL with Cassandra and Hector, see:
https://github.com/rantav/hector/wiki/Using-CQL
The following mail list thread is a good discussion on where we will be going with CQL as an API:
http://groups.google.com/group/hector-users/browse_thread/thread/540dc9c3908fbb44/f5ee488f2178e2f4

For the sake of completeness I think the Pelops library should be mentioned too. Hector seems to be the most used, but Pelops has a simpler API. Pelops does not support CQL.
Coming from Ruby I find both to be extremely verbose and imperative, though.

Kundera no more relies on Solandra for indexing approach. It enables you now to use secondary indexing support provided by Cassandra and as well as it gives you a way to run jpa queries over OPP (like range queries etc). We are working to enable native CQL support.
Take a look at:
http://mevivs.wordpress.com/2012/02/13/how-to-crud-and-jpa-association-handling-using-kundera/
for more details.
-Vivek

There is no java client in the same level with hector, hector is the best and there is work in progress in hector side to support cql. I saw cql commits for hector in github this month, but doesn't know it's final state. You can ask it to hector users group http://groups.google.com/group/hector-users
Also there is a very simple object mapper in hector
https://github.com/rantav/hector/wiki/Hector-Object-Mapper-%28HOM%29
My Best,
Serdar Irmak

Kundera 2.0.4 released:
Major Changes in this release:
Cross-datastore persistence( Easy to migerate existing mysql app over nosql)
support for relational databases (e.g Mysql etc)
replace solandra with lucene based indexing.
Support added for bi-directinal associations.
Performance improvement fixes.
We tested and 1 million inserts with proper indexing happened in 6 minutes.
Vivek

I am yet to try Hector, but am involved in latest Kundera 2.0.1 release. I suggest you give it a try. It has gone a major change since its inception and you can see a lot of new features getting added and bugs being fixed. Currently it supports JPA 1.0 and Cassandra 0.7.6 but we are planning to add support for Cassandra 0.8 and JPA 2.0 very soon. There is a pretty good example here: https://github.com/impetus-opensource/Kundera/wiki/Getting-started that may help you get started.

Astyanax api produces human-readable code and does include connection pooling.

CQL support over cassandra has been integrated in kundera 2.0.6(yet to be released). It allows to execute CQL as nativequery now.
-Vivek

Cassandra vs Riak [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am looking for an eventually consistent data store and it looks like it may be coming down to Riak or Cassandra. Has anyone got expereinces of a view on this?

As you probably know, they are both architecturally strongly influenced by Dynamo (eventually consistent, no single points of failure, etc). Both also go beyond Dynamo in providing a "richer than pure K/V" data model -- in Cassandra's case, providing a Bigtable-like ColumnFamily mode, in Riak's, a Document-oriented one. I have seen sane people choose both.
I believe points that favor Cassandra include
speed
support for clusters spanning multiple data centers
big names using it (digg, twitter, facebook, webex, ... -- http://n2.nabble.com/Cassandra-users-survey-tp4040068p4040393.html)
Points that favor Riak include
map/reduce support out of the box
/Cassandra dev, fwiw

Riak is used by
Mozilla Foundation
Ask.com sponsored listings
Comcast
Citigroup
Bet365
I think they both pass the test of credible reference customers/users.
Cassandra seems more mature, and is currently doing better in benchmarks. Riak seems easier to add a node to as your cluster grows.

For completeness: A good (probably biased) comparison between the two can be found at http://docs.basho.com/riak/1.3.2/references/appendices/comparisons/Riak-Compared-to-Cassandra/

Use and download are different. Best to get references.
Perhaps a private conversation could be had where Riak references in these companies could be shared? Not sure how to get such with Cassandra, but there is a community of companies that support Cassandra that seem like a good place to start. As these probably have community participants in Cassandra development, it may be a REALLY reasonable place to start.
I would like to hear Riak's answer to recent and large deployments where customers are happy.
I also would like to see the roadmap for each product. Cassandra is a bit easier to track (http://wiki.apache.org/cassandra/) than Riak in my view as Cassandra's wiki discusses limitations and things that are probably going to change going forward, but neither outline futures well. I could understand that of an open source community ... perhaps ... but I cannot for a product for which I must pay.

I also would suggest research of Cloudant, which has what appears to be a very nice layering of capabilities. It also looks like it is bringing to bear the capabilities elsewhere in Apache land. CouchDB is the Apache platform on which Cloudant is based. BUT the indexing with Lucene seems but the tip of the iceberg when it comes to where Cloudant could go. Creating and managing an index is a very systematic process, a kind of data pipeline, that could be scripted using other Apache community assets. AND capabilities like NLP also could be added through Lucene indirectly, or maybe directly into what is persisted.
It would be nice to see a proposed Cloudant roadmap, especially since the team could mine the riches of the Apache community and integrate such into Cloudant. Such probably exists as there is an operational component to the Cloudant revenue model that will require it, if for no other reason.
Another area of interest ... Cloudant's pricing model ... it is clear their revenue model is not based on software, but around service. That is quite attractive, and it seems consistent with the ecosystem surrounding Cassandra too. I don't know if the Basho folks have won over enough of the nosql community as yet ... don't see such from any buzz around their web site or product.
I like this Cloudant web page (https://cloudant.com/the-data-layer/). I was surprised to see the embedded Erlang capability ... I did not know CouchDB was written in Erlang as this seems unusual to me in the Apache community (my ignorance); CouchDB appears to be older than other nosql products I know (now) to be written in Erlang. Whatever their strategy, they at least count Amazon EC2 and Microsoft Azure as hosting partners, indicating an appreciation of Microsoft and !Microsoft worlds - all very important if properly recognizing the middleware value potential (beyond cache or hash table applications) that these types of data stores could have.
Finally, while I don't know the board well, Andy Palmer's guidance looks like it will be valuable. He can bring some guidance vis-a-vis structured data (through VoltDB) to a world that rightly or wrongly may be unfairly branded as KVP hash tables of unstructured data. The need for structure and ecosystem surrounding nosql "databases" is being recognized ... witness Google's efforts with Spanner ... KVP/little structure/need for search-ability motivated Google's investment in the Spanner space. While we all may not need something like Spanner, we probably do need an improving and robust "enterprise" management and interoperability capability in these nosql databases to make it reasonable to incorporate them into modern cloud architectures. The needed structure can come from ease of interoperability and functional richness. It can also come from new capabilities that support conversion of unstructured data to structured data (e.g. indexes, use of NLP to create structured and parsed renderings of things inside of a KVP blob, and plenty of other things that, if put into a roadmap and published, could entice and grow a user base). Cloudant looks like it has a good chance of success ... I will take a closer look at it ...
And look what I found about CouchDB ...
CouchDB comes with a suite of features, such as on-the-fly document transformation and real-time change notifications, that makes web app development a breeze. It even comes with an easy to use web administration console. You guessed it, served up directly out of CouchDB! We care a lot about distributed scaling. CouchDB is highly available and partition tolerant, but is also eventually consistent. And we care a lot about your data. CouchDB has a fault-tolerant storage engine that puts the safety of your data first.

Can you recommend a PostgreSQL Visual Database Designer for Linux?

When I'm in Windows, I use the excellent MicroOLAP Database Designer for PostgreSQL, but its not open source or multiplataform.
Do you know or can recommend me an alternative to this software, that I can use in Linux?
EDIT: Just to clarify, I don't want to use wine to emulate MicroOlap for PostgreSQL, it doesn't work too well, I would prefer something native, or Java based.

pgDesigner is a database design application for PostgreSQL, for
versions 7.x and 8.x.
pgDesigner provides the following features:
Complete datamodel editor
Support for PostgreSQL objects: tables, views, relations,
tablespaces, procedures, triggers, types, domains and sequences
Automatic updating of relations between tables.
Wizard for the construction of views.
Report generator, with statistics
Printing the diagram
SQL export
Creation of the database
Management of the project on a diagram chart

I stopped using software database designers years ago, and reverted back to the trusty pen and paper which is just easier to use in my experience.
To answer your question though, take a look at dbDesigner4 which is what I used to use. I remember it being fantastic. It's open source and multiplatform.

How about Clay? It's a plugin for Eclipse, and the free version supports generating Postgres DDL.

I really like dbWrench. It's commercial as well, but not expensive and is Java based. It can reverse engineer a database and generates pretty good HTML based documentation.
http://www.dbwrench.com/

This is a crappy answer for which I should be taken out and shot, but you can search over nearly all PostgreSQL related projects at PgFoundry. I don't know from GUI database design tools, but I'd imagine you should be able to find something there, if it exists.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string