Our company use Cassandra 4.0 as our DWH and we are trying to change our ETL tool to nifi.
But nifi only supports Cassandra 3.0 which Cassandra supports only python 2.7.
Is there any way to use nifi with Cassandra 4.0?
what open source ETL tool would you recommend?
I will provide some information that may be helpful. It looks like there is a request for using Cassandra 4.0 in a future version of NiFi, as a ticket was just submitted. If it were to get enough noise, then I think they might add it in. https://issues.apache.org/jira/projects/NIFI/issues/NIFI-10285
Related
I am new to both spark and talend.
But I read everywhere that both of these are ETL tools. I read another stackoverflow answer here. From the other answer what I understood is talend do use spark for large data processing. But can talend do all the ETL work efficiently that spark is doing without using spark under the hood? Or is it essentially a wrapper over spark where all the data is send to talend is actually put inside the spark inside talend for processing?
I am quite confused with this. Can someone clarify this?
Unlike Informatica BDM which has its own Blaze framework for processing on Hadoop (native), Talend relies on other frameworks such as Map Reduce (Hadoop using possibly tez underneath) or Spark engine. So you could avoid Spark, but there is less point in doing so. The key point is that we could expect I think some productivity using Talend as it is graphical based, which is handy when there are many fields and you do not need possibly the most skilled staff.
For NOSQL, like HBase, they provide specific connectors or could use the Phoenix route. Talend also has connectors for KAFKA.
Spark is just one of the frameworks supported by Talend. When you create a new job, you can pick Spark from the dropdown list. You can get more details in the docs.
I have a cassandra ubuntu visual cluster and need to benchmark it.
I try to do it with yahoo's ycsb (without use of maven if possible).
I use cassandra 3.0.1 but I cant find a suitbale version of ycsb.
I dont want to change to an oldest version of cassandra (ycsb latest cassandra-binding is for cassandra 2.x)
What should I do?
As suggested here, despite Cassandra 3.x is not officially supported, you can use the cassandra-cql binding.
For instance:
/bin/ycsb load cassandra-cql -threads 4 -P workloads/workloada
I just tested it on Cassandra 3.11.0 and it works for both load and run.
That said, the benchmark software to use depends on your test schedule. If you want to benchmark only Cassandra, then #gsteiner 's solution might be the best. If you want to benchmark different databases using the same tool to avoid variability, then YCSB is the right one.
I would recommend using Cassandra-stress to perform a load/performance test on your Cassandra cluster. It is very customizable, to the point that you can test distributions with different data models as well as specify how hard you want to push your cluster.
Here is a link to the Datastax documentation for it that goes into how to use the tool in depth.
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsCStress_t.html
In order to configure kundera for Cassandra, I notice there are 3 possible options for kundera.client.lookup.class as below
com.impetus.client.cassandra.pelops.PelopsClientFactory
com.impetus.kundera.client.cassandra.dsdriver.DSClientFactory
com.impetus.client.cassandra.thrift.ThriftClientFactory
I am not sure of the Pros and Cons of the above 3 and hence not sure which one to use. Please help me decide
I suggest you to use com.impetus.client.cassandra.thrift.ThriftClientFactory. It is the implementation using just Cassandra's thrift api.
PelopsClient is not in active development.
DSClient is built over datastax driver of cassandra.
There is no real advantage of using either DSClient or ThriftClient.
After further research, I found the following
Don't use PelopsClient as its not in active development as mentioned by #karthik , but more importantly because of the issue reported here
Data Stax Driver is better than thrift client as it over comes few limitations of thrift and they use a different binary protocol specific to cassandra which gives a better performance. Refer Datastax java driver support for Cassandra using Kundera
I am working for a small concern and very new to apache cassandra. Studying about cassandra and performing some small analytics like sum function on cassandra DB for creating reports. For the same, Hive and Accunu can be choices.
Datastax Enterprise provides the solution for Apache Cassandra and Hive Integration. Is Datastax Enterprise is the only solution for such integration. Is there any way to resolve the hive and cassandra integration. If so, Can I get the links or documents regarding the same. Is that possible to work the same with the windows platform.
Is any other solution to perform analytics on cassandra DB?
Thanks in advance .
I was trying to download DataStax Enterprise (DSE) for Windows but found there is no such option on their website. I suppose they do not support DSE for Windows.
Apache Cassandra does have builtin Hadoop support. You need to set up a standalone Hadoop cluster colocated with Apache Cassandra nodes and then use ColumnFamilyInputFormat and ColumnFamilyOutputFormat to read/write data from/to your Hadoop cluster.
I'm currently starting a project that use Cassandra Apache. So I'm interesting in accessing to my database cassandra from Java. For that, I'm using Hector Cassandra. However, I've some doubts about what's the differences between the access via Hector or JDBC Cassandra (specifically this: https://code.google.com/a/apache-extras.org/p/cassandra-jdbc/).
I believe the following (although I not sure if I'm right):
one difference between both could be that are API of different level (I consider that Hector Cassandra is an API of higher-level than JDBC Cassandra)?
in JDBC Cassandra is used CQL for accessing/modifying the database, while Hector Cassandra don't use CQL (only use the methods provided for that).
I'll be thankful if someone can help me and tell me if I'm right/wrong in the previous lines and more differences between both (Hector and JDBC Cassandra).
Thank in advance!
Official Cassandra Java Driver (https://github.com/datastax/java-driver) is probably the best (IMHO, the only) choice for a new project for several reasons:
New features
All other Cassandra clients (Hector, Astyanax, etc) are based on legacy Thrift RPC protocol. RPC "One response per one request" model has severe limitations, for example it doesn't allow processing several requests at the same time in a single connection or streaming large ResultSets.
So, DataStax developed a new protocol that doesn't have RPC limitations. Thrift API won't be getting new features, it's only kept for backward-compatibility. In contrast, Java Driver is actively developed to incorporate the new features of Cassandra 2.0, like conditional updates, batching prepared statements, etc. The overview of new features is here: http://www.datastax.com/dev/blog/cql-in-cassandra-2-0
Convenience
In early Cassandra days (0.7) in our company we have used in-house low-level Thrift client. Later on we have used Hector, Pelops and Astyanax in various projects. I can say that the clients based on Java Driver look the most simple and clean to me.
Performance
We have made some performance testing of Cassandra Java Driver vs other clients. In most scenarios the performance is roughly the same. However, there are certain situations when Cassandra Java Driver significantly outperforms other clients due to its asynchronous nature.
Btw, there's a couple of related questions with excellent answers:
Advantages of using cql over thrift
Cassandra Client Java API's
EDIT: When I wrote this, I wasn't aware that Achilles (https://github.com/doanduyhai/Achilles) mentioned in another answer has CQL implementation that works via Java Driver. For the same of completeness I must say that Achilles' DAO on top of CQL might be (or might became one day) viable alternative to plain CQL via Java Driver.
#mol
Why do you restrict to Hector and cassandra-jdbc if you're starting a new project ?
There are many other interesting choices:
Astyanax as Martin mentioned (Thrift & CQL3)
FireBrand (Thrift via Hector)
Achilles I've just developed (CQL3 & Cassandra 2.0 via Java driver core)
Java Driver Core for plain CQL3
Hector is indeed a higher-level API. Internally it will use Cassandra's Thrift API to execute its functions. It will not convert them to equivalent CQL calls. But its API also provides access to CQL. In this case it will pass the CQL (via Thrift) to Cassandra's APIs for CQL.
CQL in Cassandra is a SQL-like language that works via the Cassandra APIs. So it does not provide any additional capability in the use of Cassandra than the APIs but does make it easier at times to use. If you are considering using Hector I would also look at Astyanax which is a newer take on a high-level Java API to Cassandra.
Since you are starting a new project, it is best to start with CQL as Java native driver:
http://www.datastax.com/documentation/developer/java-driver/1.0/webhelp/index.html#common/drivers/introduction/introArchOverview_c.html
Per DataStax, it is 10-15% faster than Thrift APIs, as it uses Binary Protocol.