I could not find a Hazlecast Jet source connector for Apache Pulsar. Have anybody tried this? Appreciate any directions, points, sources, considerations if I have to write a custom stream connector for Pulsar as source for Jet?
The initial version of the Jet connector for Apache Pulsar is recently implemented here. It hasn't been extensively tested yet. For details, you can look at the design document in which the connector's sufficiencies and deficiencies stated, and tutorial. If there is anything confusing about these, you can ask again.
Hazelcast Jet doesn't have any connector for Apache Pulsar as of now (version 4.0). If you'd like to contribute one you can have a look at the Source Builder class and its section on the reference manual as a starting point.
Also, please check out existing implementations of various connectors at the Hazelcast Jet extension modules repository which uses source builder API and contribute yours to there.
Related
Our company use Cassandra 4.0 as our DWH and we are trying to change our ETL tool to nifi.
But nifi only supports Cassandra 3.0 which Cassandra supports only python 2.7.
Is there any way to use nifi with Cassandra 4.0?
what open source ETL tool would you recommend?
I will provide some information that may be helpful. It looks like there is a request for using Cassandra 4.0 in a future version of NiFi, as a ticket was just submitted. If it were to get enough noise, then I think they might add it in. https://issues.apache.org/jira/projects/NIFI/issues/NIFI-10285
From the wiki provided by those 2 projects, I found it seems they did the similar job. But there must be some difference or it's no need for 2.
So what are the differences, and what is the practical advice to choose from one another.
thx a lot!
Great answers above.
Just quick update with Cloudera+Hortonworks merge last year.
These companies have decided to standardize on Ranger.
CDH5 and CDH6 will still use Sentry until CDH product line retires in ~2-3 years.
Ranger will be used for Cloudera+Hortonworks' combined "Unity" platform / CDP product.
Cloudera were saying to us that Ranger is a more "mature" product.
Since Unity hasn't released yet (as of May 2019), something may come up in the future, but that's the current direction. (Oct 2019 update: Unity is now known as CDP and is available for beta testing; will be available for cloud deployments soon, and in 2020 for on-prem customers)
If you're a former Cloudera customer / or CDH user, you would still have to use Apache Sentry. There is a significant overlap between Sentry and Ranger, but if you start fresh, definitely look at Ranger.
You can use Sentry or Ranger depends upon what hadoop distribution tool that you are using like Cloudera or Hortonworks.
Apache Sentry - Owned by Cloudera. Supports HDFS, Hive, Solr and Impala. (Ranger will not support Impala)
Apache Ranger - Owned by Hortonworks. Apache Ranger offers a centralized security framework to manage fine-grained access control across: HDFS, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN
https://cwiki.apache.org/confluence/display/SENTRY/Sentry+Tutorial
http://hortonworks.com/apache/ranger/
Thx Kumar
Apache Ranger overlaps with Apache Sentry since it also deals with authorization and permissions. It adds an authorization layer to Hive, HBase, and Knox. Both Sentry and Ranger support column-level permissions in Hive (startig from 1.5 release).
Ref: https://www.xplenty.com/blog/2014/11/5-hadoop-security-projects/
you can also check RecordService.
RecordService provides an abstraction layer between compute frameworks and data storage. It provides row- and column-level security, and other advantages.
Ref: http://blog.cloudera.com/blog/2015/09/recordservice-for-fine-grained-security-enforcement-across-the-hadoop-ecosystem/
http://recordservice.io/
Both manage permissions based on role-table grants. Ranger provides dynamic data masking (in transit). Both integrated with Informatica's Secure at Source (Identify risky data stores in the Enterprise) to deliver Data Governance solution.
In order to configure kundera for Cassandra, I notice there are 3 possible options for kundera.client.lookup.class as below
com.impetus.client.cassandra.pelops.PelopsClientFactory
com.impetus.kundera.client.cassandra.dsdriver.DSClientFactory
com.impetus.client.cassandra.thrift.ThriftClientFactory
I am not sure of the Pros and Cons of the above 3 and hence not sure which one to use. Please help me decide
I suggest you to use com.impetus.client.cassandra.thrift.ThriftClientFactory. It is the implementation using just Cassandra's thrift api.
PelopsClient is not in active development.
DSClient is built over datastax driver of cassandra.
There is no real advantage of using either DSClient or ThriftClient.
After further research, I found the following
Don't use PelopsClient as its not in active development as mentioned by #karthik , but more importantly because of the issue reported here
Data Stax Driver is better than thrift client as it over comes few limitations of thrift and they use a different binary protocol specific to cassandra which gives a better performance. Refer Datastax java driver support for Cassandra using Kundera
I am new to bluemix and also Apache Spark. I just wanted to do a small task using IBM analytics for Apache Spark where I want to create a virtual sensor using Bluemix's virtual sensors (https://virtualsensors.mybluemix.net/) and use that generated data as input to the spark streaming service and do some analytics based on the input data. But, I don't know exactly how to connect the instances of those two application and I am stuck. It would be great if someone could help me.
Thanks,
From the documentation the Virtual Sensors just emit their sensor data using MQTT, so I imagine this would be as easy as importing an MQTT library in your language of choice and simply connecting that to the Virtual Sensors.
You haven't really specified what language you're working with on the Spark side, but they'll probably all shake out to either:
Paho (Python, Java, Scala)
Scala-MQTT-client (specifically Scala)
For how to use it, the Paho project also includes some basic documentation about how MQTT works.
Some of the other basics are covered in the MQTT FAQ and this youtube video.
If you need to add the JAR to your notebook, you should be able to use the %AddJar command. You can read about that here -- scroll down to the section titled "Deploy your custom library jar to a Jupyter Notebook" for the instructions and example use.
I would like you to go through this recipe that shows how to configure the Apache Spark Streaming running in IBM Bluemix to get data from the actual sensor devices. I believe, you can just tweak the topic id to get the data from virtual sensor as well.
Also, look at the Github project that shows how to create the Spark-mqtt-connector Dstream such that the Spark service can consume the events in real-time.
I'm currently starting a project that use Cassandra Apache. So I'm interesting in accessing to my database cassandra from Java. For that, I'm using Hector Cassandra. However, I've some doubts about what's the differences between the access via Hector or JDBC Cassandra (specifically this: https://code.google.com/a/apache-extras.org/p/cassandra-jdbc/).
I believe the following (although I not sure if I'm right):
one difference between both could be that are API of different level (I consider that Hector Cassandra is an API of higher-level than JDBC Cassandra)?
in JDBC Cassandra is used CQL for accessing/modifying the database, while Hector Cassandra don't use CQL (only use the methods provided for that).
I'll be thankful if someone can help me and tell me if I'm right/wrong in the previous lines and more differences between both (Hector and JDBC Cassandra).
Thank in advance!
Official Cassandra Java Driver (https://github.com/datastax/java-driver) is probably the best (IMHO, the only) choice for a new project for several reasons:
New features
All other Cassandra clients (Hector, Astyanax, etc) are based on legacy Thrift RPC protocol. RPC "One response per one request" model has severe limitations, for example it doesn't allow processing several requests at the same time in a single connection or streaming large ResultSets.
So, DataStax developed a new protocol that doesn't have RPC limitations. Thrift API won't be getting new features, it's only kept for backward-compatibility. In contrast, Java Driver is actively developed to incorporate the new features of Cassandra 2.0, like conditional updates, batching prepared statements, etc. The overview of new features is here: http://www.datastax.com/dev/blog/cql-in-cassandra-2-0
Convenience
In early Cassandra days (0.7) in our company we have used in-house low-level Thrift client. Later on we have used Hector, Pelops and Astyanax in various projects. I can say that the clients based on Java Driver look the most simple and clean to me.
Performance
We have made some performance testing of Cassandra Java Driver vs other clients. In most scenarios the performance is roughly the same. However, there are certain situations when Cassandra Java Driver significantly outperforms other clients due to its asynchronous nature.
Btw, there's a couple of related questions with excellent answers:
Advantages of using cql over thrift
Cassandra Client Java API's
EDIT: When I wrote this, I wasn't aware that Achilles (https://github.com/doanduyhai/Achilles) mentioned in another answer has CQL implementation that works via Java Driver. For the same of completeness I must say that Achilles' DAO on top of CQL might be (or might became one day) viable alternative to plain CQL via Java Driver.
#mol
Why do you restrict to Hector and cassandra-jdbc if you're starting a new project ?
There are many other interesting choices:
Astyanax as Martin mentioned (Thrift & CQL3)
FireBrand (Thrift via Hector)
Achilles I've just developed (CQL3 & Cassandra 2.0 via Java driver core)
Java Driver Core for plain CQL3
Hector is indeed a higher-level API. Internally it will use Cassandra's Thrift API to execute its functions. It will not convert them to equivalent CQL calls. But its API also provides access to CQL. In this case it will pass the CQL (via Thrift) to Cassandra's APIs for CQL.
CQL in Cassandra is a SQL-like language that works via the Cassandra APIs. So it does not provide any additional capability in the use of Cassandra than the APIs but does make it easier at times to use. If you are considering using Hector I would also look at Astyanax which is a newer take on a high-level Java API to Cassandra.
Since you are starting a new project, it is best to start with CQL as Java native driver:
http://www.datastax.com/documentation/developer/java-driver/1.0/webhelp/index.html#common/drivers/introduction/introArchOverview_c.html
Per DataStax, it is 10-15% faster than Thrift APIs, as it uses Binary Protocol.