DSE 6 comes pre-bundled Cassandra and SparkSql. Has anyone also setup 'Hive on Spark' there? I wonder about spark version conflicts being an issue. Reason i wan't this is that Hive seems to allow masking/authorization with Ranger but SparkSQL doesn't
Answer not directly related to setting Hive, etc. but DSE has security (authentication/authorization/...) built-in (see FAQ), and it's supported by the all components, including Spark SQL. If you want to have more granular permissions, you can set row-level access control.
I would like to have row level security enforced in Apache Spark SQL. Is that supported? (allow users to send raw HiveQL / Spark SQL queries but only show the data they are supposed to see). Is there a built in way to do so in Apache Spark?
No Spark does not provide security at this level. If you want that kind of security look for Accumulo DB. Accumulo was created in 2008 by the US National Security Agency and contributed to the Apache Foundation. It is a system built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Written in Java, Accumulo has cell-level access labels and server-side programming mechanisms. You can refer the book - Accumulo- Application Development Table Design and Best Practices
Apache Apex - is an open source enterprise grade unified stream and batch processing platform. It is used in GE Predix platform for IOT.
What are the key differences between these 2 platforms?
Questions
From a data science perspective, how is it different from Spark?
Does Apache Apex provide functionality like Spark MLlib? If we have to built scalable ML models on Apache apex how to do it & which language to use?
Will data scientists have to learn Java to built scalable ML models? Does it have python API like pyspark?
Can Apache Apex be integrated with Spark and can we use Spark MLlib on top of Apex to built ML models?
Apache Apex an engine for processing streaming data. Some others which try to achieve the same are Apache storm, Apache flink. Differenting factor for Apache Apex is: it comes with built-in support for fault-tolerance, scalability and focus on operability which are key considerations in production use-cases.
Comparing it with Spark: Apache Spark is actually a batch processing. If you consider Spark streaming (which uses spark underneath) then it is micro-batch processing. In contrast, Apache apex is a true stream processing. In a sense that, incoming record does NOT have to wait for next record for processing. Record is processed and sent to next level of processing as soon as it arrives.
Currently, work is under progress for adding support for integration of Apache Apex with machine learning libraries like Apache Samoa, H2O
Refer https://issues.apache.org/jira/browse/SAMOA-49
Currently, it has support for Java, Scala.
https://www.datatorrent.com/blog/blog-writing-apache-apex-application-in-scala/
For Python, you may try it using Jython. But, I haven't not tried it myself. So, not very sure about it.
Integration with Spark may not be good idea considering they are two different processing engines. But, Apache apex integration with Machine learning libraries is under progress.
If you have any other questions, requests for features you can post them on mailing list for apache apex users: https://mail-archives.apache.org/mod_mbox/incubator-apex-users/
I'm trying to implement security on my hadoop data.I'm using cloudera hadoop
Below are the two specific things I'm looking for
1. Role based authorization and authentication
2. Encryption on data residing in HDFS
I have looked into Kerboroes but it doesn't provide encryption for data already residing in HDFS.
Are there any other security tools i can go for? has anyone done above two security features in cloudera hadoop.
Please suggest
I think Apache Sentry will be best for you.You can find more information here.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Is Datastax Cassandra the only available Cassandra that can be used in a production environment? Is there any free alternatives available? What about the cassandra available on Apache site?
Datastax Community Edition is also free, it contains a basic version of OpsCenter -- http://planetcassandra.org/cassandra/
Here is the difference between the community edition and DSE
http://www.datastax.com/download/dse-vs-dsc
They can both be used in production. DataStax Enterprise comes with a bunch of extra features on top of Apache Cassandra, and also comes with support.
Datastax is a commercial company, who supports C*. The base source code of Cassandra is taken of the Apache Repositories, then some of their own code is merged. Besides this, as already mentioned by others, Datastax version comes with some additional tools for maintaining a Cassandra Cluster.
One of the benefits of Datastax Enterprise is their neatless SOLR Integration, another great Apache Foundation Project.
Cassandra comes with a Query Language called CQL (Cassandra Query Language) which is "similar" to SQL, you should however think of CQL like a cousin of SQL, not a brother.
One of the great features of the Enterprise edition is that you can query a SOLR index through their CQL integration, also a Cassandra Cluster shares it's resources with SOLR, so you don't need a second Cluster for SOLR.
You could... set up Apache or Datastax Cassandra, you would get almost the same thing, but if you need something similar to SQL Like Statement (natively not available in Cassandra), or you do have a very much denormalized database and you need search capabilities, then Datstax Enterprise (DSE) is your only viable choice.
As someone already has mentioned, DSE is free for startups until they reach an annual revenue of 3m USD, or are funded with 30m. This should give everybody the opportunity to leverage the power of NoSQL and use one of the most reliable databases for big data out there.
For the Cassandra product, you can use the Apache open source offering in production, if your organisation is comfortable with open source.
You can also use the Datastax Community version of Cassandra, which is also open source and free to deploy; that gives you a bit more assurance from DataStax who offer commercial support.
Then there is DataStax Enterprise, which is the version that you pay to use, with a support model included. This still uses open source Cassandra, with additional code from DataStax. They have also put this release through their internal test processes, so that they are happy to support it. That generally means the releases will lag that Apache and Community versions, if that matters to you.
The DataStax 'Dev Center' product is a GUI tool that allows you to enter CQL commands against a Cassandra installation - it is free to use against any release. You may find it useful, though the CQLSH command-line should offer much of what you may need (and Cassandra CLI).
The DataStax 'Ops Center' product is available in a free version, which can run against any Cassandra with the associated 'DataStax Agent' used to collect data from each node. The Enterprise version of Ops Center includes additional functionality; that is available if you purchase the fully support DSE (DataStax Enterprise) stack.
Hope that helps. Much more information available at Planet Cassandra and the DataStax web sites.
Besides Apache Cassandra, there's Scylla which is a drop in replacement for Cassandra written in C++. It claims to be 10 times faster than Apache Cassandra. However, Scylla is still in alpha version, and you should stay away from it in a production environment.
Scylla aims to support all cassandra features together with toolings. It also supports JMX monitoring.
Apache Cassandra also have all features as well as community edition of DataStax . So you can put Apache Cassandra on Production enivorment .
Another good feature of DSE is the ability to do backup and recovery of your Cassandra database which I would think is very important if you are planning to use this in a production setup.