How to write the like query in cassandra.
select * from user where user_name like '%abcd%'
How to write it into CQL(Cassandra query language)
Because i have to search some content base on keyword.
If it doesn't need to be real-time, you could use Hive or Shark. This enables you to run exactly the query you're speaking about. If you use DSE it works out of the box with Hive. If not, you'll want to check out this Hive driver.
To get this working with open source Cassandra, you'll need:
HDFS running co-located with your Cassandra nodes
If you use Spark, you'll need Spark workers (ideally co-located as well, though this isn't a hard requirement)
Hive or Shark running on a machine that can access the cluster
Related
I want to use Spark SQL (installed on Machine 1) with connectors for different data stores like HBase, Hive, Cassandra, and MySQL (installed on Machine 2 to perform simple analytics like Min/Max, averaging, etc.
My Question: Is the processing of these queries done on Machine 1 or Spark SQL acts as just an interface to perform different analytics but on the data store end (ie. Machine 2)?
Yes and no. It depends on your spark job.
Spark SQL is a separate implementation. It is datastore agnostic. When you implement a spark sql job , spark transforms it into something called DAG.
It is a similar technique to a database query plan, but running completely on the spark cluster.
In case of simple min / max, it might be translated into a direct underlying store query. But it might also be translated into something which is preselecting bunch of records, then doing an own data processing. This way it is also possible to join and aggregate data from different data sources.
You can analyze the spark sql plan with common explain statement or via spark web ui.
HIVE has a metastore and HIVESERVER2 listens for SQL requests; with the help of metastore, the query is executed and the result is passed back.
The Thrift framework is actually customised as HIVESERVER2. In this way, HIVE is acting as a service. Via programming language, we can use HIVE as a database.
The relationship between Spark-SQL and HIVE is that:
Spark-SQL just utilises the HIVE setup (HDFS file system, HIVE Metastore, Hiveserver2). When we invoke /sbin/start-thriftserver2.sh (present in spark installation), we are supposed to give hiveserver2 port number, and the hostname. Then via spark's beeline, we can actually create, drop and manipulate tables in HIVE. The API can be either Spark-SQL or HIVE QL.
If we create a table / drop a table, it will be clearly visible if we login into HIVE and check(say via HIVE beeline or HIVE CLI). To put in other words, changes made via Spark can be seen in HIVE tables.
My understanding is that Spark does not have its own meta store setup like HIVE. Spark just utilises the HIVE setup and simply the SQL execution happens via Spark SQL API.
Is my understanding correct here?
Then I am little confused about the usage of bin/spark-sql.sh (which is also present in Spark installation). Documentation says that via this SQL shell, we can create tables like we do above (via Thrift Server/Beeline). Now my question is: How the metadata information is maintained by spark then?
Or like the first approach, can we make spark-sql CLI to communicate to HIVE (to be specific: hiveserver2 of HIVE) ?
If yes, how can we do that ?
Thanks in advance!
My understanding is that Spark does not have its own meta store setup like HIVE
Spark will start a Derby server on its own, if a Hive metastore is not provided
can we make spark-sql CLI to communicate to HIVE
Start an external metastore process, add a hive-site.xml file to $SPARK_CONF_DIR with hive.metastore.uris, or use SET SQL statements for the same.
Then spark-sql CLI should be able to query Hive tables. From code, you need to use enableHiveSupport() method on the SparkSession.
Hive can have its metadata and stores the tables,columns,partitions information over there.
If I do not want to use the hive.Can we create a metadata for spark same as hive.
I want to query spark SQL (not using dataframe) like Hive (select, from and where) Can we do that? if yes, which relational DB can we use for metadata storage?
Can we create a metadata for spark same as hive.
Spark does this for you and you don't have to use a separate installation of Hive or even just part of it (e.g. a Hive metastore).
Regardless of the installation of Apache Spark you use, Spark SQL uses a Hive metastore internally for the same purpose as Hive does (but the metastore is now part of Spark SQL).
if yes which relational DB can we use for metadata storage?
Anything that Hive supports, e.g. Oracle, MySQL, PostgreSQL. The configuration is pretty much as you would do with a separate Hive installation (which is usually the case in such enterprisey installations).
You may want to read Hive Metastore.
Spark is essentially a distributed computation system instead of a distributed storage. Therefore, we mostly use Spark to do the computation work, which needs the metadata from different storage.
However, Spark internally provides an InMemoryCatalog to store the metadata if it's not configured with Hive.
You can take a look at this for more information.
The heap usage of hiveserver2 is constantly increasing (first pic).
There are applications such as nifi, zeppelin, spark related to hive. Nifi use puthivesql, zeppelin use jdbc(hive) and spark use spark-sql. I couldn't find any clue to this.
Hive requires a lot of resources for establishing connection. So, first reason is a lot of queries in your puthiveql processor, cause for everyone of them hive need to open connection. Get attention on your hive job browser (you can use hue for this purpose)
Possible resolution: e.g. if you use insert queries - so use orc files to insert data. If you use update queries - use temporary table and merge query.
Currently we are building a reporting platform as a data store we used Shark. Since the development of Shark is stopped so we are in the phase of evaluating Spark SQL. Based on the use cases we have we had few questions.
1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo). We would like to know how can we get this data into Spark SQL? Does there exist any utility which we can use? Does this utility support continuous refresh of data (sync of new add/update/delete on data store to Spark SQL?
2) Is the a way to create multiple database in Spark SQL?
3) For Reporting UI we use Jasper, we would like to connect from Jasper to Spark SQL. When we did our initial search we got to know currently there is no support for consumer to connect Spark SQL through JDBC, but in future releases you would like the add the same. We would like to know by when Spark SQL would have a stable release which would have JDBC Support? Meanwhile we took the source code from https://github.com/amplab/shark/tree/sparkSql but we had some difficulty in setting it up locally and evaluating it . It would be great if you can help us with setup instructions.(I can share the issue we are facing please let me know where can I post the error logs)
4) We would also require a SQL prompt where we can execute queries, currently Spark Shell provides SCALA prompt where SCALA code can be executed, from SCALA code we can fire SQL queries. Like Shark we would like to have SQL prompt in Spark SQL. When we did our search we found that in future release of Spark this would be added. It would be great if you can tell us which release of Spark would address the same.
as for
3) Spark 1.1 provides better support for SparkSQL ThriftServer interface, which you may want to use for JDBC interfacing. Hive JDBC clients that support v. 0.12.0 are able to connect and interface with such server.
4) Spark 1.1 also provides a SparkSQL CLI interface that can be used for entering queries. In the same fashion that Hive CLI or Impala Shell.
Please, provide more details about what you are trying to achieve for 1 and 2.
I can answer (1):
Apache Sqoop was made specifically to solve this problem for the relational databases. The tool was made for HDFS, HBase, and Hive -- as such it can be used to make data available to Spark, via HDFS and the Hive metastore.
http://sqoop.apache.org/
I believe Cassandra is available to SparkContext via this connector from DataStax: https://github.com/datastax/spark-cassandra-connector -- which I have never used.
I'm not aware of any connector for MongoDB.
1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo)
You have to use different driver for each case. For cassandra there is datastax driver (but i encountered some compatibility problems with SparkSQL). For any SQL system you can use JdbcRDD. The usage is straightforward, look at the scala example:
test("basic functionality") {
sc = new SparkContext("local", "test")
val rdd = new JdbcRDD(
sc,
() => { DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb") },
"SELECT DATA FROM FOO WHERE ? <= ID AND ID <= ?",
1, 100, 3,
(r: ResultSet) => { r.getInt(1) } ).cache()
assert(rdd.count === 100)
assert(rdd.reduce(_+_) === 10100)
}
But notion that it's just an RDD, so you should work with this data through map-reduce api, not in SQLContext.
Does there exist any utility which we can use?
There is Apache Sqoop project but it's in active development state. The current stable version even doesn't save files in parquet format.
Spark SQL is a capability of the Spark framework. It shouldn't be compared to Shark because Shark is a service. (Recall that with Shark, you run a ThriftServer that you can then connect to from your Thrift app or even ODBC.)
Can you elaborate on what you mean by "get this data into Spark SQL"?
There are a couple of Spark - MongoDB connectors:
- the mongodb connector for hadoop (which doesn't actually need Hadoop at all!) https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
the Stratio mongodb connector https://github.com/Stratio/spark-mongodb
If your data is huge and need to perform a lot of transformations then Spark SQL can be used for ETL purpose, else presto could solve all your problems. Addressing your queries one by one:
As your data is in MySQL, Oracle, Cassandra, Mongo all these can be integrated in Presto as it has connectors https://prestodb.github.io/docs/current/connector.html for all these databases.
Once you install Presto in cluster mode you can query all these databases together in one platform, which also provides to join a table from Cassandra and other tables from Mongo, this flexibility is unparalleled.
Presto can be used to connect to Apache Superset https://superset.incubator.apache.org/ which is open source and provides all sets Dashboarding. Also Presto can be connected to Tableau.
You can install MySQL workbench with presto connecting details which helps in providing a UI for all your databases at one place.