I have a complex analytical Neo4j Cypher query which I run each time in run-time. According to the following documentation, https://neo4j.com/developer/apache-spark/ looks like I may execute the query on Apache Spark cluster:
org.neo4j.spark.Neo4j(sc).cypher("MATCH (n:Person) RETURN n.name").partitions(5).batch(10000).loadRowRdd
Does this mean that instead of doing such a simple query, I can do a Cypher query of any complexity this way to take advantage of Spark's in-memory parallel processing?
The documentation for the new Neo4j Spark Connector states:
partitions: This defines the parallelization level while pulling data
from Neo4j. Note: as more parallelization does not mean better query
performance, tune wisely in according to your Neo4j installation.
You can definitely try it out, but it isn't a given that the performance will be better. Read more in the docs: https://neo4j.com/docs/spark/current/reading/
Related
I know the question had been asked years ago, but I am still wondering the true purpose of using SparkSQL / HiveContext.
Spark approach gives a more generic distributed way that the built-in MapReduce.
I read a lot of articles claiming that MR way is already dead and Spark is the best (I understand that I can implement an MR approach through Spark).
When it is recommended to query data using HiveContext, I am a little bit confused.
Indeed, running a query from SparkSQL/HiveContext doesn't it imply running a MR job ? Isn't it to back to the main problematic ? TEZ isn't it enought if I don't need to encapsulate the query result in more complex code ?
Am I wrong (I am sure I am :-)) ?
Indeed, running a query from SparkSQL/HiveContext doesn't it imply running a MR job ?
It does not. In fact using HiveContext or SparkSession with "Hive support" doesn't imply any connection to Hive, other than using Hive metastore. This approach is used by many other systems, both ETL solutions and databases.
Finally:
Hive is a database with modular components. It supports relatively rich permissions system, mutations and transactions.
Spark is general purpose processing engine. Despite having SQL-ish component it doesn't attempt to be a database.
I would like to store result of continuous queries running against streaming data in such a manner so that results are persisted into distributed nodes to ensure failover and scalability.
Can Spark SQL experts please shed some light on
- (1) which storage option I should choose so that OLAP queries are faster
- (2) how to ensure data available for query even if one node is down
- (3) internally how does Spark SQL store the resultset ?
Thanks
Kaniska
It depends what kind of latency you can afford.
One way is to persist the result into HDFS/Cassandra using Persist() API. If your data is small then cache() of each RDD should give you a good result.
Store where your spark executors are co-located. For example:
It is also possible to use Memory based storage like tachyon to persist your stream (i.e. each RDD of your stream) and query against it.
If latency is not an issue then persist(MEMORY_OR_DISK_2) should give you what you need. Mind you performance is a hit or miss in that scenario. Also this stores the data in two executors.
In other cases if your clients are more comfortable in OLTP like database where they just need to query the constantly updating result you can use conventional database like postgres or mysql. This is a preferred method among many as query time is consistent and predictable. If the result is not update heavy but partitioned (say by time) then Greenplum like systems are also a choice.
I am new with Spark and trying to understand performance difference in below approaches (Spark on hadoop)
Scenario : As per batch processing I have 50 hive queries to run.Some can run parallel and some sequential.
- First approach
All of queries can be stored in a hive table and I can write a Spark driver to read all queries at once and run all queries in parallel ( with HiveContext) using java multi-threading
pros: easy to maintain
Cons: all resources may get occupied and
performance tuning can be tough for each query.
- Second approach
using oozie spark actions run each query individual
pros:optimization can be done at query level
cons: tough to maintain.
I couldn't find any document about the first approach that how Spark will process queries internally in first approach.From performance point of view which approach is better ?
The only thing on Spark multithreading I could found is:
"within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads"
Thanks in advance
Since your requirement is to run hive queries in parallel with the condition
Some can run parallel and some sequential
This kinda of workflows are best handled by a DAG processor which Apache Oozie is. This approach will be clearner than you managing your queries by code i.e. you will be building your own DAG processor instead of using the one provided by oozie.
Can anyone recommend me which technology can be explored if I am having a large data set in Cassandra table (3 node cluster) and I need to perform a sum operation on records received on daily basis. The count so calculated needs to be updated in a MySQL table.
Steps to perform -
1. Fetch Ids from MY SQL table
2. Run Sum operation from Cassandra table
3. Insert/update the calculated sum value in MYSQL table
Currently I am using plain Java to perform these tasks using SQL and CQL queries but its very slow and in future data will be growing exponentially.
Can anyone suggest technologies that can be explored to get this task accomplish in fastest possible way and lowest development time.
There's not much to recommend, it depends only on the task you have and your own preferences.
Apache Storm is a streaming engine, it would be good if you want to process stream of entries, not a batch of data like in your case.
Both Apache Spark and Apache Flink will allow you to perform batch job once a day or make a streaming application that will calculate results from one day.
I prefer Apache Spark, as it has unified API for batch and streaming jobs (so you can easily change code from batch to streaming) and strong community support. Apache Flink supports real time streaming, however it's not necessary in your case.
However, you should look and these two frameworks on your own and choose this framework, which looks better for you. In my opinion both of them will be ok
I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.
I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.
1- If you are going to have more writes than reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer.
2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. In-memory processing makes it ~100x faster than Hive. Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.
3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.
Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.
But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.
Hope this answers some of your queries.
P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.
There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/