how to write Spark data frame to Neo4j database - apache-spark

I'd like to build this workflow:
preprocess some data with Spark, ending with a data frame
write such dataframe to Neo4j as a set of nodes
My idea is really basic: write each row in the df as a node, where each column value represents the value of the node's attribute
I have seen many articles, including neo4j-spark-connector and Introducing the Neo4j 3.0 Apache Spark Connector but they all focus on importing into Spark data from a Neo4j db... so far, I wasn't able to find a clear example of writing a Spark data frame to a Neo4j database.
Any pointer to documentation or very basic examples are much appreciated.

Read this issue to answer my question.
Long story short, neo4j-spark-connector can write Spark data to Neo4j db, and yes, there is a lack in the documentation of the new release.

you can write some routine and use an opensource neo4j java driver
https://github.com/neo4j/neo4j-java-driver
for example.
Simple serialise the result of an RDD (using rdd.toJson) and then use the above driver to create your neo4j nodes and push into your neo4j instance.

I know the question is pretty old but I don't think the neo4j-spark-connector can solve your issue. The full story, sample code and the details are available here but to cut the long story short if you look carefully at the Neo4jDataFrame.mergeEdgeList example (which has been suggested), you'll noticed that what it does is to instantiate a driver for each row in the dataframe. That will work in a unit test with 10 rows but you can't expect it to work in a real case scenario with millions or billions of rows. Besides there are other defects explained in the link above where you can find a csv based solution. Hope it helps.

Related

How to speed up spark sql filter queries if the where clause is already fixed?

In my case, the data resides in spark tables which are created by calling createOrReplaceTempView API on a dataframe. Once the table is created, several queries are going to run on top of the table. Most of the time, the where query is going to be based on a particular column. The concerned columns' name is already known. I would like to know if some sort of optimizations can be done to improve the performance of the filter query.
I tried exploring the approach of indexing but it turns out spark does not support indexing a particular column.
Have you looked at the SPARK UI to see where most of your time is being consumed? Is it really the query where most of the time is spent? Usually reading the data from disk is where most of the time is spent. Learn to read the SPARK UI and find where the real bottleneck is. The SQL tab is a really great way to start figuring things out.
Here's some tricks to run faster in spark that apply to most jobs:
Can you reframe the problem? Was the data you are using in a format that helps you solve the query? Can you change how it's written to change the problem? (Could you start "pre-chewing" the data before you even query it to have it stored in the best format to help you solve the issue you want to solve?) Most performance gains come from changing the parameters of the problem to make them easier/faster to solve.
What format (is the incoming data) you are
storing the data in? Are you using Parquet/Orc? They have a great payoff disk space/compression that are worth using. They also can enable file level filter to speed read. Is their transformation work that you can push upstream to help make the query do less work? Can you be writing the data via a partition schema that would aid lookups?
How many files is your input? Can you consolidate files to maximize read throughput. Reading/listing a lot of small files as input slows down the processing of data.
If the tempView query is of similar size every time you could look at tweaking the partition count so that files are smaller but approximately the size of your HDFS block size. (Assuming you are using hdfs). HDFS you have to read an entire block weather you use all the data or not. Try and fit this to some multiple of your executors so that you are finishing together and not straggling. This is hard to get perfect but you can make decent strides to find a good ratio.
There is no need to optimize filter conditions with spark. spark already is smart enough to optimize its conditions post where query to fetch minimum rows first. The best I guess you can do is by persisting your TempView if querying the same view again and again.

What is your approach for querying Cassandra with Spark (in R or Python)?

I am working with about a TB of data stored in Cassandra and trying to query it using Spark and R (could be Python).
My preference for querying the data would be to abstract the Cassandra table I'm querying from as a Spark RDD (using sparklyr and the spark-cassandra-connector with spark-sql) and simply doing an inner join on the column of interest (it is a partition key column). The company I'm working with says that this approach is a bad idea as it will translate into an IN clause in CQL and thus cause a big slow-down.
Instead I'm using their preferred method: write a closure that will extract the data for a single id in the partition key using a jdbc connection and then apply that closure 200k times for each id I'm interested in. I use spark_apply to apply that closure in parallel for each executor. I also set my spark.executor.cores to 1 so I get a lot of parellelization.
I'm having a lot of trouble with this approach and am wondering what the best practice is. Is it true that Spark SQL does not account for the slowdown associated with pulling multiple ids from a partition key column (IN operator)?
A few points here:
Working with Spark-SQL is not always the most performant option, the
optimized might not always as good of a job than a job you write
yourself
Check the logs carefully during your work, always check how your high-level queries are translated to CQL queries. In particular, make sure you avoid a full table scan if you can.
If you joining on the partition key, you should look into leveraging the methods: repartitionByCassandraReblica, and joinWithCassandraTable. Have a look at the official doc here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md and Tip4 of this blog post: https://www.instaclustr.com/cassandra-connector-for-spark-5-tips-for-success/
Finale note, it's quite common to have 2 Cassandra data center when using Spark. The first one serves regular read / write, the second one is used for running Spark. It's a separation of concern best practice (at the cost of an additional DC of course).
Hope it helps!

Use Cases for Spark

We have an application which the clients use to track their procurement cycle. We need to build a solution which will help the users to pull any column from any table in a particular subject area and they should be able to see all the rows of the result of this join of the tables from which the columns have been pulled. It needs to be similar to a Salesforce kind of reporting solution. We are looking at HDFS and Spark in Azure HDInsight to support these kind of querying capabilities. We would like to know if this is a valid use case for Spark. The volume of the joins of all tables can easily touch 500 million rows which will be pulled into the Spark driver memory before being displayed to the user.
Please let me know if this is something that can be done using Spark.
As per my understanding, Spark is mostly used for batch processing. If your use case is directly user-facing, then I am doubtful about using Spark because there may be better solutions(or alternate architectures). Becuase joining 500 million rows in realtime sounds crazy!
The volume of the joins of all tables can easily touch 500 million rows which will be pulled into the Spark driver memory before being displayed to the user.
This is another thing that puzzled me. Pulling all the 500 million rows into RAM of a single java process doesn't sound right because of the obvious reasons.
Updated
Just using spark for processing huge data will not be effective for realtime solutions(like your use case). But, Spark will be very effective if you are going to pre-process your data, cache the results using some other system, prepare views using the results can be served to your users. More or less similar to Lambda Architecture.
Spark on Yarn cluster to periodically process the data and generate/update the different views, a distributed storage system (preferably columnar storage systems) to cache the views, a REST API to serve the views to users.
Late reply to the question, but in case someone else is reading this in future. AWS Redshift does exactly this.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Can GraphX be used to store, process, query and update large distributed graphs?

Can GraphX store, process, query and update large distributed graphs?
Does GraphX support these features or the data must be extracted from a Graph database source that will be subsequently processed by GraphX?
I want to avoid costs related to network communication and data movement.
It can actually be done, albeit with a pretty complicated measures. MLnick from GraphFlow posted on titan mail group here that he managed to use Spark 0.8 on Titan/Cassandra graph using FaunusVertex and TitanCassandraInputFormat, and that there was a problem in groovy 1.8.9 and newer Kryo version.
In his GraphFlow presentation in spark-summit, he seemed to have made the Titan/HBase over Spark 0.7.x works.
Or if you're savvy enough to implement the TitanInputFormat/TitanOutputFormat from Titan 0.5, perhaps you could keep us in the loop. And Titan developers said they do want to support Spark but haven't got the time/resources to do so.
Using Spark on Titan database is pretty much the only option I can think of regarding your question.
Spark doesn't really have support in itself for long term storage yet, other than through HDFS(technically it doesn't need to run on HDFS but it is heavily integrated with it). So you could just store all the edges and vertices in files, but this is clearly not the most efficient way. Another option is to use a graph database like neo4j

Resources