Geohash NEO4j Graph with Spark - apache-spark

I am using Neo4j/Cypher , my data is about 200GB , so i thought of scalable solution "spark".
Two solutions are available to make neo4j graphs with spark :
1) Cypher for Apache Spark (CAPS)
2) Neo4j-Spark-Connector
I used the first one ,CAPS .
The pre-processed CSV got two "geohash" informations : one for pickup and another for drop off for each row
what i want is to make a connected graph of geohash nodes.
CAPS allow only to make a graph by mapping nodes :
If node with id 0 is to be connected to node with id 1 you need to have a relationship with start id 0 and end id 1.
A very simple layout would be:
Nodes: (just id, no properties)
id
0
1
2
Relationships: (just the mandatory fields)
id | start | end
0 | 0 | 1
1 | 0 | 2
based on that i ve loaded my CSV into a Spark Dataframe , then i 've splitted the dataframe into :
Pickup dataframe
Drop off data-frame and
Trip data frame
I've generated an id for the two first data-frames, and created a mapping by adding columns to third data-frame
and this was the result :
A pair of nodes ( pickup-[Trip]->drop off) generated for each mapped rows.
The problem that i got is:
1) the geohash of pickup or a drop off could be repeated for different trips=> i want to merge the creation of nodes
2) a drop off for a trip could be a pickup for another trip so i need to merge this two nodes into one
i tried to change the graph but i was surprised that spark graphs are immutable=>you can't apply cypher queries to change it.
So is there a way to make a connected ,oriented and merged geohash graph with spark ?
This is my code :
package org.opencypher.spark.examples
import org.opencypher.spark.api.CAPSSession
import org.opencypher.spark.api.io.{CAPSNodeTable, CAPSRelationshipTable}
import org.opencypher.spark.util.ConsoleApp
import java.net.URI
import org.opencypher.okapi.api.io.conversion.NodeMapping
import org.opencypher.okapi.api.io.conversion.RelationshipMapping
import org.opencypher.spark.api.io.neo4j.Neo4jPropertyGraphDataSource
import org.opencypher.spark.api.io.neo4j.Neo4jConfig
import org.apache.spark.sql.functions._
import org.opencypher.okapi.api.graph.GraphName
object GreenCabsInputDataFrames extends ConsoleApp {
//1) Create CAPS session and retrieve Spark session
implicit val session: CAPSSession = CAPSSession.local()
val spark = session.sparkSession
//2) Load a csv into dataframe
val df=spark.read.csv("C:\\Users\\Ahmed\\Desktop\\green_result\\green_data.csv").select("_c0","_c3","_c4","_c9","_c10","_c11","_c12","_c13","_c14","_c15","_c16","_c17","_c18","_c19","_c20","_c21","_c22","_c23")
//3) cache the dataframe
val df1=df.cache()
//4) subset the dataframe
val pickup_dataframe=df1.select("_c0","_c3","_c4","_c9","_c10","_c11","_c12","_c13","_c14","_c15","_c16","_c17","_c18","_c19","_c20","_c21")
val dropoff_dataframe=df1.select("_c22","_c23")
//5) uncache the dataframe
df1.unpersist()
//6) add id columns to pickup , dropoff and trip dataframes
val pickup_dataframe2= pickup_dataframe.withColumn("id1",monotonically_increasing_id+pickup_dataframe.count()).select("id1",pickup_dataframe.columns:_*)
val dropoff_dataframe2= dropoff_dataframe.withColumn("id2",monotonically_increasing_id+pickup_dataframe2.count()+pickup_dataframe.count()).select("id2",dropoff_dataframe.columns:_*)
//7) create the relationship "trip" is dataframe
val trip_data_dataframe2=pickup_dataframe2.withColumn("idj",monotonically_increasing_id).join(dropoff_dataframe2.withColumn("idj",monotonically_increasing_id),"idj")
//drop unnecessary columns
val pickup_dataframe3=pickup_dataframe2.drop("_c0","_c3","_c4","_c9","_c10","_c11","_c12","_c13","_c14","_c15","_c16","_c17","_c18","_c19")
val trip_data_dataframe3=trip_data_dataframe2.drop("_c20","_c21","_c22","_c23")
//8) reordering the columns of trip dataframe
val trip_data_dataframe4=trip_data_dataframe3.select("idj", "id1", "id2", "_c0", "_c10", "_c11", "_c12", "_c13", "_c14", "_c15", "_c16", "_c17", "_c18", "_c19", "_c3", "_c4","_c9")
//8.1)displaying dataframes in console
pickup_dataframe3.show()
dropoff_dataframe2.show()
trip_data_dataframe4.show()
//9) mapping the columns
val Pickup_mapping=NodeMapping.withSourceIdKey("id1").withImpliedLabel("HashNode").withPropertyKeys("_c21","_c20")
val Dropoff_mapping=NodeMapping.withSourceIdKey("id2").withImpliedLabel("HashNode").withPropertyKeys("_c23","_c22")
val Trip_mapping=RelationshipMapping.withSourceIdKey("idj").withSourceStartNodeKey("id1").withSourceEndNodeKey("id2").withRelType("TRIP").withPropertyKeys("_c0","_c3","_c4","_c9","_c10","_c11","_c12","_c13","_c14","_c15","_c16","_c17","_c18","_c19")
//10) create tables
val Pickup_Table2 = CAPSNodeTable(Pickup_mapping, pickup_dataframe3)
val Dropoff_Table = CAPSNodeTable(Dropoff_mapping, dropoff_dataframe2)
val Trip_Table = CAPSRelationshipTable(Trip_mapping,trip_data_dataframe4)
//11) Create graph
val graph = session.readFrom(Pickup_Table2,Dropoff_Table, Trip_Table)
//12) Connect to Neo4j
val boltWriteURI: URI = new URI("bolt://localhost:7687")
val neo4jWriteConfig: Neo4jConfig = new Neo4jConfig(boltWriteURI, "neo4j", Some("wakarimashta"), true)
val neo4jResult: Neo4jPropertyGraphDataSource = new Neo4jPropertyGraphDataSource(neo4jWriteConfig)(session)
//13) Store graph in neo4j
val neo4jResultName: GraphName = new GraphName("neo4jgraphs151")
neo4jResult.store(neo4jResultName, graph)
}

You are right, CAPS is, just like Spark, an immutable system. However, with CAPS you can create new graphs from within a Cypher statement: https://github.com/opencypher/cypher-for-apache-spark/blob/master/spark-cypher-examples/src/main/scala/org/opencypher/spark/examples/MultipleGraphExample.scala
At the moment the CONSTRUCT clause has limited support for MERGE. It only allows to add already bound nodes to the newly created graph, while each bound node is added exactly once independent off how many time it occurs in the binding table.
Consider the following query:
MATCH (n), (m)
CONSTRUCT
CREATE (n), (m)
RETURN GRAPH
The resulting graph will have as many nodes as the input graph.
To solve your problem you could use two approaches: a) already deduplicate before creating the graph, b) using Cypher queries. Approach b) would look like:
// assuming that graph is the graph created at step 11
session.catalog.store("inputGraph", graph)
session.cypher("""
CATALOG CREATE GRAPH temp {
FROM GRAPH session.inputGraph
MATCH (n)
WITH DISTINCT n.a AS a, n.b as b
CONSTRUCT
CREATE (:HashNode {a: a, b as b})
RETURN GRAPH
}
""".stripMargin)
val mergeGraph = session.cypher("""
FROM GRAPH inputGraph
MATCH (from)-[via]->(to)
FROM GRAPH temp
MATCH (n), (m)
WHERE from.a = n.a AND from.b = n.b AND to.a = m.a AND to.b = m.b
CONSTRUCT
CREATE (n)-[COPY OF via]->(m)
RETURN GRAPH
""".stripMargin).graph
Note: Use the property names for bot pickup and dropoff nodes (e.g. a and b)

Related

Spark joins performance issue

I'm trying to merge historical and incremental data. As part of the incremental data, I'm getting deletes. Below is the case.
historical data - 100 records ( 20 columns, id is the key column)
incremental data - 10 records ( 20 columns, id is the key column)
Out of the 10 records in incremental data, only 5 will match with historical data.
Now I want 100 records in the final dataframe of which 95 records belong to historical data and 5 records belong to incremental data(wherever id column is match).
Update timestamp field is available in both historical and incremental data.
Below is the approach I tried.
DF1 - Historical Data
DF2 - Incremental Delete Dataset
DF3 = DF1 LEFTANTIJOIN DF2
DF4 = DF2 INNERJOIN DF1
DF5 = DF3 UNION DF4
However, I observed It has lot of performance issue as I'm running this join on billions of records. Any better way to do this?
you can use the cogroup operator combined with a user defined function to construct the different variations of the join.
Suppose we have these two RDDs as an example :
visits = sc.parallelize([("h", "1.2.3.4"), ("a", "3.4.5.6"), ("h","1.3.3.1")] )
pageNames = sc.parallelize([("h", "Home"), ("a", "About"), ("o", "Other")])
cg = visits.cogroup(pageNames).map(lambda x :(x[0], ( list(x[1][0]), list(x[1][1]))))
You can implement an inner join as such :
innerjoin = cg.flatMap(lambda x: J(x))
Where J is defined as such :
def J(x):
j=[]
k=x[0]
if x[1][0]!=[] and x[1][1]!=[]:
for l in x[1][0]:
for r in x[1][1]:
j.append((k,(l,r)))
return j
For a right outer join for example you just need to change the J function to an roJ function defined as such :
def roJ(x):
j=[]
k=x[0]
if x[1][0]!=[] and x[1][1]!=[]:
for l in x[1][0]:
for r in x[1][1]:
j.append((k,(l,r)))
elif x[1][1]!=[] :
for r in x[1][1]:
j.append((k, (None, r)))
return j
And call it like so :
rightouterjoin = cg.flatMap(lambda x: roJ(x))
And so on for other types of join you'd wish to implement
Performance issues are not just related to the size of your data. It depends on many other parameters like, the keys you used for partition, your partitioned file sizes and the cluster configuration you are running your job on. I would recommend you to go through the official documentation on Tuning your spark jobs and make necessary changes.
https://spark.apache.org/docs/latest/tuning.html
Below is the approach I did.
historical_data.as("a").join(
incr_data.as("b"),
$"a.id" === $"b.id", "full")
.select(historical_data.columns.map(f => expr(s"""case when a.id=b.id then b.${f} else a.${f} end as $f""")): _*)

spark graph frames aggregate messages multiple iterations

Spark graphFrames documentation has a nice example how to apply aggregate messages function.
To me, it seems to only calculate the friends /connections of the single and first vertices and not iterate deeper into the graph as graphXs pregel operator.
How can I accomplish such iterations in graphFrames as well using aggregate messages similar to how iteration is handled here https://github.com/sparkling-graph/sparkling-graph/blob/master/operators/src/main/scala/ml/sparkling/graph/operators/measures/vertex/eigenvector/EigenvectorCentrality.scala in graphX?
import org.graphframes.examples
import org.graphframes.lib.AggregateMessages
val g: GraphFrame = examples.Graphs.friends // get example graph
// We will use AggregateMessages utilities later, so name it "AM" for short.
val AM = AggregateMessages
// For each user, sum the ages of the adjacent users.
val msgToSrc = AM.dst("age")
val msgToDst = AM.src("age")
val agg = g.aggregateMessages
.sendToSrc(msgToSrc) // send destination user's age to source
.sendToDst(msgToDst) // send source user's age to destination
.agg(sum(AM.msg).as("summedAges")) // sum up ages, stored in AM.msg column
agg.show()
http://graphframes.github.io/user-guide.html#message-passing-via-aggregatemessages

mimic ' group by' and window function logic in spark

I have a large csv file with the columns id,time,location. I made it an RDD, and want to compute some aggregated metrics of the trips, when a trip is defined as a time-contiguous set of records of the same id, separated by at least 1 hour on either side. I am new to spark. (related)
To do that, I think to create an RDD with elements of the form (trip_id,(time, location)) and use reduceByKey to calculate all the needed metrics.
To calculate the trip_id, i try to implement the SQL-approach of the linked question, to make an indicator field of whether the record is a start of a trip, and make a cumulative sum of this indicator field. This does not sound like a distributed approach: is there a better one?
Furthermore, how can I add this indicator field? it should be 1 if the time-difference to the previous record of the same id is above an hour, and 0 otherwise. I thought of at first doing groupBy id and then sort in each of the values, but they will be inside an Array and thus not amenable to sortByKey, and there is no lead function as in SQL to get the previous value.
Example of the suggested aforementioned approach: for the RDD
(1,9:30,50)
(1,9:37,70)
(1,9:50,80)
(2,19:30,10)
(1,20:50,20)
We want to turn it first into the RDD with the time differences,
(1,9:30,50,inf)
(1,9:37,70,00:07:00)
(1,9:50,80,00:13:00)
(2,19:30,10,inf)
(2,20:50,20,01:20:00)
(The value of the earliest record is, say, scala's PositiveInfinity constant)
and turn this last field into an indicator field of whether it is above 1, which indicates whether we start a trip,
(1,9:30,50,1)
(1,9:37,70,0)
(1,9:50,80,0)
(2,19:30,10,1)
(2,20:50,20,1)
and then turn it into a trip_id
(1,9:30,50,1)
(1,9:37,70,1)
(1,9:50,80,1)
(2,19:30,10,2)
(2,20:50,20,3)
and then use this trip_id as the key to aggregations.
The preprocessing was simply to load the file and delete the header,
val rawdata=sc.textFile("some_path")
def isHeader(line:String)=line.contains("id")
val data=rawdata.filter(!isHeader(_))
Edit
While trying to implement with spark SQL, I ran into an error regarding the time difference:
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
");
since spark doesn't know how to take the difference between two timestamps. I'm trying to check whether this difference is above 1 hour.
It Also doesn't recognize the function getTime, that I found online as an answer, the following returns an error too (Couldn't find window function time.getTime):
val lags=sqlContext.sql("
select time.getTime() - (lag(time)).getTime() over (partition by id order by time)
from data
");
Even though making a similar lag difference for a numeric attribute works:
val lag_numeric=sqlContext.sql("
select longitude - lag(longitude) over (partition by id order by time)
from data"); //works
Spark didn't recognize the function Hours.hoursBetween either. I'm using spark 1.4.0.
I also tried to define an appropriate user-defined-function, but UDFS are oddly not recognized inside queries:
val timestamp_diff: ((Timestamp,Timestamp) => Double) =
(d1: Timestamp,d2: Timestamp) => d1.getTime()-d2.getTime()
val lags=sqlContext.sql("select timestamp_diff(time,lag(time))
over (partition by id order by time) from data");
So, how can spark test whether the difference between timestamps is above an hour?
Full code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import sqlContext._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.hive.HiveContext//For window functions
import java.util.Date
import java.sql.Timestamp
case class Record(id: Int, time:Timestamp, longitude: Double, latitude: Double)
val raw_data=sc.textFile("file:///home/sygale/merged_table.csv")
val data_records=
raw_data.map(line=>
Record( line.split(',')(0).toInt,
Timestamp.valueOf(line.split(',')(1)),
line.split(',')(2).toDouble,
line.split(',')(3).toDouble
))
val data=data_records.toDF()
data.registerTempTable("data")
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
");

Accessing a global lookup Apache Spark

I have a list of csv files each with a bunch of category names as header columns. Each row is a list of users with a boolean value (0, 1) whether they are part of that category or not. Each of the csv files does not have the same set of header categories.
I want to create a composite csv across all the files which has the following output:
Header is a union of all the headers
Each row is a unique user with a boolean value corresponding to the category column
The way I wanted to tackle this is to create a tuple of a user_id and a unique category_id for each cell with a '1'. Then reduce all these columns for each user to get the final output.
How do I create the tuple to begin with? Can I have a global lookup for all the categories?
Example Data:
File 1
user_id,cat1,cat2,cat3
21321,,,1,
21322,1,1,1,
21323,1,,,
File 2
user_id,cat4,cat5
21321,1,,,
21323,,1,,
Output
user_id,cat1,cat2,cat3,cat4,cat5
21321,,1,1,,,
21322,1,1,1,,,
21323,1,1,,,,
Probably the title of the question is misleading in the sense that conveys a certain implementation choice as there's no need for a global lookup in order to solve the problem at hand.
In big data, there's a basic principle guiding most solutions: divide and conquer. In this case, the input CSV files could be divided in tuples of (user,category).
Any number of CSV files containing an arbitrary number of categories can be transformed to this simple format. The resulting CSV results of the union of the previous step, extraction of the total nr of categories present and some data transformation to get it in the desired format.
In code this algorithm would look like this:
import org.apache.spark.SparkContext._
val file1 = """user_id,cat1,cat2,cat3|21321,,,1|21322,1,1,1|21323,1,,""".split("\\|")
val file2 = """user_id,cat4,cat5|21321,1,|21323,,1""".split("\\|")
val csv1 = sparkContext.parallelize(file1)
val csv2 = sparkContext.parallelize(file2)
import org.apache.spark.rdd.RDD
def toTuples(csv:RDD[String]):RDD[(String, String)] = {
val headerLine = csv.first
val header = headerLine.split(",")
val data = csv.filter(_ != headerLine).map(line => line.split(","))
data.flatMap{elem =>
val merged = elem.zip(header)
val id = elem.head
merged.tail.collect{case (v,cat) if v == "1" => (id, cat)}
}
}
val data1 = toTuples(csv1)
val data2 = toTuples(csv2)
val union = data1.union(data2)
val categories = union.map{case (id, cat) => cat}.distinct.collect.sorted //sorted category names
val categoriesByUser = union.groupByKey.mapValues(v=>v.toSet)
val numericCategoriesByUser = categoriesByUser.mapValues{catSet => categories.map(cat=> if (catSet(cat)) "1" else "")}
val asCsv = numericCategoriesByUser.collect.map{case (id, cats)=> id + "," + cats.mkString(",")}
Results in:
21321,,,1,1,
21322,1,1,1,,
21323,1,,,,1
(Generating the header is simple and left as an exercise for the reader)
You dont need to do this as a 2 step process if all you need is the resulting values.
A possible design:
1/ Parse your csv. You dont mention whether your data is on a distributed FS, so i'll assume it is not.
2/ Enter your (K,V) pairs into a mutable parallelized (to take advantage of Spark) map.
pseudo-code:
val directory = ..
mutable.ParHashMap map = new mutable.ParHashMap()
while (files[i] != null)
{
val file = directory.spark.textFile("/myfile...")
val cols = file.map(_.split(","))
map.put(col[0], col[i++])
}
and then you can access your (K/V) tuples by way of an iterator on the map.

non-ordinal access to rows returned by Spark SQL query

In the Spark documentation, it is stated that the result of a Spark SQL query is a SchemaRDD. Each row of this SchemaRDD can in turn be accessed by ordinal. I am wondering if there is any way to access the columns using the field names of the case class on top of which the SQL query was built. I appreciate the fact that the case class is not associated with the result, especially if I have selected individual columns and/or aliased them: however, some way to access fields by name rather than ordinal would be convenient.
A simple way is to use the "language-integrated" select method on the resulting SchemaRDD to select the column(s) you want -- this still gives you a SchemaRDD, and if you select more than one column then you will still need to use ordinals, but you can always select one column at a time. Example:
// setup and some data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Score(name: String, value: Int)
val scores =
sc.textFile("data.txt").map(_.split(",")).map(s => Score(s(0),s(1).trim.toInt))
scores.registerAsTable("scores")
// initial query
val original =
sqlContext.sql("Select value AS myVal, name FROM scores WHERE name = 'foo'")
// now a simple "language-integrated" query -- no registration required
val secondary = original.select('myVal)
secondary.collect().foreach(println)
Now secondary is a SchemaRDD with just one column, and it works despite the alias in the original query.
Edit: but note that you can register the resulting SchemaRDD and query it with straight SQL syntax without needing another case class.
original.registerAsTable("original")
val secondary = sqlContext.sql("select myVal from original")
secondary.collect().foreach(println)
Second edit: When processing an RDD one row at a time, it's possible to access the columns by name by using the matching syntax:
val secondary = original.map {case Row(myVal: Int, _) => myVal}
although this could get cumbersome if the right hand side of the '=>' requires access to a lot of the columns, as they would each need to be matched on the left. (This from a very useful comment in the source code for the Row companion object)

Resources