Spark graphFrames documentation has a nice example how to apply aggregate messages function.
To me, it seems to only calculate the friends /connections of the single and first vertices and not iterate deeper into the graph as graphXs pregel operator.
How can I accomplish such iterations in graphFrames as well using aggregate messages similar to how iteration is handled here in graphX?
import org.graphframes.examples
import org.graphframes.lib.AggregateMessages
val g: GraphFrame = examples.Graphs.friends // get example graph
// We will use AggregateMessages utilities later, so name it "AM" for short.
val AM = AggregateMessages
// For each user, sum the ages of the adjacent users.
val msgToSrc = AM.dst("age")
val msgToDst = AM.src("age")
val agg = g.aggregateMessages
.sendToSrc(msgToSrc) // send destination user's age to source
.sendToDst(msgToDst) // send source user's age to destination
.agg(sum(AM.msg).as("summedAges")) // sum up ages, stored in AM.msg column


Avoid lazy evaluation of code in spark without cache

How can I avoid lazy evaluation in spark. I have a data frame which needs to be populated at once, since I need to filter the data on the basis of random number generated for each row of data frame, say if random number generated > 0.5, it will be filtered as dataA and if random number generated < 0.5 it will be filtered as dataB.
val randomNumberDF = df.withColumn("num", Math.random())
val dataA = randomNumberDF.filter(col("num") >= 0.5)
val dataB = randomNumberDF.filter(col("num") < 0.5)
Since spark is doing lazy eval, while filtering there is no reliable distribution of rows which are being filtered as dataA and dataB(sometimes same row is being present in both dataA and dataB)
How can I avoid this re-computation of "num" column, I have tried using "cache", which worked, but given my data size is going to be big, I am ruling out that solution.
I have also tried using other actions on the randomNumberDF, like :
these didn't solve the problem.
Please suggest something different from cache/persist/writing data to HDFS and again reading it as solution.
References I have already checked :
How to force spark to avoid Dataset re-computation?
How to force Spark to only execute a transformation once?
How to force Spark to evaluate DataFrame operations inline
If all you're looking for is a way to ensure that the same values are in randomNumberDF.num, then you can generate random numbers with a seed (using org.apache.spark.sql.functions.rand()):
The below is using 112 as the seed value:
val randomNumberDF = df.withColumn("num", rand(112))
val dataA = randomNumberDF.filter(col("num") >= 0.5)
val dataB = randomNumberDF.filter(col("num") < 0.5)
That will ensure that the values in num are the same across the multiple evaluations of randomNumberDF.
besides using org.apache.spark.sql.functions.rand with a given seed, you coud use eager-checkpointing:
This will materialize the dataframe to disk

How do I write a standalone application in Spark to find 20 of most mentions in a text file filled with extracted tweets

I'm creating a standalone application in spark where I need to read in a text file that is filled with tweets. Every mention starts with the symbol, "#". The objective is to go through this file, and find the most 20 mentions. Punctuation should be stripped from all mentions and if the tweet has the same mention more than once, it should be counted only once. There can be multiple unique mentions in a single tweet. There are many tweets in the file.
I am new to scala and apache-spark. I was thinking of using the filter function and placing the results in a list. Then convert the list into a set where items are unique. But the syntax, regular expressions, and reading the file are a problem i face.
def main(args: Array[String]){
val locationTweetFile = args(0)
val spark = SparkSession.builder.appName("does this matter?").getOrCreate()
tweet file is huge, is this command below, safe?
val tweetsFile =
val mentionsExp = """([#])+""".r
If the tweet had said
"Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER."
Then the output should be something like, ((honda, 1),(customer,1))
Since there are multiple tweets, another tweet can say,
"#HoNdA I am the same #cuSTomER #STACKEXCHANGE."
Then the Final output will be something like
Let's go step-by step.
1) appName("does this matter?") in your case doesn't matter
2) is safe due to its laziness, file won't be loaded into your memory
Now, about implementation:
Spark is about transformation of data, so you need to think how to transform raw tweets to list of unique mentions in each tweet. Next you transform list of mentions to Map[Mention, Int], where Int is a total count of that mention in the RDD.
Tranformation is usually done via map(f: A => B) method where f is a function mapping A value to B.
def tweetToMentions(tweet: String): Seq[String] =
tweet.split(" ").collect {
case s if s.startsWith("#") => s.replaceAll("[,.;!?]", "").toLowerCase
val mentions = tweetToMentions("Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER.")
// mentions: Seq("#honda", "#customer")
The next step is to apply this function to each element in our RDD:
val mentions = tweetsFile.flatMap(tweetToMentions)
Note that we use flatMap instead of map because tweetToMentions returns Seq[String] and we want our RDD to contain only mentions, flatMap will flatten the result.
To count occurences of each mention in the RDD we need to apply some magic:
First, we map our mentions to pairs of (Mention, 1) => (mention, 1))
Then we use reduceByKey which will count how many times each mention occurs in our RDD. Lastly, we order the mentions by their counts and retreive result.
val result = mentions
.map(mention => (mention, 1))
.reduceByKey((a, b) => a + b)

Geohash NEO4j Graph with Spark

I am using Neo4j/Cypher , my data is about 200GB , so i thought of scalable solution "spark".
Two solutions are available to make neo4j graphs with spark :
1) Cypher for Apache Spark (CAPS)
2) Neo4j-Spark-Connector
I used the first one ,CAPS .
The pre-processed CSV got two "geohash" informations : one for pickup and another for drop off for each row
what i want is to make a connected graph of geohash nodes.
CAPS allow only to make a graph by mapping nodes :
If node with id 0 is to be connected to node with id 1 you need to have a relationship with start id 0 and end id 1.
A very simple layout would be:
Nodes: (just id, no properties)
Relationships: (just the mandatory fields)
id | start | end
0 | 0 | 1
1 | 0 | 2
based on that i ve loaded my CSV into a Spark Dataframe , then i 've splitted the dataframe into :
Pickup dataframe
Drop off data-frame and
Trip data frame
I've generated an id for the two first data-frames, and created a mapping by adding columns to third data-frame
and this was the result :
A pair of nodes ( pickup-[Trip]->drop off) generated for each mapped rows.
The problem that i got is:
1) the geohash of pickup or a drop off could be repeated for different trips=> i want to merge the creation of nodes
2) a drop off for a trip could be a pickup for another trip so i need to merge this two nodes into one
i tried to change the graph but i was surprised that spark graphs are immutable=>you can't apply cypher queries to change it.
So is there a way to make a connected ,oriented and merged geohash graph with spark ?
This is my code :
package org.opencypher.spark.examples
import org.opencypher.spark.api.CAPSSession
import{CAPSNodeTable, CAPSRelationshipTable}
import org.opencypher.spark.util.ConsoleApp
import org.apache.spark.sql.functions._
import org.opencypher.okapi.api.graph.GraphName
object GreenCabsInputDataFrames extends ConsoleApp {
//1) Create CAPS session and retrieve Spark session
implicit val session: CAPSSession = CAPSSession.local()
val spark = session.sparkSession
//2) Load a csv into dataframe
//3) cache the dataframe
val df1=df.cache()
//4) subset the dataframe
//5) uncache the dataframe
//6) add id columns to pickup , dropoff and trip dataframes
val pickup_dataframe2= pickup_dataframe.withColumn("id1",monotonically_increasing_id+pickup_dataframe.count()).select("id1",pickup_dataframe.columns:_*)
val dropoff_dataframe2= dropoff_dataframe.withColumn("id2",monotonically_increasing_id+pickup_dataframe2.count()+pickup_dataframe.count()).select("id2",dropoff_dataframe.columns:_*)
//7) create the relationship "trip" is dataframe
val trip_data_dataframe2=pickup_dataframe2.withColumn("idj",monotonically_increasing_id).join(dropoff_dataframe2.withColumn("idj",monotonically_increasing_id),"idj")
//drop unnecessary columns
val pickup_dataframe3=pickup_dataframe2.drop("_c0","_c3","_c4","_c9","_c10","_c11","_c12","_c13","_c14","_c15","_c16","_c17","_c18","_c19")
val trip_data_dataframe3=trip_data_dataframe2.drop("_c20","_c21","_c22","_c23")
//8) reordering the columns of trip dataframe
val"idj", "id1", "id2", "_c0", "_c10", "_c11", "_c12", "_c13", "_c14", "_c15", "_c16", "_c17", "_c18", "_c19", "_c3", "_c4","_c9")
//8.1)displaying dataframes in console
//9) mapping the columns
val Pickup_mapping=NodeMapping.withSourceIdKey("id1").withImpliedLabel("HashNode").withPropertyKeys("_c21","_c20")
val Dropoff_mapping=NodeMapping.withSourceIdKey("id2").withImpliedLabel("HashNode").withPropertyKeys("_c23","_c22")
val Trip_mapping=RelationshipMapping.withSourceIdKey("idj").withSourceStartNodeKey("id1").withSourceEndNodeKey("id2").withRelType("TRIP").withPropertyKeys("_c0","_c3","_c4","_c9","_c10","_c11","_c12","_c13","_c14","_c15","_c16","_c17","_c18","_c19")
//10) create tables
val Pickup_Table2 = CAPSNodeTable(Pickup_mapping, pickup_dataframe3)
val Dropoff_Table = CAPSNodeTable(Dropoff_mapping, dropoff_dataframe2)
val Trip_Table = CAPSRelationshipTable(Trip_mapping,trip_data_dataframe4)
//11) Create graph
val graph = session.readFrom(Pickup_Table2,Dropoff_Table, Trip_Table)
//12) Connect to Neo4j
val boltWriteURI: URI = new URI("bolt://localhost:7687")
val neo4jWriteConfig: Neo4jConfig = new Neo4jConfig(boltWriteURI, "neo4j", Some("wakarimashta"), true)
val neo4jResult: Neo4jPropertyGraphDataSource = new Neo4jPropertyGraphDataSource(neo4jWriteConfig)(session)
//13) Store graph in neo4j
val neo4jResultName: GraphName = new GraphName("neo4jgraphs151"), graph)
You are right, CAPS is, just like Spark, an immutable system. However, with CAPS you can create new graphs from within a Cypher statement:
At the moment the CONSTRUCT clause has limited support for MERGE. It only allows to add already bound nodes to the newly created graph, while each bound node is added exactly once independent off how many time it occurs in the binding table.
Consider the following query:
MATCH (n), (m)
CREATE (n), (m)
The resulting graph will have as many nodes as the input graph.
To solve your problem you could use two approaches: a) already deduplicate before creating the graph, b) using Cypher queries. Approach b) would look like:
// assuming that graph is the graph created at step 11"inputGraph", graph)
FROM GRAPH session.inputGraph
WITH DISTINCT n.a AS a, n.b as b
CREATE (:HashNode {a: a, b as b})
val mergeGraph = session.cypher("""
FROM GRAPH inputGraph
MATCH (from)-[via]->(to)
MATCH (n), (m)
WHERE from.a = n.a AND from.b = n.b AND to.a = m.a AND to.b = m.b
CREATE (n)-[COPY OF via]->(m)
Note: Use the property names for bot pickup and dropoff nodes (e.g. a and b)

PySpark isin function

I am converting my legacy Python code to Spark using PySpark.
I would like to get a PySpark equivalent of:
usersofinterest = actdataall[actdataall['ORDValue'].isin(orddata['ORDER_ID'].unique())]['User ID']
Both, actdataall and orddata are Spark dataframes.
I don't want to use toPandas() function given the drawback associated with it.
If both dataframes are big, you should consider using an inner join which will work as a filter:
First let's create a dataframe containing the order IDs we want to keep:
orderid_df ="ORDValue")).distinct()
Now let's join it with our actdataall dataframe:
usersofinterest = actdataall.join(orderid_df, "ORDValue", "inner").select('User ID').distinct()
If your target list of order IDs is small then you can use the pyspark.sql isin function as mentioned in furianpandit's post, don't forget to broadcast your variable before using it (spark will copy the object to every node making their tasks a lot faster):
orderid_list ='ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0]
The most direct translation of your code would be:
from pyspark.sql import functions as F
# collect all the unique ORDER_IDs to the driver
order_ids = [x.ORDER_ID for x in'ORDER_ID').distinct().collect()]
# filter ORDValue column by list of order_ids, then select only User ID column
usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids)).select('User ID')
However, you should only filter like this only if number of 'ORDER_ID' is definitely small (perhaps <100,000 or so).
If the number of 'ORDER_ID's is large, you should use a broadcast variable which sends the list of order_ids to each executor so it can compare against the order_ids locally for faster processing. Note, this will work even if 'ORDER_ID' is small.
order_ids = [x.ORDER_ID for x in'ORDER_ID').distinct().collect()]
order_ids_broadcast = sc.broadcast(order_ids) # send to broadcast variable
usersofinterest = actdataall.filter(F.col('ORDValue').isin(order_ids_broadcast.value)).select('User ID')
For more information on broadcast variables, check out:
So, you have two spark dataframe. One is actdataall and other is orddata, then use following command to get your desire result.
usersofinterest = actdataall.where(actdataall['ORDValue'].isin('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0])).select('User ID')

mimic ' group by' and window function logic in spark

I have a large csv file with the columns id,time,location. I made it an RDD, and want to compute some aggregated metrics of the trips, when a trip is defined as a time-contiguous set of records of the same id, separated by at least 1 hour on either side. I am new to spark. (related)
To do that, I think to create an RDD with elements of the form (trip_id,(time, location)) and use reduceByKey to calculate all the needed metrics.
To calculate the trip_id, i try to implement the SQL-approach of the linked question, to make an indicator field of whether the record is a start of a trip, and make a cumulative sum of this indicator field. This does not sound like a distributed approach: is there a better one?
Furthermore, how can I add this indicator field? it should be 1 if the time-difference to the previous record of the same id is above an hour, and 0 otherwise. I thought of at first doing groupBy id and then sort in each of the values, but they will be inside an Array and thus not amenable to sortByKey, and there is no lead function as in SQL to get the previous value.
Example of the suggested aforementioned approach: for the RDD
We want to turn it first into the RDD with the time differences,
(The value of the earliest record is, say, scala's PositiveInfinity constant)
and turn this last field into an indicator field of whether it is above 1, which indicates whether we start a trip,
and then turn it into a trip_id
and then use this trip_id as the key to aggregations.
The preprocessing was simply to load the file and delete the header,
val rawdata=sc.textFile("some_path")
def isHeader(line:String)=line.contains("id")
val data=rawdata.filter(!isHeader(_))
While trying to implement with spark SQL, I ran into an error regarding the time difference:
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
since spark doesn't know how to take the difference between two timestamps. I'm trying to check whether this difference is above 1 hour.
It Also doesn't recognize the function getTime, that I found online as an answer, the following returns an error too (Couldn't find window function time.getTime):
val lags=sqlContext.sql("
select time.getTime() - (lag(time)).getTime() over (partition by id order by time)
from data
Even though making a similar lag difference for a numeric attribute works:
val lag_numeric=sqlContext.sql("
select longitude - lag(longitude) over (partition by id order by time)
from data"); //works
Spark didn't recognize the function Hours.hoursBetween either. I'm using spark 1.4.0.
I also tried to define an appropriate user-defined-function, but UDFS are oddly not recognized inside queries:
val timestamp_diff: ((Timestamp,Timestamp) => Double) =
(d1: Timestamp,d2: Timestamp) => d1.getTime()-d2.getTime()
val lags=sqlContext.sql("select timestamp_diff(time,lag(time))
over (partition by id order by time) from data");
So, how can spark test whether the difference between timestamps is above an hour?
Full code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import sqlContext._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.hive.HiveContext//For window functions
import java.util.Date
import java.sql.Timestamp
case class Record(id: Int, time:Timestamp, longitude: Double, latitude: Double)
val raw_data=sc.textFile("file:///home/sygale/merged_table.csv")
val data_records=>
Record( line.split(',')(0).toInt,
val data=data_records.toDF()
val lags=sqlContext.sql("
select time - lag(time) over (partition by id order by time) as diff_time from data
