How to decide whether to use Spark RDD filter or not - apache-spark

I am using spark to read and analyse a data file, file contains data like following:
1,unit1,category1_1,100
2,unit1,category1_2,150
3,unit2,category2_1,200
4,unit3,category3_1,200
5,unit3,category3_2,300
The file contains around 20 million records. If user input unit or category, spark need filter the data by inputUnit or inputCategory.
Solution 1:
sc.textFile(file).map(line => {
val Array(id,unit,category,amount) = line.split(",")
if ( (StringUtils.isNotBlank(inputUnit) && unit != inputUnit ) ||
(StringUtils.isNotBlank(inputCategory) && category != inputCategory)){
null
} else {
val obj = new MyObj(id,unit,category,amount)
(id,obj)
}
}).filter(_!=null).collectAsMap()
Solution 2:
var rdd = sc.textFile(file).map(line => {
val (id,unit,category,amount) = line.split(",")
(id,unit,category,amount)
})
if (StringUtils.isNotBlank(inputUnit)) {
rdd = rdd.filter(_._2 == inputUnit)
}
if (StringUtils.isNotBlank(inputCategory)) {
rdd = rdd.filter(_._3 == inputCategory)
}
rdd.map(e => {
val obj = new MyObject(e._1, e._2, e._3, e._4)
(e._1, obj)
}).collectAsMap
I want to understand, which solution is better, or both of them are poor? If both are poor, how to make a good one? Personally, I think second one is better, but I am not quite sure whether it is nice to declare a rdd as var... (I am new to Spark, and I am using Spark 1.5.0 and Scala 2.10.4 to write the code, this is my first time asking a question in StackOverFlow, feel free to edit if it is not well formatted) Thanks.

Related

Collect all pickups while passing on the street for the first time in Jsprit

I'm trying to solve a simple TSP with Jsprit. I only have 1 vehicle and a total of about 200 pickups.
This is my vehicle routing algorithm code:
val vrpBuilder = VehicleRoutingProblem.Builder.newInstance()
vrpBuilder.setFleetSize(VehicleRoutingProblem.FleetSize.FINITE)
vrpBuilder.setRoutingCost(costMatrix)
vrpBuilder.addAllVehicles(vehicles)
vrpBuilder.addAllJobs(pickups)
val vrp = vrpBuilder.build()
val builder = Jsprit.Builder.newInstance(vrp)
val stateManager = StateManager(vrp)
val constraintManager = ConstraintManager(vrp, stateManager)
constraintManager.addConstraint(object : HardActivityConstraint {
override fun fulfilled(
context: JobInsertionContext?,
prevActivity: TourActivity?,
newActivity: TourActivity?,
nextActivity: TourActivity?,
previousActivityDepartureTime: Double
): HardActivityConstraint.ConstraintsStatus {
if (prevActivity !== null && newActivity !== null && nextActivity !== null && context !== null) {
if (prevActivity.location.id === nextActivity.location.id) {
return HardActivityConstraint.ConstraintsStatus.FULFILLED
}
val distanceBetweenPrevAndNew = costMatrix.getDistance(
prevActivity.location,
newActivity.location,
prevActivity.endTime,
context.newVehicle
)
val distanceBetweenPrevAndNext = costMatrix.getDistance(
prevActivity.location,
nextActivity.location,
prevActivity.endTime,
context.newVehicle
)
if (distanceBetweenPrevAndNext > distanceBetweenPrevAndNew) {
return HardActivityConstraint.ConstraintsStatus.FULFILLED
} else {
return HardActivityConstraint.ConstraintsStatus.NOT_FULFILLED
}
}
return HardActivityConstraint.ConstraintsStatus.FULFILLED
}
}, ConstraintManager.Priority.CRITICAL)
builder.setProperty(Jsprit.Parameter.FAST_REGRET, true.toString())
builder.setProperty(Jsprit.Parameter.CONSTRUCTION, Construction.REGRET_INSERTION.toString())
builder.setProperty(Jsprit.Parameter.THREADS, threads.toString())
builder.setStateAndConstraintManager(stateManager, constraintManager)
val vra = builder.buildAlgorithm()
As routing cost i'm using a distance only FastVehicleRoutingTransportCostsMatrix built with the Graphhopper Matrix Api.
The vehicle should avoid passing in front of an uncollected pickup, I tried to set up an hard activity constraint that checks if the new inserted activity is further away than the next one. If it's not further away, the constraint is not fulfilled. However the constraint does not work quite well.
Here's an example of an optimized route with the constraint:
example
The correct order for my case should be: 46 43 45 44 , however Jsprit orders them in that way because after 44 the vehicle has to do an uturn and run through the street again to reach 47,48,49...
I'm not sure if setting up a constraint is the right way to solve this, do you have any advices?
Thanks

Is it possible to build spark code on fly and execute?

I am trying to create a generic function to read a csv file using databriks CSV READER.But the option's are not mandatory it can differ based on the my input json configuration file.
Example1 :
"ReaderOption":{
"delimiter":";",
"header":"true",
"inferSchema":"true",
"schema":"""some custome schema.."""
},
Example2:
"ReaderOption":{
"delimiter":";",
"schema":"""some custome schema.."""
},
Is it possible to construct options or the entire read statement in runtime and run in spark ?
like below,
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.option(options)
.load(inputPath)
readDF
}
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.options(options)
.load(inputPath)
readDF
}
There is an options which takes key, value pair.

Spark accumulator, I get always 0 value

I'm using a LongAccumulator to count the number of record which I save in Cassandra.
object Main extends App {
val conf = args(0)
val ssc = StreamingContext.getStreamingContext(conf)
Runner.apply(conf).startJob(ssc)
StreamingContext.startStreamingContext(ssc)
StreamingContext.stopStreamingContext(ssc)
}
class Runner (conf: Conf) {
override def startJob(ssc: StreamingContext): Unit = {
accTotal = ssc.sparkContext.longAccumulator("total")
val inputKafka = createDirectStream(ssc, kafkaParams, topicsSet)
val rddAvro = inputKafka.map{x => x.value()}
saveToCassandra(rddAvro)
println("XXX:" + accTotal.value) //-->0
}
def saveToCassandra(upserts: DStream[Data]) = {
val rddCassandraUpsert = upserts.map {
record =>
accTotal.add(1)
println("ACC: " + accTotal.value) --> 1,2,3,4.. OK. Spark Web UI, ok too.
DataExt(record.data,
record.data1)}
rddCassandraUpsert.saveToCassandra(keyspace, table)
}
}
I see that the code is executed right and I save data in Cassandra, when I finally print the accumulator the value is 0, but if I print it in the map fuction I can see the right values. Why?
I'm using Spark 2.0.2 and executing from Intellj in local mode. I have checked the spark web UI and I can see the accumulador updated.
The problem is probably here:
object Main extends App {
...
Spark doesn't support applications extending App, doing so, can result in non-deterministic behaviors:
Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
You should always use standard applications with main:
object Main {
def main(args: Array[String]) {
...

How can I retrieve the alias for a DataFrame in Spark

I'm using Spark 2.0.2. I have a DataFrame that has an alias on it, and I'd like to be able to retrieve that. A simplified example of why I'd want that is below.
def check(ds: DataFrame) = {
assert(ds.count > 0, s"${df.getAlias} has zero rows!")
}
The above code of course fails because DataFrame has no getAlias function. Is there a way to do this?
You can try something like this but I wouldn't go so far to claim it is supported:
Spark < 2.1:
import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
import org.apache.spark.sql.Dataset
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _) => Some(alias)
case _ => None
}
Spark 2.1+:
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _, _) => Some(alias)
case _ => None
}
Example usage:
val plain = Seq((1, "foo")).toDF
getAlias(plain)
Option[String] = None
val aliased = plain.alias("a dataset")
getAlias(aliased)
Option[String] = Some(a dataset)
Disclaimer: as stated above, this code relies on undocumented APIs subject to change. It works as of Spark 2.3.
After much digging into mostly undocumented Spark methods, here is the full code to pull the list of fields, along with the table alias for a dataframe in PySpark:
def schema_from_plan(df):
plan = df._jdf.queryExecution().analyzed()
all_fields = _schema_from_plan(plan)
iterator = plan.output().iterator()
output_fields = {}
while iterator.hasNext():
field = iterator.next()
queryfield = all_fields.get(field.exprId().id(),{})
if not queryfield=={}:
tablealias = queryfield["tablealias"]
else:
tablealias = ""
output_fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
return list(output_fields.values())
def _schema_from_plan(root,tablealias=None,fields={}):
iterator = root.children().iterator()
while iterator.hasNext():
node = iterator.next()
nodeClass = node.getClass().getSimpleName()
if (nodeClass=="SubqueryAlias"):
# get the alias and process the subnodes with this alias
_schema_from_plan(node,node.alias(),fields)
else:
if tablealias:
# add all the fields, along with the unique IDs, and a new tablealias field
iterator = node.output().iterator()
while iterator.hasNext():
field = iterator.next()
fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
_schema_from_plan(node,tablealias,fields)
return fields
# example: fields = schema_from_plan(df)
For Java:
As #veinhorn mentioned, it is also possible to get the alias in Java. Here is a utility method example:
public static <T> Optional<String> getAlias(Dataset<T> dataset){
final LogicalPlan analyzed = dataset.queryExecution().analyzed();
if(analyzed instanceof SubqueryAlias) {
SubqueryAlias subqueryAlias = (SubqueryAlias) analyzed;
return Optional.of(subqueryAlias.alias());
}
return Optional.empty();
}

couchbase spark connector DCP recover from last position

I have this spark application:
val conf = new SparkConf().setMaster("local[*]")
.setAppName("StreamingSample")
.set("com.couchbase.bucket.test", "")
.set("com.couchbase.nodes", "test-machine")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.couchbaseStream(from = FromNow, to = ToInfinity)
.filter(!_.isInstanceOf[Snapshot]) // Don't print snapshots, just mutations and deletions
.checkpoint(Seconds(2))
.foreachRDD(rdd => {
val om: Broadcast[ObjectMapper] = ScalaObjectMapper.getInstance(rdd.sparkContext)
rdd.foreach {
case m: Mutation =>
val content: Map[String, Object] = om.value.readValue(m.content, classOf[Map[String, Object]])
content("objectType") match {
case "o" => println("o")
case "c" => println("c")
case "s" => println("s")
case unsupportedType => println("unsupported")
}
case m: Deletion => println("delete")
}
})
when recover spark fail how I recover from last position?
Unfortunately, the current connector version (1.2.1) can only stream either from the beginning or from the current position (end of the stream). So in your example, you have no choice but to change FromNow to FromBeginning and then skip (in code) past all the messages you've already seen until you catch up.
The client team is currently working on a new implementation that will be able to remember state, so you'll be able to restore from a specific point in the stream.

Resources