spark null pointer exception for customzied filter unftion - apache-spark

I have a customzied filter function writtern as:
def filterSpiderBots(inputDF: DataFrame, whitelistMatcher: Matcher) {
val filterFunc = udf(
(ua:String) => ua == null || whitelistMatcher.matches(ua)
)
inputDF.filter(filterFunc($"ua"))
}
The matcher reads from
FileInputStream(SparkFiles.get("iab-whitelist.txt"))
I made sure the Match is not null, and I have a check for null ua in place.
but when I start the program, I still get NullPointerException. I don't know what is the cause of it..
Update:
after not use matcher, i still get the same error
def filterSpiderBots(inputDF: DataFramer) {
val filterFunc = udf(
(ua:String) => ua == null
)
inputDF.filter(filterFunc($"ua"))
}

filter() preserves records that evaluate to true .. and in your case, all the null records evaluate to true. So, change your filterFunc to "ua != null" ?

Related

Collect all pickups while passing on the street for the first time in Jsprit

I'm trying to solve a simple TSP with Jsprit. I only have 1 vehicle and a total of about 200 pickups.
This is my vehicle routing algorithm code:
val vrpBuilder = VehicleRoutingProblem.Builder.newInstance()
vrpBuilder.setFleetSize(VehicleRoutingProblem.FleetSize.FINITE)
vrpBuilder.setRoutingCost(costMatrix)
vrpBuilder.addAllVehicles(vehicles)
vrpBuilder.addAllJobs(pickups)
val vrp = vrpBuilder.build()
val builder = Jsprit.Builder.newInstance(vrp)
val stateManager = StateManager(vrp)
val constraintManager = ConstraintManager(vrp, stateManager)
constraintManager.addConstraint(object : HardActivityConstraint {
override fun fulfilled(
context: JobInsertionContext?,
prevActivity: TourActivity?,
newActivity: TourActivity?,
nextActivity: TourActivity?,
previousActivityDepartureTime: Double
): HardActivityConstraint.ConstraintsStatus {
if (prevActivity !== null && newActivity !== null && nextActivity !== null && context !== null) {
if (prevActivity.location.id === nextActivity.location.id) {
return HardActivityConstraint.ConstraintsStatus.FULFILLED
}
val distanceBetweenPrevAndNew = costMatrix.getDistance(
prevActivity.location,
newActivity.location,
prevActivity.endTime,
context.newVehicle
)
val distanceBetweenPrevAndNext = costMatrix.getDistance(
prevActivity.location,
nextActivity.location,
prevActivity.endTime,
context.newVehicle
)
if (distanceBetweenPrevAndNext > distanceBetweenPrevAndNew) {
return HardActivityConstraint.ConstraintsStatus.FULFILLED
} else {
return HardActivityConstraint.ConstraintsStatus.NOT_FULFILLED
}
}
return HardActivityConstraint.ConstraintsStatus.FULFILLED
}
}, ConstraintManager.Priority.CRITICAL)
builder.setProperty(Jsprit.Parameter.FAST_REGRET, true.toString())
builder.setProperty(Jsprit.Parameter.CONSTRUCTION, Construction.REGRET_INSERTION.toString())
builder.setProperty(Jsprit.Parameter.THREADS, threads.toString())
builder.setStateAndConstraintManager(stateManager, constraintManager)
val vra = builder.buildAlgorithm()
As routing cost i'm using a distance only FastVehicleRoutingTransportCostsMatrix built with the Graphhopper Matrix Api.
The vehicle should avoid passing in front of an uncollected pickup, I tried to set up an hard activity constraint that checks if the new inserted activity is further away than the next one. If it's not further away, the constraint is not fulfilled. However the constraint does not work quite well.
Here's an example of an optimized route with the constraint:
example
The correct order for my case should be: 46 43 45 44 , however Jsprit orders them in that way because after 44 the vehicle has to do an uturn and run through the street again to reach 47,48,49...
I'm not sure if setting up a constraint is the right way to solve this, do you have any advices?
Thanks

using Object name in if statement both true and false?

I try to use my object name for an if statement.. but both come up as true, why?
var moduleInfo = new Object("moduleInfo");
moduleInfo ["name"] = "Module: Export"
if (moduleInfo !== "moduleInfo"){
console.log("window is NOT modulInfo")
}
if (moduleInfo == "moduleInfo"){
console.log("window IS modulInfo")
}
The !== is comparing by type, and you are comparing an object with a primitive type of string. replacing either that operator with != or replacing the second one with === will probably get you a more consistent/desired result.
== converts the operands to the same type before making the comparison
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Comparison_Operators

Spock: check the query parameter count in URI

I have just started with spock. I have one functionality. where the java function makes an http call. As per functionality, the URI used in http call, must contain "loc" parameter and it should be only once.
I am writing Spock test case. I have written below snippet.
def "prepareURI" () {
given: "Search Object"
URI uri = new URI();
when:
uri = handler.prepareURI( properties) // it will return URI like http://example.com?query=abc&loc=US
then:
with(uri)
{
def map = uri.getQuery().split('&').inject([:]) {map, kv-> def (key, value) = kv.split('=').toList(); map[key] = value != null ? URLDecoder.decode(value) : null; map }
assert map.loc != null
}
}
From above snippet, my 2 tests got passed like
It should be exists
It should not be null
I want to check the count of "loc" query parameter. It should be passed exactly once. With map as above, If I pass "loc" parameter twice, map overrides the old value with 2nd one.
Does any one knows, how to access the query parameters as list, and in list I want to search the count of Strings which starts with "loc"
Thanks in advance.
Perhaps an example would be the best start:
def uri = new URI('http://example.com?query=abc&loc=US')
def parsed = uri.query.tokenize('&').collect { it.tokenize('=') }
println "parsed to list: $parsed"
println "count of 'loc' params: " + parsed.count { it.first() == 'loc' }
println "count of 'bob' params: " + parsed.count { it.first() == 'bob' }
println "count of params with value 'abc': " + parsed.count { it.last() == 'abc' }
prints:
$ groovy test.groovy
parsed to list: [[query, abc], [loc, US]]
count of 'loc' params: 1
count of 'bob' params: 0
count of params with value 'abc': 1
the problem, as you correctly noted, is that you can not put your params into a map if your intent is to count the number of params with a certain name.
In the above, we parse the params in to a list of lists where the inner lists are key, value pairs. This way we can call it.first() to get the param names and it.last() to get the param values. The groovy List.count { } method lets us count the occurences of a certain item in the list of params.
As for your code, there is no need to call new URI() at the beginning of your test as you set the value anyway a few lines down.
Also the with(uri) call is unnecessary as you don't use any of the uri methods without prefixing them with uri. anyway. I.e. you can either write:
def uri = new URI('http://example.com?query=abc&loc=US')
def parsed = uri.query.tokenize('&').collect { it.tokenize('=') }
or:
def uri = new URI('http://example.com?query=abc&loc=US')
uri.with {
def parsed = query.tokenize('&').collect { it.tokenize('=') }
}
(note that we are using query directly in the second example)
but there is not much point in using with if you are still prefixing with uri..
The resulting test case might look something like:
def "prepareURI"() {
given: "Search Object"
def uri = handler.prepareURI( properties) // it will return URI like http://example.com?query=abc&loc=US
when:
def parsed = query.tokenize('&').collect { it.tokenize('=') }
then:
assert parsed.count { it.first() == 'loc' } == 1
}

How can I retrieve the alias for a DataFrame in Spark

I'm using Spark 2.0.2. I have a DataFrame that has an alias on it, and I'd like to be able to retrieve that. A simplified example of why I'd want that is below.
def check(ds: DataFrame) = {
assert(ds.count > 0, s"${df.getAlias} has zero rows!")
}
The above code of course fails because DataFrame has no getAlias function. Is there a way to do this?
You can try something like this but I wouldn't go so far to claim it is supported:
Spark < 2.1:
import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
import org.apache.spark.sql.Dataset
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _) => Some(alias)
case _ => None
}
Spark 2.1+:
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _, _) => Some(alias)
case _ => None
}
Example usage:
val plain = Seq((1, "foo")).toDF
getAlias(plain)
Option[String] = None
val aliased = plain.alias("a dataset")
getAlias(aliased)
Option[String] = Some(a dataset)
Disclaimer: as stated above, this code relies on undocumented APIs subject to change. It works as of Spark 2.3.
After much digging into mostly undocumented Spark methods, here is the full code to pull the list of fields, along with the table alias for a dataframe in PySpark:
def schema_from_plan(df):
plan = df._jdf.queryExecution().analyzed()
all_fields = _schema_from_plan(plan)
iterator = plan.output().iterator()
output_fields = {}
while iterator.hasNext():
field = iterator.next()
queryfield = all_fields.get(field.exprId().id(),{})
if not queryfield=={}:
tablealias = queryfield["tablealias"]
else:
tablealias = ""
output_fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
return list(output_fields.values())
def _schema_from_plan(root,tablealias=None,fields={}):
iterator = root.children().iterator()
while iterator.hasNext():
node = iterator.next()
nodeClass = node.getClass().getSimpleName()
if (nodeClass=="SubqueryAlias"):
# get the alias and process the subnodes with this alias
_schema_from_plan(node,node.alias(),fields)
else:
if tablealias:
# add all the fields, along with the unique IDs, and a new tablealias field
iterator = node.output().iterator()
while iterator.hasNext():
field = iterator.next()
fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
_schema_from_plan(node,tablealias,fields)
return fields
# example: fields = schema_from_plan(df)
For Java:
As #veinhorn mentioned, it is also possible to get the alias in Java. Here is a utility method example:
public static <T> Optional<String> getAlias(Dataset<T> dataset){
final LogicalPlan analyzed = dataset.queryExecution().analyzed();
if(analyzed instanceof SubqueryAlias) {
SubqueryAlias subqueryAlias = (SubqueryAlias) analyzed;
return Optional.of(subqueryAlias.alias());
}
return Optional.empty();
}

Method parameters in findall closure are empty

There is a method in code
def createdOrders = getValuesByStatus(extractedResponse, status)
Where extractedResponse is jsonSlurper parsed result and status just string value
Method:
def getValuesByStatus(def jsonData, String status ) {
jsonData.findAll() { json->
json.responseStatus.status == status
}
}
Testing with Spock this getValuesByStatus works perfectly,
but running main application somehow results in
java.lang.NullPointerException: Cannot get property 'responseStatus' on null object
Running debug in intellij idea I can see that jsonData is not null and not empty, but breakpoint in findall closure showing that json element is null...
Just use the null safe operator:
def getValuesByStatus(def jsonData, String status ) {
jsonData.findAll { json->
json?.responseStatus?.status == status
}
}
jsonData contains null values, but in debug mode you cant see those, so null value checking in findall closure helped.

Resources