How can I retrieve the alias for a DataFrame in Spark

How can I retrieve the alias for a DataFrame in Spark - apache-spark

I'm using Spark 2.0.2. I have a DataFrame that has an alias on it, and I'd like to be able to retrieve that. A simplified example of why I'd want that is below.
def check(ds: DataFrame) = {
assert(ds.count > 0, s"${df.getAlias} has zero rows!")
}
The above code of course fails because DataFrame has no getAlias function. Is there a way to do this?

You can try something like this but I wouldn't go so far to claim it is supported:
Spark < 2.1:
import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
import org.apache.spark.sql.Dataset
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _) => Some(alias)
case _ => None
}
Spark 2.1+:
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _, _) => Some(alias)
case _ => None
}
Example usage:
val plain = Seq((1, "foo")).toDF
getAlias(plain)
Option[String] = None
val aliased = plain.alias("a dataset")
getAlias(aliased)
Option[String] = Some(a dataset)

Disclaimer: as stated above, this code relies on undocumented APIs subject to change. It works as of Spark 2.3.
After much digging into mostly undocumented Spark methods, here is the full code to pull the list of fields, along with the table alias for a dataframe in PySpark:
def schema_from_plan(df):
plan = df._jdf.queryExecution().analyzed()
all_fields = _schema_from_plan(plan)
iterator = plan.output().iterator()
output_fields = {}
while iterator.hasNext():
field = iterator.next()
queryfield = all_fields.get(field.exprId().id(),{})
if not queryfield=={}:
tablealias = queryfield["tablealias"]
else:
tablealias = ""
output_fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
return list(output_fields.values())
def _schema_from_plan(root,tablealias=None,fields={}):
iterator = root.children().iterator()
while iterator.hasNext():
node = iterator.next()
nodeClass = node.getClass().getSimpleName()
if (nodeClass=="SubqueryAlias"):
# get the alias and process the subnodes with this alias
_schema_from_plan(node,node.alias(),fields)
else:
if tablealias:
# add all the fields, along with the unique IDs, and a new tablealias field
iterator = node.output().iterator()
while iterator.hasNext():
field = iterator.next()
fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
_schema_from_plan(node,tablealias,fields)
return fields
# example: fields = schema_from_plan(df)

For Java:
As #veinhorn mentioned, it is also possible to get the alias in Java. Here is a utility method example:
public static <T> Optional<String> getAlias(Dataset<T> dataset){
final LogicalPlan analyzed = dataset.queryExecution().analyzed();
if(analyzed instanceof SubqueryAlias) {
SubqueryAlias subqueryAlias = (SubqueryAlias) analyzed;
return Optional.of(subqueryAlias.alias());
}
return Optional.empty();
}

Related

Gatling Rest API Testing - retrieve a value from json response and add it to the list, iterate through list

I am new to Gatling, I am trying to do the performance testing for couple of rest calls. In my scenario I need to extract a value from json response of the 1st call and add those values to the list after looping for few times. Again after looping for few times and adding the values into the list, I want to reuse each value in my next rest call by iterating over the values in the list. Can anyone please suggest on how to implement this. I tried something as below,
var datasetIdList = List.empty[String]
val datasetidsFeeder = datasetIdList.map(datasetId => Map("datasetId" -> datasetId)).iterator
def createData() = {
repeat(20){
feed("").exec(http("create dataset").post("/create/data").header("content-type", "application/json")
.body(StringBody("""{"name":"name"}"""))
.asJson.check(jsonPath("$.id").saveAs("userId"))))
.exec(session => { var usrid = session("userId").as[String].trim
datasetIdList:+= usrid session})
}}
def upload()= feed(datasetidsFeeder).exec(http("file upload").post("/compute-metaservice/datasets/${datasetId}/uploadFile")
.formUpload("File","./src/test/resources/data/File.csv")
.header("content-type","multipart/form-data")
.check(status is 200))
val scn = scenario("create data and upload").exec(createData()).exec(upload())
setUp(scn.inject(atOnceUsers(1))).protocols(httpConf)
}
I am seeing an exception that ListFeeder is empty when trying to run above script. Can someone please help
Updated Code:
class ParallelcallsSimulation extends Simulation{
var idNumbers = (1 to 50).iterator
val customFeeder = Iterator.continually(Map(
"name" -> ("test_gatling_"+ idNumbers.next())
))
val httpConf = http.baseUrl("http://localhost:8080")
.header("Authorization","Bearer 6a4aee03-9172-4e31-a784-39dea65e9063")
def createDatasetsAndUpload() = {
repeat(3) {
//create dataset
feed(customFeeder).exec(http("create data").post("/create/data").header("content-type", "application/json")
.body(StringBody("""{ "name": "${name}","description": "create data and upload file"}"""))
.asJson.check(jsonPath("$.id").saveAs("userId")))
.exec(session => {
val name = session("name").asOption[String]
println(name.getOrElse("COULD NOT FIND NAME"))
val userId = session("userId").as[String].trim
println("%%%%% User ID ====>"+userId)
val datasetIdList = session("datasetIdList").asOption[List[_]].getOrElse(Nil)
session.set("datasetIdList", userId :: datasetIdList)
})
}
}
// File Upload
def fileUpload() = foreach("${datasetIdList}","datasetId"){
exec(http("file upload").post("/uploadFile")
.formUpload("File","./src/test/resources/data/File.csv")
.header("content-type","multipart/form-data")
.check(status is 200))
}
def getDataSetId() = foreach("${datasetIdList}","datasetId"){
exec(http("get datasetId")
.get("/get/data/${datasetId}")
.header("content-type","application/json")
.asJson.check(jsonPath("$.dlp.dlp_job_status").optional
.saveAs("dlpJobStatus")).check(status is 200)
).exec(session => {
val datastId = session("datasetId").asOption[String]
println("request for datasetId >>>>>>>>"+datastId.getOrElse("datasetId not found"))
val jobStatus = session("dlpJobStatus").asOption[String]
println("JOB STATUS:::>>>>>>>>>>"+jobStatus.getOrElse("Dlp Job Status not Found"))
println("Time: >>>>>>"+System.currentTimeMillis())
session
}).pause(10)
}
val scn1 = scenario("create multiple datasets and upload").exec(createDatasetsAndUpload()).exec(fileUpload())
val scn2 = scenario("get datasetId").pause(100).exec(getDataSetId())
setUp(scn1.inject(atOnceUsers(1)),scn2.inject(atOnceUsers(1))).protocols(httpConf)
}
I see below error when I try to execute above script
[ERROR] i.g.c.s.LoopBlock$ - Condition evaluation crashed with message 'No attribute named 'datasetIdList' is defined', exiting loop

var datasetIdList = List.empty[String] defines a mutable variable pointing to a immutable list.
val datasetidsFeeder = datasetIdList.map(datasetId => Map("datasetId" -> datasetId)).iterator uses the immutable list. Further changes to datasetIdList is irrelevant to datasetidsFeeder.
Mutating a global variable with your virtual user is usually not a good idea.
You can save the value into the user's session instead.
In the exec block, you can write:
val userId = session("userId").as[String].trim
val datasetIdList = session("datasetIdList").asOption[List[_]].getOrElse(Nil)
session.set("datasetIdList", userId :: datasetIdList)
Then you can use foreach to iterate them all without using a feeder at all.
foreach("${datasetIdList}", "datasetId") {
exec(http("file upload")
...
}
You should put more work in your question.
Your code is not syntax-highlighted, and is formatted poorly.
You said "I am seeing an exception that ListFeeder is empty" but the words "ListFeeder" are not seen anywhere.
You should post the error message so that it's easier to see what went wrong.
In the documentation linked, there is a Warning. Quoted below:
Session instances are immutable!
Why is that so? Because Sessions are messages that are dealt with in a multi-threaded concurrent way, so immutability is the best way to deal with state without relying on synchronization and blocking.
A very common pitfall is to forget that set and setAll actually return new instances.
This is why the code in the updated question doesn't update the list.
session => {
...
session.set("datasetIdList", userId :: datasetIdList)
println("%%%% List =====>>>" + datasetIdList.toString())
session
}
The updated session is simply discarded. And the original session is returned in the anonymous function.

BSON to Play JSON support for Long values

I've started using the play-json/play-json-compat libraries with reactivemongo 0.20.11.
So I can use JSON Play reads/writes while importing the 'reactivemongo.play.json._' package and then easily fetch data from a JSONCollection instead of a BSONCollection.
For most cases, this works great but for Long fields, it doesn't :(
For example:
case class TestClass(name: String, age: Long)
object TestClass {
implicit val reads = Json.reads[TestClass]
}
If I try querying using the following func:
def getData: Map[String, TestClass] = {
val res = collection.find(emptyDoc)
.cursor[TestClass]()
.collect[List](-1, Cursor.ContOnError[List[TestClass]] { case (_, t) =>
failureLogger.error(s"Failed deserializing TestClass from Mongo", t)
})
.map { items =>
items map { item =>
item.name -> item.age
} toMap
}
Await.result(res, 10 seconds)
}
Then I get the following error:
play.api.libs.json.JsResultException: JsResultException(errors:List((/age,List(ValidationError(List(error.expected.jsnumber),WrappedArray())))))
I've debugged the reading of the document and noticed that when it first converts the BSON to a JsObject, then the long field is as following:
"age": {"$long": 1526389200000}
I found a way to make this work but I really don't like it:
case class MyBSONLong(`$long`: Long)
object MyBSONLong {
implicit val longReads = Json.reads[MyBSONLong]
}
case class TestClass(name: String, age: Long)
object TestClass {
implicit val reads = (
(__ \ "name").read[String] and
(__ \ "age").read[MyBSONLong].map(_.`$long`)
) (apply _)
}
So this works, but it's a very ugly solution.
Is there a better way to do this?
Thanks in advance :)

Failed to obtain broadcast value

I create a spark application like below.
When run with local client mode, everything goes fine.
But when I submit into YARN with cluster deploy mode on prod environment, variable applicationAction in last match block always be null.
So is there any problem which I'm using broadcast here, or there's any other method I could pass the variables to the last match case block.
Thanks.
object SparkTask {
private sealed trait AppAction {}
case class Action1() extends AppAction
case class Action2() extends AppAction
def main(args: Array[String]): Unit = {
var applicationAction: Broadcast[AppAction] = null
val sparkSession = SparkSession.builder.appName("SparkTask").getOrCreate
args(0) match {
case "action-1" => applicationAction = sparkSession.sparkContext.broadcast(Action1())
case "action-2" => applicationAction = sparkSession.sparkContext.broadcast(Action2())
case _ => sys.exit(255)
}
// Here goes some df action and get a persisted dataset
val df1 = ...
val df2 = ...
val df3 = ...
applicationAction.value match {
case Action1() => handleAction1(df3)
case Action2() => handleAction2(df3)
}
}
}

The purpose of broadcast variables it to share some data with executors.
I think in your use-case there are two possibilites:
You're trying to get some information from executors to driver: for this you shouldn't use broadcast variables but accumulators or something like take/collect.
You want take a decision based on applicationAction.value (immutable): in this case you can then use directly the value of args(0).

Scala - Remove all elements in a list/map of strings from a single String

Working on an internal website where the URL contains the source reference from other systems. This is a business requirement and cannot be changed.
i.e. "http://localhost:9000/source.address.com/7808/project/repo"
"http://localhost:9000/build.address.com/17808/project/repo"
I need to remove these strings from the "project/repo" string/variables using a trait so this can be used natively from multiple services. I also want to be able to add more sources to this list (which already exists) and not modify the method.
"def normalizePath" is the method accessed by services, 2 non-ideal but reasonable attempts so far. Getting stuck on a on using foldLeft which I woudl like some help with or an simpler way of doing the described. Code Samples below.
1st attempt using an if-else (not ideal as need to add more if/else statements down the line and less readable than pattern match)
trait NormalizePath {
def normalizePath(path: String): String = {
if (path.startsWith("build.address.com/17808")) {
path.substring("build.address.com/17808".length, path.length)
} else {
path
}
}
}
and 2nd attempt (not ideal as likely more patterns will get added and it generates more bytecode than if/else)
trait NormalizePath {
val pattern = "build.address.com/17808/"
val pattern2 = "source.address.com/7808/"
def normalizePath(path: String) = path match {
case s if s.startsWith(pattern) => s.substring(pattern.length, s.length)
case s if s.startsWith(pattern2) => s.substring(pattern2.length, s.length)
case _ => path
}
}
Last attempt is to use an address list(already exists elsewhere but defined here as MWE) to remove occurrences from the path string and it doesn't work:
trait NormalizePath {
val replacements = (
"build.address.com/17808",
"source.address.com/7808/")
private def remove(path: String, string: String) = {
path-string
}
def normalizePath(path: String): String = {
replacements.foldLeft(path)(remove)
}
}
Appreciate any help on this!

If you are just stripping out those strings:
val replacements = Seq(
"build.address.com/17808",
"source.address.com/7808/")
replacements.foldLeft("http://localhost:9000/source.address.com/7808/project/repo"){
case(path, toReplace) => path.replaceAll(toReplace, "")
}
// http://localhost:9000/project/repo
If you are replacing those string by something else:
val replacementsMap = Seq(
"build.address.com/17808" -> "one",
"source.address.com/7808/" -> "two/")
replacementsMap.foldLeft("http://localhost:9000/source.address.com/7808/project/repo"){
case(path, (toReplace, replacement)) => path.replaceAll(toReplace, replacement)
}
// http://localhost:9000/two/project/repo
The replacements collection can come from elsewhere in the code and will not need to be redeployed.
// method replacing by empty string
def normalizePath(path: String) = {
replacements.foldLeft(path){
case(startingPoint, toReplace) => startingPoint.replaceAll(toReplace, "")
}
}
normalizePath("foobar/build.address.com/17808/project/repo")
// foobar/project/repo
normalizePath("whateverPath")
// whateverPath
normalizePath("build.address.com/17808build.address.com/17808/project/repo")
// /project/repo

A very simple replacement could be made as follows:
val replacements = Seq(
"build.address.com/17808",
"source.address.com/7808/")
def normalizePath(path: String): String = {
replacements.find(path.startsWith(_)) // find the first occurrence
.map(prefix => path.substring(prefix.length)) // remove the prefix
.getOrElse(path) // if not found, return the original string
}
Since the expected replacements are very similar, have you tried to generalize them and use regex matching?

There are a million and one ways to extract /project/repo from a String in Scala. Here are a few I came up with:
val list = List("build.address.com/17808", "source.address.com/7808") //etc
def normalizePath(path: String) = {
path.stripPrefix(list.find(x => path.contains(x)).getOrElse(""))
}
Output:
scala> normalizePath("build.address.com/17808/project/repo")
res0: String = /project/repo
val list = List("build.address.com/17808", "source.address.com/7808") //etc
def normalizePath(path: String) = {
list.map(x => if (path.contains(x)) {
path.takeRight(path.length - x.length)
}).filter(y => y != ()).head
}
Output:
scala> normalizePath("build.address.com/17808/project/repo")
res0: Any = /project/repo
val list = List("build.address.com/17808", "source.address.com/7808") //etc
def normalizePath(path: String) = {
list.foldLeft(path)((a, b) => a.replace(b, ""))
}
Output:
scala> normalizePath("build.address.com/17808/project/repo")
res0: String = /project/repo
Depends how complicated you want your code to look (or how silly you want to be), really. Note that the second example has return type Any, which might not be ideal for your scenario. Also, these examples aren't meant to be able to just take the String out of the middle of your path... they can be fairly easily modified if you want to do that though. Let me know if you want me to add some examples just stripping things like build.address.com/17808 out of a String - I'd be happy to do so.

How to decide whether to use Spark RDD filter or not

I am using spark to read and analyse a data file, file contains data like following:
1,unit1,category1_1,100
2,unit1,category1_2,150
3,unit2,category2_1,200
4,unit3,category3_1,200
5,unit3,category3_2,300
The file contains around 20 million records. If user input unit or category, spark need filter the data by inputUnit or inputCategory.
Solution 1:
sc.textFile(file).map(line => {
val Array(id,unit,category,amount) = line.split(",")
if ( (StringUtils.isNotBlank(inputUnit) && unit != inputUnit ) ||
(StringUtils.isNotBlank(inputCategory) && category != inputCategory)){
null
} else {
val obj = new MyObj(id,unit,category,amount)
(id,obj)
}
}).filter(_!=null).collectAsMap()
Solution 2:
var rdd = sc.textFile(file).map(line => {
val (id,unit,category,amount) = line.split(",")
(id,unit,category,amount)
})
if (StringUtils.isNotBlank(inputUnit)) {
rdd = rdd.filter(_._2 == inputUnit)
}
if (StringUtils.isNotBlank(inputCategory)) {
rdd = rdd.filter(_._3 == inputCategory)
}
rdd.map(e => {
val obj = new MyObject(e._1, e._2, e._3, e._4)
(e._1, obj)
}).collectAsMap
I want to understand, which solution is better, or both of them are poor? If both are poor, how to make a good one? Personally, I think second one is better, but I am not quite sure whether it is nice to declare a rdd as var... (I am new to Spark, and I am using Spark 1.5.0 and Scala 2.10.4 to write the code, this is my first time asking a question in StackOverFlow, feel free to edit if it is not well formatted) Thanks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How can I retrieve the alias for a DataFrame in Spark - apache-spark

Related

Gatling Rest API Testing - retrieve a value from json response and add it to the list, iterate through list

BSON to Play JSON support for Long values

Failed to obtain broadcast value

Scala - Remove all elements in a list/map of strings from a single String

How to decide whether to use Spark RDD filter or not

Categories

Resources