I create a spark application like below.
When run with local client mode, everything goes fine.
But when I submit into YARN with cluster deploy mode on prod environment, variable applicationAction in last match block always be null.
So is there any problem which I'm using broadcast here, or there's any other method I could pass the variables to the last match case block.
Thanks.
object SparkTask {
private sealed trait AppAction {}
case class Action1() extends AppAction
case class Action2() extends AppAction
def main(args: Array[String]): Unit = {
var applicationAction: Broadcast[AppAction] = null
val sparkSession = SparkSession.builder.appName("SparkTask").getOrCreate
args(0) match {
case "action-1" => applicationAction = sparkSession.sparkContext.broadcast(Action1())
case "action-2" => applicationAction = sparkSession.sparkContext.broadcast(Action2())
case _ => sys.exit(255)
}
// Here goes some df action and get a persisted dataset
val df1 = ...
val df2 = ...
val df3 = ...
applicationAction.value match {
case Action1() => handleAction1(df3)
case Action2() => handleAction2(df3)
}
}
}
The purpose of broadcast variables it to share some data with executors.
I think in your use-case there are two possibilites:
You're trying to get some information from executors to driver: for this you shouldn't use broadcast variables but accumulators or something like take/collect.
You want take a decision based on applicationAction.value (immutable): in this case you can then use directly the value of args(0).
Related
I'm writing a custom Spark catalyst Expression with custom codegen, but it seems that Spark (3.0.0) doesn't want to use the generated code, and falls back to interpreted mode.
I create my SparkSession in a pretty standard way, except that I try to force codegen:
val spark = SparkSession.builder()
.appName("test-spark")
.master("local[5]")
.config("spark.sql.codegen.factoryMode", "CODEGEN_ONLY")
.config("spark.sql.codegen.fallback", "false")
.getOrCreate()
And then I have this custom Expression with both interpreted mode and codegen defined:
case class IsTrimmedExpr(child: Expression) extends UnaryExpression with ExpectsInputTypes {
override def inputTypes: Seq[DataType] = Seq(StringType)
override lazy val dataType: DataType = BooleanType
override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
throw new RuntimeException("expected code gen")
nullSafeCodeGen(ctx, ev, input => s"($input.trim().equals($input))")
}
override protected def nullSafeEval(input: Any): Any = {
throw new RuntimeException("should not eval")
val str = input.asInstanceOf[org.apache.spark.unsafe.types.UTF8String]
str.trim.equals(str)
}
}
which I register into the session's registry:
spark.sessionState.functionRegistry.registerFunction(
FunctionIdentifier("is_trimmed"), {
case Seq(s) => IsTrimmedExpr(s)
}
)
To invoke the function/Expression, I do
val df = Seq(" abc", "def", "56 ", " 123 ", "what is a trim").toDF("word")
df.selectExpr("word", "is_trimmed(word)").show()
But instead of the expected exception from the doGenCode function, I got the exception from the nullSafeEval function which should not be run at all.
How do I force Spark to use codegen mode?
Enabling codegen is done via setting spark.sql.codegen to True
I'm doing some statistics using spark streaming and cassandra. When reading cassandra tables by spark-cassandra-connector and make the cassandra row RDD to a DStreamRDD by ConstantInputDStream, the "CurrentDate" variable in where clause still stays the same day as the program starts.
The purpose is to analyze the total score by some dimensions till current date, but now the code runs analysis just till the day it start running. I run the code in 2019-05-25 and data inserted into table after that time cannot be take in.
The code I use is like below:
class TestJob extends Serializable {
def test(ssc : StreamingContext) : Unit={
val readTableRdd = ssc.cassandraTable(Configurations.getInstance().keySpace1,Constants.testTable)
.select(
"code",
"date",
"time",
"score"
).where("date<= ?",new Utils().getCurrentDate())
val DStreamRdd = new ConstantInputDStream(ssc,readTableRdd)
DStreamRdd.foreachRDD{r=>
//DO SOMETHING
}
}
}
object GetSSC extends Serializable {
def getSSC() : StreamingContext ={
val conf = new SparkConf()
.setMaster(Configurations.getInstance().sparkHost)
.setAppName(Configurations.getInstance().appName)
.set("spark.cassandra.connection.host", Configurations.getInstance().casHost)
.set("spark.cleaner.ttl", "3600")
.set("spark.default.parallelism","3")
.set("spark.ui.port","5050")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
#transient lazy val ssc = new StreamingContext(sc,Seconds(30))
ssc
}
}
object Main {
val logger : Log = LogFactory.getLog(Main.getClass)
def main(args : Array[String]) : Unit={
val ssc = GetSSC.getSSC()
try{
new TestJob().test(ssc)
ssc.start()
ssc.awaitTermination()
}catch {
case e : Exception =>
logger.error(Main.getClass.getSimpleName+"error :
"+e.printStackTrace())
}
}
}
Table used in this Demo like:
CREATE TABLE test.test_table (
code text PRIMARY KEY, //UUID
date text, // '20190520'
time text, // '12:00:00'
score int); // 90
Any help is appreciated!
In general, RDDs that are returned by Spark Cassandra Connector aren't the streaming RDDs - there is no such functionality in Cassandra that will allow to subscribe to the changes feed and analyze it. You can implement something like by explicitly looping and fetching the data, but it will require careful design of the tables, but it's hard to say something without digging more deeply into requirements for latency, amount of data, etc.
Getting Null Pointer error when unit testing controller. The issue seem to be in the line
def signupUser = Action.async{
implicit request => { //requeust seem to be null
I suspect so because the stacktrace from previous tests point to implicit request line. But I don’t know what could be wrong in this because I am using FakeRequest like so val request = FakeRequest("POST", "/ws/users/signup").withJsonBody(Json.parse("""{"bad": "field"}"""))
Following is a snippet of a controller I want to unit-test
class UserController #Inject()(userRepo: UsersRepository,cc: ControllerComponents, silhouette: Silhouette[JWTEnv])(implicit exec: ExecutionContext) extends AbstractController(cc){
def signupUser = Action.async{
implicit request => {...}
}
I only want to test that the controller returns an error when it gets a request without json body. Thus I don't need Silhouette and I want to mock it. But I am getting null pointer error.
Following is the way I have written my unit test case is
class UserControllerUnitSpec extends PlaySpec with MockitoSugar {
"User signup request with non-JSON body" should {
"return 400 (Bad Request) and the validation text 'Incorrect body type. Body type must be JSON'" in {
val email = "d#d.com"
val loginInfo = LoginInfo(CredentialsProvider.ID, email);
val passwordInfo = PasswordInfo("someHasher","somePassword",Some("someSalt"))
val internalUserProfile = InternalUserProfile(loginInfo,true,Some(passwordInfo))
val externalUserProfile = ExternalUserProfile(email,"d","d",Some("somePassword"))
val userProfile = UserProfile(Some(internalUserProfile),externalUserProfile)
val user = User(UUID.randomUUID(),userProfile)
println("testing with mocked User value",user);
val mockUserRepository = mock[UsersRepository]
when(mockUserRepository.findUser(loginInfo)).thenReturn(Future(Some(user)))
when(mockUserRepository.saveUser(user)).thenReturn(Future(Some(user)))
val mockSilhouette = mock[Silhouette[JWTEnv]] //I am probably not doing this correctly
val mockControllerComponents = mock[ControllerComponents] //I am not sure if this is correct either
val controller = new UserController(mockUserRepository,mockControllerComponents,mockSilhouette)
val result:Future[Result] = controller.signupUser(FakeRequest())
(result.map(response => {
println("response: ",response)
response mustBe BadRequest
}))
}
}
}
Regarding mockControllerComponents, Helpers.stubControllerComponents can be used instead of a mock:
val mockControllerComponents = Helpers.stubControllerComponents()
Regarding mockSilhouette, you have to setup the mock using when(...).thenReturn(...) similarly to how you have done it formockUserRepository, that is, inspect all the usages of silhouette inside signupUser and provide the appropriate method stubs:
val mockSilhouette = mock[Silhouette[JWTEnv]]
when(mockSilhouette.foo(...)).thenReturn(...)
when(mockUserRepository.bar(...)).thenReturn(...)
...
(Posted solution on behalf of the question author).
Here is the answer which worked. Thanks Mario.
class UserControllerUnitSpec extends PlaySpec /*with MockitoSugar*/ {
"User signup request with non-JSON body" should {
"return 400 (Bad Request) and the validation text 'Incorrect body type. Body type must be JSON'" in {
val email = "d#d.com"
val loginInfo = LoginInfo(CredentialsProvider.ID, email);
val passwordInfo = PasswordInfo("someHasher","somePassword",Some("someSalt"))
val internalUserProfile = InternalUserProfile(loginInfo,true,Some(passwordInfo))
val externalUserProfile = ExternalUserProfile(email,"d","d",Some("somePassword"))
val userProfile = UserProfile(Some(internalUserProfile),externalUserProfile)
val user = User(UUID.randomUUID(),userProfile)
println("testing with mocked User value",user);
val mockUserRepository = mock(classOf[UsersRepository])
// when(mockUserRepository.findUser(loginInfo)).thenReturn(Future(Some(user)))
// when(mockUserRepository.saveUser(user)).thenReturn(Future(Some(user)))
// val mockSilhouette = mock(classOf[Silhouette[JWTEnv]])
val mockControllerComponents = Helpers.stubControllerComponents()//mock(classOf[ControllerComponents])
/*
The controller needs Silhouette. Using Silhouette's test kit to create fake instances.
If you would like to test this controller, you must provide an environment that can handle your Identity and Authenticator implementation.
For this case Silhouette provides a FakeEnvironment which automatically sets up all components needed to test your specific actions.
You must only specify one or more LoginInfo -> Identity pairs that should be returned by calling request.identity in your action and
the authenticator instance that tracks this user.
*/
//User extends Identity trait
/*
Under the hood, the environment instantiates a FakeIdentityService which stores your given identities and returns it if needed.
It instantiates also the appropriate AuthenticatorService based on your defined Authenticator type. All Authenticator services are real
service instances set up with their default values and dependencies.
*/
implicit val sys = ActorSystem("MyTest")
implicit val mat = ActorMaterializer()
implicit val env = FakeEnvironment[JWTEnv](Seq(loginInfo->user))
val defaultParser = new mvc.BodyParsers.Default()
val securedAction = new DefaultSecuredAction(new DefaultSecuredRequestHandler(new DefaultSecuredErrorHandler(stubMessagesApi())),defaultParser)
val unsecuredAction = new DefaultUnsecuredAction(new DefaultUnsecuredRequestHandler(new DefaultUnsecuredErrorHandler(stubMessagesApi())),defaultParser)
val userAware = new DefaultUserAwareAction(new DefaultUserAwareRequestHandler(),defaultParser)
val mockSilhouette = new SilhouetteProvider[JWTEnv](env,securedAction,unsecuredAction,userAware)
val controller = new UserController(mockUserRepository,mockControllerComponents,mockSilhouette)
val request = FakeRequest("POST","ws/users/signup")
println("sending request",request)
//val result = controller.someMethod()
val result:Future[Result] = controller.signupUser(request)
status(result) mustBe BAD_REQUEST
}
}
}
I'm using a LongAccumulator to count the number of record which I save in Cassandra.
object Main extends App {
val conf = args(0)
val ssc = StreamingContext.getStreamingContext(conf)
Runner.apply(conf).startJob(ssc)
StreamingContext.startStreamingContext(ssc)
StreamingContext.stopStreamingContext(ssc)
}
class Runner (conf: Conf) {
override def startJob(ssc: StreamingContext): Unit = {
accTotal = ssc.sparkContext.longAccumulator("total")
val inputKafka = createDirectStream(ssc, kafkaParams, topicsSet)
val rddAvro = inputKafka.map{x => x.value()}
saveToCassandra(rddAvro)
println("XXX:" + accTotal.value) //-->0
}
def saveToCassandra(upserts: DStream[Data]) = {
val rddCassandraUpsert = upserts.map {
record =>
accTotal.add(1)
println("ACC: " + accTotal.value) --> 1,2,3,4.. OK. Spark Web UI, ok too.
DataExt(record.data,
record.data1)}
rddCassandraUpsert.saveToCassandra(keyspace, table)
}
}
I see that the code is executed right and I save data in Cassandra, when I finally print the accumulator the value is 0, but if I print it in the map fuction I can see the right values. Why?
I'm using Spark 2.0.2 and executing from Intellj in local mode. I have checked the spark web UI and I can see the accumulador updated.
The problem is probably here:
object Main extends App {
...
Spark doesn't support applications extending App, doing so, can result in non-deterministic behaviors:
Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
You should always use standard applications with main:
object Main {
def main(args: Array[String]) {
...
I'm using Spark 2.0.2. I have a DataFrame that has an alias on it, and I'd like to be able to retrieve that. A simplified example of why I'd want that is below.
def check(ds: DataFrame) = {
assert(ds.count > 0, s"${df.getAlias} has zero rows!")
}
The above code of course fails because DataFrame has no getAlias function. Is there a way to do this?
You can try something like this but I wouldn't go so far to claim it is supported:
Spark < 2.1:
import org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias
import org.apache.spark.sql.Dataset
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _) => Some(alias)
case _ => None
}
Spark 2.1+:
def getAlias(ds: Dataset[_]) = ds.queryExecution.analyzed match {
case SubqueryAlias(alias, _, _) => Some(alias)
case _ => None
}
Example usage:
val plain = Seq((1, "foo")).toDF
getAlias(plain)
Option[String] = None
val aliased = plain.alias("a dataset")
getAlias(aliased)
Option[String] = Some(a dataset)
Disclaimer: as stated above, this code relies on undocumented APIs subject to change. It works as of Spark 2.3.
After much digging into mostly undocumented Spark methods, here is the full code to pull the list of fields, along with the table alias for a dataframe in PySpark:
def schema_from_plan(df):
plan = df._jdf.queryExecution().analyzed()
all_fields = _schema_from_plan(plan)
iterator = plan.output().iterator()
output_fields = {}
while iterator.hasNext():
field = iterator.next()
queryfield = all_fields.get(field.exprId().id(),{})
if not queryfield=={}:
tablealias = queryfield["tablealias"]
else:
tablealias = ""
output_fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
return list(output_fields.values())
def _schema_from_plan(root,tablealias=None,fields={}):
iterator = root.children().iterator()
while iterator.hasNext():
node = iterator.next()
nodeClass = node.getClass().getSimpleName()
if (nodeClass=="SubqueryAlias"):
# get the alias and process the subnodes with this alias
_schema_from_plan(node,node.alias(),fields)
else:
if tablealias:
# add all the fields, along with the unique IDs, and a new tablealias field
iterator = node.output().iterator()
while iterator.hasNext():
field = iterator.next()
fields[field.exprId().id()] = {
"tablealias": tablealias,
"dataType": field.dataType().typeName(),
"name": field.name()
}
_schema_from_plan(node,tablealias,fields)
return fields
# example: fields = schema_from_plan(df)
For Java:
As #veinhorn mentioned, it is also possible to get the alias in Java. Here is a utility method example:
public static <T> Optional<String> getAlias(Dataset<T> dataset){
final LogicalPlan analyzed = dataset.queryExecution().analyzed();
if(analyzed instanceof SubqueryAlias) {
SubqueryAlias subqueryAlias = (SubqueryAlias) analyzed;
return Optional.of(subqueryAlias.alias());
}
return Optional.empty();
}