How can a CDI bean be stored for the entire duration of Batch Job?
I have a Chunk Job that i would like to monitor. I though i could do with a Bean Instance which scoped during the entire duration of the job.
It's not really feasible to do this. Serialization doesn't work quite the way most people think for CDI.
What I would recommend is having your batch job fire events at various points and use cDI to observe those events and take action.
You could also develop a custom scope for the batch and use it that way, but if you end up with multiple nodes executing the batch job it won't work quite right.
Related
Using DataFrame.show() API, we can take a glance about the underlying data.
Is it good to use this method in production spark job?
Basically, I know we can comment this kind of code before launching the job, but if we just keep it, is it a good practice?
Or it will cause performance issue?
The show() command is an action.
Adding unnecessary action to the code, might disturb Spark optimizer, as the optimizer can change the order of the transformation, but should trigger an action every time their is an action.
i.e. Using unnecessary action limits the optimizer work.
See Actions vs Transformations
No, it's not a good method. Spark is a lazy evaluator which implies that execution won't start until necessary. It creates a Directed Acyclic Graph to keep track of the requests in order. However, it won't execute anything until an action is called. Hence unnecessary calling actions like show should be avoided.
show() command is an action, so we should not use that in our production code as it would materialize your code unnecessary and ultimately slow down your job to an extent.
I wonder if anyone has any experience with fish tagging* Flink batch runs.
*Just as a fish can be tagged and have its movement tracked, stamping log events with a common tag or set of data elements allows the complete flow of a transaction or a request to be tracked. We call this Fish Tagging.
source
Specifically, I would like to make sure that a batch ID is added to each line of the log which has anything to do with that particular batch execution. This will allow me to track batches in Kibana
I don't see how using log4j's MDC will would propagate through multiple Flink nodes, and using a system property lookup to inject an ID through VM params would not allow me to run batches concurrently (would it?)
Thanks in advance for any pointers
So asking if anyone knows a way to change the Spark properties (e.g. spark.executor.memory, spark.shuffle.spill.compress, etc) during runtime, so that a change may take effect between the tasks/stages during a job...
So I know that...
1) The documentation for Spark 2.0+ (and previous versions too) state that once the Spark Context has been created, it can't be changed in runtime.
2) SparkSession.conf.set that may change a few things for SQL, but I was looking at more general, all encompassing configurations.
3) I could start a new context in the program with new properties, but the case here is to actually tune the properties once a job is already executing.
Ideas...
1) Would killing an Executor force it to read a configuration file again, or does it just get what's already configured during the beginning of the job?
2) Is there any command to force a "refresh" of the properties in spark context?
So hoping there might be a way or other ideas out there (thanks in advance)...
After submitting the Spark application, we can change a few parameter values at Runtime and a few not.
By using spark.conf.isModifiable() method, we can check parameter value we can modify at runtime or not. If the value returns true then we can modify the parameter value otherwise, we can't modify the value at runtime.
Examples:
>>> spark.conf.isModifiable("spark.executor.memory")
False
>>> spark.conf.isModifiable("spark.sql.shuffle.partitions")
True
So based on the above testing, we can't modify the spark.executor.memory parameter value at runtime.
No, it is not possible to change settings like spark.executor.memory at runtime.
In addition, there are probably not too many great tricks in the direction of 'quickly switching to a new context' as the strength of spark is that it can pick up data and keep going. What you essentially are asking for is a map-reduce framework. Of course you could rewrite your job into this structure, and divide the work across multiple spark jobs, but then you would lose some of the ease and performance that spark brings. (Though possibly not all).
If you really think the request makes sense on a conceptual level, you could consider making a feature request. This can be through your spark supplier, or directly by logging a Jira on the apache Spark project.
I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse.
For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good idea?
Not considering the performance implications, is this even possible?
Is it possible to get and use a JavaSparkContext from within a task?
No. The spark context is only valid on the driver and Spark will prevent serialization of it. Therefore it's not possible to use the Spark context from within a task.
For every incoming record, I'd like to potentially launch a spark job
from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good
idea?
Without more details, my umbrella answer would be: Probably not a good idea.
Not considering the performance implications, is this even possible?
Yes, probably by bringing the base collection to the driver (collect) and iterating over it. If that collection doesn't fit in memory of the driver, please previous point.
If we need to process every record, consider performing some form of join with the 'decorating' dataset - that will be only 1 large job instead of tons of small ones.
In a Web App, need to display a view of 6 objects (rows of table DB) via JSF per page. To advance to the next view another different random 6 objects are to be displayed and so on...
So I was thinking in having a #Singleton that queries all the rows of the table with a #Schedule job in a given interval, let's say every 1 hour. It will have a getCollection() method.
Then every visitor will have a #SessionScoped CDI bean, that will query the Collection of the #Singleton, then shuffle it to make a random view to the specific user.
As with many visits, many CDI beans will be created that will access the getCollection() method concurrently.
Is this thought correctly? Any specific annotations are needed for this case? Any other way of doing this?
-----UPDATE---
After speaking with friends, specially Luiggi Mendoza, they tell me that best thing here is to use a EHCACHE or similar, instead of a Singleon. I think that's the way.
Will you have a cluster of webservers? In that case you need a distributed cache or you need to update state through the database.
Else I would just go for a simple map in a bean that is #ApplicationScoped.
I use JPA in a solution where a big bunch of data is almost always the same. There's more then one tomcat involved so a pure cache in a bean that is #ApplicationScoped won't work. To fix it I have no second level cache and instead cache the result of the database queries. That means each tomcat has it's own cache.
For every login a timestamp is read, if the data in the cache is not stale it is used. Otherwise the cache is updated. The timestamp is updated when changes occur with the help of database triggers.
#PrePersist
#PreUpdate
#PreRemove
public void newTimeStamp() {
// save a new timestamp
}
#Singleton is not part of the CDI specification so I would refrain from using it.
Also I would keep my client bean #RequestScoped and reload the 6 objects with #PostConstruct. That way you will get a fresh 6 for every request.
Or if that is to short lived maybe #ViewScoped (requires myfaces codi), #ConversationScoped or #ViewAccessScoped (requires myfaces codi).