SparkListener in Spark on YARN-CLUSTER not works? - apache-spark

My main purpose is to get the appId after submitting the yarn-cluster task through java code, which is convenient for more business operations.
Add the --conf=spark.extraListeners=Mylistener
While SparkListener does work when I use Spark in standalone mode, it doesn't work when I run Spark on a cluster over Yarn. Is it possible for SparkListener to work when running over Yarn? If so, what steps should I do to enable that?
Here is the Mylistener class code:
public class Mylistener extends SparkListener {
private static Logger logger = LoggerFactory.getLogger(EnvelopeSparkListener.class);
#Override
public void onApplicationStart(SparkListenerApplicationStart sparkListenerApplicationStart) {
Option<String> appId = sparkListenerApplicationStart.appId();
EnvelopeSubmit.appId = appId.get();
logger.info("====================start");
}
#Override
public void onBlockManagerAdded(SparkListenerBlockManagerAdded blockManagerAdded) {
logger.info("=====================add");
}
}
Here is the Main class to submit the application:
public static void main(String[] args) {
String jarpath = args[0];
String childArg = args[1];
System.out.println("jarpath:" + jarpath);
System.out.println("childArg:" + childArg);
System.setProperty("HADOOP_USER_NAME", "hdfs");
String[] arg = {"--verbose=true", "--class=com.cloudera.labs.envelope.EnvelopeMain",
"--master=yarn", "--deploy-mode=cluster","--conf=spark.extraListeners=Mylistener","--conf","spark.eventLog.enabled=true", "--conf","spark.yarn.jars=hdfs://192.168.6.188:8020/user/hdfs/lib/*", jarpath, childArg};
SparkSubmit.main(arg);
}

If you just want to get the app id you can simply do this,
logger.info(s"Application id: ${sparkSession.sparkContext.applicationId}")
Hope this answers your question!

Related

How to prevent duplicate tasks when run same IScheduledExecutorService on apps in cluster?

I want to understand difference between hazelcast methods for IScheduledExecutorService for prevent duplicate tasks.
I have two java app with HazelcastInstance. Respectively I have hazelcast cluster with two HazelcastInstances (servers).
I use IMap and want to reset AtomicLong every midnight.
config.getScheduledExecutorConfig("my scheduler")
.setPoolSize(16)
.setCapacity(100)
.setDurability(1);
class DelayedResetTask implements Runnable, HazelcastInstanceAware, Serializable {
static final long serialVersionUID = -7588380448693010399L;
private transient HazelcastInstance client;
#Override
public void run() {
final IMap<Long, AtomicLong> map = client.getMap(HazelcastConfiguration.mapName);
final ILogger logger = client.getLoggingService().getLogger(HazelcastInstance.class);
logger.info("Show data in cache before reset: " + map.entrySet());
map.keySet().forEach(key -> map.put(key, new AtomicLong(0)));
logger.info("Data was reseted: " + map.entrySet());
}
#Override
public void setHazelcastInstance(HazelcastInstance hazelcastInstance) { this.client = hazelcastInstance; }
}
private void resetAtMidnight() {
final Long midnight = LocalDateTime.now().until(LocalDate.now().plusDays(1).atStartOfDay(), ChronoUnit.MINUTES);
executor.scheduleAtFixedRate(new DelayedResetTask(), midnight, TimeUnit.DAYS.toMinutes(1), TimeUnit.MINUTES);
}
I don't want to execute this task on each instance in parallel. After reading documentation documentation I don't understand how I can execute reset in both servers for one step (without duplicate tasks, without execution on both servers at one time).
What method I can use for my task scheduleOnAllMembersAtFixedRate or scheduleAtFixedRate or scheduleOnMembersAtFixedRate.
How to prevent duplicate tasks when run same IScheduledExecutorService on apps in cluster?
You need to run your code only once in the cluster, since the map you are resetting can be accessed from any member. Both members access to the same map instance, only the entries are kept in different members. You can use scheduleAtFixedRate to run it once.
Additionally, you do not need to call IMap#keySet().forEach() to traverse over all entries in the map. Instead, you can use EntryProcessor as below:
public static class DelayedResetTask implements Runnable, HazelcastInstanceAware, Serializable {
static final long serialVersionUID = -7588380448693010399L;
private transient HazelcastInstance client;
#Override
public void run() {
final IMap<Long, AtomicLong> map = client.getMap(HazelcastConfiguration.mapName);
final ILogger logger = client.getLoggingService().getLogger(HazelcastInstance.class);
logger.info("Show data in cache before reset: " + map.entrySet());
map.executeOnEntries(new AbstractEntryProcessor() {
#Override
public Object process(Map.Entry entry) {
entry.setValue(new AtomicLong(0));
return null;
}
});
logger.info("Data was reseted: " + map.entrySet());
}
#Override
public void setHazelcastInstance(HazelcastInstance hazelcastInstance) { this.client = hazelcastInstance; }

Cluster of two different machine with high availability?

I have two different machine. One is configured with IP 192.168.2.100 and the other one with 192.168.2.101.
This is the code of the first verticle :
public class Sender extends AbstractVerticle {
public static void main(String... args) {
// Configuration du cluster manager
Config config = new Config();
config.getNetworkConfig().getJoin().getTcpIpConfig().addMember("192.168.2.101");
VertxOptions options = new VertxOptions();
options.setClusterManager(new HazelcastClusterManager());
options.setClusterHost("192.168.2.100");
options.setClustered(true);
options.setHAEnabled(true);
Vertx.clusteredVertx(options, vertx ->
vertx.result().deployVerticle(Sender.class.getName(), new DeploymentOptions().setHa(true))
);
}
#Override
public void start() throws Exception {
vertx.setPeriodic(5000, id -> {
vertx.eventBus().send("Address", "message",rep->{
System.out.println("response : "+rep.result().body());
});
});
}
}
And this is the code of the second verticle:
package com.vetx.Consumer;
import com.hazelcast.config.Config;
public class Consumer extends AbstractVerticle {
private String name = null;
public Consumer(String name) {
this.name = name;
}
public Consumer(){
}
public static void main(String... args) {
// Configuration du cluster manager
Config config = new Config();
config.getNetworkConfig().getJoin().getTcpIpConfig().addMember("192.168.2.100");
VertxOptions options = new VertxOptions();
options.setClusterManager(new HazelcastClusterManager());
options.setClusterHost("192.168.2.101");
options.setClustered(true);
options.setHAEnabled(true);
Vertx.clusteredVertx(options, vertx ->
vertx.result().deployVerticle(Consumer.class.getName(), new DeploymentOptions().setHa(true))
);
}
#Override
public void start() throws Exception {
vertx.eventBus().consumer("Address", message -> {
System.out.println(" received message: " +message.body());
message.reply("Success");
});
}
}
I try to use the high availability with cluster to implement a consumer to consume a message and a sender to send the message.When a try to kill the sender in order to redeploy verticle after failover, i got the following exception :
SEVERE: Failed to redeploy verticle after failover
java.lang.ClassNotFoundException: com.vetx.Sender.Sender
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at io.vertx.core.impl.JavaVerticleFactory.createVerticle(JavaVerticleFactory.java:37)
at io.vertx.core.impl.DeploymentManager.createVerticles(DeploymentManager.java:229)
at io.vertx.core.impl.DeploymentManager.lambda$doDeployVerticle$2(DeploymentManager.java:202)
at io.vertx.core.impl.FutureImpl.setHandler(FutureImpl.java:76)
at io.vertx.core.impl.DeploymentManager.doDeployVerticle(DeploymentManager.java:171)
at io.vertx.core.impl.DeploymentManager.doDeployVerticle(DeploymentManager.java:143)
at io.vertx.core.impl.DeploymentManager.deployVerticle(DeploymentManager.java:131)
at io.vertx.core.impl.HAManager.doDeployVerticle(HAManager.java:281)
at io.vertx.core.impl.HAManager.processFailover(HAManager.java:553)
at io.vertx.core.impl.HAManager.checkFailover(HAManager.java:489)
at io.vertx.core.impl.HAManager.nodeLeft(HAManager.java:309)
at io.vertx.core.impl.HAManager.access$100(HAManager.java:102)
at io.vertx.core.impl.HAManager$1.nodeLeft(HAManager.java:152)
at io.vertx.spi.cluster.hazelcast.HazelcastClusterManager.memberRemoved(HazelcastClusterManager.java:325)
at com.hazelcast.internal.cluster.impl.ClusterServiceImpl.dispatchEvent(ClusterServiceImpl.java:916)
at com.hazelcast.internal.cluster.impl.ClusterServiceImpl.dispatchEvent(ClusterServiceImpl.java:88)
at com.hazelcast.spi.impl.eventservice.impl.LocalEventDispatcher.run(LocalEventDispatcher.java:56)
at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:217)
at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:200)
Both of the cluster members should have the class com.vetx.Sender.Sender in their classpath since it will be serialized/deserialized on both sides. Seems like one of the members doesn't have it.
Also, did you make sure members can form a cluster of 2? I see your network config has one ip on each host getTcpIpConfig().addMember("192.168.2.101"), best practice is to add all ip's on all hosts i.e. to have an identical network config on all hosts to avoid confusions.

Pass parameters to the jar when using spark launcher

I am trying to create an executable jar which is using a spark launcher to run another jar with data transformation task(this jar creates spark session).
I need to pass java parameters(some java arrays) to the jar which is executed by the launcher.
object launcher {
#throws[Exception]
// How do I pass parameters to spark_job_with_spark_session.jar
def main(args: Array[String]): Unit = {
val handle = new SparkLauncher()
.setAppResource("spark_job_with_spark_session.jar")
.setVerbose(true)
.setMaster("local[*]")
.setConf(SparkLauncher.DRIVER_MEMORY, "4g")
.launch()
}
}
How can I do that?
need to pass java parameters(some java arrays)
It is equivalent to executing spark-submit so you cannot pass Java objects directly. Use app args
addAppArgs(String... args)
to pass application arguments, and parse them in your app.
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package com.meow.woof.meow_spark_launcher.app;
import com.meow.woof.meow_spark_launcher.common.TaskListener;
import org.apache.spark.launcher.SparkAppHandle;
import org.apache.spark.launcher.SparkLauncher;
/**
*
* #author hahattpro
*/
public class ExampleSparkLauncherApp {
public static void main(String[] args) throws Exception {
SparkAppHandle handle = new SparkLauncher()
.setAppResource("/home/cpu11453/workplace/experiment/SparkPlayground/target/scala-2.11/SparkPlayground-assembly-0.1.jar")
.setMainClass("me.thaithien.playground.ConvertToCsv")
.setMaster("spark://cpu11453:7077")
.setConf(SparkLauncher.DRIVER_MEMORY, "3G")
.addAppArgs("--input" , "/data/download_hdfs/data1/2019_08_13/00/", "--output", "/data/download_hdfs/data1/2019_08_13/00_csv_output/")
.startApplication(new TaskListener());
handle.addListener(new SparkAppHandle.Listener() {
#Override
public void stateChanged(SparkAppHandle handle) {
System.out.println(handle.getState() + " new state");
}
#Override
public void infoChanged(SparkAppHandle handle) {
System.out.println(handle.getState() + " new state");
}
});
System.out.println(handle.getState().toString());
while (!handle.getState().isFinal()) {
//await until job finishes
Thread.sleep(1000L);
}
}
}
Here is example code that work

Spring Integration Cassandra persistence workflow

I try to realize the following workflow with Spring Integration:
1) Poll REST API
2) store the POJO in Cassandra cluster
It's my first try with Spring Integration, so I'm still a bit overwhelmed about the mass of information from the reference. After some research, I could make the following work.
1) Poll REST API
2) Transform mapped POJO JSON result into a string
3) save string into file
Here's the code:
#Configuration
public class ConsulIntegrationConfig {
#InboundChannelAdapter(value = "consulHttp", poller = #Poller(maxMessagesPerPoll = "1", fixedDelay = "1000"))
public String consulAgentPoller() {
return "";
}
#Bean
public MessageChannel consulHttp() {
return MessageChannels.direct("consulHttp").get();
}
#Bean
#ServiceActivator(inputChannel = "consulHttp")
MessageHandler consulAgentHandler() {
final HttpRequestExecutingMessageHandler handler =
new HttpRequestExecutingMessageHandler("http://localhost:8500/v1/agent/self");
handler.setExpectedResponseType(AgentSelfResult.class);
handler.setOutputChannelName("consulAgentSelfChannel");
LOG.info("Created bean'consulAgentHandler'");
return handler;
}
#Bean
public MessageChannel consulAgentSelfChannel() {
return MessageChannels.direct("consulAgentSelfChannel").get();
}
#Bean
public MessageChannel consulAgentSelfFileChannel() {
return MessageChannels.direct("consulAgentSelfFileChannel").get();
}
#Bean
#ServiceActivator(inputChannel = "consulAgentSelfFileChannel")
MessageHandler consulAgentFileHandler() {
final Expression directoryExpression = new SpelExpressionParser().parseExpression("'./'");
final FileWritingMessageHandler handler = new FileWritingMessageHandler(directoryExpression);
handler.setFileNameGenerator(message -> "../../agent_self.txt");
handler.setFileExistsMode(FileExistsMode.APPEND);
handler.setCharset("UTF-8");
handler.setExpectReply(false);
return handler;
}
}
#Component
public final class ConsulAgentTransformer {
#Transformer(inputChannel = "consulAgentSelfChannel", outputChannel = "consulAgentSelfFileChannel")
public String transform(final AgentSelfResult json) throws IOException {
final String result = new StringBuilder(json.toString()).append("\n").toString();
return result;
}
This works fine!
But now, instead of writing the object to a file, I want to store it in a Cassandra cluster with spring-data-cassandra. For that, I commented out the file handler in the config file, return the POJO in transformer and created the following, :
#MessagingGateway(name = "consulCassandraGateway", defaultRequestChannel = "consulAgentSelfFileChannel")
public interface CassandraStorageService {
#Gateway(requestChannel="consulAgentSelfFileChannel")
void store(AgentSelfResult agentSelfResult);
}
#Component
public final class CassandraStorageServiceImpl implements CassandraStorageService {
#Override
public void store(AgentSelfResult agentSelfResult) {
//use spring-data-cassandra repository to store
LOG.info("Received 'AgentSelfResult': {} in Cassandra cluster...");
LOG.info("Trying to store 'AgentSelfResult' in Cassandra cluster...");
}
}
But this seems to be a wrong approach, the service method is never triggered.
So my question is, what would be a correct approach for my usecase? Do I have to implement the MessageHandler interface in my service component, and use a #ServiceActivator in my config. Or is there something missing in my current "gateway-approach"?? Or maybe there is another solution, that I'm not able to see..
Like mentioned before, I'm new to SI, so this may be a stupid question...
Nevertheless, thanks a lot in advance!
It's not clear how you are wiring in your CassandraStorageService bean.
The Spring Integration Cassandra Extension Project has a message-handler implementation.
The Cassandra Sink in spring-cloud-stream-modules uses it with Java configuration so you can use that as an example.
So I finally made it work. All I needed to do was
#Component
public final class CassandraStorageServiceImpl implements CassandraStorageService {
#ServiceActivator(inputChannel="consulAgentSelfFileChannel")
#Override
public void store(AgentSelfResult agentSelfResult) {
//use spring-data-cassandra repository to store
LOG.info("Received 'AgentSelfResult': {}...");
LOG.info("Trying to store 'AgentSelfResult' in Cassandra cluster...");
}
}
The CassandraMessageHandler and the spring-cloud-streaming seemed to be a to big overhead to my use case, and I didn't really understand yet... And with this solution, I keep control over what happens in my spring component.

How to write client proxy for SPI and what the difference between client and server proxies?

I have developed own idGenerator based on Hazelcast IdGenerator class (with storing each last_used_id into db). Now I want to run hazelcast cluster as a single java application and my web-application as other app (web-application restart shouldn't move id values to next block). I move MyIdGeneratorProxy and MyIdGeneratorService to new application, run it, run web-application as a hazelcast-client and get
IllegalArgumentException: No factory registered for service: ecs:impl:idGeneratorService
It was okay when client and server were the same application.
It seems it's unable to process without some clientProxy. I have compared IdGeneratorProxy and ClientIdGeneratorProxy and it looks the same. What is the idea? How to write client proxy for services? I have found no documentation yet. Is direction of investigations correct? I thought it is possible to divide hazelcast inner services (like a id generator service) and my business-processes. Should I store custom ClientProxy (for custom spi) in my web-application?
This is a demo how to create a client proxy, the missing part CustomClientProxy function call, is quit complicated(more like a server proxy,here is called ReadRequest, the server is called Operation), you can find a how AtomicLong implement.For every client proxy method you have to make a request.
#Test
public void client() throws InterruptedException, IOException
{
ClientConfig cfg = new XmlClientConfigBuilder("hazelcast-client.xml").build();
ServiceConfig serviceConfig = new ServiceConfig();
serviceConfig.setName(ConnectorService.NAME)
.setClassName(ConnectorService.class.getCanonicalName())
.setEnabled(true);
ProxyFactoryConfig proxyFactoryConfig = new ProxyFactoryConfig();
proxyFactoryConfig.setService(ConnectorService.NAME);
proxyFactoryConfig.setClassName(CustomProxyFactory.class.getName());
cfg.addProxyFactoryConfig(proxyFactoryConfig);
HazelcastInstance hz = HazelcastClient.newHazelcastClient(cfg);
Thread.sleep(1000);
for (int i = 0; i < 10; i++)
{
Connector c = hz.getDistributedObject(ConnectorService.NAME, "Connector:" + ThreadLocalRandom.current()
.nextInt(10000));
System.out.println(c.snapshot());
}
}
private static class CustomProxyFactory implements ClientProxyFactory
{
#Override
public ClientProxy create(String id)
{
return new CustomClientProxy(ConnectorService.NAME, id);
}
}
private static class CustomClientProxy extends ClientProxy implements Connector
{
protected CustomClientProxy(String serviceName, String objectName)
{
super(serviceName, objectName);
}
#Override
public ConnectorState snapshot()
{
return null;
}
#Override
public void loadState(ConnectorState state)
{
}
#Override
public boolean reconnect(HostNode node)
{
return false;
}
#Override
public boolean connect()
{
return false;
}
}
EDIT
In hazelcast the IdGenerate is implemented as a wrapper for AtomicLong, you should implement you IdGenerate by you own, instead of extend IdGenerate.
So you have to implement these(more like a todo list XD):
API
interface MyIdGenerate
Server
MyIdGenerateService
MyIdGenerateProxy
MyIdGenerateXXXOperation
Client
ClientMyIdGenerateFactory
ClientMyIdGenerateProxy
MyIdGenerateXXXRequest
I also made a sequence(same as IdGenerate) here, this is backed by zookeeper or redis,also it's easy to add a db backend,too.I will integrate to hazelcast if I got time.

Resources