Setting up two instance Cassandra cluster on a single physical node (laptop with windows 10) - cassandra

My main objective is to experiment with a Cassandra cluster. I have a Laptop so my options are, as I understand, (1) Use some virtualization software (e.g. Hyper-V) to create multiple VMs and then run each VM with a Cassandra instance, or (2) Use docker to create multiple instances of Cassandra, or (3) Directly run multiple instances.
I thought (3) would be faster, and provide me with more insights. So I tried this (by following https://stackoverflow.com/a/25348301/1029599). But I'm getting strange situations when I see that I'm not able to change the JMX port. More details below:
I've created two folders of Cassandra 3.11.7 - one in C drive and other in D drive.
For C drive folder, I've edited cassandra.yaml to replace 'listen_address: localhost' by 'listen_address: 127.0.0.1' and 'rpc_address: localhost' by 'rpc_address: 127.0.0.1'. In addition, set seeds to to point to D-drive-instance as '- seeds: "127.0.0.2"'. I've NOT edited cassandra-env.sh to let JMX_PORT be the default 7199.
For D drive folder, I've edited cassandra.yaml to point localhost as '127.0.0.2' and seed as '- seeds: "127.0.0.1"'. In addition, I've edited cassandra-env.sh to let JMX_PORT=7200.
Surprisingly, when I'm starting D drive's cassandra instance, it's always picking JMX_PORT as 7199 and not 7200.
Log (relevant portion from the start):
D:\apache-cassandra-3.11.7\bin>.\cassandra
WARNING! Powershell script execution unavailable. Please use 'powershell Set-ExecutionPolicy Unrestricted' on this user-account to run cassandra with fully featured functionality on this platform. Starting with legacy startup options Starting Cassandra Server INFO [main] 2020-08-17 17:31:21,632 YamlConfigurationLoader.java:89 - Configuration location: file:/D:/apache-cassandra-3.11.7/conf/cassandra.yaml INFO [main] 2020-08-17 17:31:22,359 Config.java:534 - Node configuration:[allocate_tokens_for_keyspace=null; authenticator=AllowAllAuthenticator; authorizer=AllowAllAuthorizer; auto_bootstrap=true; auto_snapshot=true; back_pressure_enabled=false; back_pressure_strategy=org.apache.cassandra.net.RateBasedBackPressure{high_ratio=0.9, factor=5, flow=FAST}; batch_size_fail_threshold_in_kb=50; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; broadcast_address=null; broadcast_rpc_address=null; buffer_pool_use_heap_if_exhausted=true; cas_contention_timeout_in_ms=1000; cdc_enabled=false; cdc_free_space_check_interval_ms=250; cdc_raw_directory=null; cdc_total_space_in_mb=0; check_for_duplicate_rows_during_compaction=true; check_for_duplicate_rows_during_reads=true; client_encryption_options=<REDACTED>; cluster_name=Test Cluster; column_index_cache_size_in_kb=2; column_index_size_in_kb=64; commit_failure_policy=stop; commitlog_compression=null; commitlog_directory=null; commitlog_max_compression_buffers_in_pool=3; commitlog_periodic_queue_size=-1; commitlog_segment_size_in_mb=32; commitlog_sync=periodic; commitlog_sync_batch_window_in_ms=NaN; commitlog_sync_period_in_ms=10000; commitlog_total_space_in_mb=null; compaction_large_partition_warning_threshold_mb=100; compaction_throughput_mb_per_sec=16; concurrent_compactors=null; concurrent_counter_writes=32; concurrent_materialized_view_writes=32; concurrent_reads=32; concurrent_replicates=null; concurrent_writes=32; counter_cache_keys_to_save=2147483647; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; credentials_cache_max_entries=1000; credentials_update_interval_in_ms=-1; credentials_validity_in_ms=2000; cross_node_timeout=false; data_file_directories=[Ljava.lang.String;#235834f2; disk_access_mode=auto; disk_failure_policy=stop; disk_optimization_estimate_percentile=0.95; disk_optimization_page_cross_chance=0.1; disk_optimization_strategy=ssd; dynamic_snitch=true; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=600000; dynamic_snitch_update_interval_in_ms=100; enable_materialized_views=true; enable_sasi_indexes=true; enable_scripted_user_defined_functions=false; enable_user_defined_functions=false; enable_user_defined_functions_threads=true; encryption_options=null; endpoint_snitch=SimpleSnitch; file_cache_round_up=null; file_cache_size_in_mb=null; gc_log_threshold_in_ms=200; gc_warn_threshold_in_ms=1000; hinted_handoff_disabled_datacenters=[]; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; hints_compression=null; hints_directory=null; hints_flush_period_in_ms=10000; incremental_backups=false; index_interval=null; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; initial_token=null; inter_dc_stream_throughput_outbound_megabits_per_sec=200; inter_dc_tcp_nodelay=false; internode_authenticator=null; internode_compression=dc; internode_recv_buff_size_in_bytes=0; internode_send_buff_size_in_bytes=0; key_cache_keys_to_save=2147483647; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=127.0.0.2; listen_interface=null; listen_interface_prefer_ipv6=false; listen_on_broadcast_address=false; max_hint_window_in_ms=10800000; max_hints_delivery_threads=2; max_hints_file_size_in_mb=128; max_mutation_size_in_kb=null; max_streaming_retries=3; max_value_size_in_mb=256; memtable_allocation_type=heap_buffers; memtable_cleanup_threshold=null; memtable_flush_writers=0; memtable_heap_space_in_mb=null; memtable_offheap_space_in_mb=null; min_free_space_per_drive_in_mb=50; native_transport_flush_in_batches_legacy=true; native_transport_max_concurrent_connections=-1; native_transport_max_concurrent_connections_per_ip=-1; native_transport_max_concurrent_requests_in_bytes=-1; native_transport_max_concurrent_requests_in_bytes_per_ip=-1; native_transport_max_frame_size_in_mb=256; native_transport_max_negotiable_protocol_version=-2147483648; native_transport_max_threads=128; native_transport_port=9042; native_transport_port_ssl=null; num_tokens=256; otc_backlog_expiration_interval_ms=200; otc_coalescing_enough_coalesced_messages=8; otc_coalescing_strategy=DISABLED; otc_coalescing_window_us=200; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_cache_max_entries=1000; permissions_update_interval_in_ms=-1; permissions_validity_in_ms=2000; phi_convict_threshold=8.0; prepared_statements_cache_size_mb=null; range_request_timeout_in_ms=10000; read_request_timeout_in_ms=5000; repair_session_max_tree_depth=18; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_scheduler_id=null; request_scheduler_options=null; request_timeout_in_ms=10000; role_manager=CassandraRoleManager; roles_cache_max_entries=1000; roles_update_interval_in_ms=-1; roles_validity_in_ms=2000; row_cache_class_name=org.apache.cassandra.cache.OHCProvider; row_cache_keys_to_save=2147483647; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=127.0.0.2; rpc_interface=null; rpc_interface_prefer_ipv6=false; rpc_keepalive=true; rpc_listen_backlog=50; rpc_max_threads=2147483647; rpc_min_threads=16; rpc_port=9160; rpc_recv_buff_size_in_bytes=null; rpc_send_buff_size_in_bytes=null; rpc_server_type=sync; saved_caches_directory=null; seed_provider=org.apache.cassandra.locator.SimpleSeedProvider{seeds=127.0.0.1}; server_encryption_options=<REDACTED>; slow_query_log_timeout_in_ms=500; snapshot_before_compaction=false; snapshot_on_duplicate_row_detection=false; ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=false; storage_port=7000; stream_throughput_outbound_megabits_per_sec=200; streaming_keep_alive_period_in_secs=300; streaming_socket_timeout_in_ms=86400000; thrift_framed_transport_size_in_mb=15; thrift_max_message_length_in_mb=16; thrift_prepared_statements_cache_size_mb=null; tombstone_failure_threshold=100000; tombstone_warn_threshold=1000; tracetype_query_ttl=86400; tracetype_repair_ttl=604800; transparent_data_encryption_options=org.apache.cassandra.config.TransparentDataEncryptionOptions#5656be13; trickle_fsync=false; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=60000; unlogged_batch_across_partitions_warn_threshold=10; user_defined_function_fail_timeout=1500; user_defined_function_warn_timeout=500; user_function_timeout_policy=die; windows_timer_interval=1; write_request_timeout_in_ms=2000] INFO [main] 2020-08-17 17:31:22,361 DatabaseDescriptor.java:381 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap INFO [main] 2020-08-17 17:31:22,366 DatabaseDescriptor.java:439 - Global memtable on-heap threshold is enabled at 503MB INFO [main] 2020-08-17 17:31:22,367 DatabaseDescriptor.java:443 - Global memtable off-heap threshold is enabled at 503MB INFO [main] 2020-08-17 17:31:22,538 RateBasedBackPressure.java:123 - Initialized back-pressure with high ratio: 0.9, factor: 5, flow: FAST, window size: 2000. INFO [main] 2020-08-17 17:31:22,538 DatabaseDescriptor.java:773 - Back-pressure is disabled with strategy org.apache.cassandra.net.RateBasedBackPressure{high_ratio=0.9, factor=5, flow=FAST}. INFO [main] 2020-08-17 17:31:22,686 JMXServerUtils.java:252 - Configured JMX server at: service:jmx:rmi://127.0.0.1/jndi/rmi://127.0.0.1:7199/jmxrmi INFO [main] 2020-08-17 17:31:22,700 CassandraDaemon.java:490 - Hostname: DESKTOP-NQ7673H INFO [main] 2020-08-17 17:31:22,703 CassandraDaemon.java:497 - JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.8.0_261 INFO [main] 2020-08-17 17:31:22,709 CassandraDaemon.java:498 - Heap size: 1.968GiB/1.968GiB INFO [main] 2020-08-17 17:31:22,712 CassandraDaemon.java:503 - Code Cache Non-heap memory: init = 2555904(2496K) used = 5181696(5060K) committed = 5242880(5120K) max = 251658240(245760K) INFO [main] 2020-08-17 17:31:22,714 CassandraDaemon.java:503 - Metaspace Non-heap memory: init = 0(0K) used = 19412472(18957K) committed = 20054016(19584K) max
= -1(-1K) INFO [main] 2020-08-17 17:31:22,738 CassandraDaemon.java:503 - Compressed Class Space Non-heap memory: init = 0(0K) used = 2373616(2317K) committed = 2621440(2560K) max = 1073741824(1048576K) INFO [main] 2020-08-17 17:31:22,739 CassandraDaemon.java:503 - Par Eden Space Heap memory: init = 279183360(272640K) used = 111694624(109076K) committed = 279183360(272640K) max = 279183360(272640K) INFO [main] 2020-08-17 17:31:22,740 CassandraDaemon.java:503 - Par Survivor Space Heap memory: init = 34865152(34048K) used = 0(0K) committed = 34865152(34048K) max = 34865152(34048K) INFO [main] 2020-08-17 17:31:22,743 CassandraDaemon.java:503 - CMS Old Gen Heap memory: init
= 1798569984(1756416K) used = 0(0K) committed = 1798569984(1756416K) max = 1798569984(1756416K) INFO [main] 2020-08-17 17:31:22,744 CassandraDaemon.java:505 - Classpath: D:\apache-cassandra-3.11.7\conf;D:\apache-cassandra-3.11.7\lib\airline-0.6.jar;D:\apache-cassandra-3.11.7\lib\antlr-runtime-3.5.2.jar;D:\apache-cassandra-3.11.7\lib\apache-cassandra-3.11.7.jar;D:\apache-cassandra-3.11.7\lib\apache-cassandra-thrift-3.11.7.jar;D:\apache-cassandra-3.11.7\lib\asm-5.0.4.jar;D:\apache-cassandra-3.11.7\lib\caffeine-2.2.6.jar;D:\apache-cassandra-3.11.7\lib\cassandra-driver-core-3.0.1-shaded.jar;D:\apache-cassandra-3.11.7\lib\commons-cli-1.1.jar;D:\apache-cassandra-3.11.7\lib\commons-codec-1.9.jar;D:\apache-cassandra-3.11.7\lib\commons-lang3-3.1.jar;D:\apache-cassandra-3.11.7\lib\commons-math3-3.2.jar;D:\apache-cassandra-3.11.7\lib\compress-lzf-0.8.4.jar;D:\apache-cassandra-3.11.7\lib\concurrent-trees-2.4.0.jar;D:\apache-cassandra-3.11.7\lib\concurrentlinkedhashmap-lru-1.4.jar;D:\apache-cassandra-3.11.7\lib\disruptor-3.0.1.jar;D:\apache-cassandra-3.11.7\lib\ecj-4.4.2.jar;D:\apache-cassandra-3.11.7\lib\guava-18.0.jar;D:\apache-cassandra-3.11.7\lib\HdrHistogram-2.1.9.jar;D:\apache-cassandra-3.11.7\lib\high-scale-lib-1.0.6.jar;D:\apache-cassandra-3.11.7\lib\hppc-0.5.4.jar;D:\apache-cassandra-3.11.7\lib\jackson-annotations-2.9.10.jar;D:\apache-cassandra-3.11.7\lib\jackson-core-2.9.10.jar;D:\apache-cassandra-3.11.7\lib\jackson-databind-2.9.10.4.jar;D:\apache-cassandra-3.11.7\lib\jamm-0.3.0.jar;D:\apache-cassandra-3.11.7\lib\javax.inject.jar;D:\apache-cassandra-3.11.7\lib\jbcrypt-0.3m.jar;D:\apache-cassandra-3.11.7\lib\jcl-over-slf4j-1.7.7.jar;D:\apache-cassandra-3.11.7\lib\jctools-core-1.2.1.jar;D:\apache-cassandra-3.11.7\lib\jflex-1.6.0.jar;D:\apache-cassandra-3.11.7\lib\jna-4.2.2.jar;D:\apache-cassandra-3.11.7\lib\joda-time-2.4.jar;D:\apache-cassandra-3.11.7\lib\json-simple-1.1.jar;D:\apache-cassandra-3.11.7\lib\jstackjunit-0.0.1.jar;D:\apache-cassandra-3.11.7\lib\libthrift-0.9.2.jar;D:\apache-cassandra-3.11.7\lib\log4j-over-slf4j-1.7.7.jar;D:\apache-cassandra-3.11.7\lib\logback-classic-1.1.3.jar;D:\apache-cassandra-3.11.7\lib\logback-core-1.1.3.jar;D:\apache-cassandra-3.11.7\lib\lz4-1.3.0.jar;D:\apache-cassandra-3.11.7\lib\metrics-core-3.1.5.jar;D:\apache-cassandra-3.11.7\lib\metrics-jvm-3.1.5.jar;D:\apache-cassandra-3.11.7\lib\metrics-logback-3.1.5.jar;D:\apache-cassandra-3.11.7\lib\netty-all-4.0.44.Final.jar;D:\apache-cassandra-3.11.7\lib\ohc-core-0.4.4.jar;D:\apache-cassandra-3.11.7\lib\ohc-core-j8-0.4.4.jar;D:\apache-cassandra-3.11.7\lib\reporter-config-base-3.0.3.jar;D:\apache-cassandra-3.11.7\lib\reporter-config3-3.0.3.jar;D:\apache-cassandra-3.11.7\lib\sigar-1.6.4.jar;D:\apache-cassandra-3.11.7\lib\slf4j-api-1.7.7.jar;D:\apache-cassandra-3.11.7\lib\snakeyaml-1.11.jar;D:\apache-cassandra-3.11.7\lib\snappy-java-1.1.1.7.jar;D:\apache-cassandra-3.11.7\lib\snowball-stemmer-1.3.0.581.1.jar;D:\apache-cassandra-3.11.7\lib\ST4-4.0.8.jar;D:\apache-cassandra-3.11.7\lib\stream-2.5.2.jar;D:\apache-cassandra-3.11.7\lib\thrift-server-0.3.7.jar;D:\apache-cassandra-3.11.7\build\classes\main;D:\apache-cassandra-3.11.7\build\classes\thrift;D:\apache-cassandra-3.11.7\lib\jamm-0.3.0.jar INFO [main] 2020-08-17 17:31:22,747 CassandraDaemon.java:507 - JVM Arguments: [-ea,
-javaagent:D:\apache-cassandra-3.11.7\lib\jamm-0.3.0.jar, -Xms2G, -Xmx2G, -XX:+HeapDumpOnOutOfMemoryError, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, -XX:+CMSParallelRemarkEnabled, -XX:SurvivorRatio=8, -XX:MaxTenuringThreshold=1, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Dlogback.configurationFile=logback.xml, -Djava.library.path=D:\apache-cassandra-3.11.7\lib\sigar-bin, -Dcassandra.jmx.local.port=7199, -Dcassandra, -Dcassandra-foreground=yes, -Dcassandra.logdir=D:\apache-cassandra-3.11.7\logs, -Dcassandra.storagedir=D:\apache-cassandra-3.11.7\data] WARN [main] 2020-08-17 17:31:22,763 StartupChecks.java:169 - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info. WARN [main] 2020-08-17 17:31:22,768 StartupChecks.java:220 - The JVM is not configured to stop on OutOfMemoryError which can cause data corruption. Use one of the following JVM options to configure the behavior on OutOfMemoryError: -XX:+ExitOnOutOfMemoryError,
-XX:+CrashOnOutOfMemoryError, or -XX:OnOutOfMemoryError="<cmd args>;<cmd args>" INFO [main] 2020-08-17 17:31:22,774 SigarLibrary.java:44 - Initializing SIGAR library
Can you pls help me resolve the port issue which is stopping me from running two instances.
In addition, I'll appreciate if you suggest other ways to get this done e.g. options 1 or 2 as I've mentioned in the beginning of my post or GCP or AWS with free account.

Option 2 - using docker - is very easy. This is described in details here: cassandra - docker hub
First of all you must increase resources in docker, because default configuration migh be too small for running more than 1-2 nodes - I am using 8 GB RAM and 3 cpu on my laptop and I am able to start 3 nodes of Cassandra simultanously.
The first step is to create a bridge network which will be used by Cassandra's cluster:
docker network create mynet
Then start a first node :
docker run --name cas1 --network mynet -d cassandra
then wait about one minute until the node starts, and then start another node:
docker run --name cas2 --network mynet -d -e CASSANDRA_SEEDS=cas1 cassandra
and again, start third node:
docker run --name cas3 --network mynet -d -e CASSANDRA_SEEDS=cas1 cassandra
Finally start another docker container instance in interactive mode, run cqlsh in it and connect cqlsh to your cluster (to the first node named cas1):
docker run -it --network mynet --rm cassandra cqlsh cas1
Connected to Test Cluster at cas1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

Related

WARN - Running Spark Locally with Docker - Initial job has not accepted any resources

I launched Spark master and worker in my laptop using Docker bridge network spark
docker network create spark
I put the following command
docker run -ti -p 8080:8080 -p 7077:7077 -p 4040:4040 -e SPARK_NO_DAEMONIZE=true --network=spark --name spark-master apache/spark:v3.3.0 /opt/spark/sbin/start-master.sh
docker run -ti -p 8080:8080 -p 7077:7077 -p 4040:4040 -e SPARK_NO_DAEMONIZE=true --network=spark --name spark-master apache/spark:v3.3.0 /opt/spark/sbin/start-worker.sh spark://<master>:7077
Once they both start, I try and launch the following application code from my IDE (written in Kotlin, but doesn't matter if it's also in Java)
var sparkSession = SparkSession.builder()
.appName("mapreduce")
.master("spark://localhost:7077")
.config("spark.dynamicAllocation.enabled", "false")
.orCreate
var dataset: Dataset<String> = sparkSession.createDataset(listOf("Banana", "Car", "Glass", "Banana", "Computer", "Car"),
Encoders.STRING())
dataset = dataset.map(MapFunction{c: String->"word: "+c} , Encoders.STRING())
dataset.show()
The code works if master is local. I opened localhost:8080 and localhost:8081 and I can see the job getting registered. So, why am I getting a warning message as follows
22/09/05 01:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220905001717-0003/1 is now EXITED (Command exited with code 1)
22/09/05 01:17:38 INFO StandaloneSchedulerBackend: Executor app-20220905001717-0003/1 removed: Command exited with code 1
22/09/05 01:17:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
22/09/05 01:17:38 INFO BlockManagerMaster: Removal of executor 1 requested
22/09/05 01:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220905001717-0003/2 on worker-20220905000436-172.18.0.3-8082 (172.18.0.3:8082) with 8 core(s)
22/09/05 01:17:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
22/09/05 01:17:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20220905001717-0003/2 on hostPort 172.18.0.3:8082 with 8 core(s), 1024.0 MiB RAM
22/09/05 01:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220905001717-0003/2 is now RUNNING
22/09/05 01:17:40 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Spark Pod restarting every hour in Kubernetes

I have deployed spark applications in cluster-mode in kubernetes. The spark application pod is getting restarted almost every hour.
The driver log has this message before restart:
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 1 on x.x.x.x: The executor with id 1 was deleted by a user or the framework.
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 2 on y.y.y.y: The executor with id 2 was deleted by a user or the framework.
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 1 (epoch 0)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, x.x.x.x, 44879, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 2 (epoch 1)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, y.y.y.y, 46191, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
20/07/11 13:34:02 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
20/07/11 13:34:16 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
And the Executor log has:
20/07/11 15:55:01 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/07/11 15:55:01 INFO MemoryStore: MemoryStore cleared
20/07/11 15:55:01 INFO BlockManager: BlockManager stopped
20/07/11 15:55:01 INFO ShutdownHookManager: Shutdown hook called
How can I find what's causing the executors deletion?
Deployment:
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 0 max surge
Pod Template:
Labels: app=test
chart=test-2.0.0
heritage=Tiller
product=testp
release=test
service=test-spark
Containers:
test-spark:
Image: test-spark:2df66df06c
Port: <none>
Host Port: <none>
Command:
/spark/bin/start-spark.sh
Args:
while true; do sleep 30; done;
Limits:
memory: 4Gi
Requests:
memory: 4Gi
Liveness: exec [/spark/bin/liveness-probe.sh] delay=300s timeout=1s period=30s #success=1 #failure=10
Environment:
JVM_ARGS: -Xms256m -Xmx1g
KUBERNETES_MASTER: https://kubernetes.default.svc
KUBERNETES_NAMESPACE: test-spark
IMAGE_PULL_POLICY: Always
DRIVER_CPU: 1
DRIVER_MEMORY: 2048m
EXECUTOR_CPU: 1
EXECUTOR_MEMORY: 2048m
EXECUTOR_INSTANCES: 2
KAFKA_ADVERTISED_HOST_NAME: kafka.default:9092
ENRICH_KAFKA_ENRICHED_EVENTS_TOPICS: test-events
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: test-spark-5c5997b459 (1/1 replicas created)
Events: <none>
I did a quick research on running Spark on Kubernetes, and it seems that Spark by design will terminate executor pod when they finished running Spark applications. Quoted from the official Spark website:
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
Therefore, I believe there is nothing to worry about the restarts as long as your Spark instance still manages to start executor pods as and when required.
Reference: https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#how-it-works
I don't know how you configured your application pod but you can use this to stop restarting pod include this in your deployment yaml file so that pod will never restart and you can debug the pod onwards.
restartPolicy: Never

Flink Job Deployment In Kubernetes

I am trying to deploy a Flink job in Kubernetes cluster (Azure AKS). The Job Cluster is getting aborted just after starting but Task manager is running fine.
The docker image is created successfully without any exception. I am able to run the docker image as well as able to SSH to docker image.
I have followed steps mentioned in the below link:
https://github.com/apache/flink/tree/release-1.9/flink-container/kubernetes
While creating image I have provided Job jar and it has been copied on "/opt/artifacts" inside the image. But still not getting why getting below exception in Job Cluster pod log.
Caused by: org.apache.flink.util.FlinkException: Failed to find job JAR on class path. Please provide the job class name explicitly.
I am new in Kubernetes, Could you please give me some clue to debug this issue.
Please find below complete logs:
A. flink-job-cluster Pod Log
develk#ACIDLAELKV01:~/cntx_eng$ kubectl logs flink-job-cluster-kszwf
Starting the job-cluster
Starting standalonejob as a console application on host flink-job-cluster-kszwf.
2019-12-12 10:37:17,170 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2019-12-12 10:37:17,172 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting StandaloneJobClusterEntryPoint (Version: 1.8.0, Rev:4caec0d, Date:03.04.2019 # 13:25:54 PDT)
2019-12-12 10:37:17,172 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: flink
2019-12-12 10:37:17,173 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: <no hadoop dependency found>
2019-12-12 10:37:17,173 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: OpenJDK 64-Bit Server VM - IcedTea - 1.8/25.212-b04
2019-12-12 10:37:17,173 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 989 MiBytes
2019-12-12 10:37:17,173 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /usr/lib/jvm/java-1.8-openjdk/jre
2019-12-12 10:37:17,174 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - No Hadoop Dependency available
2019-12-12 10:37:17,174 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options:
2019-12-12 10:37:17,174 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xms1024m
2019-12-12 10:37:17,174 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xmx1024m
2019-12-12 10:37:17,174 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:/opt/flink-1.8.0/conf/log4j-console.properties
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlogback.configurationFile=file:/opt/flink-1.8.0/conf/logback-console.xml
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments:
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --configDir
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - /opt/flink-1.8.0/conf
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Djobmanager.rpc.address=flink-job-cluster
2019-12-12 10:37:17,175 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dparallelism.default=1
2019-12-12 10:37:17,176 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dblob.server.port=6124
2019-12-12 10:37:17,176 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dqueryable-state.server.ports=6125
2019-12-12 10:37:17,176 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Classpath: /opt/flink-1.8.0/lib/log4j-1.2.17.jar:/opt/flink-1.8.0/lib/slf4j-log4j12-1.7.15.jar:/opt/flink-1.8.0/lib/flink-dist_2.11-1.8.0.jar:::
2019-12-12 10:37:17,176 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2019-12-12 10:37:17,178 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-12-12 10:37:17,306 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2019-12-12 10:37:17,306 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2019-12-12 10:37:17,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
2019-12-12 10:37:17,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
2019-12-12 10:37:17,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2019-12-12 10:37:17,307 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2019-12-12 10:37:17,336 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting StandaloneJobClusterEntryPoint.
2019-12-12 10:37:17,336 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install default filesystem.
2019-12-12 10:37:17,343 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2019-12-12 10:37:17,352 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install security context.
2019-12-12 10:37:17,362 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2019-12-12 10:37:17,381 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2019-12-12 10:37:17,382 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing cluster services.
2019-12-12 10:37:17,638 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at flink-job-cluster:6123
2019-12-12 10:37:18,163 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2019-12-12 10:37:18,237 INFO akka.remote.Remoting - Starting remoting
2019-12-12 10:37:18,366 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#flink-job-cluster:6123]
2019-12-12 10:37:18,375 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor system started at akka.tcp://flink#flink-job-cluster:6123
2019-12-12 10:37:18,398 INFO org.apache.flink.configuration.Configuration - Config uses fallback configuration key 'jobmanager.rpc.address' instead of key 'rest.address'
2019-12-12 10:37:18,407 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-63338044-67c1-4872-a3d9-c94563b3a7c3
2019-12-12 10:37:18,412 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2019-12-12 10:37:18,428 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2019-12-12 10:37:18,430 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to start actor system at flink-job-cluster:0
2019-12-12 10:37:18,464 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2019-12-12 10:37:18,472 INFO akka.remote.Remoting - Starting remoting
2019-12-12 10:37:18,480 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink-metrics#flink-job-cluster:33529]
2019-12-12 10:37:18,482 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor system started at akka.tcp://flink-metrics#flink-job-cluster:33529
2019-12-12 10:37:18,490 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /tmp/blobStore-ba64dcdb-5095-41fc-9c98-0f1528d95c40
2019-12-12 10:37:18,514 INFO org.apache.flink.configuration.Configuration - Config uses fallback configuration key 'jobmanager.rpc.address' instead of key 'rest.address'
2019-12-12 10:37:18,515 WARN org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Upload directory /tmp/flink-web-f6be0c2d-5099-4bd6-bc72-a0ae1fc6448e/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2019-12-12 10:37:18,516 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Created directory /tmp/flink-web-f6be0c2d-5099-4bd6-bc72-a0ae1fc6448e/flink-web-upload for file uploads.
2019-12-12 10:37:18,603 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Starting rest endpoint.
2019-12-12 10:37:18,872 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - Log file environment variable 'log.file' is not set.
2019-12-12 10:37:18,872 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (fallback keys: [{key=jobmanager.web.log.path, isDeprecated=true}])'.
2019-12-12 10:37:19,115 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Rest endpoint listening at flink-job-cluster:8081
2019-12-12 10:37:19,116 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - http://flink-job-cluster:8081 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000
2019-12-12 10:37:19,116 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Web frontend listening at http://flink-job-cluster:8081.
2019-12-12 10:37:19,239 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2019-12-12 10:37:19,262 INFO org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever - Scanning class path for job JAR
2019-12-12 10:37:19,270 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shutting down rest endpoint.
2019-12-12 10:37:19,295 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Removing cache directory /tmp/flink-web-f6be0c2d-5099-4bd6-bc72-a0ae1fc6448e/flink-web-ui
2019-12-12 10:37:19,299 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - http://flink-job-cluster:8081 lost leadership
2019-12-12 10:37:19,299 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shut down complete.
2019-12-12 10:37:19,302 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting StandaloneJobClusterEntryPoint down with application status FAILED. Diagnostics org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:257)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:224)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:172)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:171)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:535)
at org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.main(StandaloneJobClusterEntryPoint.java:105)
Caused by: org.apache.flink.util.FlinkException: Failed to find job JAR on class path. Please provide the job class name explicitly.
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.getJobClassNameOrScanClassPath(ClassPathJobGraphRetriever.java:131)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:114)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.retrieveJobGraph(ClassPathJobGraphRetriever.java:96)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:62)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:41)
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:184)
... 6 more
Caused by: java.util.NoSuchElementException: No JAR with manifest attribute for entry class
at org.apache.flink.container.entrypoint.JarManifestParser.findOnlyEntryClass(JarManifestParser.java:80)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.scanClassPathForJobJar(ClassPathJobGraphRetriever.java:137)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.getJobClassNameOrScanClassPath(ClassPathJobGraphRetriever.java:129)
... 11 more
.
2019-12-12 10:37:19,305 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6124
2019-12-12 10:37:19,305 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2019-12-12 10:37:19,315 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopping Akka RPC service.
2019-12-12 10:37:19,320 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-12-12 10:37:19,321 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-12-12 10:37:19,323 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-12-12 10:37:19,325 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-12-12 10:37:19,354 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-12-12 10:37:19,356 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-12-12 10:37:19,378 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC service.
2019-12-12 10:37:19,382 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster entrypoint StandaloneJobClusterEntryPoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint StandaloneJobClusterEntryPoint.
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:190)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:535)
at org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.main(StandaloneJobClusterEntryPoint.java:105)
Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:257)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:224)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:172)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:171)
... 2 more
Caused by: org.apache.flink.util.FlinkException: Failed to find job JAR on class path. Please provide the job class name explicitly.
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.getJobClassNameOrScanClassPath(ClassPathJobGraphRetriever.java:131)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:114)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.retrieveJobGraph(ClassPathJobGraphRetriever.java:96)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:62)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:41)
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:184)
... 6 more
Caused by: java.util.NoSuchElementException: No JAR with manifest attribute for entry class
at org.apache.flink.container.entrypoint.JarManifestParser.findOnlyEntryClass(JarManifestParser.java:80)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.scanClassPathForJobJar(ClassPathJobGraphRetriever.java:137)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.getJobClassNameOrScanClassPath(ClassPathJobGraphRetriever.java:129)
... 11 more
develk#ACIDLAELKV01:~/cntx_eng$
Now, I have added Job class name as in argument section of "job-cluster-job.yaml.template" file.
Like below:
args: ["job-cluster",
"--job-classname", "com.flink.wordCountSimple",
"-Djobmanager.rpc.address=flink-job-cluster",
But after that I am getting below exception:
Caused by: org.apache.flink.util.FlinkException: Could not load the provided entrypoint class.
Please see below detail log.
2019-12-13 19:08:34,323 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint - Shut down complete.
2019-12-13 19:08:34,329 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting StandaloneJobClusterEntryPoint down with application status FAILED. Diagnostics org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:257)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:224)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:172)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:171)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:535)
at org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.main(StandaloneJobClusterEntryPoint.java:105)
Caused by: org.apache.flink.util.FlinkException: Could not load the provided entrypoint class.
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:119)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.retrieveJobGraph(ClassPathJobGraphRetriever.java:96)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:62)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:41)
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:184)
... 6 more
Caused by: java.lang.ClassNotFoundException: com.flink.wordCountSimple
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:116)
... 10 more
.
2019-12-13 19:08:34,337 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6124
2019-12-13 19:08:34,338 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2019-12-13 19:08:34,364 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopping Akka RPC service.
2019-12-13 19:08:34,368 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-12-13 19:08:34,372 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-12-13 19:08:34,392 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-12-13 19:08:34,392 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-12-13 19:08:34,406 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-12-13 19:08:34,410 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-12-13 19:08:34,434 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped Akka RPC service.
2019-12-13 19:08:34,443 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster entrypoint StandaloneJobClusterEntryPoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint StandaloneJobClusterEntryPoint.
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:190)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:535)
at org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.main(StandaloneJobClusterEntryPoint.java:105)
Caused by: org.apache.flink.util.FlinkException: Could not create the DispatcherResourceManagerComponent.
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:257)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:224)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:172)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:171)
... 2 more
Caused by: org.apache.flink.util.FlinkException: Could not load the provided entrypoint class.
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:119)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.retrieveJobGraph(ClassPathJobGraphRetriever.java:96)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:62)
at org.apache.flink.runtime.dispatcher.JobDispatcherFactory.createDispatcher(JobDispatcherFactory.java:41)
at org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:184)
... 6 more
Caused by: java.lang.ClassNotFoundException: com.flink.wordCountSimple
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.flink.container.entrypoint.ClassPathJobGraphRetriever.createPackagedProgram(ClassPathJobGraphRetriever.java:116)
... 10 more
There's a complete, working example of creating and running a flink job cluster on kubernetes in https://github.com/alpinegizmo/flink-containers-example. Maybe that will help. See also https://www.youtube.com/watch?v=ceZtUDgh2TE.
version: "2.1"
services:
jobmanager:
build:
context: ./
args:
JAR_FILE: flink-event-tracker-bundled-1.6.0.jar
image: test/flink-event-tracker
expose:
- "6123"
ports:
- "8081:8081"
- "6123:6123"
command: job-cluster --job-classname com.company.test.flink.pipelines.KafkaPipelineConsumer -Djobmanager.rpc.address=jobmanager --runner=FlinkRunner --streaming=true --checkpointingInterval=30000
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
- JOB_MANAGER=jobmanager
volumes:
- data-volume:/docker/volumes
taskmanager:
image: test/flink-event-tracker
expose:
- "6121"
- "6122"
depends_on:
- jobmanager
command: task-manager -Djobmanager.rpc.address=jobmanager
links:
- "jobmanager:jobmanager"
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
- JOB_MANAGER=jobmanager
volumes:
- data-volume:/docker/volumes
volumes:
data-volume:
driver: local
driver_opts:
o: bind
type: none
device: /Users/home/Development/docker/volumes/flink
Docker file
FROM flink:1.9
ARG JAR_FILE=""
ENV APP_OPTS ""
ENV JAVA_OPTS ""
ENV JOB_MANAGER=""
# Build arg allows passing the version at runtime
ARG VERSION=unset-version
COPY flink-conf.yml $FLINK_HOME/conf/flink-conf.yaml
COPY target/$JAR_FILE $FLINK_HOME/lib/event-tracker.jar
COPY docker-cluster-entrypoint.sh /docker-cluster-entrypoint.sh
RUN apt-get update && apt-get install procps -y && apt-get install curl -y
RUN echo "root:root" | chpasswd
RUN chmod 777 /docker-cluster-entrypoint.sh
RUN chmod 777 $FLINK_HOME/lib/event-tracker.jar
ENTRYPOINT [ "bash","/docker-cluster-entrypoint.sh" ]
docker-cluster-entrypoint.sh
FLINK_HOME=${FLINK_HOME:-"/opt/flink/bin"}
JOB_CLUSTER="job-cluster"
TASK_MANAGER="task-manager"
CMD="$1"
shift;
if [ "${CMD}" = "--help" -o "${CMD}" = "-h" ]; then
echo "Usage: $(basename $0) (${JOB_CLUSTER}|${TASK_MANAGER})"
exit 0
elif [ "${CMD}" = "${JOB_CLUSTER}" -o "${CMD}" = "${TASK_MANAGER}" ]; then
echo "Starting the ${CMD}"
if [ "${CMD}" = "${TASK_MANAGER}" ]; then
exec $FLINK_HOME/bin/taskmanager.sh start-foreground "$#"
else
exec $FLINK_HOME/bin/standalone-job.sh start-foreground "$#"
fi
fi
How to run:-
mvn clean install
docker-compose -f docker-compose.local.yml up --scale taskmanager=2 > exceptionlog.log
docker-compose -f docker-compose.local.yml build
this is the entire conf that runs your docker. but if you want to run in kube, just convert the docker-compose file to its corresponding kube files...remaining can stay same.. may be do a helm that way kube maintenance is better.
Note:- we are using apache beam to code the job

how to check when did cassandra got shut down or started/re-started from logs?

nodetool status -- shows some nodes as down and also application is not able to reach the cassandra nodes. When I connect to cassandra nodes and check the logs there are some errors in the logs (system.log and debug.log).
Didn't understand how to check if when did Cassandra got started / re-started and got shutdown. Is there any way to check this from logs ? if so which log and how ?
In the logs folder of your Cassandra installation you should find system.log file. The default location for the log files is /var/log/cassandra.
When Cassandra starts the output looks like this:
INFO [main] 2018-08-18 19:10:27,162 YamlConfigurationLoader.java:89 - Configuration location: file:/home/20171127/.ccm/3113/node1/conf/cassandra.yaml
INFO [main] 2018-08-18 19:10:27,696 Config.java:495 - Node configuration:[allocate_tokens_for_keyspace=null; authenticator=AllowAllAuthenticator;INFO [main] 2018-08-18 19:10:27,698 DatabaseDescriptor.java:367 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
INFO [main] 2018-08-18 19:10:27,699 DatabaseDescriptor.java:425 - Global memtable on-heap threshold is enabled at 123MB
INFO [main] 2018-08-18 19:10:27,700 DatabaseDescriptor.java:429 - Global memtable off-heap threshold is enabled at 123MB
WARN [main] 2018-08-18 19:10:27,974 DatabaseDescriptor.java:550 - Only 30.922GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots
INFO [main] 2018-08-18 19:10:28,025 RateBasedBackPressure.java:123 - Initialized back-pressure with high ratio: 0.9, factor: 5, flow: FAST, window size: 2000.
INFO [main] 2018-08-18 19:10:28,026 DatabaseDescriptor.java:729 - Back-pressure is disabled with strategy org.apache.cassandra.net.RateBasedBackPressure{high_ratio=0.9, factor=5, flow=FAST}.
INFO [main] 2018-08-18 19:10:28,265 JMXServerUtils.java:246 - Configured JMX server at: service:jmx:rmi://127.0.0.1/jndi/rmi://127.0.0.1:7100/jmxrmi
INFO [main] 2018-08-18 19:10:28,282 CassandraDaemon.java:473 - Hostname: 2VVg5
INFO [main] 2018-08-18 19:10:28,284 CassandraDaemon.java:480 - JVM vendor/version: OpenJDK 64-Bit Server VM/1.8.0_181
INFO [main] 2018-08-18 19:10:28,285 CassandraDaemon.java:481 - Heap size: 495.000MiB/495.000MiB
INFO [main] 2018-08-18 19:10:28,287 CassandraDaemon.java:486 - Code Cache Non-heap memory: init = 2555904(2496K) used = 4178816(4080K) committed = 4194304(4096K) max = 251658240(245760K)
INFO [main] 2018-08-18 19:10:28,289 CassandraDaemon.java:486 - Metaspace Non-heap memory: init = 0(0K) used = 18530200(18095K) committed = 19005440(18560K) max = -1(-1K)
INFO [main] 2018-08-18 19:10:28,289 CassandraDaemon.java:486 - Compressed Class Space Non-heap memory: init = 0(0K) used = 2092504(2043K) committed = 2228224(2176K) max = 1073741824(1048576K)
INFO [main] 2018-08-18 19:10:28,290 CassandraDaemon.java:486 - Par Eden Space Heap memory: init = 41943040(40960K) used = 41943040(40960K) committed = 41943040(40960K) max = 41943040(40960K)
INFO [main] 2018-08-18 19:10:28,291 CassandraDaemon.java:486 - Par Survivor Space Heap memory: init = 5242880(5120K) used = 5242880(5120K) committed = 5242880(5120K) max = 5242880(5120K)
INFO [main] 2018-08-18 19:10:28,293 CassandraDaemon.java:486 - CMS Old Gen Heap memory: init = 471859200(460800K) used = 669336(653K) committed = 471859200(460800K) max = 471859200(460800K)
When Cassandra shuts down properly, the log entries look like this:
INFO [StorageServiceShutdownHook] 2018-08-18 19:28:23,661 Gossiper.java:1559 - Announcing shutdown
INFO [StorageServiceShutdownHook] 2018-08-18 19:28:23,662 StorageService.java:2289 - Node /127.0.0.1 state jump to shutdown
INFO [StorageServiceShutdownHook] 2018-08-18 19:28:25,664 MessagingService.java:992 - Waiting for messaging service to quiesce
INFO [ACCEPT-/127.0.0.1] 2018-08-18 19:28:25,665 MessagingService.java:1346 - MessagingService has terminated the accept() thread
INFO [StorageServiceShutdownHook] 2018-08-18 19:28:25,729 HintsService.java:220 - Paused hints dispatch
Depending on how Cassandra stopped is possible to not have any log entries where you could see that Cassandra stopped.
Other interesting files that you could look into are debug.log and gc.log.

Cassandra stopped working after nodetool repair

After running "nodetool repair" command cassandra node gone down and did not start again.
INFO [main] 2016-10-19 12:44:50,244 ColumnFamilyStore.java:405 - Initializing system_schema.aggregates
INFO [main] 2016-10-19 12:44:50,247 ColumnFamilyStore.java:405 - Initializing system_schema.indexes
INFO [main] 2016-10-19 12:44:50,248 ViewManager.java:139 - Not submitting build tasks for views in keyspace system_schema as storage service is not initialized
Cassandra version 3.7
Turned on the node and it's fine. It took too long to start (more than 30 minutes).
INFO [main] 2016-10-19 15:32:48,348 ColumnFamilyStore.java:405 - Initializing system_schema.indexes
INFO [main] 2016-10-19 15:32:48,354 ViewManager.java:139 - Not submitting build tasks for views in keyspace system_schema as storage service is not initialized
INFO [main] 2016-10-19 16:07:36,529 ColumnFamilyStore.java:405 - Initializing system_distributed.parent_repair_history
INFO [main] 2016-10-19 16:07:36,546 ColumnFamilyStore.java:405 - Initializing system_distributed.repair_history
Now I'm trying to figure out why it is so slow.

Resources