Spark Streaming integration with Kinesis not receiving records in EMR

Spark Streaming integration with Kinesis not receiving records in EMR - apache-spark

I'm trying to run the word count example described here, but the DStream reading from the Kinesis stream is always empty.
This is how I'm running:
Launched an AWS EMR cluster in version 6.5.0 (Running spark 3.1.2)
SSHed into the master instance
ran: spark-example --packages org.apache.spark:spark-streaming-kinesis-asl_2.12:3.1.2 streaming.JavaKinesisWordCountASL streaming_test streaming_test https://kinesis.sa-east-1.amazonaws.com
In another tab, ran: spark-example --packages org.apache.spark:spark-streaming-kinesis-asl_2.12:3.1.2 streaming.KinesisWordProducerASL streaming-test https://kinesis.sa-east-1.amazonaws.com 100 10
Additional info:
EMR cluster with 2 m5.xlarge instances
Kinesis with a single shard only
I can fetch records from the stream using boto3
A DynamoDB table was indeed created for storing checkpoints, but nothing was written on it
Logs (This is just a sample - after it finishes initializing, it keeps repeating that pattern of pprint with no records, followed by a bunch of spark related logs, then followed again by another pprint with no records)
GiB)
22/01/27 21:39:46 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 77) (ip-10-0-13-187.sa-east-1.compute.internal, executor 1, partition 6, PROCESS_LOCAL, 4443 bytes) taskResourceAssignments Map()
22/01/27 21:39:46 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 76) in 19 ms on ip-10-0-13-187.sa-east-1.compute.internal (executor 1) (1/3)
22/01/27 21:39:46 INFO TaskSetManager: Starting task 2.0 in stage 8.0 (TID 78) (ip-10-0-13-187.sa-east-1.compute.internal, executor 1, partition 7, PROCESS_LOCAL, 4443 bytes) taskResourceAssignments Map()
22/01/27 21:39:46 INFO TaskSetManager: Finished task 1.0 in stage 8.0 (TID 77) in 10 ms on ip-10-0-13-187.sa-east-1.compute.internal (executor 1) (2/3)
22/01/27 21:39:46 INFO TaskSetManager: Finished task 2.0 in stage 8.0 (TID 78) in 8 ms on ip-10-0-13-187.sa-east-1.compute.internal (executor 1) (3/3)
22/01/27 21:39:46 INFO YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool
22/01/27 21:39:46 INFO DAGScheduler: ResultStage 8 (print at JavaKinesisWordCountASL.java:190) finished in 0,042 s
22/01/27 21:39:46 INFO DAGScheduler: Job 4 is finished. Cancelling potential speculative or zombie tasks for this job
22/01/27 21:39:46 INFO YarnScheduler: Killing all running tasks in stage 8: Stage finished
22/01/27 21:39:46 INFO DAGScheduler: Job 4 finished: print at JavaKinesisWordCountASL.java:190, took 0,048372 s
-------------------------------------------
Time: 1643319586000 ms
-------------------------------------------
22/01/27 21:39:46 INFO JobScheduler: Finished job streaming job 1643319586000 ms.0 from job set of time 1643319586000 ms
22/01/27 21:39:46 INFO JobScheduler: Total delay: 0,271 s for time 1643319586000 ms (execution: 0,227 s)
22/01/27 21:39:46 INFO ReceivedBlockTracker: Deleting batches:
Also, apparently, the Library does manage to connect to the Kinesis stream:
22/01/27 21:39:44 INFO KinesisInputDStream: Slide time = 2000 ms
22/01/27 21:39:44 INFO KinesisInputDStream: Storage level = Serialized 1x Replicated
22/01/27 21:39:44 INFO KinesisInputDStream: Checkpoint interval = null
22/01/27 21:39:44 INFO KinesisInputDStream: Remember interval = 2000 ms
22/01/27 21:39:44 INFO KinesisInputDStream: Initialized and validated org.apache.spark.streaming.kinesis.KinesisInputDStream#7cc3580b
Help would be very appreciated!

Related

How to correctly parallelize multiple JSON file aggregation in PySpark

I have a large set of json_list files on S3 with some logs that I would like to aggregate (basically just count number of requests by path, location etc.) I've been doing the following, but judging by the logs, I'm not sure it's actually parallelized.. first it takes about 3 minutes to download the individual S3 files one by one, and then the rest still seems split to 1000 executions.. I thought Spark would break this down into a map-reduce kind of approach itself but maybe I totally misunderstood what it does and what it doesn't do. Could someone provide a hint please.
df = (
spark.read
.json(test_paths, schema=schema)
.filter(col('method') == 'GET')
.filter((col('status_code') == 200) | (col('status_code') == 206))
.withColumn('date', from_unixtime('timestamp').cast(DateType()))
.groupBy('path', 'client_country_code', 'date', 'file_size')
.count()
)
Here's the driver log for 1000 urls:
20/11/15 19:15:23 INFO InMemoryFileIndex: Listing leaf files and directories in parallel under 1000 paths. The first several paths are: s3n://bucket../10004.json_lines.gz.
20/11/15 19:15:23 INFO SparkContext: Starting job: json at NativeMethodAccessorImpl.java:0
20/11/15 19:15:23 INFO DAGScheduler: Got job 49 (json at NativeMethodAccessorImpl.java:0) with 1000 output partitions
20/11/15 19:15:23 INFO DAGScheduler: Final stage: ResultStage 75 (json at NativeMethodAccessorImpl.java:0)
20/11/15 19:15:23 INFO DAGScheduler: Parents of final stage: List()
20/11/15 19:15:23 INFO DAGScheduler: Missing parents: List()
20/11/15 19:15:23 INFO DAGScheduler: Submitting ResultStage 75 (MapPartitionsRDD[206] at json at NativeMethodAccessorImpl.java:0), which has no missing parents
20/11/15 19:15:23 INFO MemoryStore: Block broadcast_77 stored as values in memory (estimated size 84.3 KiB, free 2.2 GiB)
20/11/15 19:15:23 INFO MemoryStore: Block broadcast_77_piece0 stored as bytes in memory (estimated size 29.9 KiB, free 2.2 GiB)
20/11/15 19:15:23 INFO BlockManagerInfo: Added broadcast_77_piece0 in memory on e05e979b7108:34999 (size: 29.9 KiB, free: 2.2 GiB)
20/11/15 19:15:23 INFO SparkContext: Created broadcast 77 from broadcast at DAGScheduler.scala:1223
20/11/15 19:15:23 INFO DAGScheduler: Submitting 1000 missing tasks from ResultStage 75 (MapPartitionsRDD[206] at json at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
20/11/15 19:15:23 INFO TaskSchedulerImpl: Adding task set 75.0 with 1000 tasks
20/11/15 19:15:23 INFO TaskSetManager: Starting task 0.0 in stage 75.0 (TID 33224, e05e979b7108, executor driver, partition 0, PROCESS_LOCAL, 7473 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 1.0 in stage 75.0 (TID 33225, e05e979b7108, executor driver, partition 1, PROCESS_LOCAL, 7473 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 2.0 in stage 75.0 (TID 33226, e05e979b7108, executor driver, partition 2, PROCESS_LOCAL, 7474 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 3.0 in stage 75.0 (TID 33227, e05e979b7108, executor driver, partition 3, PROCESS_LOCAL, 7475 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 4.0 in stage 75.0 (TID 33228, e05e979b7108, executor driver, partition 4, PROCESS_LOCAL, 7476 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 5.0 in stage 75.0 (TID 33229, e05e979b7108, executor driver, partition 5, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 6.0 in stage 75.0 (TID 33230, e05e979b7108, executor driver, partition 6, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:23 INFO TaskSetManager: Starting task 7.0 in stage 75.0 (TID 33231, e05e979b7108, executor driver, partition 7, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:23 INFO Executor: Running task 0.0 in stage 75.0 (TID 33224)
20/11/15 19:15:23 INFO Executor: Running task 1.0 in stage 75.0 (TID 33225)
20/11/15 19:15:23 INFO Executor: Running task 2.0 in stage 75.0 (TID 33226)
20/11/15 19:15:23 INFO Executor: Running task 5.0 in stage 75.0 (TID 33229)
20/11/15 19:15:23 INFO Executor: Running task 3.0 in stage 75.0 (TID 33227)
20/11/15 19:15:23 INFO Executor: Running task 7.0 in stage 75.0 (TID 33231)
20/11/15 19:15:23 INFO Executor: Running task 6.0 in stage 75.0 (TID 33230)
20/11/15 19:15:23 INFO Executor: Running task 4.0 in stage 75.0 (TID 33228)
20/11/15 19:15:24 INFO Executor: Finished task 1.0 in stage 75.0 (TID 33225). 2025 bytes result sent to driver
20/11/15 19:15:24 INFO Executor: Finished task 0.0 in stage 75.0 (TID 33224). 2025 bytes result sent to driver
20/11/15 19:15:24 INFO TaskSetManager: Starting task 8.0 in stage 75.0 (TID 33232, e05e979b7108, executor driver, partition 8, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:24 INFO TaskSetManager: Finished task 1.0 in stage 75.0 (TID 33225) in 567 ms on e05e979b7108 (executor driver) (1/1000)
20/11/15 19:15:24 INFO Executor: Running task 8.0 in stage 75.0 (TID 33232)
20/11/15 19:15:24 INFO Executor: Finished task 6.0 in stage 75.0 (TID 33230). 2033 bytes result sent to driver
20/11/15 19:15:24 INFO TaskSetManager: Starting task 9.0 in stage 75.0 (TID 33233, e05e979b7108, executor driver, partition 9, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:24 INFO TaskSetManager: Starting task 10.0 in stage 75.0 (TID 33234, e05e979b7108, executor driver, partition 10, PROCESS_LOCAL, 7477 bytes)
20/11/15 19:15:24 INFO TaskSetManager: Finished task 0.0 in stage 75.0 (TID 33224) in 570 ms on e05e979b7108 (executor driver) (2/1000)
20/11/15 19:15:24 INFO Executor: Running task 9.0 in stage 75.0 (TID 33233)
20/11/15 19:15:24 INFO Executor: Running task 10.0 in stage 75.0 (TID 33234)
20/11/15 19:15:24 INFO TaskSetManager: Finished task 6.0 in stage 75.0 (TID 33230) in 571 ms on e05e979b7108 (executor driver) (3/1000)
....
20/11/15 19:15:43 INFO TaskSetManager: Finished task 998.0 in stage 75.0 (TID 34222) in 158 ms on e05e979b7108 (executor driver) (999/1000)
20/11/15 19:15:43 INFO Executor: Finished task 999.0 in stage 75.0 (TID 34223). 2033 bytes result sent to driver
20/11/15 19:15:43 INFO TaskSetManager: Finished task 999.0 in stage 75.0 (TID 34223) in 175 ms on e05e979b7108 (executor driver) (1000/1000)
20/11/15 19:15:43 INFO TaskSchedulerImpl: Removed TaskSet 75.0, whose tasks have all completed, from pool
20/11/15 19:15:43 INFO DAGScheduler: ResultStage 75 (json at NativeMethodAccessorImpl.java:0) finished in 19.850 s
20/11/15 19:15:43 INFO DAGScheduler: Job 49 is finished. Cancelling potential speculative or zombie tasks for this job
20/11/15 19:15:43 INFO TaskSchedulerImpl: Killing all running tasks in stage 75: Stage finished
20/11/15 19:15:43 INFO DAGScheduler: Job 49 finished: json at NativeMethodAccessorImpl.java:0, took 19.890458 s
20/11/15 19:15:43 INFO InMemoryFileIndex: It took 19936 ms to list leaf files for 1000 paths.

There's a lot of setup overhead, especially with many small files. JSON is also a very inefficient storage format as the whole file will be needed to be read every time. Ideally each file should be 64+MB to give the spark workers enough data to process efficiently.
Have you considered making step 1 of your workflow just reading in the JSON files and then saving in a columnar format like Parquet to a smaller number of files.?

Apache Spark driver logs don't specify reason of stage cancelling

I run Apache Spark on AWS EMR under YARN.
The cluster has 1 master and 10 executors.
After some hours of processing my cluster failed and I go to look on a log.
So, I see that all working executors were trying to kill task at one time (It's the log of someone executor):
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 66.0 in stage 2.0 (TID 466), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 65.0 in stage 2.0 (TID 465), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 67.0 in stage 2.0 (TID 467), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 64.0 in stage 2.0 (TID 464), reason: Stage cancelled
20/03/05 00:02:12 ERROR Utils: Aborting a task
I see that reason is Stage cancelled but I can't get any details about that. I see driver logs and find that they have the last record at much earlier time.
So I have 2 questions:
Why driver logs are much shorter than executors logs?
How can I get the real reason why stage cancelled?
20/03/04 18:39:40 INFO TaskSetManager: Starting task 159.0 in stage 1.0 (TID 359, ip-172-31-6-236.us-west-2.compute.internal, executor 40, partition 159, RACK_LOCAL, 8421 bytes)
20/03/04 18:39:40 INFO ExecutorAllocationManager: New executor 40 has registered (new total is 40)
20/03/04 18:39:41 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-6-236.us-west-2.compute.internal:33589 with 2.8 GB RAM, BlockManagerId(40, ip-172-31-6-236.us-west-2.compute.internal, 33589, None)
20/03/04 18:39:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-6-236.us-west-2.compute.internal:33589 (size: 44.7 KB, free: 2.8 GB)
20/03/04 18:39:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-6-236.us-west-2.compute.internal:33589 (size: 37.4 KB, free: 2.8 GB)

HDP3.1.4 - Spark2 with Hive Warehouse Connector error using spark-submit and pyspark shell: KeeperErrorCode = ConnectionLoss

Environment:
HDP 3.1.4 - configured and tested Hive server 2 - tested and working
Hive server 2 LLAP - tested and working Spark configured as per documentation to use Hive Warehouse Connector (HWC)
Apache Zeppelin - spark2 interpreter configured to use HWC
Trying to execute the following script:
from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
# Create spark session
spark = SparkSession.builder.appName("LLAP Test - CLI").enableHiveSupport().getOrCreate()
# Create HWC session
hive = HiveWarehouseSession.session(spark).userPassword('hive','hive').build()
# Execute a query to read from Spark using HWC
hive.executeQuery("select * from wifi_table where partit='2019-12-02'").show(20)
Problem:
When submitting an application with spark-submit or using the pyspark shell with the above script (or any script that executes a query to with the HiveWarehouseSession) and the spark job gets stuck, throwing an exception: java.lang.RuntimeException: java.io.IOException: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
The command to execute is the following:
$ /usr/hdp/current/spark2-client/bin/spark-submit --master yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip spark_compare_test.py
Here is the stacktrace:
[...]
20/01/03 12:39:55 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
20/01/03 12:39:56 INFO DAGScheduler: Got job 0 (showString at NativeMethodAccessorImpl.java:0) with 1 output partitions
20/01/03 12:39:56 INFO DAGScheduler: Final stage: ResultStage 0 (showString at NativeMethodAccessorImpl.java:0)
20/01/03 12:39:56 INFO DAGScheduler: Parents of final stage: List()
20/01/03 12:39:56 INFO DAGScheduler: Missing parents: List()
20/01/03 12:39:56 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at showString at NativeMethodAccessorImpl.java:0), which has no missing parents
20/01/03 12:39:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.5 KB, free 366.3 MB)
20/01/03 12:39:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.6 KB, free 366.3 MB)
20/01/03 12:39:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on EDGE01.machine:38050 (size: 3.6 KB, free: 366.3 MB)
20/01/03 12:39:56 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1039
20/01/03 12:39:56 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at showString at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
20/01/03 12:39:56 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
20/01/03 12:39:56 WARN TaskSetManager: Stage 0 contains a task of very large size (465 KB). The maximum recommended task size is 100 KB.
20/01/03 12:39:56 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, DN02.machine, executor 2, partition 0, NODE_LOCAL, 476705 bytes)
20/01/03 12:39:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on DN02.machine:41521 (size: 3.6 KB, free: 366.3 MB)
20/01/03 12:42:08 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, DN02.machine, executor 2): java.lang.RuntimeException: java.io.IOException: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReaderFactory.createDataReader(HiveWarehouseDataReaderFactory.java:66)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.hadoop.hive.registry.impl.ZkRegistryBase.ensureInstancesCache(ZkRegistryBase.java:619)
at org.apache.hadoop.hive.llap.registry.impl.LlapZookeeperRegistryImpl.getInstances(LlapZookeeperRegistryImpl.java:422)
at org.apache.hadoop.hive.llap.registry.impl.LlapZookeeperRegistryImpl.getInstances(LlapZookeeperRegistryImpl.java:63)
at org.apache.hadoop.hive.llap.registry.impl.LlapRegistryService.getInstances(LlapRegistryService.java:181)
at org.apache.hadoop.hive.llap.registry.impl.LlapRegistryService.getInstances(LlapRegistryService.java:177)
at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getServiceInstanceForHost(LlapBaseInputFormat.java:415)
at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getServiceInstance(LlapBaseInputFormat.java:397)
at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getRecordReader(LlapBaseInputFormat.java:160)
at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReader.getRecordReader(HiveWarehouseDataReader.java:72)
at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReader.<init>(HiveWarehouseDataReader.java:50)
at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReaderFactory.getDataReader(HiveWarehouseDataReaderFactory.java:72)
at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReaderFactory.createDataReader(HiveWarehouseDataReaderFactory.java:64)
... 18 more
Caused by: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at shadecurator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
at shadecurator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
at shadecurator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
at shadecurator.org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:489)
at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199)
at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193)
at shadecurator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190)
at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175)
at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32)
at shadecurator.org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194)
at shadecurator.org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61)
at shadecurator.org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53)
at shadecurator.org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576)
at shadecurator.org.apache.curator.framework.recipes.cache.PathChildrenCache.rebuild(PathChildrenCache.java:326)
at shadecurator.org.apache.curator.framework.recipes.cache.PathChildrenCache.start(PathChildrenCache.java:303)
at org.apache.hadoop.hive.registry.impl.ZkRegistryBase.ensureInstancesCache(ZkRegistryBase.java:597)
... 29 more
[...]
I have tried the following with no effect whatsoever:
Checked zookeeper health and connection limiting
Changed zookeeper hosts
Increased zookeeper timeout to 10s, 120s and 600s and no effect
Tried to submit the application on multiple nodes, the error persists
There is another strange behavior, running the script on the Zeppelin spark2 interpreter there is no error and the HWC works. I have compared the environments, and there is no configuration mismatch on the main variables.
At this point I'm stuck and don't know where to look for further troubleshooting. I can add more information as requested.

First query to cassandra tables through Thrift server takes too long

I am trying to query cassandra table through Thrift server. I have setup my spark cluster having one master and one worker in the same node.
I am starting thrift server with following command without having any custom configuration.
$SPARK_HOME/sbin/start-thriftserver.sh --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.2 --conf spark.cassandra.connection.host=127.0.0.1 --master spark://<spark-master>:7077
I have created following table in cassandra and inserted not more than 10 records in it and configured in hive metastore.
CREATE TABLE IF NOT EXISTS places_for_research(
research_id uuid,
tenant_id uuid,
country text,
place_id uuid,
PRIMARY KEY((tenant_id,research_id),country,place_id)
);
Now when I query this table from beeline, first time it takes around 19 seconds and on subsequent execution it reduces this time to half second.
Following is the query which I execute from beeline which return 2 records.
select * from places_for_research where tenant_id='340276cb-389b-4f57-a2cf-6ff5ec3e4d91' and research_id='95dafbe7-78d0-4509-9553-899dfaa7b858';
Wondering what is causing so much time for first request. How can I optimise first request performance?
Following is the thrift server logs for your ref
17/11/03 20:12:50 INFO SparkExecuteStatementOperation: Running query 'select * from places_for_research where tenant_id='340276cb-389b-4f57-a2cf-6ff5ec3e4d91' and research_id='95dafbe7-78d0-4509-9553-899dfaa7b858'' with 9d9a5c7c-2766-48c3-ab58-348b461b6577
17/11/03 20:12:50 INFO SparkSqlParser: Parsing command: select * from places_for_research where tenant_id='340276cb-389b-4f57-a2cf-6ff5ec3e4d91' and research_id='95dafbe7-78d0-4509-9553-899dfaa7b858'
17/11/03 20:12:51 INFO HiveMetaStore: 2: get_table : db=default tbl=places_for_research
17/11/03 20:12:51 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=places_for_research
17/11/03 20:12:51 INFO HiveMetaStore: 2: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
17/11/03 20:12:51 INFO ObjectStore: ObjectStore, initialize called
17/11/03 20:12:51 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery#0" since the connection used is closing
17/11/03 20:12:51 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
17/11/03 20:12:51 INFO ObjectStore: Initialized ObjectStore
17/11/03 20:12:52 INFO CatalystSqlParser: Parsing command: array<string>
17/11/03 20:12:52 INFO HiveMetaStore: 2: get_table : db=default tbl=places_for_research
17/11/03 20:12:52 INFO audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=places_for_research
17/11/03 20:12:52 INFO CatalystSqlParser: Parsing command: array<string>
17/11/03 20:12:53 INFO ClockFactory: Using native clock to generate timestamps.
17/11/03 20:12:53 WARN NettyUtil: Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead.
17/11/03 20:12:54 INFO Cluster: New Cassandra host /127.0.0.1:9042 added
17/11/03 20:12:54 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
17/11/03 20:12:55 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(tenant_id), IsNotNull(research_id), EqualTo(tenant_id,340276cb-389b-4f57-a2cf-6ff5ec3e4d91), EqualTo(research_id,95dafbe7-78d0-4509-9553-899dfaa7b858)]
17/11/03 20:12:55 INFO CassandraSourceRelation: Input Predicates: [IsNotNull(tenant_id), IsNotNull(research_id), EqualTo(tenant_id,340276cb-389b-4f57-a2cf-6ff5ec3e4d91), EqualTo(research_id,95dafbe7-78d0-4509-9553-899dfaa7b858)]
17/11/03 20:12:57 INFO CodeGenerator: Code generated in 652.925772 ms
17/11/03 20:12:57 INFO SparkContext: Starting job: run at AccessController.java:0
17/11/03 20:12:57 INFO DAGScheduler: Got job 0 (run at AccessController.java:0) with 1 output partitions
17/11/03 20:12:57 INFO DAGScheduler: Final stage: ResultStage 0 (run at AccessController.java:0)
17/11/03 20:12:57 INFO DAGScheduler: Parents of final stage: List()
17/11/03 20:12:57 INFO DAGScheduler: Missing parents: List()
17/11/03 20:12:57 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at run at AccessController.java:0), which has no missing parents
17/11/03 20:12:58 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 12.8 KB, free 366.3 MB)
17/11/03 20:12:58 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 6.3 KB, free 366.3 MB)
17/11/03 20:12:58 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.110:57001 (size: 6.3 KB, free: 366.3 MB)
17/11/03 20:12:58 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
17/11/03 20:12:58 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at run at AccessController.java:0)
17/11/03 20:12:58 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/11/03 20:12:58 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.110, executor 0, partition 0, ANY, 8403 bytes)
17/11/03 20:13:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.110:57005 (size: 6.3 KB, free: 366.3 MB)
17/11/03 20:13:05 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
17/11/03 20:13:09 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 11709 ms on 192.168.1.110 (executor 0) (1/1)
17/11/03 20:13:09 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/11/03 20:13:10 INFO DAGScheduler: ResultStage 0 (run at AccessController.java:0) finished in 11.734 s
17/11/03 20:13:10 INFO DAGScheduler: Job 0 finished: run at AccessController.java:0, took 12.189787 s
17/11/03 20:13:10 INFO CodeGenerator: Code generated in 63.249603 ms
17/11/03 20:13:10 INFO SparkExecuteStatementOperation: Result Schema: StructType(StructField(tenant_id,StringType,true), StructField(research_id,StringType,true), StructField(country,StringType,true), StructField(place_id,StringType,true))
Thanks.

The Spark Thrift Server is lazy which means it doesn't actually start any machinery for doing queries until after the first query is launched. The delay you see is the actual starting up and requesting of remote resources. This will always take some non-zero amount of time but you could possibly avoid this by always having your thrift server immediately queried with a dummy request after being started up.

How to use foreach in foreachRDD to spark streaming?

I want to read each element to foreachRDD and use each tuple do something.
Set spark work memory = 756m.
def main(args: Array[String]) {
val sc = new StreamingContext("....."))
val dataSet = sc.textFileStreame($<HDFS_FILE_PATH>)
dataSet.foreachRDD(rdd => {
rdd.foreachPartition((iterator: Iterator[String]) => {
println("1 : "+iterator.next())
})
})
}
sc.start()
sc.awaitTermination()
When the source sbt compile and running spark..it did not work like this.
They didnt show console.
14/10/01 18:22:50 INFO MemoryStore: ensureFreeSpace(171438) called with curMem=0, maxMem=1109498265
14/10/01 18:22:50 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 167.4 KB, free 1057.9 MB)
14/10/01 18:22:50 INFO FileInputFormat: Total input paths to process : 1
14/10/01 18:22:50 INFO JobScheduler: Added jobs for time 1412155370000 ms
14/10/01 18:22:50 INFO JobScheduler: Starting job streaming job 1412155370000 ms.0 from job set of time 1412155370000 ms
14/10/01 18:22:50 INFO SparkContext: Starting job: foreachPartition at SbclogCep.scala:54
14/10/01 18:22:50 INFO DAGScheduler: Got job 0 (foreachPartition at SbclogCep.scala:54) with 1 output partitions (allowLocal=false)
14/10/01 18:22:50 INFO DAGScheduler: Final stage: Stage 0(foreachPartition at SbclogCep.scala:54)
14/10/01 18:22:50 INFO DAGScheduler: Parents of final stage: List()
14/10/01 18:22:51 INFO DAGScheduler: Missing parents: List()
14/10/01 18:22:51 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[2] at map at MappedDStream.scala:35), which has no missing parents
14/10/01 18:22:51 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[2] at map at MappedDStream.scala:35)
14/10/01 18:22:51 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/01 18:23:00 INFO FileInputDStream: Finding new files took 4 ms
14/10/01 18:23:00 INFO FileInputDStream: New files at time 1412155380000 ms:
14/10/01 18:23:00 INFO JobScheduler: Added jobs for time 1412155380000 ms
14/10/01 18:23:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/10/01 18:23:10 INFO FileInputDStream: Finding new files took 3 ms
14/10/01 18:23:10 INFO FileInputDStream: New files at time 1412155390000 ms:
14/10/01 18:23:10 INFO JobScheduler: Added jobs for time 1412155390000 ms
14/10/01 18:23:20 INFO FileInputDStream: Finding new files took 8 ms
14/10/01 18:23:20 INFO FileInputDStream: New files at time 1412155400000 ms:
14/10/01 18:23:20 INFO JobScheduler: Added jobs for time 1412155400000 ms
14/10/01 18:23:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/10/01 18:23:30 INFO FileInputDStream: Finding new files took 4 ms
14/10/01 18:23:30 INFO FileInputDStream: New files at time 1412155410000 ms:
14/10/01 18:23:30 INFO JobScheduler: Added jobs for time 1412155410000 ms
14/10/01 18:23:36 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/10/01 18:23:40 INFO FileInputDStream: Finding new files took 3 ms
14/10/01 18:23:40 INFO FileInputDStream: New files at time 1412155420000 ms:
14/10/01 18:23:40 INFO JobScheduler: Added jobs for time 1412155420000 ms
14/10/01 18:23:50 INFO FileInputDStream: Finding new files took 8 ms
14/10/01 18:23:50 INFO FileInputDStream: New files at time 1412155430000 ms:
14/10/01 18:23:50 INFO JobScheduler: Added jobs for time 1412155430000 ms
14/10/01 18:23:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/10/01 18:24:00 INFO FileInputDStream: Finding new files took 4 ms
14/10/01 18:24:00 INFO FileInputDStream: New files at time 1412155440000 ms:
14/10/01 18:24:00 INFO JobScheduler: Added jobs for time 1412155440000 ms
14/10/01 18:24:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/10/01 18:24:10 INFO FileInputDStream: Finding new files took 3 ms
14/10/01 18:24:10 INFO FileInputDStream: New files at time 1412155450000 ms:
14/10/01 18:24:10 INFO JobScheduler: Added jobs for time 1412155450000 ms
Help me please.
I want to go home.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Streaming integration with Kinesis not receiving records in EMR - apache-spark

Related

How to correctly parallelize multiple JSON file aggregation in PySpark

Apache Spark driver logs don't specify reason of stage cancelling

HDP3.1.4 - Spark2 with Hive Warehouse Connector error using spark-submit and pyspark shell: KeeperErrorCode = ConnectionLoss

First query to cassandra tables through Thrift server takes too long

How to use foreach in foreachRDD to spark streaming?

Categories

Resources