Need to tune spark job which is long running - apache-spark

I need to pull data from oracle to Hive. My job is completed in 24 hours.
I am using spark JDBC API to pull the data.How to tune this job?
Oracle table info:
No of blocks:54014592
Memory in MB:421989
DRIVER_MEMORY :25 gb
EXECUTOR_CORES :5
EXECUTOR_INSTANCES :25
EXECUTOR_MEMORY :20 gb
spark Parallel Degree:25
Table has 29 partitions and max partition size is 93 GB
Logs:It has many Garbage Collection
18/12/03 05:11:55 INFO scheduler.TaskSetManager: Finished task 16.0 in stage 4.0 (TID 144) in 1448053 ms on bdgtr004d02h1u.nam.nsroot.net (executor 18) (36/64)
1837.013: [Full GC (System.gc()) 549M->296M(25G), 0.5553646 secs]
18/12/03 05:22:11 INFO storage.BlockManagerInfo: Added rdd_10_44 in memory on bdgtr015d07h2u.nam.nsroot.net:36517 (size: 498.7 MB, free: 10.3 GB)
18/12/03 05:58:59 INFO scheduler.TaskSetManager: Finished task 38.0 in stage 4.0 (TID 166) in 4271907 ms on bdgtr007d17i2u.nam.nsroot.net (executor 5) (59/64)
18/12/03 06:16:17 INFO storage.BlockManagerInfo: Added rdd_10_22 in memory on bdgtr006d20i2u.nam.nsroot.net:34124 (size: 705.2 MB, free: 8.4 GB)
5437.013: [Full GC (System.gc()) 1121M->297M(25G), 0.6317014 secs]
18/12/03 06:17:00 INFO scheduler.TaskSetManager: Finished task 22.1 in stage 4.0 (TID 192) in 2686834 ms on bdgtr006d20i2u.nam.nsroot.net (executor 9) (60/64)
7237.013: [Full GC (System.gc()) 1112M->297M(25G), 0.7000144 secs]
18/12/03 07:02:15 INFO storage.BlockManagerInfo: Added rdd_10_63 in memory on bdgtr007d17i2u.nam.nsroot.net:43841 (size: 318.9 MB, free: 9.0 GB)
18/12/03 07:02:39 INFO scheduler.TaskSetManager: Finished task 63.0 in stage 4.0 (TID 191) in 8091801 ms on bdgtr007d17i2u.nam.nsroot.net (executor 5) (61/64)
9037.014: [Full GC (System.gc()) 1097M->297M(25G), 0.6828210 secs]
18/12/03 07:17:57 INFO storage.BlockManagerInfo: Added rdd_10_58 in memory on bdgtr002d16i2u.nam.nsroot.net:41262 (size: 247.2 MB, free: 9.6 GB)
18/12/03 07:18:17 INFO scheduler.TaskSetManager: Finished task 58.0 in stage 4.0 (TID 186) in 9030124 ms on bdgtr002d16i2u.nam.nsroot.net (executor 25) (62/64)
18/12/03 07:21:11 INFO storage.BlockManagerInfo: Added rdd_10_0 in memory on bdgtr001d01h1u.nam.nsroot.net:41190 (size: 515.8 MB, free: 10.0 GB)
18/12/03 07:21:49 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 128) in 9241836 ms on bdgtr001d01h1u.nam.nsroot.net (executor 8) (63/64)
10837.013: [Full GC (System.gc()) 1095M->297M(25G), 0.7272104 secs]
18/12/03 07:51:01 INFO storage.BlockManagerInfo: Added rdd_10_59 in memory on bdgtr009d08i2u.nam.nsroot.net:44716 (size: 287.4 MB, free: 9.4 GB)

Related

Do not see data written from Spark in ADX

I am using azure-kusto-spark to write data to ADX, I can see schema created in ADX, but I do not see any data, there is not any error from log, note I try it using local spark.
df.show();
df.write()
.format("com.microsoft.kusto.spark.datasource")
.option(KustoSinkOptions.KUSTO_CLUSTER(), cluster)
.option(KustoSinkOptions.KUSTO_DATABASE(), db)
.option(KustoSinkOptions.KUSTO_TABLE(), table)
.option(KustoSinkOptions.KUSTO_AAD_APP_ID(), client_id)
.option(KustoSinkOptions.KUSTO_AAD_APP_SECRET(), client_key)
.option(KustoSinkOptions.KUSTO_AAD_AUTHORITY_ID(), "microsoft.com")
.option(KustoSinkOptions.KUSTO_TABLE_CREATE_OPTIONS(), "CreateIfNotExist")
.mode(SaveMode.Append)
.save();
22/12/13 12:06:14 INFO QueuedIngestClient: Creating a new IngestClient
22/12/13 12:06:14 INFO ResourceManager: Refreshing Ingestion Auth Token
22/12/13 12:06:16 INFO ResourceManager: Refreshing Ingestion Resources
22/12/13 12:06:16 INFO KustoConnector: ContainerProvider: Got 2 storage SAS with command :'.create tempstorage'. from service 'ingest-engineermetricdata.eastus'
22/12/13 12:06:16 INFO KustoConnector: ContainerProvider: Got 2 storage SAS with command :'.create tempstorage'. from service 'ingest-engineermetricdata.eastus'
22/12/13 12:06:16 INFO KustoConnector: KustoWriter$: finished serializing rows in partition 0 for requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:16 INFO KustoConnector: KustoWriter$: finished serializing rows in partition 1 for requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:17 INFO KustoConnector: KustoWriter$: Ingesting from blob - partition: 0 requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:17 INFO KustoConnector: KustoWriter$: Ingesting from blob - partition: 1 requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:19 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2135 bytes result sent to driver
22/12/13 12:06:19 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2135 bytes result sent to driver
22/12/13 12:06:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6306 ms on 192.168.50.160 (executor driver) (1/2)
22/12/13 12:06:19 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6231 ms on 192.168.50.160 (executor driver) (2/2)
22/12/13 12:06:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/12/13 12:06:19 INFO DAGScheduler: ResultStage 0 (foreachPartition at KustoWriter.scala:107) finished in 7.070 s
22/12/13 12:06:19 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/12/13 12:06:19 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
22/12/13 12:06:19 INFO DAGScheduler: Job 0 finished: foreachPartition at KustoWriter.scala:107, took 7.157414 s
22/12/13 12:06:19 INFO KustoConnector: KustoClient: Polling on ingestion results for requestId: 9065b634-3b74-4993-830b-16ee534409d5, will move data to destination table when finished
22/12/13 12:13:30 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.50.160:56364 in memory (size: 4.9 KiB, free: 2004.6 MiB)
Local Spark writes data to ADX
The following code works.
Tested on Azure Databricks.
11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12).
com.microsoft.azure.kusto:kusto-spark_3.0_2.12:3.1.6
import com.microsoft.kusto.spark.datasink.KustoSinkOptions
import org.apache.spark.sql.{SaveMode, SparkSession}
val cluster = "..."
val client_id = "..."
val client_key = "..."
val authority = "..."
val db = "mydb"
val table = "mytable"
val df = spark.range(10)
df.show()
df.write
.format("com.microsoft.kusto.spark.datasource")
.option(KustoSinkOptions.KUSTO_CLUSTER, cluster)
.option(KustoSinkOptions.KUSTO_DATABASE, db)
.option(KustoSinkOptions.KUSTO_TABLE, table)
.option(KustoSinkOptions.KUSTO_AAD_APP_ID, client_id)
.option(KustoSinkOptions.KUSTO_AAD_APP_SECRET, client_key)
.option(KustoSinkOptions.KUSTO_AAD_AUTHORITY_ID, authority)
.option(KustoSinkOptions.KUSTO_TABLE_CREATE_OPTIONS, "CreateIfNotExist")
.mode(SaveMode.Append)
.save()
The ingestion time depends on the ingestion batching policy of the table.
Defaults and limits
Type
Property
Default
Low latency setting
Minimum value
Maximum value
Number of items
MaximumNumberOfItems
1000
1000
1
25,000
Data size (MB)
MaximumRawDataSizeMB
1024
1024
100
4096
Time (sec)
MaximumBatchingTimeSpan
300
20 - 30
10
1800

spark-submit --packages returns Error: Missing application resource

I installed .NET for Apache Spark using the following guide:
https://learn.microsoft.com/en-us/dotnet/spark/tutorials/get-started?WT.mc_id=dotnet-35129-website&tabs=windows
The Hello World worked.
Now I am trying to connect to and read from a Kafka cluster.
The following sample code should be able to get me connected to a Confluent Cloud Kafka cluster:
var df = spark
.ReadStream()
.Format("kafka")
.Option("kafka.bootstrap.servers", "my-bootstrap-server:9092")
.Option("subscribe", "wallet_txn_log")
.Option("startingOffsets", "earliest")
.Option("kafka.security.protocol", "SASL_SSL")
.Option("kafka.sasl.mechanism", "PLAIN")
.Option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"xxx\" password=\"xxx\";")
.Load();
When running the code, I get the following error:
Failed to find data source: kafka. Please deploy the application as
per the deployment section of "Structured Streaming + Kafka
Integration Guide".
The guide says that I need to add the spark-sql-kafka library in the correct version:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.13:3.2.1
When I run that, I get this error:
C:\Code\MySparkApp\bin\Debug\net6.0>spark-submit --packages
org.apache.spark:spark-sql-kafka-0-10_2.13:3.2.1 Error: Missing
application resource.
I have installed spark-3.2.1-bin-hadoop2.7
I assume that spark-submit is not able to pull the correct image from Maven.
How to proceed from here?
Edit 1:
I figured I should use --packages in the whole "run" command.
Here is the latest command:
C:\Code\MySparkApp\bin\Debug\net6.0>spark-submit --class
org.apache.spark.deploy.dotnet.DotnetRunner --master local
C:\Code\MySparkApp\bin\Debug\net6.0\microsoft-spark-3-2_2.12-2.1.1.jar
dotnet MySparkApp.dll C:\Code\MySparkApp\input.txt --packages
org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1
Now again it is giving the error:
Failed to find data source: kafka
Maybe this is the wrong way to reference the Kafka library in a Spark .NET application?
Log output:
C:\Code\MySparkApp\bin\Debug\net6.0>spark-submit --class
> org.apache.spark.deploy.dotnet.DotnetRunner --master local
> C:\Code\MySparkApp\bin\Debug\net6.0\microsoft-spark-3-2_2.12-2.1.1.jar
> dotnet MySparkApp.dll C:\Code\MySparkApp\input.txt --packages
> org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1 Using Spark's default
> log4j profile: org/apache/spark/log4j-defaults.properties 22/10/06
> 18:57:07 INFO DotnetRunner: Starting DotnetBackend with dotnet.
> 22/10/06 18:57:07 INFO DotnetBackend: The number of DotnetBackend
> threads is set to 10. 22/10/06 18:57:08 INFO DotnetRunner: Port number
> used by DotnetBackend is 55998 22/10/06 18:57:08 INFO DotnetRunner:
> Adding key=spark.jars and
> value=file:/C:/Code/MySparkApp/bin/Debug/net6.0/microsoft-spark-3-2_2.12-2.1.1.jar
> to environment 22/10/06 18:57:08 INFO DotnetRunner: Adding
> key=spark.app.name and
> value=org.apache.spark.deploy.dotnet.DotnetRunner to environment
> 22/10/06 18:57:08 INFO DotnetRunner: Adding key=spark.submit.pyFiles
> and value= to environment 22/10/06 18:57:08 INFO DotnetRunner: Adding
> key=spark.submit.deployMode and value=client to environment 22/10/06
> 18:57:08 INFO DotnetRunner: Adding key=spark.master and value=local to
> environment [2022-10-06T16:57:08.2893549Z] [DESKTOP-PR6Q966] [Info]
> [ConfigurationService] Using port 55998 for connection.
> [2022-10-06T16:57:08.2932382Z] [DESKTOP-PR6Q966] [Info] [JvmBridge]
> JvMBridge port is 55998 [2022-10-06T16:57:08.2943994Z]
> [DESKTOP-PR6Q966] [Info] [JvmBridge] The number of JVM backend thread
> is set to 10. The max number of concurrent sockets in JvmBridge is set
> to 7. 22/10/06 18:57:08 INFO SparkContext: Running Spark version 3.2.1
> 22/10/06 18:57:08 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where
> applicable 22/10/06 18:57:08 INFO ResourceUtils:
> ============================================================== 22/10/06 18:57:08 INFO ResourceUtils: No custom resources configured
> for spark.driver. 22/10/06 18:57:08 INFO ResourceUtils:
> ============================================================== 22/10/06 18:57:08 INFO SparkContext: Submitted application:
> word_count_sample 22/10/06 18:57:08 INFO ResourceProfile: Default
> ResourceProfile created, executor resources: Map(cores -> name: cores,
> amount: 1, script: , vendor: , memory -> name: memory, amount: 1024,
> script: , vendor: , offHeap -> name: offHeap, amount: 0, script: ,
> vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
> 22/10/06 18:57:08 INFO ResourceProfile: Limiting resource is cpu
> 22/10/06 18:57:08 INFO ResourceProfileManager: Added ResourceProfile
> id: 0 22/10/06 18:57:08 INFO SecurityManager: Changing view acls to:
> Kenan 22/10/06 18:57:08 INFO SecurityManager: Changing modify acls to:
> Kenan 22/10/06 18:57:08 INFO SecurityManager: Changing view acls
> groups to: 22/10/06 18:57:08 INFO SecurityManager: Changing modify
> acls groups to: 22/10/06 18:57:08 INFO SecurityManager:
> SecurityManager: authentication disabled; ui acls disabled; users
> with view permissions: Set(Kenan); groups with view permissions:
> Set(); users with modify permissions: Set(Kenan); groups with modify
> permissions: Set() 22/10/06 18:57:08 INFO Utils: Successfully started
> service 'sparkDriver' on port 56006. 22/10/06 18:57:08 INFO SparkEnv:
> Registering MapOutputTracker 22/10/06 18:57:08 INFO SparkEnv:
> Registering BlockManagerMaster 22/10/06 18:57:08 INFO
> BlockManagerMasterEndpoint: Using
> org.apache.spark.storage.DefaultTopologyMapper for getting topology
> information 22/10/06 18:57:08 INFO BlockManagerMasterEndpoint:
> BlockManagerMasterEndpoint up 22/10/06 18:57:08 INFO SparkEnv:
> Registering BlockManagerMasterHeartbeat 22/10/06 18:57:08 INFO
> DiskBlockManager: Created local directory at
> C:\Users\Kenan\AppData\Local\Temp\blockmgr-ca3af1bf-634a-45b2-879d-ca2c6db97299
> 22/10/06 18:57:08 INFO MemoryStore: MemoryStore started with capacity
> 366.3 MiB 22/10/06 18:57:08 INFO SparkEnv: Registering OutputCommitCoordinator 22/10/06 18:57:09 INFO Utils: Successfully
> started service 'SparkUI' on port 4040. 22/10/06 18:57:09 INFO
> SparkUI: Bound SparkUI to 0.0.0.0, and started at
> http://DESKTOP-PR6Q966.mshome.net:4040 22/10/06 18:57:09 INFO
> SparkContext: Added JAR
> file:/C:/Code/MySparkApp/bin/Debug/net6.0/microsoft-spark-3-2_2.12-2.1.1.jar
> at
> spark://DESKTOP-PR6Q966.mshome.net:56006/jars/microsoft-spark-3-2_2.12-2.1.1.jar
> with timestamp 1665075428422 22/10/06 18:57:09 INFO Executor: Starting
> executor ID driver on host DESKTOP-PR6Q966.mshome.net 22/10/06
> 18:57:09 INFO Executor: Fetching
> spark://DESKTOP-PR6Q966.mshome.net:56006/jars/microsoft-spark-3-2_2.12-2.1.1.jar
> with timestamp 1665075428422 22/10/06 18:57:09 INFO
> TransportClientFactory: Successfully created connection to
> DESKTOP-PR6Q966.mshome.net/172.24.208.1:56006 after 11 ms (0 ms spent
> in bootstraps) 22/10/06 18:57:09 INFO Utils: Fetching
> spark://DESKTOP-PR6Q966.mshome.net:56006/jars/microsoft-spark-3-2_2.12-2.1.1.jar
> to
> C:\Users\Kenan\AppData\Local\Temp\spark-91d1752d-a8f0-42c7-a340-e4e7c3ea84b0\userFiles-6a2073f2-d8d9-4a42-8aac-b5c0c7142763\fetchFileTemp6627445237981542962.tmp
> 22/10/06 18:57:09 INFO Executor: Adding
> file:/C:/Users/Kenan/AppData/Local/Temp/spark-91d1752d-a8f0-42c7-a340-e4e7c3ea84b0/userFiles-6a2073f2-d8d9-4a42-8aac-b5c0c7142763/microsoft-spark-3-2_2.12-2.1.1.jar
> to class loader 22/10/06 18:57:09 INFO Utils: Successfully started
> service 'org.apache.spark.network.netty.NettyBlockTransferService' on
> port 56030. 22/10/06 18:57:09 INFO NettyBlockTransferService: Server
> created on DESKTOP-PR6Q966.mshome.net:56030 22/10/06 18:57:09 INFO
> BlockManager: Using
> org.apache.spark.storage.RandomBlockReplicationPolicy for block
> replication policy 22/10/06 18:57:09 INFO BlockManagerMaster:
> Registering BlockManager BlockManagerId(driver,
> DESKTOP-PR6Q966.mshome.net, 56030, None) 22/10/06 18:57:09 INFO
> BlockManagerMasterEndpoint: Registering block manager
> DESKTOP-PR6Q966.mshome.net:56030 with 366.3 MiB RAM,
> BlockManagerId(driver, DESKTOP-PR6Q966.mshome.net, 56030, None)
> 22/10/06 18:57:09 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, DESKTOP-PR6Q966.mshome.net, 56030, None)
> 22/10/06 18:57:09 INFO BlockManager: Initialized BlockManager:
> BlockManagerId(driver, DESKTOP-PR6Q966.mshome.net, 56030, None)
> 22/10/06 18:57:09 INFO SharedState: Setting
> hive.metastore.warehouse.dir ('null') to the value of
> spark.sql.warehouse.dir. 22/10/06 18:57:09 INFO SharedState: Warehouse
> path is 'file:/C:/Code/MySparkApp/bin/Debug/net6.0/spark-warehouse'.
> 22/10/06 18:57:10 INFO InMemoryFileIndex: It took 21 ms to list leaf
> files for 1 paths. 22/10/06 18:57:12 INFO FileSourceStrategy: Pushed
> Filters: 22/10/06 18:57:12 INFO FileSourceStrategy: Post-Scan Filters:
> (size(split(value#0, , -1), true) > 0),isnotnull(split(value#0, ,
> -1)) 22/10/06 18:57:12 INFO FileSourceStrategy: Output Data Schema: struct<value: string> 22/10/06 18:57:12 INFO CodeGenerator: Code
> generated in 181.3829 ms 22/10/06 18:57:12 INFO MemoryStore: Block
> broadcast_0 stored as values in memory (estimated size 286.3 KiB, free
> 366.0 MiB) 22/10/06 18:57:12 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.1 KiB,
> free 366.0 MiB) 22/10/06 18:57:12 INFO BlockManagerInfo: Added
> broadcast_0_piece0 in memory on DESKTOP-PR6Q966.mshome.net:56030
> (size: 24.1 KiB, free: 366.3 MiB) 22/10/06 18:57:12 INFO SparkContext:
> Created broadcast 0 from showString at <unknown>:0 22/10/06 18:57:12
> INFO FileSourceScanExec: Planning scan with bin packing, max size:
> 4194406 bytes, open cost is considered as scanning 4194304 bytes.
> 22/10/06 18:57:12 INFO DAGScheduler: Registering RDD 3 (showString at
> <unknown>:0) as input to shuffle 0 22/10/06 18:57:12 INFO
> DAGScheduler: Got map stage job 0 (showString at <unknown>:0) with 1
> output partitions 22/10/06 18:57:12 INFO DAGScheduler: Final stage:
> ShuffleMapStage 0 (showString at <unknown>:0) 22/10/06 18:57:12 INFO
> DAGScheduler: Parents of final stage: List() 22/10/06 18:57:12 INFO
> DAGScheduler: Missing parents: List() 22/10/06 18:57:12 INFO
> DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at
> showString at <unknown>:0), which has no missing parents 22/10/06
> 18:57:12 INFO MemoryStore: Block broadcast_1 stored as values in
> memory (estimated size 38.6 KiB, free 366.0 MiB) 22/10/06 18:57:12
> INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory
> (estimated size 17.6 KiB, free 365.9 MiB) 22/10/06 18:57:12 INFO
> BlockManagerInfo: Added broadcast_1_piece0 in memory on
> DESKTOP-PR6Q966.mshome.net:56030 (size: 17.6 KiB, free: 366.3 MiB)
> 22/10/06 18:57:12 INFO SparkContext: Created broadcast 1 from
> broadcast at DAGScheduler.scala:1478 22/10/06 18:57:13 INFO
> DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0
> (MapPartitionsRDD[3] at showString at <unknown>:0) (first 15 tasks are
> for partitions Vector(0)) 22/10/06 18:57:13 INFO TaskSchedulerImpl:
> Adding task set 0.0 with 1 tasks resource profile 0 22/10/06 18:57:13
> INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0)
> (DESKTOP-PR6Q966.mshome.net, executor driver, partition 0,
> PROCESS_LOCAL, 4850 bytes) taskResourceAssignments Map() 22/10/06
> 18:57:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 22/10/06
> 18:57:13 INFO CodeGenerator: Code generated in 10.268 ms 22/10/06
> 18:57:13 INFO CodeGenerator: Code generated in 4.9722 ms 22/10/06
> 18:57:13 INFO CodeGenerator: Code generated in 6.0205 ms 22/10/06
> 18:57:13 INFO CodeGenerator: Code generated in 5.18 ms 22/10/06
> 18:57:13 INFO FileScanRDD: Reading File path:
> file:///C:/Code/MySparkApp/input.txt, range: 0-102, partition values:
> [empty row] 22/10/06 18:57:13 INFO LineRecordReader: Found UTF-8 BOM
> and skipped it 22/10/06 18:57:13 INFO Executor: Finished task 0.0 in
> stage 0.0 (TID 0). 2845 bytes result sent to driver 22/10/06 18:57:13
> INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 319 ms
> on DESKTOP-PR6Q966.mshome.net (executor driver) (1/1) 22/10/06
> 18:57:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have
> all completed, from pool 22/10/06 18:57:13 INFO DAGScheduler:
> ShuffleMapStage 0 (showString at <unknown>:0) finished in 0.379 s
> 22/10/06 18:57:13 INFO DAGScheduler: looking for newly runnable stages
> 22/10/06 18:57:13 INFO DAGScheduler: running: Set() 22/10/06 18:57:13
> INFO DAGScheduler: waiting: Set() 22/10/06 18:57:13 INFO DAGScheduler:
> failed: Set() 22/10/06 18:57:13 INFO ShufflePartitionsUtil: For
> shuffle(0), advisory target size: 67108864, actual target size
> 1048576, minimum partition size: 1048576 22/10/06 18:57:13 INFO
> CodeGenerator: Code generated in 11.5441 ms 22/10/06 18:57:13 INFO
> HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enabled is
> set to true, but current version of codegened fast hashmap does not
> support this aggregate. 22/10/06 18:57:13 INFO CodeGenerator: Code
> generated in 10.7919 ms 22/10/06 18:57:13 INFO SparkContext: Starting
> job: showString at <unknown>:0 22/10/06 18:57:13 INFO DAGScheduler:
> Got job 1 (showString at <unknown>:0) with 1 output partitions
> 22/10/06 18:57:13 INFO DAGScheduler: Final stage: ResultStage 2
> (showString at <unknown>:0) 22/10/06 18:57:13 INFO DAGScheduler:
> Parents of final stage: List(ShuffleMapStage 1) 22/10/06 18:57:13 INFO
> DAGScheduler: Missing parents: List() 22/10/06 18:57:13 INFO
> DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[7] at
> showString at <unknown>:0), which has no missing parents 22/10/06
> 18:57:13 INFO MemoryStore: Block broadcast_2 stored as values in
> memory (estimated size 37.4 KiB, free 365.9 MiB) 22/10/06 18:57:13
> INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory
> (estimated size 17.7 KiB, free 365.9 MiB) 22/10/06 18:57:13 INFO
> BlockManagerInfo: Added broadcast_2_piece0 in memory on
> DESKTOP-PR6Q966.mshome.net:56030 (size: 17.7 KiB, free: 366.2 MiB)
> 22/10/06 18:57:13 INFO SparkContext: Created broadcast 2 from
> broadcast at DAGScheduler.scala:1478 22/10/06 18:57:13 INFO
> DAGScheduler: Submitting 1 missing tasks from ResultStage 2
> (MapPartitionsRDD[7] at showString at <unknown>:0) (first 15 tasks are
> for partitions Vector(0)) 22/10/06 18:57:13 INFO TaskSchedulerImpl:
> Adding task set 2.0 with 1 tasks resource profile 0 22/10/06 18:57:13
> INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 1)
> (DESKTOP-PR6Q966.mshome.net, executor driver, partition 0, NODE_LOCAL,
> 4453 bytes) taskResourceAssignments Map() 22/10/06 18:57:13 INFO
> Executor: Running task 0.0 in stage 2.0 (TID 1) 22/10/06 18:57:13 INFO
> BlockManagerInfo: Removed broadcast_1_piece0 on
> DESKTOP-PR6Q966.mshome.net:56030 in memory (size: 17.6 KiB, free:
> 366.3 MiB) 22/10/06 18:57:13 INFO ShuffleBlockFetcherIterator: Getting 1 (864.0 B) non-empty blocks including 1 (864.0 B) local and 0 (0.0 B)
> host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
> 22/10/06 18:57:13 INFO ShuffleBlockFetcherIterator: Started 0 remote
> fetches in 8 ms 22/10/06 18:57:13 INFO Executor: Finished task 0.0 in
> stage 2.0 (TID 1). 6732 bytes result sent to driver 22/10/06 18:57:13
> INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 1) in 124 ms
> on DESKTOP-PR6Q966.mshome.net (executor driver) (1/1) 22/10/06
> 18:57:13 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have
> all completed, from pool 22/10/06 18:57:13 INFO DAGScheduler:
> ResultStage 2 (showString at <unknown>:0) finished in 0.136 s 22/10/06
> 18:57:13 INFO DAGScheduler: Job 1 is finished. Cancelling potential
> speculative or zombie tasks for this job 22/10/06 18:57:13 INFO
> TaskSchedulerImpl: Killing all running tasks in stage 2: Stage
> finished 22/10/06 18:57:13 INFO DAGScheduler: Job 1 finished:
> showString at <unknown>:0, took 0.149812 s 22/10/06 18:57:13 INFO
> CodeGenerator: Code generated in 7.0234 ms 22/10/06 18:57:13 INFO
> CodeGenerator: Code generated in 7.0701 ms
> +------+-----+ | word|count|
> +------+-----+ | .NET| 3| |Apache| 2| | This| 2| | Spark| 2| | app| 2| | World| 1| | for| 1| |counts| 1| |
> words| 1| | with| 1| | uses| 1| | Hello| 1|
> +------+-----+
>
> Moo 22/10/06 18:57:13 ERROR DotnetBackendHandler: Failed to execute
> 'load' on 'org.apache.spark.sql.streaming.DataStreamReader' with
> args=() [2022-10-06T16:57:13.6895055Z] [DESKTOP-PR6Q966] [Error]
> [JvmBridge] JVM method execution failed: Nonstatic method 'load'
> failed for class '22' when called with no arguments
> [2022-10-06T16:57:13.6895347Z] [DESKTOP-PR6Q966] [Error] [JvmBridge]
> org.apache.spark.sql.AnalysisException: Failed to find data source:
> kafka. Please deploy the application as per the deployment section of
> "Structured Streaming + Kafka Integration Guide".
> at org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindKafkaDataSourceError(QueryCompilationErrors.scala:1037)
> at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:668)
> at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:156)
> at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:143)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.apache.spark.api.dotnet.DotnetBackendHandler.handleMethodCall(DotnetBackendHandler.scala:165)
> at org.apache.spark.api.dotnet.DotnetBackendHandler.$anonfun$handleBackendRequest$2(DotnetBackendHandler.scala:105)
> at org.apache.spark.api.dotnet.ThreadPool$$anon$1.run(ThreadPool.scala:34)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
>
> [2022-10-06T16:57:13.6986588Z] [DESKTOP-PR6Q966] [Exception]
> [JvmBridge] JVM method execution failed: Nonstatic method 'load'
> failed for class '22' when called with no arguments at
> Microsoft.Spark.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic,
> Object classNameOrJvmObjectReference, String methodName, Object[]
> args) Unhandled exception. System.Exception: JVM method execution
> failed: Nonstatic method 'load' failed for class '22' when called with
> no arguments ---> Microsoft.Spark.JvmException:
> org.apache.spark.sql.AnalysisException: Failed to find data source:
> kafka. Please deploy the application as per the deployment section of
> "Structured Streaming + Kafka Integration Guide".
> at org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindKafkaDataSourceError(QueryCompilationErrors.scala:1037)
> at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:668)
> at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:156)
> at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:143)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.apache.spark.api.dotnet.DotnetBackendHandler.handleMethodCall(DotnetBackendHandler.scala:165)
> at org.apache.spark.api.dotnet.DotnetBackendHandler.$anonfun$handleBackendRequest$2(DotnetBackendHandler.scala:105)
> at org.apache.spark.api.dotnet.ThreadPool$$anon$1.run(ThreadPool.scala:34)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at
..
--packages must be supplied before --class. Look at the Mongo example.
Otherwise, it is passed as main method arguments along with your other class arguments - C:\Code\MySparkApp\bin\Debug\net6.0\microsoft-spark-3-2_2.12-2.1.1.jar dotnet MySparkApp.dll C:\Code\MySparkApp\input.txt. Print the main method arguments to further debug...
You can also set spark.jars.packages in your SparkSession Config options.
Regarding versions, unclear what Scala version you have but spark-sql-kafka-0-10_2.12:3.2.1 is correct for Spark 3.2.1, Scala 2.12 which seems to match your Microsoft JAR.

Does spark creates two datasets or stages that work on same logic?

I was trying to read from a CSV file and insert those entries into database.
I figured out that internally spark created two RDD i.e. rdd_0_0 and rdd_0_1 that works on same data and does all the processing.
Can anyone help in figuring out why call method is called twice by different datasets.
If two datasets/stages are created why they both of them working on same logic??
Please help me in confirming if that is the case spark works??
public final class TestJavaAggregation1 implements Serializable {
private static final long serialVersionUID = 1L;
static CassandraConfig config = null;
static PreparedStatement statement = null;
private transient SparkConf conf;
private PersonAggregationRowWriterFactory aggregationWriter = new PersonAggregationRowWriterFactory();
public Session session;
private TestJavaAggregation1(SparkConf conf) {
this.conf = conf;
}
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setAppName(“REadFromCSVFile”).setMaster(“local[1]”).set(“spark.executor.memory”, “1g”);
conf.set(“spark.cassandra.connection.host”, “localhost”);
TestJavaAggregation1 app = new TestJavaAggregation1(conf);
app.run();
}
private void run() {
JavaSparkContext sc = new JavaSparkContext(conf);
aggregateData(sc);
sc.stop();
}
private JavaRDD sparkConfig(JavaSparkContext sc) {
JavaRDD lines = sc.textFile(“PersonAggregation1_500.csv”, 1);
System.out.println(lines.getCheckpointFile());
lines.cache();
final String heading = lines.first();
System.out.println(heading);
String headerValues = heading.replaceAll(“\t”, “,”);
System.out.println(headerValues);
CassandraConnector connector = CassandraConnector.apply(sc.getConf());
Session session = connector.openSession();
try {
session.execute(“DROP KEYSPACE IF EXISTS java_api5″);
session.execute(“CREATE KEYSPACE java_api5 WITH replication = {‘class': ‘SimpleStrategy’, ‘replication_factor': 1}”);
session.execute(“CREATE TABLE java_api5.person (hashvalue INT, id INT, state TEXT, city TEXT, country TEXT, full_name TEXT, PRIMARY KEY((hashvalue), id, state, city, country, full_name)) WITH CLUSTERING ORDER BY (id DESC);”);
} catch (Exception ex) {
ex.printStackTrace();
}
return lines;
}
#SuppressWarnings(“serial”)
public void aggregateData(JavaSparkContext sc) {
JavaRDD lines = sparkConfig(sc);
System.out.println(“FirstRDD” + lines.partitions().size());
JavaRDD result = lines.map(new Function() {
int i = 0;
public PersonAggregation call(String row) {
PersonAggregation aggregate = new PersonAggregation();
row = row + “,” + this.hashCode();
String[] parts = row.split(“,”);
aggregate.setId(Integer.valueOf(parts[0]));
aggregate.setFull_name(parts[1]);
aggregate.setState(parts[4]);
aggregate.setCity(parts[5]);
aggregate.setCountry(parts[6]);
aggregate.setHashValue(Integer.valueOf(parts[7]));
*//below save inserts 200 entries into the database while the CSV file has only 100 records.*
**saveToJavaCassandra(aggregate);**
return aggregate;
}
});
System.out.println(result.collect().size());
List personAggregationList = result.collect();
JavaRDD aggregateRDD = sc.parallelize(personAggregationList);
javaFunctions(aggregateRDD).writerBuilder(“java_api5″, “person”,
aggregationWriter).saveToCassandra();
}
}
Please find the logs below too:
15/05/29 12:40:37 INFO FileInputFormat: Total input paths to process : 1
15/05/29 12:40:37 INFO SparkContext: Starting job: first at TestJavaAggregation1.java:89
15/05/29 12:40:37 INFO DAGScheduler: Got job 0 (first at TestJavaAggregation1.java:89) with 1 output partitions (allowLocal=true)
15/05/29 12:40:37 INFO DAGScheduler: Final stage: Stage 0(first at TestJavaAggregation1.java:89)
15/05/29 12:40:37 INFO DAGScheduler: Parents of final stage: List()
15/05/29 12:40:37 INFO DAGScheduler: Missing parents: List()
15/05/29 12:40:37 INFO DAGScheduler: Submitting Stage 0 (PersonAggregation_5.csv MappedRDD[1] at textFile at TestJavaAggregation1.java:84), which has no missing parents
15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(2560) called with curMem=157187, maxMem=1009589944
15/05/29 12:40:37 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.5 KB, free 962.7 MB)
15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(1897) called with curMem=159747, maxMem=1009589944
15/05/29 12:40:37 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1897.0 B, free 962.7 MB)
15/05/29 12:40:37 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54664 (size: 1897.0 B, free: 962.8 MB)
15/05/29 12:40:37 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/05/29 12:40:37 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/05/29 12:40:37 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PersonAggregation_5.csv MappedRDD[1] at textFile at TestJavaAggregation1.java:84)
15/05/29 12:40:37 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/05/29 12:40:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1326 bytes)
15/05/29 12:40:37 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/05/29 12:40:37 INFO CacheManager: Partition rdd_1_0 not found, computing it
15/05/29 12:40:37 INFO HadoopRDD: Input split: file:/F:/workspace/apoorva/TestProject/PersonAggregation_5.csv:0+230
15/05/29 12:40:37 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/05/29 12:40:37 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/05/29 12:40:37 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/05/29 12:40:37 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/05/29 12:40:37 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(680) called with curMem=161644, maxMem=1009589944
15/05/29 12:40:37 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 680.0 B, free 962.7 MB)
15/05/29 12:40:37 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:54664 (size: 680.0 B, free: 962.8 MB)
15/05/29 12:40:37 INFO BlockManagerMaster: Updated info of block rdd_1_0
15/05/29 12:40:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2335 bytes result sent to driver
15/05/29 12:40:37 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 73 ms on localhost (1/1)
15/05/29 12:40:37 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/05/29 12:40:37 INFO DAGScheduler: Stage 0 (first at TestJavaAggregation1.java:89) finished in 0.084 s
15/05/29 12:40:37 INFO DAGScheduler: Job 0 finished: first at TestJavaAggregation1.java:89, took 0.129536 s
1,FName1,MName1,LName1,state1,city1,country1
1,FName1,MName1,LName1,state1,city1,country1
15/05/29 12:40:37 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
15/05/29 12:40:37 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
FirstRDD1
SecondRDD1
15/05/29 12:40:47 INFO SparkContext: Starting job: collect at TestJavaAggregation1.java:147
15/05/29 12:40:47 INFO DAGScheduler: Got job 1 (collect at TestJavaAggregation1.java:147) with 1 output partitions (allowLocal=false)
15/05/29 12:40:47 INFO DAGScheduler: Final stage: Stage 1(collect at TestJavaAggregation1.java:147)
15/05/29 12:40:47 INFO DAGScheduler: Parents of final stage: List()
15/05/29 12:40:47 INFO DAGScheduler: Missing parents: List()
15/05/29 12:40:47 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[2] at map at TestJavaAggregation1.java:117), which has no missing parents
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(3872) called with curMem=162324, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.8 KB, free 962.7 MB)
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(2604) called with curMem=166196, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.5 KB, free 962.7 MB)
15/05/29 12:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54664 (size: 2.5 KB, free: 962.8 MB)
15/05/29 12:40:47 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/05/29 12:40:47 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838
15/05/29 12:40:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[2] at map at TestJavaAggregation1.java:117)
15/05/29 12:40:47 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/05/29 12:40:47 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1326 bytes)
15/05/29 12:40:47 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/05/29 12:40:47 INFO BlockManager: Found block rdd_1_0 locally
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state1,city1,country1
15/05/29 12:40:47 INFO DCAwareRoundRobinPolicy: Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
15/05/29 12:40:47 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
Connected to cluster: Test Cluster
Datacenter: datacenter1; Host: localhost/127.0.0.1; Rack: rack1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state2,city2,country1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state3,city3,country1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state4,city4,country1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state5,city5,country1
15/05/29 12:40:47 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2343 bytes result sent to driver
15/05/29 12:40:47 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 184 ms on localhost (1/1)
15/05/29 12:40:47 INFO DAGScheduler: Stage 1 (collect at TestJavaAggregation1.java:147) finished in 0.185 s
15/05/29 12:40:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/05/29 12:40:47 INFO DAGScheduler: Job 1 finished: collect at TestJavaAggregation1.java:147, took 0.218779 s
______________________________5_______________________________
15/05/29 12:40:47 INFO SparkContext: Starting job: collect at TestJavaAggregation1.java:150
15/05/29 12:40:47 INFO DAGScheduler: Got job 2 (collect at TestJavaAggregation1.java:150) with 1 output partitions (allowLocal=false)
15/05/29 12:40:47 INFO DAGScheduler: Final stage: Stage 2(collect at TestJavaAggregation1.java:150)
15/05/29 12:40:47 INFO DAGScheduler: Parents of final stage: List()
15/05/29 12:40:47 INFO DAGScheduler: Missing parents: List()
15/05/29 12:40:47 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[2] at map at TestJavaAggregation1.java:117), which has no missing parents
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(3872) called with curMem=168800, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.8 KB, free 962.7 MB)
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(2604) called with curMem=172672, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.5 KB, free 962.7 MB)
15/05/29 12:40:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:54664 (size: 2.5 KB, free: 962.8 MB)
15/05/29 12:40:47 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/05/29 12:40:47 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:838
15/05/29 12:40:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 2 (MappedRDD[2] at map at TestJavaAggregation1.java:117)
15/05/29 12:40:47 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
15/05/29 12:40:47 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1326 bytes)
15/05/29 12:40:47 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
15/05/29 12:40:47 INFO BlockManager: Found block rdd_1_0 locally
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state1,city1,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state2,city2,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state3,city3,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state4,city4,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state5,city5,country1
15/05/29 12:40:47 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2343 bytes result sent to driver
15/05/29 12:40:47 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 16 ms on localhost (1/1)
15/05/29 12:40:47 INFO DAGScheduler: Stage 2 (collect at TestJavaAggregation1.java:150) finished in 0.016 s
15/05/29 12:40:47 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/05/29 12:40:47 INFO DAGScheduler: Job 2 finished: collect at TestJavaAggregation1.java:150, took 0.026302 s
When you are running a spark cluster and you run a spark job. Spark distributes the data in the cluster in terms of RDD's the partitioning of data is handled by spark. When you create a lines RDD in your sparkConfig method by reading a file. Spark partitions the data and creates RDD partitions internally so that when in memory computation happens it is done over distrubuted data over the RDD's in your cluster. Therefore your JavaRDD lines is internally a union on various RDD_partitions. Hence, when you run a map job on JavaRDD lines, it runs for all the data partitioned amongst various internal RDD's that relate to the JavaRDD on which you ran the map function. As in your case spark created two internal partitions of the JavaRDD Lines, that is why the map function is called two times for the two internal JavaRDD partitions. Please tell me if you have any other questions.

Why does spark-shell throw ArrayIndexOutOfBoundsException when reading a large file from HDFS?

I am using hadoop 2.4.1 and Spark 1.1.0. I have uploaded a dataset of food review to HDFS from here and then I used the following code to read the file and process it on the spark shell:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
var path = "hdfs:///user/hduser/finefoods.txt"
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n\n")
var dataset = sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString)
var datasetObj = dataset.map{ rowStr => rowStr.split("\n")}
var tupleSet = datasetObj.map( strArr => strArr.map( elm => elm.split(": ")(1))).map( arr => (arr(0),arr(1),arr(4).toDouble))
tupleSet.groupBy(t => t._2)
When I run the last line tupleSet.groupBy(t => t._2), the spark shell throws the following exception:
scala> tupleSet.groupBy( t => t._2).first()
14/11/15 22:46:59 INFO spark.SparkContext: Starting job: first at <console>:28
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Registering RDD 11 (groupBy at <console>:28)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Got job 1 (first at <console>:28) with 1 output partitions (allowLocal=true)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Final stage: Stage 1(first at <console>:28)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 2)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Missing parents: List(Stage 2)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Submitting Stage 2 (MappedRDD[11] at groupBy at <console>:28), which has no missing parents
14/11/15 22:46:59 INFO storage.MemoryStore: ensureFreeSpace(3592) called with curMem=221261, maxMem=278302556
14/11/15 22:46:59 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.5 KB, free 265.2 MB)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from Stage 2 (MappedRDD[11] at groupBy at <console>:28)
14/11/15 22:46:59 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 3 tasks
14/11/15 22:46:59 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, localhost, ANY, 1221 bytes)
14/11/15 22:46:59 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 4, localhost, ANY, 1221 bytes)
14/11/15 22:46:59 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 3)
14/11/15 22:46:59 INFO executor.Executor: Running task 1.0 in stage 2.0 (TID 4)
14/11/15 22:46:59 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:0+134217728
14/11/15 22:46:59 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:134217728+134217728
14/11/15 22:47:02 ERROR executor.Executor: Exception in task 1.0 in stage 2.0 (TID 4)
java.lang.ArrayIndexOutOfBoundsException
14/11/15 22:47:02 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 5, localhost, ANY, 1221 bytes)
14/11/15 22:47:02 INFO executor.Executor: Running task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:268435456+102361028
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException:
14/11/15 22:47:02 ERROR scheduler.TaskSetManager: Task 1 in stage 2.0 failed 1 times; aborting job
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Cancelling stage 2
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Stage 2 was cancelled
14/11/15 22:47:02 INFO executor.Executor: Executor is trying to kill task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 INFO executor.Executor: Executor is trying to kill task 0.0 in stage 2.0 (TID 3)
14/11/15 22:47:02 INFO scheduler.DAGScheduler: Failed to run first at <console>:28
14/11/15 22:47:02 INFO executor.Executor: Executor killed task 0.0 in stage 2.0 (TID 3)
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, localhost): TaskKilled (killed intentionally)
14/11/15 22:47:02 INFO executor.Executor: Executor killed task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 2.0 (TID 5, localhost): TaskKilled (killed intentionally)
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
But when I use dummy dataset like the following, it works well:
var tupleSet = sc.parallelize(List(
("B001E4KFG0","A3SGXH7AUHU8GW",3.0),
("B001E4KFG1","A3SGXH7AUHU8GW",4.0),
("B001E4KFG2","A3SGXH7AUHU8GW",4.0),
("B001E4KFG3","A3SGXH7AUHU8GW",4.0),
("B001E4KFG4","A3SGXH7AUHU8GW",5.0),
("B001E4KFG5","A3SGXH7AUHU8GW",5.0),
("B001E4KFG0","bbb",5.0)
))
Any idea why?
There's probably an entry in the dataset that does not follow the format and therefore: elm.split(": ")(1) fails, because there's no element at that index.
You can avoid that error by checking the results of the split before accessing the (1) index. One way of doing that could be something like this:
var tupleSet = datasetObj.map(elem => elm.split(": ")).collect{case x if (x.length>1) x(1)}
One note: Your examples do not seem to match the parsing pipeline in the code. They do not contain the ": " tokens.
Since the transformations are lazy Spark won't tell you much about your input dataset (and you may not notice it) only until executing an action like groupBy().
It could also be due to empty/blank lines in your dataset. And, you are applying a split function on the data. In such case, filter out the empty lines.
Eg: myrdd.filter(_.nonEmpty).map(...)
I had a similar problem when I was converting a log data into dataframe using pySpark.
When a log entry is invalid, I returned a null value instead of a Row instance. Before converting to dataframe, I filtered out these null values. But, still, I got the above problem. Finally, the error went away when I returned a Row with null values instead of a single null value.
Pseudo code below:
Didnt work:
rdd = Parse log (log lines to Rows if valid else None)
filtered_rdd = rdd.filter(lambda x:x!=None)
logs = sqlContext.inferSchema(filtered_rdd)
Worked:
rdd = Parse log (log lines to Rows if valid else Row(None,None,...))
logs = sqlContext.inferSchema(rdd)
filtered_rdd = logs.filter(logs['id'].isNotNull())

How to run more executors on Apache Spark Cluster mode

I have 50 workers, I would like to run my job on my all workers.
In master:8080, I can see all workers there,In master:4040/executors, I can see 50 executors, but when I run my job, the information show like this:
14/10/19 14:57:07 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/19 14:57:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, slave11, NODE_LOCAL, 1302 bytes)
14/10/19 14:57:07 INFO nio.ConnectionManager: Accepted connection from [slave11/10.10.10.21:42648]
14/10/19 14:57:07 INFO nio.SendingConnection: Initiating connection to [slave11/10.10.10.21:54398]
14/10/19 14:57:07 INFO nio.SendingConnection: Connected to [slave11/10.10.10.21:54398], 1 messages pending
14/10/19 14:57:07 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slave11:54398 (size: 2.4 KB, free: 267.3 MB)
14/10/19 14:57:08 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slave11:54398 (size: 18.4 KB, free: 267.2 MB)
14/10/19 14:57:12 INFO storage.BlockManagerInfo: Added rdd_2_0 in memory on slave11:54398 (size: 87.4 MB, free: 179.8 MB)
14/10/19 14:57:12 INFO scheduler.DAGScheduler: Stage 0 (first at GeneralizedLinearAlgorithm.scala:141) finished in 5.473 s
14/10/19 14:57:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 5463 ms on slave11 (1/1)
14/10/19 14:57:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
And my job code like this:(command line)
master: $ ./spark-shell --master spark://master:7077
and this(scala code):
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
val fileName = "bc.txt"
val data = sc.textFile(fileName)
val splits = data.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
val training_1 = training.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x => x.toDouble).toArray))
}
val test_1 = test.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x => x.toDouble).toArray))
}
val numIterations = 200
val model = SVMWithSGD.train(training_1, numIterations)
My question is why only one or two(sometimes) task run on my cluster?
Is any way to configuration the number of task or it is schedule by scheduler automatically?
When my job run on two tasks and it will run with two executors that I observe on master:4040,
It will give 2x speedup, so I want to run my job on all executors, how can I do that?
Thanks everyone.
You can use the minPartitions parameter in textFile to set the min number of tasks, such as:
val data = sc.textFile(fileName, 10)
However, more partitions usually means more network traffic because more partitions make Spark hard to dispatch tasks to the local executors to run. You need to find a balance number of minPartitions by yourself.

Resources