saveAsTable in Spark 1.4 is not working as expected - apache-spark

I want to save a DataFrame as table, using the following commands:
>>> access_df = sqlContext.read.load("hdfs://10.0.0.220/user/nanounanue/access", format="parquet")
>>> df_writer = pyspark.sql.DataFrameWriter(access_df)
>>> df_writer.saveAsTable('test_access', format='parquet', mode='overwrite')
But when I try the last line I got the following stacktrace:
15/06/24 13:21:38 INFO HiveMetaStore: 0: get_table : db=default tbl=test_access
15/06/24 13:21:38 INFO audit: ugi=nanounanue ip=unknown-ip-addr cmd=get_table : db=default tbl=test_access
15/06/24 13:21:38 INFO HiveMetaStore: 0: get_table : db=default tbl=test_access
15/06/24 13:21:38 INFO audit: ugi=nanounanue ip=unknown-ip-addr cmd=get_table : db=default tbl=test_access
15/06/24 13:21:38 INFO HiveMetaStore: 0: get_database: default
15/06/24 13:21:38 INFO audit: ugi=nanounanue ip=unknown-ip-addr cmd=get_database: default
15/06/24 13:21:38 INFO HiveMetaStore: 0: get_table : db=default tbl=test_access
15/06/24 13:21:38 INFO audit: ugi=nanounanue ip=unknown-ip-addr cmd=get_table : db=default tbl=test_access
15/06/24 13:21:38 INFO MemoryStore: ensureFreeSpace(231024) called with curMem=343523, maxMem=278302556
15/06/24 13:21:38 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 225.6 KB, free 264.9 MB)
15/06/24 13:21:38 INFO MemoryStore: ensureFreeSpace(19848) called with curMem=574547, maxMem=278302556
15/06/24 13:21:38 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.4 KB, free 264.8 MB)
15/06/24 13:21:38 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:44271 (size: 19.4 KB, free: 265.3 MB)
15/06/24 13:21:38 INFO SparkContext: Created broadcast 2 from saveAsTable at NativeMethodAccessorImpl.java:-2
15/06/24 13:21:38 ERROR FileOutputCommitter: Mkdirs failed to create file:/user/hive/warehouse/test_access/_temporary/0
15/06/24 13:21:39 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1$$anon$2: Using Task Side Metadata Split Strategy
15/06/24 13:21:39 INFO SparkContext: Starting job: saveAsTable at NativeMethodAccessorImpl.java:-2
15/06/24 13:21:39 INFO DAGScheduler: Got job 1 (saveAsTable at NativeMethodAccessorImpl.java:-2) with 2 output partitions (allowLocal=false)
15/06/24 13:21:39 INFO DAGScheduler: Final stage: ResultStage 1(saveAsTable at NativeMethodAccessorImpl.java:-2)
15/06/24 13:21:39 INFO DAGScheduler: Parents of final stage: List()
15/06/24 13:21:39 INFO DAGScheduler: Missing parents: List()
15/06/24 13:21:39 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at ), which has no missing parents
15/06/24 13:21:39 INFO MemoryStore: ensureFreeSpace(68616) called with curMem=594395, maxMem=278302556
15/06/24 13:21:39 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 67.0 KB, free 264.8 MB)
15/06/24 13:21:39 INFO MemoryStore: ensureFreeSpace(24003) called with curMem=663011, maxMem=278302556
15/06/24 13:21:39 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 23.4 KB, free 264.8 MB)
15/06/24 13:21:39 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:44271 (size: 23.4 KB, free: 265.3 MB)
15/06/24 13:21:39 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:874
15/06/24 13:21:39 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at )
15/06/24 13:21:39 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/06/24 13:21:39 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, ANY, 1777 bytes)
15/06/24 13:21:39 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, ANY, 1778 bytes)
15/06/24 13:21:39 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
15/06/24 13:21:39 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
15/06/24 13:21:39 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1: Input split: ParquetInputSplit{part: hdfs://10.0.0.220/user/nanounanue/arquimedes_access/part-r-00001.gz.parquet start: 0 end: 259022 length: 259022 hosts: [] requestedSchema: message root {
optional binary client_ident (UTF8);
optional binary content_size (UTF8);
optional binary date_time (UTF8);
optional binary endpoint (UTF8);
optional binary ip_address (UTF8);
optional binary method (UTF8);
optional binary protocol (UTF8);
optional binary referer (UTF8);
optional binary response_code (UTF8);
optional binary response_time (UTF8);
optional binary user_agent (UTF8);
optional binary user_id (UTF8);
}
readSupportMetadata: {org.apache.spark.sql.parquet.row.metadata={"type":"struct","fields":[{"name":"client_ident","type":"string","nullable":true,"metadata":{}},{"name":"content_size","type":"string","nullable":true,"metadata":{}},{"name":"date_time","type":"string","nullable":true,"metadata":{}},{"name":"endpoint","type":"string","nullable":true,"metadata":{}},{"name":"ip_addres
s","type":"string","nullable":true,"metadata":{}},{"name":"method","type":"string","nullable":true,"metadata":{}},{"name":"protocol","type":"string","nullable":true,"metadata":{}},{"name":"referer","type":"string","nullable":true,"metadata":{}},{"name":"response_code","type":"string","nullable":true,"metadata":{}},{"name":"response_time","type":"string","nullable":true,"metadata":
{}},{"name":"user_agent","type":"string","nullable":true,"metadata":{}},{"name":"user_id","type":"string","nullable":true,"metadata":{}}]}, org.apache.spark.sql.parquet.row.requested_schema={"type":"struct","fields":[{"name":"client_ident","type":"string","nullable":true,"metadata":{}},{"name":"content_size","type":"string","nullable":true,"metadata":{}},{"name":"date_time","type"
:"string","nullable":true,"metadata":{}},{"name":"endpoint","type":"string","nullable":true,"metadata":{}},{"name":"ip_address","type":"string","nullable":true,"metadata":{}},{"name":"method","type":"string","nullable":true,"metadata":{}},{"name":"protocol","type":"string","nullable":true,"metadata":{}},{"name":"referer","type":"string","nullable":true,"metadata":{}},{"name":"resp
onse_code","type":"string","nullable":true,"metadata":{}},{"name":"response_time","type":"string","nullable":true,"metadata":{}},{"name":"user_agent","type":"string","nullable":true,"metadata":{}},{"name":"user_id","type":"string","nullable":true,"metadata":{}}]}}}
15/06/24 13:21:39 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
15/06/24 13:21:39 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1: Input split: ParquetInputSplit{part: hdfs://10.0.0.220/user/nanounanue/arquimedes_access/part-r-00002.gz.parquet start: 0 end: 315140 length: 315140 hosts: [] requestedSchema: message root {
optional binary client_ident (UTF8);
optional binary content_size (UTF8);
optional binary date_time (UTF8);
optional binary endpoint (UTF8);
optional binary ip_address (UTF8);
optional binary method (UTF8);
optional binary protocol (UTF8);
optional binary referer (UTF8);
optional binary response_code (UTF8);
optional binary response_time (UTF8);
optional binary user_agent (UTF8);
optional binary user_id (UTF8);
}
readSupportMetadata: {org.apache.spark.sql.parquet.row.metadata={"type":"struct","fields":[{"name":"client_ident","type":"string","nullable":true,"metadata":{}},{"name":"content_size","type":"string","nullable":true,"metadata":{}},{"name":"date_time","type":"string","nullable":true,"metadata":{}},{"name":"endpoint","type":"string","nullable":true,"metadata":{}},{"name":"ip_addres
s","type":"string","nullable":true,"metadata":{}},{"name":"method","type":"string","nullable":true,"metadata":{}},{"name":"protocol","type":"string","nullable":true,"metadata":{}},{"name":"referer","type":"string","nullable":true,"metadata":{}},{"name":"response_code","type":"string","nullable":true,"metadata":{}},{"name":"response_time","type":"string","nullable":true,"metadata":
{}},{"name":"user_agent","type":"string","nullable":true,"metadata":{}},{"name":"user_id","type":"string","nullable":true,"metadata":{}}]}, org.apache.spark.sql.parquet.row.requested_schema={"type":"struct","fields":[{"name":"client_ident","type":"string","nullable":true,"metadata":{}},{"name":"content_size","type":"string","nullable":true,"metadata":{}},{"name":"date_time","type"
:"string","nullable":true,"metadata":{}},{"name":"endpoint","type":"string","nullable":true,"metadata":{}},{"name":"ip_address","type":"string","nullable":true,"metadata":{}},{"name":"method","type":"string","nullable":true,"metadata":{}},{"name":"protocol","type":"string","nullable":true,"metadata":{}},{"name":"referer","type":"string","nullable":true,"metadata":{}},{"name":"resp
onse_code","type":"string","nullable":true,"metadata":{}},{"name":"response_time","type":"string","nullable":true,"metadata":{}},{"name":"user_agent","type":"string","nullable":true,"metadata":{}},{"name":"user_id","type":"string","nullable":true,"metadata":{}}]}}}
15/06/24 13:21:39 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
15/06/24 13:21:39 INFO InternalParquetRecordReader: RecordReader initialized will read a total of 47428 records.
15/06/24 13:21:39 INFO CodecConfig: Compression: GZIP
15/06/24 13:21:39 INFO ParquetOutputFormat: Parquet block size to 134217728
15/06/24 13:21:39 INFO ParquetOutputFormat: Parquet page size to 1048576
15/06/24 13:21:39 INFO ParquetOutputFormat: Parquet dictionary page size to 1048576
15/06/24 13:21:39 INFO ParquetOutputFormat: Dictionary is on
15/06/24 13:21:39 INFO ParquetOutputFormat: Validation is off
15/06/24 13:21:39 INFO ParquetOutputFormat: Writer version is: PARQUET_1_0
15/06/24 13:21:39 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
java.io.IOException: Mkdirs failed to create file:/user/hive/warehouse/test_access/_temporary/0/_temporary/attempt_201506241321_0001_m_000001_0 (exists=false, cwd=file:/home/nanounanue)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:442)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:428)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:154)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:279)
at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
at org.apache.spark.sql.parquet.ParquetOutputWriter.<init>(newParquet.scala:111)
...
The user nanounanue has write permission in that directory:
[hdfs#ip-10-0-0-209 ec2-user]$ hadoop fs -ls -R /user/hive/ | grep warehouse
drwxrwxrwt - hive hive 0 2015-06-23 21:16 /user/hive/warehouse
What is missing?

I've also encounter this issue. When I've moved from Spark 1.2 to Spark 1.3, It is actually permissions issues. Try to use Apache Spark instead of Cloudera, Spark, As this solved my problem.

This seems like a bug related to the creation of new directories under Hive meta-store directory
(in your case /user/hive/warehouse).
As a workaround, try changing default permissions for your meta-store directory granting your user with rwx permissions recursively.

based on your log :
file:/user/hive/warehouse/test_access/_temporary/0/_temporary/attempt_201506241321_0001_m_000001_0 (exists=false, cwd=file:/home/nanounanue)
Spark is trying to create file in path /user/hive/warehouse/test_access/
when you use default settings by spark , which use derby as hivemetastore will lead to to this default local path /user/hive/warehouse/ which your process do not have the privilege to do so.

Related

spark-submit --packages returns Error: Missing application resource

I installed .NET for Apache Spark using the following guide:
https://learn.microsoft.com/en-us/dotnet/spark/tutorials/get-started?WT.mc_id=dotnet-35129-website&tabs=windows
The Hello World worked.
Now I am trying to connect to and read from a Kafka cluster.
The following sample code should be able to get me connected to a Confluent Cloud Kafka cluster:
var df = spark
.ReadStream()
.Format("kafka")
.Option("kafka.bootstrap.servers", "my-bootstrap-server:9092")
.Option("subscribe", "wallet_txn_log")
.Option("startingOffsets", "earliest")
.Option("kafka.security.protocol", "SASL_SSL")
.Option("kafka.sasl.mechanism", "PLAIN")
.Option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"xxx\" password=\"xxx\";")
.Load();
When running the code, I get the following error:
Failed to find data source: kafka. Please deploy the application as
per the deployment section of "Structured Streaming + Kafka
Integration Guide".
The guide says that I need to add the spark-sql-kafka library in the correct version:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.13:3.2.1
When I run that, I get this error:
C:\Code\MySparkApp\bin\Debug\net6.0>spark-submit --packages
org.apache.spark:spark-sql-kafka-0-10_2.13:3.2.1 Error: Missing
application resource.
I have installed spark-3.2.1-bin-hadoop2.7
I assume that spark-submit is not able to pull the correct image from Maven.
How to proceed from here?
Edit 1:
I figured I should use --packages in the whole "run" command.
Here is the latest command:
C:\Code\MySparkApp\bin\Debug\net6.0>spark-submit --class
org.apache.spark.deploy.dotnet.DotnetRunner --master local
C:\Code\MySparkApp\bin\Debug\net6.0\microsoft-spark-3-2_2.12-2.1.1.jar
dotnet MySparkApp.dll C:\Code\MySparkApp\input.txt --packages
org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1
Now again it is giving the error:
Failed to find data source: kafka
Maybe this is the wrong way to reference the Kafka library in a Spark .NET application?
Log output:
C:\Code\MySparkApp\bin\Debug\net6.0>spark-submit --class
> org.apache.spark.deploy.dotnet.DotnetRunner --master local
> C:\Code\MySparkApp\bin\Debug\net6.0\microsoft-spark-3-2_2.12-2.1.1.jar
> dotnet MySparkApp.dll C:\Code\MySparkApp\input.txt --packages
> org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1 Using Spark's default
> log4j profile: org/apache/spark/log4j-defaults.properties 22/10/06
> 18:57:07 INFO DotnetRunner: Starting DotnetBackend with dotnet.
> 22/10/06 18:57:07 INFO DotnetBackend: The number of DotnetBackend
> threads is set to 10. 22/10/06 18:57:08 INFO DotnetRunner: Port number
> used by DotnetBackend is 55998 22/10/06 18:57:08 INFO DotnetRunner:
> Adding key=spark.jars and
> value=file:/C:/Code/MySparkApp/bin/Debug/net6.0/microsoft-spark-3-2_2.12-2.1.1.jar
> to environment 22/10/06 18:57:08 INFO DotnetRunner: Adding
> key=spark.app.name and
> value=org.apache.spark.deploy.dotnet.DotnetRunner to environment
> 22/10/06 18:57:08 INFO DotnetRunner: Adding key=spark.submit.pyFiles
> and value= to environment 22/10/06 18:57:08 INFO DotnetRunner: Adding
> key=spark.submit.deployMode and value=client to environment 22/10/06
> 18:57:08 INFO DotnetRunner: Adding key=spark.master and value=local to
> environment [2022-10-06T16:57:08.2893549Z] [DESKTOP-PR6Q966] [Info]
> [ConfigurationService] Using port 55998 for connection.
> [2022-10-06T16:57:08.2932382Z] [DESKTOP-PR6Q966] [Info] [JvmBridge]
> JvMBridge port is 55998 [2022-10-06T16:57:08.2943994Z]
> [DESKTOP-PR6Q966] [Info] [JvmBridge] The number of JVM backend thread
> is set to 10. The max number of concurrent sockets in JvmBridge is set
> to 7. 22/10/06 18:57:08 INFO SparkContext: Running Spark version 3.2.1
> 22/10/06 18:57:08 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where
> applicable 22/10/06 18:57:08 INFO ResourceUtils:
> ============================================================== 22/10/06 18:57:08 INFO ResourceUtils: No custom resources configured
> for spark.driver. 22/10/06 18:57:08 INFO ResourceUtils:
> ============================================================== 22/10/06 18:57:08 INFO SparkContext: Submitted application:
> word_count_sample 22/10/06 18:57:08 INFO ResourceProfile: Default
> ResourceProfile created, executor resources: Map(cores -> name: cores,
> amount: 1, script: , vendor: , memory -> name: memory, amount: 1024,
> script: , vendor: , offHeap -> name: offHeap, amount: 0, script: ,
> vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
> 22/10/06 18:57:08 INFO ResourceProfile: Limiting resource is cpu
> 22/10/06 18:57:08 INFO ResourceProfileManager: Added ResourceProfile
> id: 0 22/10/06 18:57:08 INFO SecurityManager: Changing view acls to:
> Kenan 22/10/06 18:57:08 INFO SecurityManager: Changing modify acls to:
> Kenan 22/10/06 18:57:08 INFO SecurityManager: Changing view acls
> groups to: 22/10/06 18:57:08 INFO SecurityManager: Changing modify
> acls groups to: 22/10/06 18:57:08 INFO SecurityManager:
> SecurityManager: authentication disabled; ui acls disabled; users
> with view permissions: Set(Kenan); groups with view permissions:
> Set(); users with modify permissions: Set(Kenan); groups with modify
> permissions: Set() 22/10/06 18:57:08 INFO Utils: Successfully started
> service 'sparkDriver' on port 56006. 22/10/06 18:57:08 INFO SparkEnv:
> Registering MapOutputTracker 22/10/06 18:57:08 INFO SparkEnv:
> Registering BlockManagerMaster 22/10/06 18:57:08 INFO
> BlockManagerMasterEndpoint: Using
> org.apache.spark.storage.DefaultTopologyMapper for getting topology
> information 22/10/06 18:57:08 INFO BlockManagerMasterEndpoint:
> BlockManagerMasterEndpoint up 22/10/06 18:57:08 INFO SparkEnv:
> Registering BlockManagerMasterHeartbeat 22/10/06 18:57:08 INFO
> DiskBlockManager: Created local directory at
> C:\Users\Kenan\AppData\Local\Temp\blockmgr-ca3af1bf-634a-45b2-879d-ca2c6db97299
> 22/10/06 18:57:08 INFO MemoryStore: MemoryStore started with capacity
> 366.3 MiB 22/10/06 18:57:08 INFO SparkEnv: Registering OutputCommitCoordinator 22/10/06 18:57:09 INFO Utils: Successfully
> started service 'SparkUI' on port 4040. 22/10/06 18:57:09 INFO
> SparkUI: Bound SparkUI to 0.0.0.0, and started at
> http://DESKTOP-PR6Q966.mshome.net:4040 22/10/06 18:57:09 INFO
> SparkContext: Added JAR
> file:/C:/Code/MySparkApp/bin/Debug/net6.0/microsoft-spark-3-2_2.12-2.1.1.jar
> at
> spark://DESKTOP-PR6Q966.mshome.net:56006/jars/microsoft-spark-3-2_2.12-2.1.1.jar
> with timestamp 1665075428422 22/10/06 18:57:09 INFO Executor: Starting
> executor ID driver on host DESKTOP-PR6Q966.mshome.net 22/10/06
> 18:57:09 INFO Executor: Fetching
> spark://DESKTOP-PR6Q966.mshome.net:56006/jars/microsoft-spark-3-2_2.12-2.1.1.jar
> with timestamp 1665075428422 22/10/06 18:57:09 INFO
> TransportClientFactory: Successfully created connection to
> DESKTOP-PR6Q966.mshome.net/172.24.208.1:56006 after 11 ms (0 ms spent
> in bootstraps) 22/10/06 18:57:09 INFO Utils: Fetching
> spark://DESKTOP-PR6Q966.mshome.net:56006/jars/microsoft-spark-3-2_2.12-2.1.1.jar
> to
> C:\Users\Kenan\AppData\Local\Temp\spark-91d1752d-a8f0-42c7-a340-e4e7c3ea84b0\userFiles-6a2073f2-d8d9-4a42-8aac-b5c0c7142763\fetchFileTemp6627445237981542962.tmp
> 22/10/06 18:57:09 INFO Executor: Adding
> file:/C:/Users/Kenan/AppData/Local/Temp/spark-91d1752d-a8f0-42c7-a340-e4e7c3ea84b0/userFiles-6a2073f2-d8d9-4a42-8aac-b5c0c7142763/microsoft-spark-3-2_2.12-2.1.1.jar
> to class loader 22/10/06 18:57:09 INFO Utils: Successfully started
> service 'org.apache.spark.network.netty.NettyBlockTransferService' on
> port 56030. 22/10/06 18:57:09 INFO NettyBlockTransferService: Server
> created on DESKTOP-PR6Q966.mshome.net:56030 22/10/06 18:57:09 INFO
> BlockManager: Using
> org.apache.spark.storage.RandomBlockReplicationPolicy for block
> replication policy 22/10/06 18:57:09 INFO BlockManagerMaster:
> Registering BlockManager BlockManagerId(driver,
> DESKTOP-PR6Q966.mshome.net, 56030, None) 22/10/06 18:57:09 INFO
> BlockManagerMasterEndpoint: Registering block manager
> DESKTOP-PR6Q966.mshome.net:56030 with 366.3 MiB RAM,
> BlockManagerId(driver, DESKTOP-PR6Q966.mshome.net, 56030, None)
> 22/10/06 18:57:09 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, DESKTOP-PR6Q966.mshome.net, 56030, None)
> 22/10/06 18:57:09 INFO BlockManager: Initialized BlockManager:
> BlockManagerId(driver, DESKTOP-PR6Q966.mshome.net, 56030, None)
> 22/10/06 18:57:09 INFO SharedState: Setting
> hive.metastore.warehouse.dir ('null') to the value of
> spark.sql.warehouse.dir. 22/10/06 18:57:09 INFO SharedState: Warehouse
> path is 'file:/C:/Code/MySparkApp/bin/Debug/net6.0/spark-warehouse'.
> 22/10/06 18:57:10 INFO InMemoryFileIndex: It took 21 ms to list leaf
> files for 1 paths. 22/10/06 18:57:12 INFO FileSourceStrategy: Pushed
> Filters: 22/10/06 18:57:12 INFO FileSourceStrategy: Post-Scan Filters:
> (size(split(value#0, , -1), true) > 0),isnotnull(split(value#0, ,
> -1)) 22/10/06 18:57:12 INFO FileSourceStrategy: Output Data Schema: struct<value: string> 22/10/06 18:57:12 INFO CodeGenerator: Code
> generated in 181.3829 ms 22/10/06 18:57:12 INFO MemoryStore: Block
> broadcast_0 stored as values in memory (estimated size 286.3 KiB, free
> 366.0 MiB) 22/10/06 18:57:12 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.1 KiB,
> free 366.0 MiB) 22/10/06 18:57:12 INFO BlockManagerInfo: Added
> broadcast_0_piece0 in memory on DESKTOP-PR6Q966.mshome.net:56030
> (size: 24.1 KiB, free: 366.3 MiB) 22/10/06 18:57:12 INFO SparkContext:
> Created broadcast 0 from showString at <unknown>:0 22/10/06 18:57:12
> INFO FileSourceScanExec: Planning scan with bin packing, max size:
> 4194406 bytes, open cost is considered as scanning 4194304 bytes.
> 22/10/06 18:57:12 INFO DAGScheduler: Registering RDD 3 (showString at
> <unknown>:0) as input to shuffle 0 22/10/06 18:57:12 INFO
> DAGScheduler: Got map stage job 0 (showString at <unknown>:0) with 1
> output partitions 22/10/06 18:57:12 INFO DAGScheduler: Final stage:
> ShuffleMapStage 0 (showString at <unknown>:0) 22/10/06 18:57:12 INFO
> DAGScheduler: Parents of final stage: List() 22/10/06 18:57:12 INFO
> DAGScheduler: Missing parents: List() 22/10/06 18:57:12 INFO
> DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at
> showString at <unknown>:0), which has no missing parents 22/10/06
> 18:57:12 INFO MemoryStore: Block broadcast_1 stored as values in
> memory (estimated size 38.6 KiB, free 366.0 MiB) 22/10/06 18:57:12
> INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory
> (estimated size 17.6 KiB, free 365.9 MiB) 22/10/06 18:57:12 INFO
> BlockManagerInfo: Added broadcast_1_piece0 in memory on
> DESKTOP-PR6Q966.mshome.net:56030 (size: 17.6 KiB, free: 366.3 MiB)
> 22/10/06 18:57:12 INFO SparkContext: Created broadcast 1 from
> broadcast at DAGScheduler.scala:1478 22/10/06 18:57:13 INFO
> DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0
> (MapPartitionsRDD[3] at showString at <unknown>:0) (first 15 tasks are
> for partitions Vector(0)) 22/10/06 18:57:13 INFO TaskSchedulerImpl:
> Adding task set 0.0 with 1 tasks resource profile 0 22/10/06 18:57:13
> INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0)
> (DESKTOP-PR6Q966.mshome.net, executor driver, partition 0,
> PROCESS_LOCAL, 4850 bytes) taskResourceAssignments Map() 22/10/06
> 18:57:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 22/10/06
> 18:57:13 INFO CodeGenerator: Code generated in 10.268 ms 22/10/06
> 18:57:13 INFO CodeGenerator: Code generated in 4.9722 ms 22/10/06
> 18:57:13 INFO CodeGenerator: Code generated in 6.0205 ms 22/10/06
> 18:57:13 INFO CodeGenerator: Code generated in 5.18 ms 22/10/06
> 18:57:13 INFO FileScanRDD: Reading File path:
> file:///C:/Code/MySparkApp/input.txt, range: 0-102, partition values:
> [empty row] 22/10/06 18:57:13 INFO LineRecordReader: Found UTF-8 BOM
> and skipped it 22/10/06 18:57:13 INFO Executor: Finished task 0.0 in
> stage 0.0 (TID 0). 2845 bytes result sent to driver 22/10/06 18:57:13
> INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 319 ms
> on DESKTOP-PR6Q966.mshome.net (executor driver) (1/1) 22/10/06
> 18:57:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have
> all completed, from pool 22/10/06 18:57:13 INFO DAGScheduler:
> ShuffleMapStage 0 (showString at <unknown>:0) finished in 0.379 s
> 22/10/06 18:57:13 INFO DAGScheduler: looking for newly runnable stages
> 22/10/06 18:57:13 INFO DAGScheduler: running: Set() 22/10/06 18:57:13
> INFO DAGScheduler: waiting: Set() 22/10/06 18:57:13 INFO DAGScheduler:
> failed: Set() 22/10/06 18:57:13 INFO ShufflePartitionsUtil: For
> shuffle(0), advisory target size: 67108864, actual target size
> 1048576, minimum partition size: 1048576 22/10/06 18:57:13 INFO
> CodeGenerator: Code generated in 11.5441 ms 22/10/06 18:57:13 INFO
> HashAggregateExec: spark.sql.codegen.aggregate.map.twolevel.enabled is
> set to true, but current version of codegened fast hashmap does not
> support this aggregate. 22/10/06 18:57:13 INFO CodeGenerator: Code
> generated in 10.7919 ms 22/10/06 18:57:13 INFO SparkContext: Starting
> job: showString at <unknown>:0 22/10/06 18:57:13 INFO DAGScheduler:
> Got job 1 (showString at <unknown>:0) with 1 output partitions
> 22/10/06 18:57:13 INFO DAGScheduler: Final stage: ResultStage 2
> (showString at <unknown>:0) 22/10/06 18:57:13 INFO DAGScheduler:
> Parents of final stage: List(ShuffleMapStage 1) 22/10/06 18:57:13 INFO
> DAGScheduler: Missing parents: List() 22/10/06 18:57:13 INFO
> DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[7] at
> showString at <unknown>:0), which has no missing parents 22/10/06
> 18:57:13 INFO MemoryStore: Block broadcast_2 stored as values in
> memory (estimated size 37.4 KiB, free 365.9 MiB) 22/10/06 18:57:13
> INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory
> (estimated size 17.7 KiB, free 365.9 MiB) 22/10/06 18:57:13 INFO
> BlockManagerInfo: Added broadcast_2_piece0 in memory on
> DESKTOP-PR6Q966.mshome.net:56030 (size: 17.7 KiB, free: 366.2 MiB)
> 22/10/06 18:57:13 INFO SparkContext: Created broadcast 2 from
> broadcast at DAGScheduler.scala:1478 22/10/06 18:57:13 INFO
> DAGScheduler: Submitting 1 missing tasks from ResultStage 2
> (MapPartitionsRDD[7] at showString at <unknown>:0) (first 15 tasks are
> for partitions Vector(0)) 22/10/06 18:57:13 INFO TaskSchedulerImpl:
> Adding task set 2.0 with 1 tasks resource profile 0 22/10/06 18:57:13
> INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 1)
> (DESKTOP-PR6Q966.mshome.net, executor driver, partition 0, NODE_LOCAL,
> 4453 bytes) taskResourceAssignments Map() 22/10/06 18:57:13 INFO
> Executor: Running task 0.0 in stage 2.0 (TID 1) 22/10/06 18:57:13 INFO
> BlockManagerInfo: Removed broadcast_1_piece0 on
> DESKTOP-PR6Q966.mshome.net:56030 in memory (size: 17.6 KiB, free:
> 366.3 MiB) 22/10/06 18:57:13 INFO ShuffleBlockFetcherIterator: Getting 1 (864.0 B) non-empty blocks including 1 (864.0 B) local and 0 (0.0 B)
> host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
> 22/10/06 18:57:13 INFO ShuffleBlockFetcherIterator: Started 0 remote
> fetches in 8 ms 22/10/06 18:57:13 INFO Executor: Finished task 0.0 in
> stage 2.0 (TID 1). 6732 bytes result sent to driver 22/10/06 18:57:13
> INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 1) in 124 ms
> on DESKTOP-PR6Q966.mshome.net (executor driver) (1/1) 22/10/06
> 18:57:13 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have
> all completed, from pool 22/10/06 18:57:13 INFO DAGScheduler:
> ResultStage 2 (showString at <unknown>:0) finished in 0.136 s 22/10/06
> 18:57:13 INFO DAGScheduler: Job 1 is finished. Cancelling potential
> speculative or zombie tasks for this job 22/10/06 18:57:13 INFO
> TaskSchedulerImpl: Killing all running tasks in stage 2: Stage
> finished 22/10/06 18:57:13 INFO DAGScheduler: Job 1 finished:
> showString at <unknown>:0, took 0.149812 s 22/10/06 18:57:13 INFO
> CodeGenerator: Code generated in 7.0234 ms 22/10/06 18:57:13 INFO
> CodeGenerator: Code generated in 7.0701 ms
> +------+-----+ | word|count|
> +------+-----+ | .NET| 3| |Apache| 2| | This| 2| | Spark| 2| | app| 2| | World| 1| | for| 1| |counts| 1| |
> words| 1| | with| 1| | uses| 1| | Hello| 1|
> +------+-----+
>
> Moo 22/10/06 18:57:13 ERROR DotnetBackendHandler: Failed to execute
> 'load' on 'org.apache.spark.sql.streaming.DataStreamReader' with
> args=() [2022-10-06T16:57:13.6895055Z] [DESKTOP-PR6Q966] [Error]
> [JvmBridge] JVM method execution failed: Nonstatic method 'load'
> failed for class '22' when called with no arguments
> [2022-10-06T16:57:13.6895347Z] [DESKTOP-PR6Q966] [Error] [JvmBridge]
> org.apache.spark.sql.AnalysisException: Failed to find data source:
> kafka. Please deploy the application as per the deployment section of
> "Structured Streaming + Kafka Integration Guide".
> at org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindKafkaDataSourceError(QueryCompilationErrors.scala:1037)
> at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:668)
> at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:156)
> at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:143)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.apache.spark.api.dotnet.DotnetBackendHandler.handleMethodCall(DotnetBackendHandler.scala:165)
> at org.apache.spark.api.dotnet.DotnetBackendHandler.$anonfun$handleBackendRequest$2(DotnetBackendHandler.scala:105)
> at org.apache.spark.api.dotnet.ThreadPool$$anon$1.run(ThreadPool.scala:34)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
>
> [2022-10-06T16:57:13.6986588Z] [DESKTOP-PR6Q966] [Exception]
> [JvmBridge] JVM method execution failed: Nonstatic method 'load'
> failed for class '22' when called with no arguments at
> Microsoft.Spark.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic,
> Object classNameOrJvmObjectReference, String methodName, Object[]
> args) Unhandled exception. System.Exception: JVM method execution
> failed: Nonstatic method 'load' failed for class '22' when called with
> no arguments ---> Microsoft.Spark.JvmException:
> org.apache.spark.sql.AnalysisException: Failed to find data source:
> kafka. Please deploy the application as per the deployment section of
> "Structured Streaming + Kafka Integration Guide".
> at org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindKafkaDataSourceError(QueryCompilationErrors.scala:1037)
> at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:668)
> at org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:156)
> at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:143)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.apache.spark.api.dotnet.DotnetBackendHandler.handleMethodCall(DotnetBackendHandler.scala:165)
> at org.apache.spark.api.dotnet.DotnetBackendHandler.$anonfun$handleBackendRequest$2(DotnetBackendHandler.scala:105)
> at org.apache.spark.api.dotnet.ThreadPool$$anon$1.run(ThreadPool.scala:34)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at
..
--packages must be supplied before --class. Look at the Mongo example.
Otherwise, it is passed as main method arguments along with your other class arguments - C:\Code\MySparkApp\bin\Debug\net6.0\microsoft-spark-3-2_2.12-2.1.1.jar dotnet MySparkApp.dll C:\Code\MySparkApp\input.txt. Print the main method arguments to further debug...
You can also set spark.jars.packages in your SparkSession Config options.
Regarding versions, unclear what Scala version you have but spark-sql-kafka-0-10_2.12:3.2.1 is correct for Spark 3.2.1, Scala 2.12 which seems to match your Microsoft JAR.

Why does textFileStream dstream give empty RDDs as if no files were processed?

I have a very basic Spark application that streams an input file, every line contains a JSON string I want to create a model object of.
public final class SparkStreamingApplication {
public static JavaSparkContext javaSparkContext() {
final SparkConf conf = new SparkConf()
.setAppName("SparkApplication")
.setMaster("local[2]");
return new JavaSparkContext(conf);
}
public static void main(String[] args) {
final JavaSparkContext sparkContext = javaSparkContext();
final String path = "data/input.txt";
final JavaStreamingContext streamingContext = new JavaStreamingContext(sparkContext, Durations.seconds(10));
final JavaDStream<String> linesDStream = streamingContext.textFileStream(path);
final JavaDStream<String> tokens = linesDStream.flatMap(x -> Arrays.asList(x.split("|")));
final JavaDStream<Long> count = tokens.count();
count.print();
streamingContext.start();
streamingContext.awaitTermination();
}
}
This results in:
16/01/24 18:44:56 INFO SparkContext: Running Spark version 1.6.0
16/01/24 18:44:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/24 18:44:58 WARN Utils: Your hostname, markus-lenovo resolves to a loopback address: 127.0.1.1; using 192.168.2.103 instead (on interface wlp2s0)
16/01/24 18:44:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/01/24 18:44:58 INFO SecurityManager: Changing view acls to: markus
16/01/24 18:44:58 INFO SecurityManager: Changing modify acls to: markus
16/01/24 18:44:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(markus); users with modify permissions: Set(markus)
16/01/24 18:44:59 INFO Utils: Successfully started service 'sparkDriver' on port 38761.
16/01/24 18:44:59 INFO Slf4jLogger: Slf4jLogger started
16/01/24 18:44:59 INFO Remoting: Starting remoting
16/01/24 18:45:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.2.103:45438]
16/01/24 18:45:00 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 45438.
16/01/24 18:45:00 INFO SparkEnv: Registering MapOutputTracker
16/01/24 18:45:00 INFO SparkEnv: Registering BlockManagerMaster
16/01/24 18:45:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-82c4981c-0b78-47c0-a8c7-e6fe8bc6ac84
16/01/24 18:45:00 INFO MemoryStore: MemoryStore started with capacity 1092.4 MB
16/01/24 18:45:00 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/24 18:45:00 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/24 18:45:00 INFO SparkUI: Started SparkUI at http://192.168.2.103:4040
16/01/24 18:45:00 INFO Executor: Starting executor ID driver on host localhost
16/01/24 18:45:00 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35429.
16/01/24 18:45:00 INFO NettyBlockTransferService: Server created on 35429
16/01/24 18:45:00 INFO BlockManagerMaster: Trying to register BlockManager
16/01/24 18:45:00 INFO BlockManagerMasterEndpoint: Registering block manager localhost:35429 with 1092.4 MB RAM, BlockManagerId(driver, localhost, 35429)
16/01/24 18:45:00 INFO BlockManagerMaster: Registered BlockManager
16/01/24 18:45:01 INFO FileInputDStream: Duration for remembering RDDs set to 60000 ms for org.apache.spark.streaming.dstream.FileInputDStream#3c35c345
16/01/24 18:45:02 INFO ForEachDStream: metadataCleanupDelay = -1
16/01/24 18:45:02 INFO MappedDStream: metadataCleanupDelay = -1
16/01/24 18:45:02 INFO MappedDStream: metadataCleanupDelay = -1
16/01/24 18:45:02 INFO ShuffledDStream: metadataCleanupDelay = -1
16/01/24 18:45:02 INFO TransformedDStream: metadataCleanupDelay = -1
16/01/24 18:45:02 INFO MappedDStream: metadataCleanupDelay = -1
16/01/24 18:45:02 INFO FlatMappedDStream: metadataCleanupDelay = -1
16/01/24 18:45:02 INFO MappedDStream: metadataCleanupDelay = -1
16/01/24 18:45:02 INFO FileInputDStream: metadataCleanupDelay = -1
16/01/24 18:45:02 INFO FileInputDStream: Slide time = 10000 ms
16/01/24 18:45:02 INFO FileInputDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/01/24 18:45:02 INFO FileInputDStream: Checkpoint interval = null
16/01/24 18:45:02 INFO FileInputDStream: Remember duration = 60000 ms
16/01/24 18:45:02 INFO FileInputDStream: Initialized and validated org.apache.spark.streaming.dstream.FileInputDStream#3c35c345
16/01/24 18:45:02 INFO MappedDStream: Slide time = 10000 ms
16/01/24 18:45:02 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/01/24 18:45:02 INFO MappedDStream: Checkpoint interval = null
16/01/24 18:45:02 INFO MappedDStream: Remember duration = 10000 ms
16/01/24 18:45:02 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream#45f27baa
16/01/24 18:45:02 INFO FlatMappedDStream: Slide time = 10000 ms
16/01/24 18:45:02 INFO FlatMappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/01/24 18:45:02 INFO FlatMappedDStream: Checkpoint interval = null
16/01/24 18:45:02 INFO FlatMappedDStream: Remember duration = 10000 ms
16/01/24 18:45:02 INFO FlatMappedDStream: Initialized and validated org.apache.spark.streaming.dstream.FlatMappedDStream#18d0e76e
16/01/24 18:45:02 INFO MappedDStream: Slide time = 10000 ms
16/01/24 18:45:02 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/01/24 18:45:02 INFO MappedDStream: Checkpoint interval = null
16/01/24 18:45:02 INFO MappedDStream: Remember duration = 10000 ms
16/01/24 18:45:02 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream#eb2c23e
16/01/24 18:45:02 INFO TransformedDStream: Slide time = 10000 ms
16/01/24 18:45:02 INFO TransformedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/01/24 18:45:02 INFO TransformedDStream: Checkpoint interval = null
16/01/24 18:45:02 INFO TransformedDStream: Remember duration = 10000 ms
16/01/24 18:45:02 INFO TransformedDStream: Initialized and validated org.apache.spark.streaming.dstream.TransformedDStream#26b276d3
16/01/24 18:45:02 INFO ShuffledDStream: Slide time = 10000 ms
16/01/24 18:45:02 INFO ShuffledDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/01/24 18:45:02 INFO ShuffledDStream: Checkpoint interval = null
16/01/24 18:45:02 INFO ShuffledDStream: Remember duration = 10000 ms
16/01/24 18:45:02 INFO ShuffledDStream: Initialized and validated org.apache.spark.streaming.dstream.ShuffledDStream#704b6684
16/01/24 18:45:02 INFO MappedDStream: Slide time = 10000 ms
16/01/24 18:45:02 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/01/24 18:45:02 INFO MappedDStream: Checkpoint interval = null
16/01/24 18:45:02 INFO MappedDStream: Remember duration = 10000 ms
16/01/24 18:45:02 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream#6fbf1474
16/01/24 18:45:02 INFO MappedDStream: Slide time = 10000 ms
16/01/24 18:45:02 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/01/24 18:45:02 INFO MappedDStream: Checkpoint interval = null
16/01/24 18:45:02 INFO MappedDStream: Remember duration = 10000 ms
16/01/24 18:45:02 INFO MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream#7784888f
16/01/24 18:45:02 INFO ForEachDStream: Slide time = 10000 ms
16/01/24 18:45:02 INFO ForEachDStream: Storage level = StorageLevel(false, false, false, false, 1)
16/01/24 18:45:02 INFO ForEachDStream: Checkpoint interval = null
16/01/24 18:45:02 INFO ForEachDStream: Remember duration = 10000 ms
16/01/24 18:45:02 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream#42b57c42
16/01/24 18:45:02 INFO RecurringTimer: Started timer for JobGenerator at time 1453657510000
16/01/24 18:45:02 INFO JobGenerator: Started JobGenerator at 1453657510000 ms
16/01/24 18:45:02 INFO JobScheduler: Started JobScheduler
16/01/24 18:45:02 INFO StreamingContext: StreamingContext started
16/01/24 18:45:10 INFO FileInputDStream: Finding new files took 184 ms
16/01/24 18:45:10 INFO FileInputDStream: New files at time 1453657510000 ms:
16/01/24 18:45:10 INFO JobScheduler: Added jobs for time 1453657510000 ms
16/01/24 18:45:10 INFO JobScheduler: Starting job streaming job 1453657510000 ms.0 from job set of time 1453657510000 ms
16/01/24 18:45:10 INFO SparkContext: Starting job: print at SparkStreamingApplication.java:33
16/01/24 18:45:10 INFO DAGScheduler: Registering RDD 5 (union at DStream.scala:617)
16/01/24 18:45:10 INFO DAGScheduler: Got job 0 (print at SparkStreamingApplication.java:33) with 1 output partitions
16/01/24 18:45:10 INFO DAGScheduler: Final stage: ResultStage 1 (print at SparkStreamingApplication.java:33)
16/01/24 18:45:10 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/01/24 18:45:10 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/01/24 18:45:10 INFO DAGScheduler: Submitting ShuffleMapStage 0 (UnionRDD[5] at union at DStream.scala:617), which has no missing parents
16/01/24 18:45:10 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.6 KB, free 4.6 KB)
16/01/24 18:45:10 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.6 KB, free 7.2 KB)
16/01/24 18:45:10 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35429 (size: 2.6 KB, free: 1092.4 MB)
16/01/24 18:45:10 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/01/24 18:45:10 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (UnionRDD[5] at union at DStream.scala:617)
16/01/24 18:45:10 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/01/24 18:45:10 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2148 bytes)
16/01/24 18:45:10 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/01/24 18:45:10 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1159 bytes result sent to driver
16/01/24 18:45:11 INFO DAGScheduler: ShuffleMapStage 0 (union at DStream.scala:617) finished in 0.211 s
16/01/24 18:45:11 INFO DAGScheduler: looking for newly runnable stages
16/01/24 18:45:11 INFO DAGScheduler: running: Set()
16/01/24 18:45:11 INFO DAGScheduler: waiting: Set(ResultStage 1)
16/01/24 18:45:11 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 174 ms on localhost (1/1)
16/01/24 18:45:11 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/01/24 18:45:11 INFO DAGScheduler: failed: Set()
16/01/24 18:45:11 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at count at SparkStreamingApplication.java:32), which has no missing parents
16/01/24 18:45:11 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.5 KB, free 10.8 KB)
16/01/24 18:45:11 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.0 KB, free 12.8 KB)
16/01/24 18:45:11 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:35429 (size: 2.0 KB, free: 1092.4 MB)
16/01/24 18:45:11 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/01/24 18:45:11 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at count at SparkStreamingApplication.java:32)
16/01/24 18:45:11 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/01/24 18:45:11 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1813 bytes)
16/01/24 18:45:11 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/01/24 18:45:11 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/01/24 18:45:11 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 8 ms
16/01/24 18:45:11 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1241 bytes result sent to driver
16/01/24 18:45:11 INFO DAGScheduler: ResultStage 1 (print at SparkStreamingApplication.java:33) finished in 0.068 s
16/01/24 18:45:11 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 72 ms on localhost (1/1)
16/01/24 18:45:11 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/01/24 18:45:11 INFO DAGScheduler: Job 0 finished: print at SparkStreamingApplication.java:33, took 0.729150 s
16/01/24 18:45:11 INFO SparkContext: Starting job: print at SparkStreamingApplication.java:33
16/01/24 18:45:11 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 144 bytes
16/01/24 18:45:11 INFO DAGScheduler: Got job 1 (print at SparkStreamingApplication.java:33) with 1 output partitions
16/01/24 18:45:11 INFO DAGScheduler: Final stage: ResultStage 3 (print at SparkStreamingApplication.java:33)
16/01/24 18:45:11 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
16/01/24 18:45:11 INFO DAGScheduler: Missing parents: List()
16/01/24 18:45:11 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[8] at count at SparkStreamingApplication.java:32), which has no missing parents
16/01/24 18:45:11 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.5 KB, free 16.3 KB)
16/01/24 18:45:11 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.0 KB, free 18.3 KB)
16/01/24 18:45:11 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:35429 (size: 2.0 KB, free: 1092.4 MB)
16/01/24 18:45:11 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/01/24 18:45:11 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[8] at count at SparkStreamingApplication.java:32)
16/01/24 18:45:11 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
16/01/24 18:45:11 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 2, localhost, partition 1,PROCESS_LOCAL, 1813 bytes)
16/01/24 18:45:11 INFO Executor: Running task 0.0 in stage 3.0 (TID 2)
16/01/24 18:45:11 INFO ContextCleaner: Cleaned accumulator 1
16/01/24 18:45:11 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
16/01/24 18:45:11 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/01/24 18:45:11 INFO Executor: Finished task 0.0 in stage 3.0 (TID 2). 1163 bytes result sent to driver
16/01/24 18:45:11 INFO DAGScheduler: ResultStage 3 (print at SparkStreamingApplication.java:33) finished in 0.048 s
16/01/24 18:45:11 INFO DAGScheduler: Job 1 finished: print at SparkStreamingApplication.java:33, took 0.112123 s
-------------------------------------------
Time: 1453657510000 ms
-------------------------------------------
0
16/01/24 18:45:11 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 2) in 48 ms on localhost (1/1)
16/01/24 18:45:11 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
16/01/24 18:45:11 INFO JobScheduler: Finished job streaming job 1453657510000 ms.0 from job set of time 1453657510000 ms
16/01/24 18:45:11 INFO JobScheduler: Total delay: 1.318 s for time 1453657510000 ms (execution: 0.963 s)
16/01/24 18:45:11 INFO FileInputDStream: Cleared 0 old files that were older than 1453657450000 ms:
16/01/24 18:45:11 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:35429 in memory (size: 2.0 KB, free: 1092.4 MB)
16/01/24 18:45:11 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer()
16/01/24 18:45:11 INFO ContextCleaner: Cleaned accumulator 2
16/01/24 18:45:11 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:35429 in memory (size: 2.6 KB, free: 1092.4 MB)
16/01/24 18:45:11 INFO InputInfoTracker: remove old batch metadata
As you can see a the at the lower end of the output the printing of the tokens dstream is 0. But the result should be 3 because each line of my input file is in the format xx | yy | zz ?!?!?
Is there something wrong in my Spark configuration or in the usage of DStreams? Thanks for any ideas and suggestions!
Spark's textFileStream creates a stream that watches a directory for new files only.
You have to change path to "data/", then you have to put the file into the directory when your stream is started.
Please note that only new files are detected and processed according to the documentation:
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
However when the file is renamed, Spark detects it.

Does spark creates two datasets or stages that work on same logic?

I was trying to read from a CSV file and insert those entries into database.
I figured out that internally spark created two RDD i.e. rdd_0_0 and rdd_0_1 that works on same data and does all the processing.
Can anyone help in figuring out why call method is called twice by different datasets.
If two datasets/stages are created why they both of them working on same logic??
Please help me in confirming if that is the case spark works??
public final class TestJavaAggregation1 implements Serializable {
private static final long serialVersionUID = 1L;
static CassandraConfig config = null;
static PreparedStatement statement = null;
private transient SparkConf conf;
private PersonAggregationRowWriterFactory aggregationWriter = new PersonAggregationRowWriterFactory();
public Session session;
private TestJavaAggregation1(SparkConf conf) {
this.conf = conf;
}
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setAppName(“REadFromCSVFile”).setMaster(“local[1]”).set(“spark.executor.memory”, “1g”);
conf.set(“spark.cassandra.connection.host”, “localhost”);
TestJavaAggregation1 app = new TestJavaAggregation1(conf);
app.run();
}
private void run() {
JavaSparkContext sc = new JavaSparkContext(conf);
aggregateData(sc);
sc.stop();
}
private JavaRDD sparkConfig(JavaSparkContext sc) {
JavaRDD lines = sc.textFile(“PersonAggregation1_500.csv”, 1);
System.out.println(lines.getCheckpointFile());
lines.cache();
final String heading = lines.first();
System.out.println(heading);
String headerValues = heading.replaceAll(“\t”, “,”);
System.out.println(headerValues);
CassandraConnector connector = CassandraConnector.apply(sc.getConf());
Session session = connector.openSession();
try {
session.execute(“DROP KEYSPACE IF EXISTS java_api5″);
session.execute(“CREATE KEYSPACE java_api5 WITH replication = {‘class': ‘SimpleStrategy’, ‘replication_factor': 1}”);
session.execute(“CREATE TABLE java_api5.person (hashvalue INT, id INT, state TEXT, city TEXT, country TEXT, full_name TEXT, PRIMARY KEY((hashvalue), id, state, city, country, full_name)) WITH CLUSTERING ORDER BY (id DESC);”);
} catch (Exception ex) {
ex.printStackTrace();
}
return lines;
}
#SuppressWarnings(“serial”)
public void aggregateData(JavaSparkContext sc) {
JavaRDD lines = sparkConfig(sc);
System.out.println(“FirstRDD” + lines.partitions().size());
JavaRDD result = lines.map(new Function() {
int i = 0;
public PersonAggregation call(String row) {
PersonAggregation aggregate = new PersonAggregation();
row = row + “,” + this.hashCode();
String[] parts = row.split(“,”);
aggregate.setId(Integer.valueOf(parts[0]));
aggregate.setFull_name(parts[1]);
aggregate.setState(parts[4]);
aggregate.setCity(parts[5]);
aggregate.setCountry(parts[6]);
aggregate.setHashValue(Integer.valueOf(parts[7]));
*//below save inserts 200 entries into the database while the CSV file has only 100 records.*
**saveToJavaCassandra(aggregate);**
return aggregate;
}
});
System.out.println(result.collect().size());
List personAggregationList = result.collect();
JavaRDD aggregateRDD = sc.parallelize(personAggregationList);
javaFunctions(aggregateRDD).writerBuilder(“java_api5″, “person”,
aggregationWriter).saveToCassandra();
}
}
Please find the logs below too:
15/05/29 12:40:37 INFO FileInputFormat: Total input paths to process : 1
15/05/29 12:40:37 INFO SparkContext: Starting job: first at TestJavaAggregation1.java:89
15/05/29 12:40:37 INFO DAGScheduler: Got job 0 (first at TestJavaAggregation1.java:89) with 1 output partitions (allowLocal=true)
15/05/29 12:40:37 INFO DAGScheduler: Final stage: Stage 0(first at TestJavaAggregation1.java:89)
15/05/29 12:40:37 INFO DAGScheduler: Parents of final stage: List()
15/05/29 12:40:37 INFO DAGScheduler: Missing parents: List()
15/05/29 12:40:37 INFO DAGScheduler: Submitting Stage 0 (PersonAggregation_5.csv MappedRDD[1] at textFile at TestJavaAggregation1.java:84), which has no missing parents
15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(2560) called with curMem=157187, maxMem=1009589944
15/05/29 12:40:37 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.5 KB, free 962.7 MB)
15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(1897) called with curMem=159747, maxMem=1009589944
15/05/29 12:40:37 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1897.0 B, free 962.7 MB)
15/05/29 12:40:37 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54664 (size: 1897.0 B, free: 962.8 MB)
15/05/29 12:40:37 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/05/29 12:40:37 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/05/29 12:40:37 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PersonAggregation_5.csv MappedRDD[1] at textFile at TestJavaAggregation1.java:84)
15/05/29 12:40:37 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/05/29 12:40:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1326 bytes)
15/05/29 12:40:37 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/05/29 12:40:37 INFO CacheManager: Partition rdd_1_0 not found, computing it
15/05/29 12:40:37 INFO HadoopRDD: Input split: file:/F:/workspace/apoorva/TestProject/PersonAggregation_5.csv:0+230
15/05/29 12:40:37 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/05/29 12:40:37 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/05/29 12:40:37 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/05/29 12:40:37 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/05/29 12:40:37 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(680) called with curMem=161644, maxMem=1009589944
15/05/29 12:40:37 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 680.0 B, free 962.7 MB)
15/05/29 12:40:37 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:54664 (size: 680.0 B, free: 962.8 MB)
15/05/29 12:40:37 INFO BlockManagerMaster: Updated info of block rdd_1_0
15/05/29 12:40:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2335 bytes result sent to driver
15/05/29 12:40:37 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 73 ms on localhost (1/1)
15/05/29 12:40:37 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/05/29 12:40:37 INFO DAGScheduler: Stage 0 (first at TestJavaAggregation1.java:89) finished in 0.084 s
15/05/29 12:40:37 INFO DAGScheduler: Job 0 finished: first at TestJavaAggregation1.java:89, took 0.129536 s
1,FName1,MName1,LName1,state1,city1,country1
1,FName1,MName1,LName1,state1,city1,country1
15/05/29 12:40:37 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
15/05/29 12:40:37 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
FirstRDD1
SecondRDD1
15/05/29 12:40:47 INFO SparkContext: Starting job: collect at TestJavaAggregation1.java:147
15/05/29 12:40:47 INFO DAGScheduler: Got job 1 (collect at TestJavaAggregation1.java:147) with 1 output partitions (allowLocal=false)
15/05/29 12:40:47 INFO DAGScheduler: Final stage: Stage 1(collect at TestJavaAggregation1.java:147)
15/05/29 12:40:47 INFO DAGScheduler: Parents of final stage: List()
15/05/29 12:40:47 INFO DAGScheduler: Missing parents: List()
15/05/29 12:40:47 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[2] at map at TestJavaAggregation1.java:117), which has no missing parents
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(3872) called with curMem=162324, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.8 KB, free 962.7 MB)
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(2604) called with curMem=166196, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.5 KB, free 962.7 MB)
15/05/29 12:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54664 (size: 2.5 KB, free: 962.8 MB)
15/05/29 12:40:47 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/05/29 12:40:47 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838
15/05/29 12:40:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[2] at map at TestJavaAggregation1.java:117)
15/05/29 12:40:47 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/05/29 12:40:47 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1326 bytes)
15/05/29 12:40:47 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/05/29 12:40:47 INFO BlockManager: Found block rdd_1_0 locally
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state1,city1,country1
15/05/29 12:40:47 INFO DCAwareRoundRobinPolicy: Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
15/05/29 12:40:47 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
Connected to cluster: Test Cluster
Datacenter: datacenter1; Host: localhost/127.0.0.1; Rack: rack1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state2,city2,country1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state3,city3,country1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state4,city4,country1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state5,city5,country1
15/05/29 12:40:47 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2343 bytes result sent to driver
15/05/29 12:40:47 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 184 ms on localhost (1/1)
15/05/29 12:40:47 INFO DAGScheduler: Stage 1 (collect at TestJavaAggregation1.java:147) finished in 0.185 s
15/05/29 12:40:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/05/29 12:40:47 INFO DAGScheduler: Job 1 finished: collect at TestJavaAggregation1.java:147, took 0.218779 s
______________________________5_______________________________
15/05/29 12:40:47 INFO SparkContext: Starting job: collect at TestJavaAggregation1.java:150
15/05/29 12:40:47 INFO DAGScheduler: Got job 2 (collect at TestJavaAggregation1.java:150) with 1 output partitions (allowLocal=false)
15/05/29 12:40:47 INFO DAGScheduler: Final stage: Stage 2(collect at TestJavaAggregation1.java:150)
15/05/29 12:40:47 INFO DAGScheduler: Parents of final stage: List()
15/05/29 12:40:47 INFO DAGScheduler: Missing parents: List()
15/05/29 12:40:47 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[2] at map at TestJavaAggregation1.java:117), which has no missing parents
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(3872) called with curMem=168800, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.8 KB, free 962.7 MB)
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(2604) called with curMem=172672, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.5 KB, free 962.7 MB)
15/05/29 12:40:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:54664 (size: 2.5 KB, free: 962.8 MB)
15/05/29 12:40:47 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/05/29 12:40:47 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:838
15/05/29 12:40:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 2 (MappedRDD[2] at map at TestJavaAggregation1.java:117)
15/05/29 12:40:47 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
15/05/29 12:40:47 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1326 bytes)
15/05/29 12:40:47 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
15/05/29 12:40:47 INFO BlockManager: Found block rdd_1_0 locally
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state1,city1,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state2,city2,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state3,city3,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state4,city4,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state5,city5,country1
15/05/29 12:40:47 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2343 bytes result sent to driver
15/05/29 12:40:47 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 16 ms on localhost (1/1)
15/05/29 12:40:47 INFO DAGScheduler: Stage 2 (collect at TestJavaAggregation1.java:150) finished in 0.016 s
15/05/29 12:40:47 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/05/29 12:40:47 INFO DAGScheduler: Job 2 finished: collect at TestJavaAggregation1.java:150, took 0.026302 s
When you are running a spark cluster and you run a spark job. Spark distributes the data in the cluster in terms of RDD's the partitioning of data is handled by spark. When you create a lines RDD in your sparkConfig method by reading a file. Spark partitions the data and creates RDD partitions internally so that when in memory computation happens it is done over distrubuted data over the RDD's in your cluster. Therefore your JavaRDD lines is internally a union on various RDD_partitions. Hence, when you run a map job on JavaRDD lines, it runs for all the data partitioned amongst various internal RDD's that relate to the JavaRDD on which you ran the map function. As in your case spark created two internal partitions of the JavaRDD Lines, that is why the map function is called two times for the two internal JavaRDD partitions. Please tell me if you have any other questions.

Why does spark-shell throw ArrayIndexOutOfBoundsException when reading a large file from HDFS?

I am using hadoop 2.4.1 and Spark 1.1.0. I have uploaded a dataset of food review to HDFS from here and then I used the following code to read the file and process it on the spark shell:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
var path = "hdfs:///user/hduser/finefoods.txt"
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n\n")
var dataset = sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString)
var datasetObj = dataset.map{ rowStr => rowStr.split("\n")}
var tupleSet = datasetObj.map( strArr => strArr.map( elm => elm.split(": ")(1))).map( arr => (arr(0),arr(1),arr(4).toDouble))
tupleSet.groupBy(t => t._2)
When I run the last line tupleSet.groupBy(t => t._2), the spark shell throws the following exception:
scala> tupleSet.groupBy( t => t._2).first()
14/11/15 22:46:59 INFO spark.SparkContext: Starting job: first at <console>:28
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Registering RDD 11 (groupBy at <console>:28)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Got job 1 (first at <console>:28) with 1 output partitions (allowLocal=true)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Final stage: Stage 1(first at <console>:28)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 2)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Missing parents: List(Stage 2)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Submitting Stage 2 (MappedRDD[11] at groupBy at <console>:28), which has no missing parents
14/11/15 22:46:59 INFO storage.MemoryStore: ensureFreeSpace(3592) called with curMem=221261, maxMem=278302556
14/11/15 22:46:59 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.5 KB, free 265.2 MB)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from Stage 2 (MappedRDD[11] at groupBy at <console>:28)
14/11/15 22:46:59 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 3 tasks
14/11/15 22:46:59 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, localhost, ANY, 1221 bytes)
14/11/15 22:46:59 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 4, localhost, ANY, 1221 bytes)
14/11/15 22:46:59 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 3)
14/11/15 22:46:59 INFO executor.Executor: Running task 1.0 in stage 2.0 (TID 4)
14/11/15 22:46:59 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:0+134217728
14/11/15 22:46:59 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:134217728+134217728
14/11/15 22:47:02 ERROR executor.Executor: Exception in task 1.0 in stage 2.0 (TID 4)
java.lang.ArrayIndexOutOfBoundsException
14/11/15 22:47:02 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 5, localhost, ANY, 1221 bytes)
14/11/15 22:47:02 INFO executor.Executor: Running task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:268435456+102361028
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException:
14/11/15 22:47:02 ERROR scheduler.TaskSetManager: Task 1 in stage 2.0 failed 1 times; aborting job
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Cancelling stage 2
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Stage 2 was cancelled
14/11/15 22:47:02 INFO executor.Executor: Executor is trying to kill task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 INFO executor.Executor: Executor is trying to kill task 0.0 in stage 2.0 (TID 3)
14/11/15 22:47:02 INFO scheduler.DAGScheduler: Failed to run first at <console>:28
14/11/15 22:47:02 INFO executor.Executor: Executor killed task 0.0 in stage 2.0 (TID 3)
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, localhost): TaskKilled (killed intentionally)
14/11/15 22:47:02 INFO executor.Executor: Executor killed task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 2.0 (TID 5, localhost): TaskKilled (killed intentionally)
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
But when I use dummy dataset like the following, it works well:
var tupleSet = sc.parallelize(List(
("B001E4KFG0","A3SGXH7AUHU8GW",3.0),
("B001E4KFG1","A3SGXH7AUHU8GW",4.0),
("B001E4KFG2","A3SGXH7AUHU8GW",4.0),
("B001E4KFG3","A3SGXH7AUHU8GW",4.0),
("B001E4KFG4","A3SGXH7AUHU8GW",5.0),
("B001E4KFG5","A3SGXH7AUHU8GW",5.0),
("B001E4KFG0","bbb",5.0)
))
Any idea why?
There's probably an entry in the dataset that does not follow the format and therefore: elm.split(": ")(1) fails, because there's no element at that index.
You can avoid that error by checking the results of the split before accessing the (1) index. One way of doing that could be something like this:
var tupleSet = datasetObj.map(elem => elm.split(": ")).collect{case x if (x.length>1) x(1)}
One note: Your examples do not seem to match the parsing pipeline in the code. They do not contain the ": " tokens.
Since the transformations are lazy Spark won't tell you much about your input dataset (and you may not notice it) only until executing an action like groupBy().
It could also be due to empty/blank lines in your dataset. And, you are applying a split function on the data. In such case, filter out the empty lines.
Eg: myrdd.filter(_.nonEmpty).map(...)
I had a similar problem when I was converting a log data into dataframe using pySpark.
When a log entry is invalid, I returned a null value instead of a Row instance. Before converting to dataframe, I filtered out these null values. But, still, I got the above problem. Finally, the error went away when I returned a Row with null values instead of a single null value.
Pseudo code below:
Didnt work:
rdd = Parse log (log lines to Rows if valid else None)
filtered_rdd = rdd.filter(lambda x:x!=None)
logs = sqlContext.inferSchema(filtered_rdd)
Worked:
rdd = Parse log (log lines to Rows if valid else Row(None,None,...))
logs = sqlContext.inferSchema(rdd)
filtered_rdd = logs.filter(logs['id'].isNotNull())

How to run more executors on Apache Spark Cluster mode

I have 50 workers, I would like to run my job on my all workers.
In master:8080, I can see all workers there,In master:4040/executors, I can see 50 executors, but when I run my job, the information show like this:
14/10/19 14:57:07 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/19 14:57:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, slave11, NODE_LOCAL, 1302 bytes)
14/10/19 14:57:07 INFO nio.ConnectionManager: Accepted connection from [slave11/10.10.10.21:42648]
14/10/19 14:57:07 INFO nio.SendingConnection: Initiating connection to [slave11/10.10.10.21:54398]
14/10/19 14:57:07 INFO nio.SendingConnection: Connected to [slave11/10.10.10.21:54398], 1 messages pending
14/10/19 14:57:07 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slave11:54398 (size: 2.4 KB, free: 267.3 MB)
14/10/19 14:57:08 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slave11:54398 (size: 18.4 KB, free: 267.2 MB)
14/10/19 14:57:12 INFO storage.BlockManagerInfo: Added rdd_2_0 in memory on slave11:54398 (size: 87.4 MB, free: 179.8 MB)
14/10/19 14:57:12 INFO scheduler.DAGScheduler: Stage 0 (first at GeneralizedLinearAlgorithm.scala:141) finished in 5.473 s
14/10/19 14:57:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 5463 ms on slave11 (1/1)
14/10/19 14:57:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
And my job code like this:(command line)
master: $ ./spark-shell --master spark://master:7077
and this(scala code):
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
val fileName = "bc.txt"
val data = sc.textFile(fileName)
val splits = data.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
val training_1 = training.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x => x.toDouble).toArray))
}
val test_1 = test.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x => x.toDouble).toArray))
}
val numIterations = 200
val model = SVMWithSGD.train(training_1, numIterations)
My question is why only one or two(sometimes) task run on my cluster?
Is any way to configuration the number of task or it is schedule by scheduler automatically?
When my job run on two tasks and it will run with two executors that I observe on master:4040,
It will give 2x speedup, so I want to run my job on all executors, how can I do that?
Thanks everyone.
You can use the minPartitions parameter in textFile to set the min number of tasks, such as:
val data = sc.textFile(fileName, 10)
However, more partitions usually means more network traffic because more partitions make Spark hard to dispatch tasks to the local executors to run. You need to find a balance number of minPartitions by yourself.

Resources