How to run more executors on Apache Spark Cluster mode - apache-spark

I have 50 workers, I would like to run my job on my all workers.
In master:8080, I can see all workers there,In master:4040/executors, I can see 50 executors, but when I run my job, the information show like this:
14/10/19 14:57:07 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/10/19 14:57:07 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, slave11, NODE_LOCAL, 1302 bytes)
14/10/19 14:57:07 INFO nio.ConnectionManager: Accepted connection from [slave11/10.10.10.21:42648]
14/10/19 14:57:07 INFO nio.SendingConnection: Initiating connection to [slave11/10.10.10.21:54398]
14/10/19 14:57:07 INFO nio.SendingConnection: Connected to [slave11/10.10.10.21:54398], 1 messages pending
14/10/19 14:57:07 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on slave11:54398 (size: 2.4 KB, free: 267.3 MB)
14/10/19 14:57:08 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on slave11:54398 (size: 18.4 KB, free: 267.2 MB)
14/10/19 14:57:12 INFO storage.BlockManagerInfo: Added rdd_2_0 in memory on slave11:54398 (size: 87.4 MB, free: 179.8 MB)
14/10/19 14:57:12 INFO scheduler.DAGScheduler: Stage 0 (first at GeneralizedLinearAlgorithm.scala:141) finished in 5.473 s
14/10/19 14:57:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 5463 ms on slave11 (1/1)
14/10/19 14:57:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
And my job code like this:(command line)
master: $ ./spark-shell --master spark://master:7077
and this(scala code):
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
val fileName = "bc.txt"
val data = sc.textFile(fileName)
val splits = data.randomSplit(Array(0.9, 0.1), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
val training_1 = training.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x => x.toDouble).toArray))
}
val test_1 = test.map { line =>
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x => x.toDouble).toArray))
}
val numIterations = 200
val model = SVMWithSGD.train(training_1, numIterations)
My question is why only one or two(sometimes) task run on my cluster?
Is any way to configuration the number of task or it is schedule by scheduler automatically?
When my job run on two tasks and it will run with two executors that I observe on master:4040,
It will give 2x speedup, so I want to run my job on all executors, how can I do that?
Thanks everyone.

You can use the minPartitions parameter in textFile to set the min number of tasks, such as:
val data = sc.textFile(fileName, 10)
However, more partitions usually means more network traffic because more partitions make Spark hard to dispatch tasks to the local executors to run. You need to find a balance number of minPartitions by yourself.

Related

Do not see data written from Spark in ADX

I am using azure-kusto-spark to write data to ADX, I can see schema created in ADX, but I do not see any data, there is not any error from log, note I try it using local spark.
df.show();
df.write()
.format("com.microsoft.kusto.spark.datasource")
.option(KustoSinkOptions.KUSTO_CLUSTER(), cluster)
.option(KustoSinkOptions.KUSTO_DATABASE(), db)
.option(KustoSinkOptions.KUSTO_TABLE(), table)
.option(KustoSinkOptions.KUSTO_AAD_APP_ID(), client_id)
.option(KustoSinkOptions.KUSTO_AAD_APP_SECRET(), client_key)
.option(KustoSinkOptions.KUSTO_AAD_AUTHORITY_ID(), "microsoft.com")
.option(KustoSinkOptions.KUSTO_TABLE_CREATE_OPTIONS(), "CreateIfNotExist")
.mode(SaveMode.Append)
.save();
22/12/13 12:06:14 INFO QueuedIngestClient: Creating a new IngestClient
22/12/13 12:06:14 INFO ResourceManager: Refreshing Ingestion Auth Token
22/12/13 12:06:16 INFO ResourceManager: Refreshing Ingestion Resources
22/12/13 12:06:16 INFO KustoConnector: ContainerProvider: Got 2 storage SAS with command :'.create tempstorage'. from service 'ingest-engineermetricdata.eastus'
22/12/13 12:06:16 INFO KustoConnector: ContainerProvider: Got 2 storage SAS with command :'.create tempstorage'. from service 'ingest-engineermetricdata.eastus'
22/12/13 12:06:16 INFO KustoConnector: KustoWriter$: finished serializing rows in partition 0 for requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:16 INFO KustoConnector: KustoWriter$: finished serializing rows in partition 1 for requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:17 INFO KustoConnector: KustoWriter$: Ingesting from blob - partition: 0 requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:17 INFO KustoConnector: KustoWriter$: Ingesting from blob - partition: 1 requestId: '9065b634-3b74-4993-830b-16ee534409d5'
22/12/13 12:06:19 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2135 bytes result sent to driver
22/12/13 12:06:19 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2135 bytes result sent to driver
22/12/13 12:06:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6306 ms on 192.168.50.160 (executor driver) (1/2)
22/12/13 12:06:19 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 6231 ms on 192.168.50.160 (executor driver) (2/2)
22/12/13 12:06:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/12/13 12:06:19 INFO DAGScheduler: ResultStage 0 (foreachPartition at KustoWriter.scala:107) finished in 7.070 s
22/12/13 12:06:19 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/12/13 12:06:19 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
22/12/13 12:06:19 INFO DAGScheduler: Job 0 finished: foreachPartition at KustoWriter.scala:107, took 7.157414 s
22/12/13 12:06:19 INFO KustoConnector: KustoClient: Polling on ingestion results for requestId: 9065b634-3b74-4993-830b-16ee534409d5, will move data to destination table when finished
22/12/13 12:13:30 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.50.160:56364 in memory (size: 4.9 KiB, free: 2004.6 MiB)
Local Spark writes data to ADX
The following code works.
Tested on Azure Databricks.
11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12).
com.microsoft.azure.kusto:kusto-spark_3.0_2.12:3.1.6
import com.microsoft.kusto.spark.datasink.KustoSinkOptions
import org.apache.spark.sql.{SaveMode, SparkSession}
val cluster = "..."
val client_id = "..."
val client_key = "..."
val authority = "..."
val db = "mydb"
val table = "mytable"
val df = spark.range(10)
df.show()
df.write
.format("com.microsoft.kusto.spark.datasource")
.option(KustoSinkOptions.KUSTO_CLUSTER, cluster)
.option(KustoSinkOptions.KUSTO_DATABASE, db)
.option(KustoSinkOptions.KUSTO_TABLE, table)
.option(KustoSinkOptions.KUSTO_AAD_APP_ID, client_id)
.option(KustoSinkOptions.KUSTO_AAD_APP_SECRET, client_key)
.option(KustoSinkOptions.KUSTO_AAD_AUTHORITY_ID, authority)
.option(KustoSinkOptions.KUSTO_TABLE_CREATE_OPTIONS, "CreateIfNotExist")
.mode(SaveMode.Append)
.save()
The ingestion time depends on the ingestion batching policy of the table.
Defaults and limits
Type
Property
Default
Low latency setting
Minimum value
Maximum value
Number of items
MaximumNumberOfItems
1000
1000
1
25,000
Data size (MB)
MaximumRawDataSizeMB
1024
1024
100
4096
Time (sec)
MaximumBatchingTimeSpan
300
20 - 30
10
1800

Spark-ftp : DataFrame is Saved to FTP incorrectly

I am struggling with the spark-ftp, I am reading from oracle DB and then wants to write the output data (from dataframe) to FTP. Everything is fine, but why is it copying a file called 1part-XXX..csv.crc instead of .csv?
Here is the code :
val jdbcSqlConnStr = "jdbc:oracle:thin://#Server:1601/WW"
val jdbcDbTable = "(select CAST(ID as INT) Program_ID, Program_name from
PROGRAM WHERE ROWNUM <=100) P"
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> jdbcSqlConnStr,
"dbtable" -> jdbcDbTable,
"driver" -> "oracle.jdbc.driver.OracleDriver",
"user" -> "user",
"password" -> "pass"
)).load
jdbcDF.write.
format("com.springml.spark.sftp").
option("host", "ftp.Server.com").
option("username", "user").
option("password", "*****").
option("fileType", "csv").
option("delimiter", "|").
save("/Test/sample.csv")
But the output file uploaded to FTP is binary and I found this in console output:
8/02/08 17:08:43 INFO FileOutputCommitter: Saved output of task
'attempt_20180208170840_0000_m_000000_0' to
file:/C:/Users/aarafeh/AppData/Local/Temp/spark_sftp_connection_temp286/_tempor
ary/0/task_20180208170840_0000_m_000000 18/02/08 17:08:43 INFO
SparkHadoopMapRedUtil: attempt_20180208170840_0000_m_000000_0:
Committed 18/02/08 17:08:43 INFO Executor: Finished task 0.0 in stage
0.0 (TID 0). 1565 bytes result sent to driver 18/02/08 17:08:43 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3591 ms on
localhost (executor driver) (1/1) 18/02/08 17:08:43 INFO
TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all
completed, from pool 18/02/08 17:08:43 INFO DAGScheduler: ResultStage
0 (csv at DefaultSource.scala:243) finished in 3.611 s 18/02/08
17:08:43 INFO DAGScheduler: Job 0 finished: csv at
DefaultSource.scala:243, took 3.814856 s 18/02/08 17:08:44 INFO
FileFormatWriter: Job null committed. 18/02/08 17:08:44 INFO
DefaultSource: Copying
C:\Users\aarafeh\AppData\Local\Temp\spark_sftp_connection_temp286.part-00000-
1efdd0f1-8201-49b4-af15-5878204e57ea-c000.csv.crc to
/J28446_Engage/Test/sample.csv
18/02/08 17:08:46 INFO SFTPClient: Copying files from C:\Users\aarafeh\AppData\Local\Temp\spark_sftp_connection_temp286.part-00000-
1efdd0f1-8201-49b4-af15-5878204e57ea-c000.csv.crc to
/J28446_Engage/Test/sample.csv 18/02/08 17:08:47 INFO SFTPClient:
Copied files successfully...
The file was uploaded successfully (sample.csv), but it is binary since it uploads the crc file.
Any idea why and how to solve?
I escalated this as an issue under the Spark-ftp project as shown here:
https://github.com/springml/spark-sftp/issues/18
and they fixed it.
Thanks.

Spark Cassandra connector saveToCassandra() is sending data to driver and causing a OOM exception

I am trying to use the Spark Cassandra connector.
Here is my code:
JavaRDD<UserStatistics> rdd=CassandraJavaUtil.javaFunctions(sparkContext).cassandraTable(
ConfigStore.read("cassandra", "keyspace"), "user_activity_" + type).where("bucket =?",
date).select("user_id", "code").mapToPair(row -> new Tuple2<String, Integer>(row
.getString("user_id"), 1)).reduceByKey((value1, value2) -> value1 + value2).map(s ->
{
List<UserStatistics> userStatistics = new ArrayList<>();
UserStatistics userStatistic = new UserStatistics();
userStatistic.setUser_id(s._1);
userStatistic.setStatistics_type(type);
long total = s._2;
int failureCount = 0;//s._2._2().iterator().next();
int selectedCount = 0; //s._2._2().iterator().next();
userStatistic.setTotal_count((int) total);
userStatistic.setFailure_count(failureCount);
userStatistic.setSelected_count(selectedCount);
userStatistics.add(userStatistic);
return userStatistic;
});
CassandraJavaUtil.javaFunctions(rdd).writerBuilder(ConfigStore.read("cassandra", "keyspace"),
"user_statistics",mapToRow(UserStatistics.class)).saveToCassandra();
After I execute this, it outputs the follow. It eventually throws a OOM exception for the driver.
I am not sure why it is trying to send data to driver.
Executor: Finished task 1007.0 in stage 0.0 (TID 1007). 84821 bytes result sent to driver
15/09/29 13:57:32 INFO TaskSetManager: Starting task 1016.0 in stage 0.0 (TID 1016, localhost, NODE_LOCAL, 2096 bytes)
15/09/29 13:57:32 INFO TaskSetManager: Finished task 1007.0 in stage 0.0 (TID 1007) in 78 ms on localhost (1009/640442)
15/09/29 13:57:32 INFO Executor: Running task 1016.0 in stage 0.0 (TID 1016)

Does spark creates two datasets or stages that work on same logic?

I was trying to read from a CSV file and insert those entries into database.
I figured out that internally spark created two RDD i.e. rdd_0_0 and rdd_0_1 that works on same data and does all the processing.
Can anyone help in figuring out why call method is called twice by different datasets.
If two datasets/stages are created why they both of them working on same logic??
Please help me in confirming if that is the case spark works??
public final class TestJavaAggregation1 implements Serializable {
private static final long serialVersionUID = 1L;
static CassandraConfig config = null;
static PreparedStatement statement = null;
private transient SparkConf conf;
private PersonAggregationRowWriterFactory aggregationWriter = new PersonAggregationRowWriterFactory();
public Session session;
private TestJavaAggregation1(SparkConf conf) {
this.conf = conf;
}
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setAppName(“REadFromCSVFile”).setMaster(“local[1]”).set(“spark.executor.memory”, “1g”);
conf.set(“spark.cassandra.connection.host”, “localhost”);
TestJavaAggregation1 app = new TestJavaAggregation1(conf);
app.run();
}
private void run() {
JavaSparkContext sc = new JavaSparkContext(conf);
aggregateData(sc);
sc.stop();
}
private JavaRDD sparkConfig(JavaSparkContext sc) {
JavaRDD lines = sc.textFile(“PersonAggregation1_500.csv”, 1);
System.out.println(lines.getCheckpointFile());
lines.cache();
final String heading = lines.first();
System.out.println(heading);
String headerValues = heading.replaceAll(“\t”, “,”);
System.out.println(headerValues);
CassandraConnector connector = CassandraConnector.apply(sc.getConf());
Session session = connector.openSession();
try {
session.execute(“DROP KEYSPACE IF EXISTS java_api5″);
session.execute(“CREATE KEYSPACE java_api5 WITH replication = {‘class': ‘SimpleStrategy’, ‘replication_factor': 1}”);
session.execute(“CREATE TABLE java_api5.person (hashvalue INT, id INT, state TEXT, city TEXT, country TEXT, full_name TEXT, PRIMARY KEY((hashvalue), id, state, city, country, full_name)) WITH CLUSTERING ORDER BY (id DESC);”);
} catch (Exception ex) {
ex.printStackTrace();
}
return lines;
}
#SuppressWarnings(“serial”)
public void aggregateData(JavaSparkContext sc) {
JavaRDD lines = sparkConfig(sc);
System.out.println(“FirstRDD” + lines.partitions().size());
JavaRDD result = lines.map(new Function() {
int i = 0;
public PersonAggregation call(String row) {
PersonAggregation aggregate = new PersonAggregation();
row = row + “,” + this.hashCode();
String[] parts = row.split(“,”);
aggregate.setId(Integer.valueOf(parts[0]));
aggregate.setFull_name(parts[1]);
aggregate.setState(parts[4]);
aggregate.setCity(parts[5]);
aggregate.setCountry(parts[6]);
aggregate.setHashValue(Integer.valueOf(parts[7]));
*//below save inserts 200 entries into the database while the CSV file has only 100 records.*
**saveToJavaCassandra(aggregate);**
return aggregate;
}
});
System.out.println(result.collect().size());
List personAggregationList = result.collect();
JavaRDD aggregateRDD = sc.parallelize(personAggregationList);
javaFunctions(aggregateRDD).writerBuilder(“java_api5″, “person”,
aggregationWriter).saveToCassandra();
}
}
Please find the logs below too:
15/05/29 12:40:37 INFO FileInputFormat: Total input paths to process : 1
15/05/29 12:40:37 INFO SparkContext: Starting job: first at TestJavaAggregation1.java:89
15/05/29 12:40:37 INFO DAGScheduler: Got job 0 (first at TestJavaAggregation1.java:89) with 1 output partitions (allowLocal=true)
15/05/29 12:40:37 INFO DAGScheduler: Final stage: Stage 0(first at TestJavaAggregation1.java:89)
15/05/29 12:40:37 INFO DAGScheduler: Parents of final stage: List()
15/05/29 12:40:37 INFO DAGScheduler: Missing parents: List()
15/05/29 12:40:37 INFO DAGScheduler: Submitting Stage 0 (PersonAggregation_5.csv MappedRDD[1] at textFile at TestJavaAggregation1.java:84), which has no missing parents
15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(2560) called with curMem=157187, maxMem=1009589944
15/05/29 12:40:37 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.5 KB, free 962.7 MB)
15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(1897) called with curMem=159747, maxMem=1009589944
15/05/29 12:40:37 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1897.0 B, free 962.7 MB)
15/05/29 12:40:37 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54664 (size: 1897.0 B, free: 962.8 MB)
15/05/29 12:40:37 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/05/29 12:40:37 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/05/29 12:40:37 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PersonAggregation_5.csv MappedRDD[1] at textFile at TestJavaAggregation1.java:84)
15/05/29 12:40:37 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/05/29 12:40:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1326 bytes)
15/05/29 12:40:37 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/05/29 12:40:37 INFO CacheManager: Partition rdd_1_0 not found, computing it
15/05/29 12:40:37 INFO HadoopRDD: Input split: file:/F:/workspace/apoorva/TestProject/PersonAggregation_5.csv:0+230
15/05/29 12:40:37 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/05/29 12:40:37 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/05/29 12:40:37 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/05/29 12:40:37 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/05/29 12:40:37 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/05/29 12:40:37 INFO MemoryStore: ensureFreeSpace(680) called with curMem=161644, maxMem=1009589944
15/05/29 12:40:37 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 680.0 B, free 962.7 MB)
15/05/29 12:40:37 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:54664 (size: 680.0 B, free: 962.8 MB)
15/05/29 12:40:37 INFO BlockManagerMaster: Updated info of block rdd_1_0
15/05/29 12:40:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2335 bytes result sent to driver
15/05/29 12:40:37 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 73 ms on localhost (1/1)
15/05/29 12:40:37 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/05/29 12:40:37 INFO DAGScheduler: Stage 0 (first at TestJavaAggregation1.java:89) finished in 0.084 s
15/05/29 12:40:37 INFO DAGScheduler: Job 0 finished: first at TestJavaAggregation1.java:89, took 0.129536 s
1,FName1,MName1,LName1,state1,city1,country1
1,FName1,MName1,LName1,state1,city1,country1
15/05/29 12:40:37 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
15/05/29 12:40:37 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
FirstRDD1
SecondRDD1
15/05/29 12:40:47 INFO SparkContext: Starting job: collect at TestJavaAggregation1.java:147
15/05/29 12:40:47 INFO DAGScheduler: Got job 1 (collect at TestJavaAggregation1.java:147) with 1 output partitions (allowLocal=false)
15/05/29 12:40:47 INFO DAGScheduler: Final stage: Stage 1(collect at TestJavaAggregation1.java:147)
15/05/29 12:40:47 INFO DAGScheduler: Parents of final stage: List()
15/05/29 12:40:47 INFO DAGScheduler: Missing parents: List()
15/05/29 12:40:47 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[2] at map at TestJavaAggregation1.java:117), which has no missing parents
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(3872) called with curMem=162324, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.8 KB, free 962.7 MB)
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(2604) called with curMem=166196, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.5 KB, free 962.7 MB)
15/05/29 12:40:47 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54664 (size: 2.5 KB, free: 962.8 MB)
15/05/29 12:40:47 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/05/29 12:40:47 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838
15/05/29 12:40:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[2] at map at TestJavaAggregation1.java:117)
15/05/29 12:40:47 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/05/29 12:40:47 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1326 bytes)
15/05/29 12:40:47 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/05/29 12:40:47 INFO BlockManager: Found block rdd_1_0 locally
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state1,city1,country1
15/05/29 12:40:47 INFO DCAwareRoundRobinPolicy: Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor)
15/05/29 12:40:47 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
Connected to cluster: Test Cluster
Datacenter: datacenter1; Host: localhost/127.0.0.1; Rack: rack1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state2,city2,country1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state3,city3,country1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state4,city4,country1
com.local.myProj1.TestJavaAggregation1$1#2f877f16,797409046,state5,city5,country1
15/05/29 12:40:47 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2343 bytes result sent to driver
15/05/29 12:40:47 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 184 ms on localhost (1/1)
15/05/29 12:40:47 INFO DAGScheduler: Stage 1 (collect at TestJavaAggregation1.java:147) finished in 0.185 s
15/05/29 12:40:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/05/29 12:40:47 INFO DAGScheduler: Job 1 finished: collect at TestJavaAggregation1.java:147, took 0.218779 s
______________________________5_______________________________
15/05/29 12:40:47 INFO SparkContext: Starting job: collect at TestJavaAggregation1.java:150
15/05/29 12:40:47 INFO DAGScheduler: Got job 2 (collect at TestJavaAggregation1.java:150) with 1 output partitions (allowLocal=false)
15/05/29 12:40:47 INFO DAGScheduler: Final stage: Stage 2(collect at TestJavaAggregation1.java:150)
15/05/29 12:40:47 INFO DAGScheduler: Parents of final stage: List()
15/05/29 12:40:47 INFO DAGScheduler: Missing parents: List()
15/05/29 12:40:47 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[2] at map at TestJavaAggregation1.java:117), which has no missing parents
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(3872) called with curMem=168800, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.8 KB, free 962.7 MB)
15/05/29 12:40:47 INFO MemoryStore: ensureFreeSpace(2604) called with curMem=172672, maxMem=1009589944
15/05/29 12:40:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.5 KB, free 962.7 MB)
15/05/29 12:40:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:54664 (size: 2.5 KB, free: 962.8 MB)
15/05/29 12:40:47 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/05/29 12:40:47 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:838
15/05/29 12:40:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 2 (MappedRDD[2] at map at TestJavaAggregation1.java:117)
15/05/29 12:40:47 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
15/05/29 12:40:47 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1326 bytes)
15/05/29 12:40:47 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
15/05/29 12:40:47 INFO BlockManager: Found block rdd_1_0 locally
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state1,city1,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state2,city2,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state3,city3,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state4,city4,country1
com.local.myProj1.TestJavaAggregation1$1#17b560af,397762735,state5,city5,country1
15/05/29 12:40:47 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2343 bytes result sent to driver
15/05/29 12:40:47 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 16 ms on localhost (1/1)
15/05/29 12:40:47 INFO DAGScheduler: Stage 2 (collect at TestJavaAggregation1.java:150) finished in 0.016 s
15/05/29 12:40:47 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/05/29 12:40:47 INFO DAGScheduler: Job 2 finished: collect at TestJavaAggregation1.java:150, took 0.026302 s
When you are running a spark cluster and you run a spark job. Spark distributes the data in the cluster in terms of RDD's the partitioning of data is handled by spark. When you create a lines RDD in your sparkConfig method by reading a file. Spark partitions the data and creates RDD partitions internally so that when in memory computation happens it is done over distrubuted data over the RDD's in your cluster. Therefore your JavaRDD lines is internally a union on various RDD_partitions. Hence, when you run a map job on JavaRDD lines, it runs for all the data partitioned amongst various internal RDD's that relate to the JavaRDD on which you ran the map function. As in your case spark created two internal partitions of the JavaRDD Lines, that is why the map function is called two times for the two internal JavaRDD partitions. Please tell me if you have any other questions.

Why does spark-shell throw ArrayIndexOutOfBoundsException when reading a large file from HDFS?

I am using hadoop 2.4.1 and Spark 1.1.0. I have uploaded a dataset of food review to HDFS from here and then I used the following code to read the file and process it on the spark shell:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
var path = "hdfs:///user/hduser/finefoods.txt"
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n\n")
var dataset = sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString)
var datasetObj = dataset.map{ rowStr => rowStr.split("\n")}
var tupleSet = datasetObj.map( strArr => strArr.map( elm => elm.split(": ")(1))).map( arr => (arr(0),arr(1),arr(4).toDouble))
tupleSet.groupBy(t => t._2)
When I run the last line tupleSet.groupBy(t => t._2), the spark shell throws the following exception:
scala> tupleSet.groupBy( t => t._2).first()
14/11/15 22:46:59 INFO spark.SparkContext: Starting job: first at <console>:28
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Registering RDD 11 (groupBy at <console>:28)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Got job 1 (first at <console>:28) with 1 output partitions (allowLocal=true)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Final stage: Stage 1(first at <console>:28)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 2)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Missing parents: List(Stage 2)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Submitting Stage 2 (MappedRDD[11] at groupBy at <console>:28), which has no missing parents
14/11/15 22:46:59 INFO storage.MemoryStore: ensureFreeSpace(3592) called with curMem=221261, maxMem=278302556
14/11/15 22:46:59 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.5 KB, free 265.2 MB)
14/11/15 22:46:59 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from Stage 2 (MappedRDD[11] at groupBy at <console>:28)
14/11/15 22:46:59 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 3 tasks
14/11/15 22:46:59 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, localhost, ANY, 1221 bytes)
14/11/15 22:46:59 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 4, localhost, ANY, 1221 bytes)
14/11/15 22:46:59 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 3)
14/11/15 22:46:59 INFO executor.Executor: Running task 1.0 in stage 2.0 (TID 4)
14/11/15 22:46:59 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:0+134217728
14/11/15 22:46:59 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:134217728+134217728
14/11/15 22:47:02 ERROR executor.Executor: Exception in task 1.0 in stage 2.0 (TID 4)
java.lang.ArrayIndexOutOfBoundsException
14/11/15 22:47:02 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 5, localhost, ANY, 1221 bytes)
14/11/15 22:47:02 INFO executor.Executor: Running task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 INFO rdd.NewHadoopRDD: Input split: hdfs://10.12.0.245/user/hduser/finefoods.txt:268435456+102361028
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException:
14/11/15 22:47:02 ERROR scheduler.TaskSetManager: Task 1 in stage 2.0 failed 1 times; aborting job
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Cancelling stage 2
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Stage 2 was cancelled
14/11/15 22:47:02 INFO executor.Executor: Executor is trying to kill task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 INFO executor.Executor: Executor is trying to kill task 0.0 in stage 2.0 (TID 3)
14/11/15 22:47:02 INFO scheduler.DAGScheduler: Failed to run first at <console>:28
14/11/15 22:47:02 INFO executor.Executor: Executor killed task 0.0 in stage 2.0 (TID 3)
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, localhost): TaskKilled (killed intentionally)
14/11/15 22:47:02 INFO executor.Executor: Executor killed task 2.0 in stage 2.0 (TID 5)
14/11/15 22:47:02 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 2.0 (TID 5, localhost): TaskKilled (killed intentionally)
14/11/15 22:47:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 4, localhost): java.lang.ArrayIndexOutOfBoundsException:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
But when I use dummy dataset like the following, it works well:
var tupleSet = sc.parallelize(List(
("B001E4KFG0","A3SGXH7AUHU8GW",3.0),
("B001E4KFG1","A3SGXH7AUHU8GW",4.0),
("B001E4KFG2","A3SGXH7AUHU8GW",4.0),
("B001E4KFG3","A3SGXH7AUHU8GW",4.0),
("B001E4KFG4","A3SGXH7AUHU8GW",5.0),
("B001E4KFG5","A3SGXH7AUHU8GW",5.0),
("B001E4KFG0","bbb",5.0)
))
Any idea why?
There's probably an entry in the dataset that does not follow the format and therefore: elm.split(": ")(1) fails, because there's no element at that index.
You can avoid that error by checking the results of the split before accessing the (1) index. One way of doing that could be something like this:
var tupleSet = datasetObj.map(elem => elm.split(": ")).collect{case x if (x.length>1) x(1)}
One note: Your examples do not seem to match the parsing pipeline in the code. They do not contain the ": " tokens.
Since the transformations are lazy Spark won't tell you much about your input dataset (and you may not notice it) only until executing an action like groupBy().
It could also be due to empty/blank lines in your dataset. And, you are applying a split function on the data. In such case, filter out the empty lines.
Eg: myrdd.filter(_.nonEmpty).map(...)
I had a similar problem when I was converting a log data into dataframe using pySpark.
When a log entry is invalid, I returned a null value instead of a Row instance. Before converting to dataframe, I filtered out these null values. But, still, I got the above problem. Finally, the error went away when I returned a Row with null values instead of a single null value.
Pseudo code below:
Didnt work:
rdd = Parse log (log lines to Rows if valid else None)
filtered_rdd = rdd.filter(lambda x:x!=None)
logs = sqlContext.inferSchema(filtered_rdd)
Worked:
rdd = Parse log (log lines to Rows if valid else Row(None,None,...))
logs = sqlContext.inferSchema(rdd)
filtered_rdd = logs.filter(logs['id'].isNotNull())

Resources