Spark job getting stuck forever - apache-spark

I am trying to run it in the docker. But it is getting stuck after giving output. Am i missing something?
FROM bitnami/spark:3
USER root
RUN curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.231/aws-java-sdk-bundle-1.12.231.jar --output /opt/bitnami/spark/jars/aws-java-sdk-bundle-1.12.231.jar RUN curl https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar --output /opt/bitnami/spark/jars/jets3t-0.9.4.jar RUN curl https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/2.1.0.10/redshift-jdbc42-2.1.0.10.jar --output /opt/bitnami/spark/jars/redshift-jdbc42-2.1.0.10.jar RUN curl https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.28.0.jar --output /opt/bitnami/spark/jars/spark-bigquery-with-dependencies_2.12-0.28.0.jar
COPY ./requirements_copy.txt / RUN pip install -r /requirements_copy.txt
version: '3'
services:
spark:
image: spark-air:latest
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- AWS_ACCESS_KEY=${AWS_ACCESS_KEY}
- AWS_SECRET_KEY=${AWS_SECRET_KEY}
volumes:
- ./dags:/opt/bitnami/spark/dags/:rw
ports:
- '8090:8080'
spark-worker:
image: spark-air:latest
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- AWS_ACCESS_KEY=${AWS_ACCESS_KEY}
- AWS_SECRET_KEY=${AWS_SECRET_KEY}
volumes:
- ./dags:/opt/bitnami/spark/dags/:rw
from pyspark import SparkContext
sc = SparkContext("local", "First App")
data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True},
{"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False},
{"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None},
{"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True}]
df = sc.parallelize(data)
df = df.collect()
print(df)
23/02/15 05:32:58 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1090 bytes result sent to driver
23/02/15 05:32:58 DEBUG ExecutorMetricsPoller: stageTCMP: (0, 0) -> 0
23/02/15 05:32:58 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 0
23/02/15 05:32:58 DEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY
23/02/15 05:32:58 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1867 ms on d981929c6c2c (executor driver) (1/1)
23/02/15 05:32:58 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
23/02/15 05:32:58 INFO DAGScheduler: ResultStage 0 (collect at /opt/bitnami/spark/dags/test.py:12) finished in 3.275 s
23/02/15 05:32:58 DEBUG DAGScheduler: After removal of stage 0, remaining stages = 0
23/02/15 05:32:58 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
23/02/15 05:32:58 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
23/02/15 05:32:58 INFO DAGScheduler: Job 0 finished: collect at /opt/bitnami/spark/dags/test.py:12, took 3.794097 s
[{'Category': 'A', 'ID': 1, 'Value': 121.44, 'Truth': True}, {'Category': 'B', 'ID': 2, 'Value': 300.01, 'Truth': False}, {'Category': 'C', 'ID': 3, 'Value': 10.99, 'Truth': None}, {'Category': 'E', 'ID': 4, 'Value': 33.87, 'Truth': True}]
23/02/15 05:33:04 DEBUG ExecutorMetricsPoller: removing (0, 0) from stageTCMP
It is getting the expecting output but it is getting stuck after that for ever until i manually killing it. I want to integrate with airflow i can't manually kill it there.

Related

spark long delay before submitting jobs to the executors

I'm using spark-Cassandra driver through spark-sql to query my Cassandra cluster. Each Cassandra node has a spark worker (co-located).
Problem: There is a long delay before submitting tasks to the executor (based on time stamps on web UI and also driver logs). The query is a simple select which specifies all cassandra partition keys and contains two stages and two tasks. Previously, the query took 300ms on another server with colocated driver and master.
But i have to move my application and spark master to another server (same as before but just on another physical server) and now the query took 40 seconds. Although task duration is about 7 seconds, Job took 40 seconds, i can not figure out what the extra delay is for?
I've also checked spark with a job with no connection to Cassandra, and it took 200ms, so i thought that its more related to spark-cassandra than to spark itself.
Here is spark logs during execution of job:
[INFO ] 2019-03-04 06:59:07.067 [qtp1151421920-470] SparkSqlParser 54 - Parsing command: select * from ...
[INFO ] 2019-03-04 06:59:07.276 [qtp1151421920-470] CassandraSourceRelation 35 - Input Predicates: ...
[INFO ] 2019-03-04 06:59:07.279 [qtp1151421920-470] ClockFactory 52 - Using native clock to generate timestamps.
[INFO ] 2019-03-04 06:59:07.439 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.201:9042 added
[INFO ] 2019-03-04 06:59:07.440 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.202:9042 added
[INFO ] 2019-03-04 06:59:07.440 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.203:9042 added
[INFO ] 2019-03-04 06:59:07.440 [qtp1151421920-470] Cluster 1543 - New Cassandra host /192.168.1.204:9042 added
[INFO ] 2019-03-04 06:59:07.446 [qtp1151421920-470] CassandraConnector 35 - Connected to Cassandra cluster: Digger Cluster
[INFO ] 2019-03-04 06:59:07.526 [qtp1151421920-470] CassandraSourceRelation 35 - Input Predicates: ...
[INFO ] 2019-03-04 06:59:07.848 [qtp1151421920-470] CodeGenerator 54 - Code generated in 120.31952 ms
[INFO ] 2019-03-04 06:59:08.264 [qtp1151421920-470] CodeGenerator 54 - Code generated in 15.084165 ms
[INFO ] 2019-03-04 06:59:08.289 [qtp1151421920-470] CodeGenerator 54 - Code generated in 17.893182 ms
[INFO ] 2019-03-04 06:59:08.379 [qtp1151421920-470] SparkContext 54 - Starting job: collectAsList at MyClass.java:5
[INFO ] 2019-03-04 06:59:08.394 [dag-scheduler-event-loop] DAGScheduler 54 - Registering RDD 12 (toJSON at MyClass.java.java:5)
[INFO ] 2019-03-04 06:59:08.397 [dag-scheduler-event-loop] DAGScheduler 54 - Got job 0 (collectAsList at MyClass.java.java:5) with 1 output partitions
[INFO ] 2019-03-04 06:59:08.398 [dag-scheduler-event-loop] DAGScheduler 54 - Final stage: ResultStage 1 (collectAsList at MyClass.java.java:5)
[INFO ] 2019-03-04 06:59:08.398 [dag-scheduler-event-loop] DAGScheduler 54 - Parents of final stage: List(ShuffleMapStage 0)
[INFO ] 2019-03-04 06:59:08.400 [dag-scheduler-event-loop] DAGScheduler 54 - Missing parents: List(ShuffleMapStage 0)
[INFO ] 2019-03-04 06:59:08.405 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[12] at toJSON at MyClass.java.java:5), which has no missing parents
[INFO ] 2019-03-04 06:59:15.703 [pool-44-thread-1] CassandraConnector 35 - Disconnected from Cassandra cluster: Digger Cluster
-----------------long delay here
[INFO ] 2019-03-04 06:59:43.547 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_0 stored as values in memory (estimated size 20.6 KB, free 17.8 GB)
[INFO ] 2019-03-04 06:59:43.579 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.5 KB, free 17.8 GB)
[INFO ] 2019-03-04 06:59:43.581 [dispatcher-event-loop-1] BlockManagerInfo 54 - Added broadcast_0_piece0 in memory on 192.168.1.94:38311 (size: 9.5 KB, free: 17.8 GB)
[INFO ] 2019-03-04 06:59:43.584 [dag-scheduler-event-loop] SparkContext 54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1006
[INFO ] 2019-03-04 06:59:43.597 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[12] at toJSON at MyClass.java.java:5) (first 15 tasks are for partitions Vector(0))
[INFO ] 2019-03-04 06:59:43.598 [dag-scheduler-event-loop] TaskSchedulerImpl 54 - Adding task set 0.0 with 1 tasks
[INFO ] 2019-03-04 06:59:43.619 [dag-scheduler-event-loop] FairSchedulableBuilder 54 - Added task set TaskSet_0.0 tasks to pool rest
[INFO ] 2019-03-04 06:59:43.652 [dispatcher-event-loop-35] TaskSetManager 54 - Starting task 0.0 in stage 0.0 (TID 0, 192.168.1.210, executor 11, partition 0, NODE_LOCAL, 6357 bytes)
[INFO ] 2019-03-04 06:59:43.920 [dispatcher-event-loop-36] BlockManagerInfo 54 - Added broadcast_0_piece0 in memory on 192.168.1.210:42612 (size: 9.5 KB, free: 912.3 MB)
[INFO ] 2019-03-04 06:59:46.591 [task-result-getter-0] TaskSetManager 54 - Finished task 0.0 in stage 0.0 (TID 0) in 2963 ms on 192.168.1.210 (executor 11) (1/1)
[INFO ] 2019-03-04 06:59:46.594 [task-result-getter-0] TaskSchedulerImpl 54 - Removed TaskSet 0.0, whose tasks have all completed, from pool rest
[INFO ] 2019-03-04 06:59:46.601 [dag-scheduler-event-loop] DAGScheduler 54 - ShuffleMapStage 0 (toJSON at MyClass.java.java:5) finished in 2.981 s
[INFO ] 2019-03-04 06:59:46.602 [dag-scheduler-event-loop] DAGScheduler 54 - looking for newly runnable stages
[INFO ] 2019-03-04 06:59:46.603 [dag-scheduler-event-loop] DAGScheduler 54 - running: Set()
[INFO ] 2019-03-04 06:59:46.603 [dag-scheduler-event-loop] DAGScheduler 54 - waiting: Set(ResultStage 1)
[INFO ] 2019-03-04 06:59:46.604 [dag-scheduler-event-loop] DAGScheduler 54 - failed: Set()
[INFO ] 2019-03-04 06:59:46.608 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting ResultStage 1 (MapPartitionsRDD[18] at collectAsList at MyClass.java.java:5), which has no missing parents
[INFO ] 2019-03-04 06:59:46.615 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_1 stored as values in memory (estimated size 20.8 KB, free 17.8 GB)
[INFO ] 2019-03-04 06:59:46.618 [dag-scheduler-event-loop] MemoryStore 54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 9.8 KB, free 17.8 GB)
[INFO ] 2019-03-04 06:59:46.619 [dispatcher-event-loop-21] BlockManagerInfo 54 - Added broadcast_1_piece0 in memory on 192.168.1.94:38311 (size: 9.8 KB, free: 17.8 GB)
[INFO ] 2019-03-04 06:59:46.620 [dag-scheduler-event-loop] SparkContext 54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1006
[INFO ] 2019-03-04 06:59:46.622 [dag-scheduler-event-loop] DAGScheduler 54 - Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[18] at collectAsList at MyClass.java.java:5) (first 15 tasks are for partitions Vector(0))
[INFO ] 2019-03-04 06:59:46.622 [dag-scheduler-event-loop] TaskSchedulerImpl 54 - Adding task set 1.0 with 1 tasks
[INFO ] 2019-03-04 06:59:46.622 [dag-scheduler-event-loop] FairSchedulableBuilder 54 - Added task set TaskSet_1.0 tasks to pool rest
[INFO ] 2019-03-04 06:59:46.627 [dispatcher-event-loop-25] TaskSetManager 54 - Starting task 0.0 in stage 1.0 (TID 1, 192.168.1.212, executor 9, partition 0, PROCESS_LOCAL, 4730 bytes)
[INFO ] 2019-03-04 06:59:46.851 [dispatcher-event-loop-9] BlockManagerInfo 54 - Added broadcast_1_piece0 in memory on 192.168.1.212:43471 (size: 9.8 KB, free: 912.3 MB)
[INFO ] 2019-03-04 06:59:47.257 [dispatcher-event-loop-38] MapOutputTrackerMasterEndpoint 54 - Asked to send map output locations for shuffle 0 to 192.168.1.212:46794
[INFO ] 2019-03-04 06:59:47.262 [map-output-dispatcher-0] MapOutputTrackerMaster 54 - Size of output statuses for shuffle 0 is 141 bytes
[INFO ] 2019-03-04 06:59:47.763 [task-result-getter-1] TaskSetManager 54 - Finished task 0.0 in stage 1.0 (TID 1) in 1140 ms on 192.168.1.212 (executor 9) (1/1)
[INFO ] 2019-03-04 06:59:47.763 [task-result-getter-1] TaskSchedulerImpl 54 - Removed TaskSet 1.0, whose tasks have all completed, from pool rest
[INFO ] 2019-03-04 06:59:47.765 [dag-scheduler-event-loop] DAGScheduler 54 - ResultStage 1 (collectAsList at MyClass.java.java:5) finished in 1.142 s
[INFO ] 2019-03-04 06:59:47.771 [qtp1151421920-470] DAGScheduler 54 - Job 0 finished: collectAsList at MyClass.java.java:5, took 39.391066 s
[INFO ] 2019-03-04 07:00:09.014 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 4
[INFO ] 2019-03-04 07:00:09.015 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 0
[INFO ] 2019-03-04 07:00:09.015 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 3
[INFO ] 2019-03-04 07:00:09.015 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 1
[INFO ] 2019-03-04 07:00:09.028 [dispatcher-event-loop-10] BlockManagerInfo 54 - Removed broadcast_1_piece0 on 192.168.1.94:38311 in memory (size: 9.8 KB, free: 17.8 GB)
[INFO ] 2019-03-04 07:00:09.045 [dispatcher-event-loop-0] BlockManagerInfo 54 - Removed broadcast_1_piece0 on 192.168.1.212:43471 in memory (size: 9.8 KB, free: 912.3 MB)
[INFO ] 2019-03-04 07:00:09.063 [Spark Context Cleaner] ContextCleaner 54 - Cleaned shuffle 0
[INFO ] 2019-03-04 07:00:09.065 [dispatcher-event-loop-16] BlockManagerInfo 54 - Removed broadcast_0_piece0 on 192.168.1.94:38311 in memory (size: 9.5 KB, free: 17.8 GB)
[INFO ] 2019-03-04 07:00:09.071 [dispatcher-event-loop-37] BlockManagerInfo 54 - Removed broadcast_0_piece0 on 192.168.1.210:42612 in memory (size: 9.5 KB, free: 912.3 MB)
[INFO ] 2019-03-04 07:00:09.074 [Spark Context Cleaner] ContextCleaner 54 - Cleaned accumulator 2
Also attached screenshots to spark web ui for the job and its tasks.Logs and images are not for the same job.
P.S: Is spark-cassandra connectors creates a new session each time i run a query (i see connect-disconnect to cassandra cluster everytime)? i run many queries in parallel, isn't that going to be much slower than pure-cassandra?
spark job
Checking with jvisualvm, Executors had no activity during the time gap, but the driver (my application) had a thread called "dag-scheduler..." running only at the time gap. The thread dump said that it stuck on InetAddress.getHostName().
Then in debug mode, i put a breakpoint there and find out that it's trying to reverse lookup (ip to hostname) for all of my cassandra-cluster, so just added all "IP HOSTNAME"s of my cassandra cluster to the end of /etc/hosts and problem solved!

How to use SparkSQL Convert ".txt" to ".parquet" in Spark 2.1.0?

Look, I used "spark-shell" command to test it.(https://spark.apache.org/docs/latest/sql-programming-guide.html)
scala> case class IP(country: String) extends Serializable
17/07/05 11:20:09 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.50.3:42868 in memory (size: 33.1 KB, free: 93.3 MB)
17/07/05 11:20:09 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.50.3:40888 in memory (size: 33.1 KB, free: 93.3 MB)
17/07/05 11:20:09 INFO ContextCleaner: Cleaned accumulator 0
17/07/05 11:20:09 INFO ContextCleaner: Cleaned accumulator 1
defined class IP
scala> import spark.implicits._
import spark.implicits._
scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
scala> val df = spark.sparkContext.textFile("/test/guchao/ip.txt").map(x => x.split("\\|", -1)).map(x => IP(x(0))).toDF()
17/07/05 11:20:36 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 216.5 KB, free 92.9 MB)
17/07/05 11:20:36 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 20.8 KB, free 92.8 MB)
17/07/05 11:20:36 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.50.3:42868 (size: 20.8 KB, free: 93.3 MB)
17/07/05 11:20:36 INFO SparkContext: Created broadcast 2 from textFile at :33
df: org.apache.spark.sql.DataFrame = [country: string]
scala> df.write.mode(SaveMode.Overwrite).save("/test/guchao/ip.parquet")
17/07/05 11:20:44 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
17/07/05 11:20:44 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
17/07/05 11:20:44 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
17/07/05 11:20:44 INFO CodeGenerator: Code generated in 88.405717 ms
17/07/05 11:20:44 INFO FileInputFormat: Total input paths to process : 1
17/07/05 11:20:44 INFO SparkContext: Starting job: save at :36
17/07/05 11:20:44 INFO DAGScheduler: Got job 1 (save at :36) with 2 output partitions
17/07/05 11:20:44 INFO DAGScheduler: Final stage: ResultStage 1 (save at :36)
17/07/05 11:20:44 INFO DAGScheduler: Parents of final stage: List()
17/07/05 11:20:44 INFO DAGScheduler: Missing parents: List()
17/07/05 11:20:44 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[12] at save at :36), which has no missing parents
17/07/05 11:20:44 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 77.3 KB, free 92.8 MB)
17/07/05 11:20:44 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 29.3 KB, free 92.7 MB)
17/07/05 11:20:44 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.50.3:42868 (size: 29.3 KB, free: 93.2 MB)
17/07/05 11:20:44 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:996
17/07/05 11:20:44 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[12] at save at :36)
17/07/05 11:20:44 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
17/07/05 11:20:44 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 192.168.50.3, executor 0, partition 0, ANY, 6027 bytes)
17/07/05 11:20:44 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.50.3:40888 (size: 29.3 KB, free: 93.3 MB)
17/07/05 11:20:45 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.50.3:40888 (size: 20.8 KB, free: 93.2 MB)
17/07/05 11:20:45 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 192.168.50.3, executor 0, partition 1, ANY, 6027 bytes)
17/07/05 11:20:45 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 679 ms on 192.168.50.3 (executor 0) (1/2)
17/07/05 11:20:46 INFO DAGScheduler: ResultStage 1 (save at :36) finished in 1.476 s
17/07/05 11:20:46 INFO DAGScheduler: Job 1 finished: save at :36, took 1.597097 s
17/07/05 11:20:46 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 804 ms on 192.168.50.3 (executor 0) (2/2)
17/07/05 11:20:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/07/05 11:20:46 INFO FileFormatWriter: Job null committed.
but the result is:
[root#master ~]# hdfs dfs -ls -h /test/guchao
17/07/05 11:20:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
drwxr-xr-x - root supergroup 0 2017-07-05 11:20 /test/guchao/ip.parquet
-rw-r--r-- 1 root supergroup 23.9 M 2017-07-05 10:05 /test/guchao/ip.txt
Why does this size of "ip.parquet" is 0? I don't understand and confuse.
Thanks!
hdfs dfs -ls -h <path> shows the size of files and shows 0 for the directory.
df.write.mode(SaveMode.Overwrite).save("/test/guchao/ip.parquet")
This creates the directory as /test/guchao/ip.parquet which has the part files inside this directory, thats why it shows 0 size
hadoop fs -ls /test/guchao/ip.parquet
this should show you the actual size of output files
If you want to get size of directory than you can use
hadoop fs -du -s /test/guchao/ip.parquet
Hope this helps!
/test/guchao/ip.parquet is a directory, get into the directory and you should find something like part-00000 which will be the file you are looking for.
hadoop fs -ls /test/guchao/ip.parquet

I am trying to understand given log generated by Spark Program

I am trying to understand the log output generated by given simple program. Need help to understand each steps or reference to such writeup would be fine.
Command
sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1), ("a", 1), ("b", 1), ("b", 1), ("b", 1), ("b", 1)), 3).map(a=> a).reduceByKey(_ + _ ).collect()
Output :
16/12/08 23:41:57 INFO spark.SparkContext: Starting job: collect at <console>:28
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Registering RDD 1 (map at <console>:28)
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:28) with 3 output partitions
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at <console>:28)
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[1] at map at <console>:28), which has no missing parents
16/12/08 23:41:57 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.6 KB, free 2.6 KB)
16/12/08 23:41:57 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1588.0 B, free 4.2 KB)
16/12/08 23:41:57 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.17.0.6:31122 (size: 1588.0 B, free: 511.5 MB)
16/12/08 23:41:57 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[1] at map at <console>:28)
16/12/08 23:41:57 INFO cluster.YarnScheduler: Adding task set 0.0 with 3 tasks
16/12/08 23:41:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 34b943b3f6ea, partition 0,PROCESS_LOCAL, 2183 bytes)
16/12/08 23:41:57 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 34b943b3f6ea, partition 1,PROCESS_LOCAL, 2199 bytes)
16/12/08 23:41:57 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 34b943b3f6ea:28772 (size: 1588.0 B, free: 511.5 MB)
16/12/08 23:41:57 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 34b943b3f6ea:39570 (size: 1588.0 B, free: 511.5 MB)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 34b943b3f6ea, partition 2,PROCESS_LOCAL, 2200 bytes)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 740 ms on 34b943b3f6ea (1/3)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 778 ms on 34b943b3f6ea (2/3)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 66 ms on 34b943b3f6ea (3/3)
16/12/08 23:41:58 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (map at <console>:28) finished in 0.792 s
16/12/08 23:41:58 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/12/08 23:41:58 INFO scheduler.DAGScheduler: running: Set()
16/12/08 23:41:58 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
16/12/08 23:41:58 INFO scheduler.DAGScheduler: failed: Set()
16/12/08 23:41:58 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[2] at reduceByKey at <console>:28), which has no missing parents
16/12/08 23:41:58 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/08 23:41:58 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.6 KB, free 6.7 KB)
16/12/08 23:41:58 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1589.0 B, free 8.3 KB)
16/12/08 23:41:58 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.17.0.6:31122 (size: 1589.0 B, free: 511.5 MB)
16/12/08 23:41:58 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/12/08 23:41:58 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (ShuffledRDD[2] at reduceByKey at <console>:28)
16/12/08 23:41:58 INFO cluster.YarnScheduler: Adding task set 1.0 with 3 tasks
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 34b943b3f6ea, partition 1,NODE_LOCAL, 1894 bytes)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 4, 34b943b3f6ea, partition 2,NODE_LOCAL, 1894 bytes)
16/12/08 23:41:58 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 34b943b3f6ea:39570 (size: 1589.0 B, free: 511.5 MB)
16/12/08 23:41:58 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 34b943b3f6ea:28772 (size: 1589.0 B, free: 511.5 MB)
16/12/08 23:41:58 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 34b943b3f6ea:60986
16/12/08 23:41:58 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 163 bytes
16/12/08 23:41:58 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 34b943b3f6ea:60984
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 5, 34b943b3f6ea, partition 0,PROCESS_LOCAL, 1894 bytes)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 4) in 331 ms on 34b943b3f6ea (1/3)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 351 ms on 34b943b3f6ea (2/3)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 5) in 29 ms on 34b943b3f6ea (3/3)
16/12/08 23:41:58 INFO scheduler.DAGScheduler: ResultStage 1 (collect at <console>:28) finished in 0.359 s
16/12/08 23:41:58 INFO cluster.YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/12/08 23:41:58 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:28, took 1.381102 s
res14: Array[(String, Int)] = Array((a,3), (b,5))
As you can see, the processing started with collect() - This is the lazy initialization that spark uses. Even though you had map and reduceByKey, the process kicked off at collect. As map and reduceByKey are transformations
You can see 3 partitions and each having a task - since you initialized RDD with 3 partitions
Another point is how each of map and reduceByKey handled data locality. All three tasks in map have PROCESS_LOCAL. The
reduceByKey needs a data shuffle and so you might have PROCESS_LOCAL and NODE_LOCAL.

Issues with reading external hive partitioned table using spark hivecontext

I have a external hive partitioned table which I'm trying to read from Spark using HiveContext. But I'm getting null values.
val maxClose = hiveContext.sql("select max(Close) from stock_partitioned_data where symbol = 'AAPL'");
maxClose.collect().foreach (println )
=====
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala> val hiveContext = new HiveContext(sc);
16/09/22 00:12:47 INFO HiveContext: Initializing execution hive, version 1.1.0
16/09/22 00:12:47 INFO ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.0
16/09/22 00:12:47 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.0
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#455aef06
scala> val maxClose = hiveContext.sql("select max(Close) from stock_data2")
16/09/22 00:12:53 INFO ParseDriver: Parsing command: select max(Close) from stock_data2
16/09/22 00:12:54 INFO ParseDriver: Parse Completed
16/09/22 00:12:54 INFO ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.0
16/09/22 00:12:54 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.0
maxClose: org.apache.spark.sql.DataFrame = [_c0: double]
scala> maxClose.collect().foreach (println )
16/09/22 00:13:04 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/09/22 00:13:04 INFO MemoryStore: ensureFreeSpace(425824) called with curMem=0, maxMem=556038881
16/09/22 00:13:04 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 415.8 KB, free 529.9 MB)
16/09/22 00:13:05 INFO MemoryStore: ensureFreeSpace(44793) called with curMem=425824, maxMem=556038881
16/09/22 00:13:05 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 43.7 KB, free 529.8 MB)
16/09/22 00:13:05 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.2.15:47553 (size: 43.7 KB, free: 530.2 MB)
16/09/22 00:13:05 INFO SparkContext: Created broadcast 0 from collect at <console>:27
16/09/22 00:13:05 INFO SparkContext: Starting job: collect at <console>:27
16/09/22 00:13:06 INFO FileInputFormat: Total input paths to process : 1
16/09/22 00:13:06 INFO DAGScheduler: Registering RDD 5 (collect at <console>:27)
16/09/22 00:13:06 INFO DAGScheduler: Got job 0 (collect at <console>:27) with 1 output partitions
16/09/22 00:13:06 INFO DAGScheduler: Final stage: ResultStage 1(collect at <console>:27)
16/09/22 00:13:06 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/09/22 00:13:06 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/09/22 00:13:06 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at collect at <console>:27), which has no missing parents
16/09/22 00:13:06 INFO MemoryStore: ensureFreeSpace(18880) called with curMem=470617, maxMem=556038881
16/09/22 00:13:06 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 18.4 KB, free 529.8 MB)
16/09/22 00:13:06 INFO MemoryStore: ensureFreeSpace(8367) called with curMem=489497, maxMem=556038881
16/09/22 00:13:06 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 8.2 KB, free 529.8 MB)
16/09/22 00:13:06 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.0.2.15:47553 (size: 8.2 KB, free: 530.2 MB)
16/09/22 00:13:06 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
16/09/22 00:13:06 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at collect at <console>:27)
16/09/22 00:13:06 INFO YarnScheduler: Adding task set 0.0 with 2 tasks
16/09/22 00:13:07 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
16/09/22 00:13:08 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 2)
16/09/22 00:13:11 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver#10.0.2.15:45637] <- [akka.tcp://driverPropsFetcher#quickstart.cloudera:33635]: Error [Shut down address: akka.tcp://driverPropsFetcher#quickstart.cloudera:33635] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://driverPropsFetcher#quickstart.cloudera:33635
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
akka.event.Logging$Error$NoCause$
16/09/22 00:13:12 INFO YarnClientSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#quickstart.cloudera:49490/user/Executor#-842589632]) with ID 1
16/09/22 00:13:12 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
16/09/22 00:13:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, quickstart.cloudera, partition 0,NODE_LOCAL, 2291 bytes)
16/09/22 00:13:13 INFO BlockManagerMasterEndpoint: Registering block manager quickstart.cloudera:56958 with 530.3 MB RAM, BlockManagerId(1, quickstart.cloudera, 56958)
16/09/22 00:13:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on quickstart.cloudera:56958 (size: 8.2 KB, free: 530.3 MB)
16/09/22 00:13:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on quickstart.cloudera:56958 (size: 43.7 KB, free: 530.2 MB)
16/09/22 00:13:31 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, quickstart.cloudera, partition 1,NODE_LOCAL, 2291 bytes)
16/09/22 00:13:31 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 18583 ms on quickstart.cloudera (1/2)
16/09/22 00:13:31 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 157 ms on quickstart.cloudera (2/2)
16/09/22 00:13:31 INFO DAGScheduler: ShuffleMapStage 0 (collect at <console>:27) finished in 25.082 s
16/09/22 00:13:31 INFO DAGScheduler: looking for newly runnable stages
16/09/22 00:13:31 INFO DAGScheduler: running: Set()
16/09/22 00:13:31 INFO DAGScheduler: waiting: Set(ResultStage 1)
16/09/22 00:13:31 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/09/22 00:13:31 INFO DAGScheduler: failed: Set()
16/09/22 00:13:31 INFO DAGScheduler: Missing parents for ResultStage 1: List()
16/09/22 00:13:31 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at collect at <console>:27), which is now runnable
16/09/22 00:13:31 INFO MemoryStore: ensureFreeSpace(16544) called with curMem=497864, maxMem=556038881
16/09/22 00:13:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 16.2 KB, free 529.8 MB)
16/09/22 00:13:31 INFO MemoryStore: ensureFreeSpace(7375) called with curMem=514408, maxMem=556038881
16/09/22 00:13:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.2 KB, free 529.8 MB)
16/09/22 00:13:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.0.2.15:47553 (size: 7.2 KB, free: 530.2 MB)
16/09/22 00:13:31 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:861
16/09/22 00:13:31 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at collect at <console>:27)
16/09/22 00:13:31 INFO YarnScheduler: Adding task set 1.0 with 1 tasks
16/09/22 00:13:31 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, quickstart.cloudera, partition 0,PROCESS_LOCAL, 1914 bytes)
16/09/22 00:13:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on quickstart.cloudera:56958 (size: 7.2 KB, free: 530.2 MB)
16/09/22 00:13:31 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to quickstart.cloudera:49490
16/09/22 00:13:31 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 157 bytes
16/09/22 00:13:31 INFO DAGScheduler: ResultStage 1 (collect at <console>:27) finished in 0.245 s
16/09/22 00:13:31 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 245 ms on quickstart.cloudera (1/1)
16/09/22 00:13:31 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/09/22 00:13:31 INFO DAGScheduler: Job 0 finished: collect at <console>:27, took 26.194947 s
[null]
===
But if I do it directly from hive console, I'm getting the results.
hive> select max(Close) from stock_data2
> ;
Query ID = cloudera_20160922001414_4b684522-3e42-4957-8260-ff6b4da67c8f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1474445009419_0005, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1474445009419_0005/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1474445009419_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-09-22 00:14:45,000 Stage-1 map = 0%, reduce = 0%
2016-09-22 00:14:55,165 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2016-09-22 00:15:03,707 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.68 sec
MapReduce Total cumulative CPU time: 2 seconds 680 msec
Ended Job = job_1474445009419_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.68 sec HDFS Read: 43379 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 680 msec
OK
52.369999
Time taken: 42.57 seconds, Fetched: 1 row(s)
I'm getting count(*) just fine, but querying column value and max values as null.
This problem has been resolved in Spark version 1.6

PySpark join two RDD results in an empty RDD

I'm a Spark newbie trying to edit and apply this movie recommendation tutorial(https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html) on my dataset .But it keeps throwing This error :
ValueError: Can not reduce() empty RDD
This is the function that computes the Root Mean Squared Error of the model :
def computeRmse(model, data, n):
"""
Compute RMSE (Root Mean Squared Error).
"""
predictions = model.predictAll(data.map(lambda x: (x[0], x[1])))
print predictions.count()
print predictions.first()
print "predictions above"
print data.count()
print data.first()
print "validation data above"
predictionsAndRatings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \
#LINE56
.join(data.map(lambda line: line.split(‘,’) ).map(lambda x: ((x[0], x[1]), x[2]))) \
.values()
print predictionsAndRatings.count()
print "predictions And Ratings above"
#LINE63
return sqrt(predictionsAndRatings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))
model = ALS.train(training, rank, numIter, lambda). data is the validation data set.
training and validation set originally from a ratings.txt file in the format of : userID,productID,rating,ratingopID
These are parts of the output :
879
...
Rating(user=0, product=656, rating=4.122132631144641)
predictions above
...
1164
...
(u'640085', u'1590', u'5')
validation data above
...
16/08/26 12:47:18 INFO DAGScheduler: Registering RDD 259 (join at /path/myapp/MyappALS.py:56)
16/08/26 12:47:18 INFO DAGScheduler: Got job 20 (count at /path/myapp/MyappALS.py:59) with 12 output partitions
16/08/26 12:47:18 INFO DAGScheduler: Final stage: ResultStage 238 (count at /path/myapp/MyappALS.py:59)
16/08/26 12:47:18 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 237)
16/08/26 12:47:18 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 237)
16/08/26 12:47:18 INFO DAGScheduler: Submitting ShuffleMapStage 237 (PairwiseRDD[259] at join at /path/myapp/MyappALS.py:56), which has no missing parents
....
0
predictions And Ratings above
...
Traceback (most recent call last):
File "/path/myapp/MyappALS.py", line 130, in <module>
validationRmse = computeRmse(model, validation, numValidation)
File "/path/myapp/MyappALS.py", line 63, in computeRmse
return sqrt(predictionsAndRatings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))
File "/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 805, in reduce
ValueError: Can not reduce() empty RDD
So from the count() i'm sure the initial RDD are not empty .
Than the INFO log Registering RDD 259 (join at /path/myapp/MyappALS.py:56) does it mean that the join job is launched ?
Is there something wrong i'm missing ?
Thank you .
That error disappeared when i added int() to :
predictionsAndRatings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \
.join(data.map(lambda x: ((int(x[0]), int(x[1])), int(x[2])))) \
.values()
we think its because pediction is outputed from the method predictAll which gives tupple ,but the other data that was parsed manually by the algorithm

Resources