Here is how I configured the hadoop and Java environment variables.I installed hadoop . but when I execute the command #sudo -u hdfs hdfs dfsadmin -safemode leave.
hdfs :command not found.I have already uninstalled and reinstalled but the problem has not been solved.I have attached the output of the command .#hdfs namenode -format. here I reformatted the namenode.
[root#MASTER ~]# sudo -u hdfs hdfs dfsadmin -safemode leave
sudo: hdfs: command not found
# JAVA VARIABLES
export JAVA_HOME=/usr/local/java
export PATH=$PATH:$JAVA_HOME:$JAVA_HOME/bin
# HADOOP VARIABLES
export HADOOP_HOME=/usr/local/hadoop-2.7.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
#export HADOOP_USER_NAME=$user
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
[root#MASTER ~]# hdfs namenode -format
21/06/22 20:20:02 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: user = root
STARTUP_MSG: host = MASTER/192.168.1.5
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.8.1
STARTUP_MSG: java = 1.8.0_291
************************************************************/
21/06/22 20:20:02 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
21/06/22 20:20:03 INFO namenode.NameNode: createNameNode [-format]
21/06/22 20:20:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Formatting using clusterid: CID-1b107e40-11f9-4e20-8a4b-c966088e6eb9
21/06/22 20:20:06 INFO namenode.FSEditLog: Edit logging is async:false
21/06/22 20:20:06 INFO namenode.FSNamesystem: KeyProvider: null
21/06/22 20:20:06 INFO namenode.FSNamesystem: fsLock is fair: true
21/06/22 20:20:06 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
21/06/22 20:20:06 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
21/06/22 20:20:06 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
21/06/22 20:20:06 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
21/06/22 20:20:06 INFO blockmanagement.BlockManager: The block deletion will start around 2021 Jun 22 20:20:06
21/06/22 20:20:06 INFO util.GSet: Computing capacity for map BlocksMap
21/06/22 20:20:06 INFO util.GSet: VM type = 64-bit
21/06/22 20:20:06 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
21/06/22 20:20:06 INFO util.GSet: capacity = 2^21 = 2097152 entries
21/06/22 20:20:06 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
21/06/22 20:20:06 INFO blockmanagement.BlockManager: defaultReplication = 1
21/06/22 20:20:06 INFO blockmanagement.BlockManager: maxReplication = 512
21/06/22 20:20:06 INFO blockmanagement.BlockManager: minReplication = 1
21/06/22 20:20:06 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
21/06/22 20:20:06 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
21/06/22 20:20:06 INFO blockmanagement.BlockManager: encryptDataTransfer = false
21/06/22 20:20:06 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
21/06/22 20:20:06 INFO namenode.FSNamesystem: fsOwner = root (auth:SIMPLE)
21/06/22 20:20:06 INFO namenode.FSNamesystem: supergroup = supergroup
21/06/22 20:20:06 INFO namenode.FSNamesystem: isPermissionEnabled = true
21/06/22 20:20:06 INFO namenode.FSNamesystem: HA Enabled: false
21/06/22 20:20:06 INFO namenode.FSNamesystem: Append Enabled: true
21/06/22 20:20:06 INFO util.GSet: Computing capacity for map INodeMap
21/06/22 20:20:06 INFO util.GSet: VM type = 64-bit
21/06/22 20:20:06 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
21/06/22 20:20:06 INFO util.GSet: capacity = 2^20 = 1048576 entries
21/06/22 20:20:06 INFO namenode.FSDirectory: ACLs enabled? false
21/06/22 20:20:06 INFO namenode.FSDirectory: XAttrs enabled? true
21/06/22 20:20:06 INFO namenode.NameNode: Caching file names occurring more than 10 times
21/06/22 20:20:06 INFO util.GSet: Computing capacity for map cachedBlocks
21/06/22 20:20:06 INFO util.GSet: VM type = 64-bit
21/06/22 20:20:06 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
21/06/22 20:20:06 INFO util.GSet: capacity = 2^18 = 262144 entries
21/06/22 20:20:06 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
21/06/22 20:20:06 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
21/06/22 20:20:06 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000
21/06/22 20:20:06 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
21/06/22 20:20:06 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
21/06/22 20:20:06 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
21/06/22 20:20:06 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
21/06/22 20:20:06 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
21/06/22 20:20:06 INFO util.GSet: Computing capacity for map NameNodeRetryCache
21/06/22 20:20:06 INFO util.GSet: VM type = 64-bit
21/06/22 20:20:06 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
21/06/22 20:20:06 INFO util.GSet: capacity = 2^15 = 32768 entries
Re-format filesystem in Storage Directory /hadoop/hdfs/namenode ? (Y or N) y
21/06/22 20:20:11 INFO namenode.FSImage: Allocated new BlockPoolId: BP-963069543-192.168.1.5-1624393211127
21/06/22 20:20:11 INFO common.Storage: Storage directory /hadoop/hdfs/namenode has been successfully formatted.
21/06/22 20:20:11 INFO namenode.FSImageFormatProtobuf: Saving image file /hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression
21/06/22 20:20:11 INFO namenode.FSImageFormatProtobuf: Image file /hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 321 bytes saved in 0 seconds.
21/06/22 20:20:11 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
21/06/22 20:20:11 INFO util.ExitUtil: Exiting with status 0
21/06/22 20:20:11 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at MASTER/192.168.1.5
************************************************************/
hdfs is not a command anymore
use:
hadoop dfsadmin
Correct usage is:
hdfs dfsadmin -safemode leave
Result:Safe mode is OFF
For more information- https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode
Related
During running one of Spark Stage,
I face the unexpected error log on executors, and stage is not working anymore.
Could anyone tell me what happened if below message is occurred?
and is there any limitation to set this setting value? (spark.sql.objectHashAggregate.sortBased.fallbackThreshold)
# encountered message on executor
21/04/29 07:25:58 INFO ObjectAggregationIterator: Aggregation hash map size 128 reaches threshold capacity (128 entries), spilling and falling back to sort based aggregation. You may change the threshold by adjust option spark.sql.objectHashAggregate.sortBased.fallbackThreshold
21/04/29 07:26:51 INFO PythonUDFRunner: Times: total = 87019, boot = -361, init = 380, finish = 87000
21/04/29 07:26:51 INFO MemoryStore: Block rdd_36_1765 stored as values in memory (estimated size 19.6 MB, free 5.2 GB)
21/04/29 07:26:53 INFO PythonRunner: Times: total = 2154, boot = 6, init = 1, finish = 2147
21/04/29 07:26:53 INFO Executor: Finished task 1765.0 in stage 6.0 (TID 11172). 5310 bytes result sent to driver
21/04/29 07:27:33 INFO PythonUDFRunner: Times: total = 93086, boot = -461, init = 480, finish = 93067
21/04/29 07:27:33 INFO MemoryStore: Block rdd_36_1792 stored as values in memory (estimated size 19.7 MB, free 5.2 GB)
21/04/29 07:27:35 INFO PythonRunner: Times: total = 1999, boot = -40047, init = 40051, finish = 1995
21/04/29 07:27:35 INFO Executor: Finished task 1792.0 in stage 6.0 (TID 11199). 5267 bytes result sent to driver
21/04/29 07:27:35 INFO PythonUDFRunner: Times: total = 97305, boot = -313, init = 350, finish = 97268
21/04/29 07:27:35 INFO MemoryStore: Block rdd_36_1789 stored as values in memory (estimated size 19.7 MB, free 5.3 GB)
21/04/29 07:27:37 INFO PythonRunner: Times: total = 1928, boot = -2217, init = 2220, finish = 1925
21/04/29 07:27:37 INFO Executor: Finished task 1789.0 in stage 6.0 (TID 11196). 5310 bytes result sent to driver
# about Spark Stage I did
#
# given dataframe is
# (uid is given by monotonically_increasing_id)
# |-------|-------|
# | uids | score |
# |-------|-------|
# |[1,2,3]| 50 |
# |[1,2] | 70 |
# |[1] | 90 |
#
# expected result
# |-------|-------|
# | uid | score |
# |-------|-------|
# | 1 | 90 |
# | 2 | 70 |
# | 3 | 50 |
rdd = df.select(F.explode('uids').alias('uid'), 'score') \
.rdd.map(lambda x: (x['uid'], x)) \
.reduceByKey(func=max, numPartitions=1800) \
.cache()
when writing to parquet with partitionBY it is taking more time .Analyzing logs i found spark is listing files in the directory and upon listing files i have observed the below behavior where it is taking more than one hour and seems to be idle and again its starts.
20/01/30 07:33:09 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/01/30 07:33:09 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 1 blocks
20/01/30 07:33:09 INFO Executor: Finished task 195.0 in stage 241.0 (TID 15820). 18200 bytes result sent to driver
20/01/30 07:33:09 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/01/30 07:33:09 INFO Executor: Finished task 198.0 in stage 241.0 (TID 15823). 18200 bytes result sent to driver
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedDirectMemory, value=50331648
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedHeapMemory, value=50331648
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedDirectMemory, value=50331648
20/01/30 07:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedHeapMemory, value=50331648
and again
20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.compilationTime, count=484, min=2, max=622, mean=16.558694661661132, stddev=13.859676272407238, median=12.0, p75=20.0, p95=47.0, p98=62.0, p99=64.0, p999=70.0
20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.generatedClassSize, count=990, min=546, max=97043, mean=2058.574386565769, stddev=2153.50835266105, median=1374.0, p75=2693.0, p95=5009.0, p98=11509.0, p99=11519.0, p999=11519.0
20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.generatedMethodSize, count=4854, min=1, max=1574, mean=95.19245880884911, stddev=158.289763457333, median=39.0, p75=142.0, p95=339.0, p98=618.0, p99=873.0, p999=1234.0
20/01/30 07:55:22 INFO metrics: type=HISTOGRAM, name=application_1577238363313_38955.3.CodeGenerator.sourceCodeSize, count=484, min=430, max=467509, mean=4743.632894656119, stddev=5893.941708479697, median=2346.0, p75=4946.0, p95=24887.0, p98=24890.0, p99=24890.0, p999=24890.0
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedDirectMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedHeapMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedDirectMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedHeapMemory, value=50331648
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.executor.filesystem.file.largeRead_ops, value=0
20/01/30 08:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.executor.filesystem.file.read_bytes, value=0
and again
20/01/30 08:55:28 INFO TaskMemoryManager: Memory used in task 15249
20/01/30 08:55:28 INFO TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter#3cadc5a3: 65.0 MB
20/01/30 08:55:28 INFO TaskMemoryManager: Acquired by HybridRowQueue(org.apache.spark.memory.TaskMemoryManager#7c64db53,/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1577238363313_38955/spark-487c8d3d-391c-47b3-9a1b-d816d9505f5c,11,org.apache.spark.serializer.SerializerManager#55a990cc): 4.2 GB
20/01/30 08:55:28 INFO TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter#785b4080: 65.0 MB
20/01/30 08:55:28 INFO TaskMemoryManager: 0 bytes of memory were used by task 15249 but are not associated with specific consumers
20/01/30 08:55:28 INFO TaskMemoryManager: 4643196305 bytes of memory are used for execution and 608596591 bytes of memory are used for storage
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedDirectMemory, value=50331648
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-client.usedHeapMemory, value=50331648
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedDirectMemory, value=50331648
20/01/30 09:55:22 INFO metrics: type=GAUGE, name=application_1577238363313_38955.3.NettyBlockTransfer.shuffle-server.usedHeapMemory, value=50331648
2
Now its taking more around 3 hrs to complete the job . Any ways to improve the performance
I have noticed same behaviour when I was writing dataframe to hdfs using partitionBy method. Later I found that I should apply in-memory partitioning before disk-partitioning.
So first repartition your dataframe on same columns which you want to use in partitionBy like below
df2=df1.repartition($"year",$"month",$"day")
df2.repartition(3).mode("overwrite").partitionBy("year","month","day").save("path to hdfs")
Look, I used "spark-shell" command to test it.(https://spark.apache.org/docs/latest/sql-programming-guide.html)
scala> case class IP(country: String) extends Serializable
17/07/05 11:20:09 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.50.3:42868 in memory (size: 33.1 KB, free: 93.3 MB)
17/07/05 11:20:09 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.50.3:40888 in memory (size: 33.1 KB, free: 93.3 MB)
17/07/05 11:20:09 INFO ContextCleaner: Cleaned accumulator 0
17/07/05 11:20:09 INFO ContextCleaner: Cleaned accumulator 1
defined class IP
scala> import spark.implicits._
import spark.implicits._
scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
scala> val df = spark.sparkContext.textFile("/test/guchao/ip.txt").map(x => x.split("\\|", -1)).map(x => IP(x(0))).toDF()
17/07/05 11:20:36 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 216.5 KB, free 92.9 MB)
17/07/05 11:20:36 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 20.8 KB, free 92.8 MB)
17/07/05 11:20:36 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.50.3:42868 (size: 20.8 KB, free: 93.3 MB)
17/07/05 11:20:36 INFO SparkContext: Created broadcast 2 from textFile at :33
df: org.apache.spark.sql.DataFrame = [country: string]
scala> df.write.mode(SaveMode.Overwrite).save("/test/guchao/ip.parquet")
17/07/05 11:20:44 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
17/07/05 11:20:44 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
17/07/05 11:20:44 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
17/07/05 11:20:44 INFO CodeGenerator: Code generated in 88.405717 ms
17/07/05 11:20:44 INFO FileInputFormat: Total input paths to process : 1
17/07/05 11:20:44 INFO SparkContext: Starting job: save at :36
17/07/05 11:20:44 INFO DAGScheduler: Got job 1 (save at :36) with 2 output partitions
17/07/05 11:20:44 INFO DAGScheduler: Final stage: ResultStage 1 (save at :36)
17/07/05 11:20:44 INFO DAGScheduler: Parents of final stage: List()
17/07/05 11:20:44 INFO DAGScheduler: Missing parents: List()
17/07/05 11:20:44 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[12] at save at :36), which has no missing parents
17/07/05 11:20:44 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 77.3 KB, free 92.8 MB)
17/07/05 11:20:44 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 29.3 KB, free 92.7 MB)
17/07/05 11:20:44 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.50.3:42868 (size: 29.3 KB, free: 93.2 MB)
17/07/05 11:20:44 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:996
17/07/05 11:20:44 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[12] at save at :36)
17/07/05 11:20:44 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
17/07/05 11:20:44 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 192.168.50.3, executor 0, partition 0, ANY, 6027 bytes)
17/07/05 11:20:44 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.50.3:40888 (size: 29.3 KB, free: 93.3 MB)
17/07/05 11:20:45 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.50.3:40888 (size: 20.8 KB, free: 93.2 MB)
17/07/05 11:20:45 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 192.168.50.3, executor 0, partition 1, ANY, 6027 bytes)
17/07/05 11:20:45 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 679 ms on 192.168.50.3 (executor 0) (1/2)
17/07/05 11:20:46 INFO DAGScheduler: ResultStage 1 (save at :36) finished in 1.476 s
17/07/05 11:20:46 INFO DAGScheduler: Job 1 finished: save at :36, took 1.597097 s
17/07/05 11:20:46 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 804 ms on 192.168.50.3 (executor 0) (2/2)
17/07/05 11:20:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/07/05 11:20:46 INFO FileFormatWriter: Job null committed.
but the result is:
[root#master ~]# hdfs dfs -ls -h /test/guchao
17/07/05 11:20:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
drwxr-xr-x - root supergroup 0 2017-07-05 11:20 /test/guchao/ip.parquet
-rw-r--r-- 1 root supergroup 23.9 M 2017-07-05 10:05 /test/guchao/ip.txt
Why does this size of "ip.parquet" is 0? I don't understand and confuse.
Thanks!
hdfs dfs -ls -h <path> shows the size of files and shows 0 for the directory.
df.write.mode(SaveMode.Overwrite).save("/test/guchao/ip.parquet")
This creates the directory as /test/guchao/ip.parquet which has the part files inside this directory, thats why it shows 0 size
hadoop fs -ls /test/guchao/ip.parquet
this should show you the actual size of output files
If you want to get size of directory than you can use
hadoop fs -du -s /test/guchao/ip.parquet
Hope this helps!
/test/guchao/ip.parquet is a directory, get into the directory and you should find something like part-00000 which will be the file you are looking for.
hadoop fs -ls /test/guchao/ip.parquet
I am trying to understand the log output generated by given simple program. Need help to understand each steps or reference to such writeup would be fine.
Command
sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1), ("a", 1), ("b", 1), ("b", 1), ("b", 1), ("b", 1)), 3).map(a=> a).reduceByKey(_ + _ ).collect()
Output :
16/12/08 23:41:57 INFO spark.SparkContext: Starting job: collect at <console>:28
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Registering RDD 1 (map at <console>:28)
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:28) with 3 output partitions
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at <console>:28)
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[1] at map at <console>:28), which has no missing parents
16/12/08 23:41:57 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.6 KB, free 2.6 KB)
16/12/08 23:41:57 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1588.0 B, free 4.2 KB)
16/12/08 23:41:57 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.17.0.6:31122 (size: 1588.0 B, free: 511.5 MB)
16/12/08 23:41:57 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/12/08 23:41:57 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[1] at map at <console>:28)
16/12/08 23:41:57 INFO cluster.YarnScheduler: Adding task set 0.0 with 3 tasks
16/12/08 23:41:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 34b943b3f6ea, partition 0,PROCESS_LOCAL, 2183 bytes)
16/12/08 23:41:57 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 34b943b3f6ea, partition 1,PROCESS_LOCAL, 2199 bytes)
16/12/08 23:41:57 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 34b943b3f6ea:28772 (size: 1588.0 B, free: 511.5 MB)
16/12/08 23:41:57 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 34b943b3f6ea:39570 (size: 1588.0 B, free: 511.5 MB)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 34b943b3f6ea, partition 2,PROCESS_LOCAL, 2200 bytes)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 740 ms on 34b943b3f6ea (1/3)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 778 ms on 34b943b3f6ea (2/3)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 66 ms on 34b943b3f6ea (3/3)
16/12/08 23:41:58 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (map at <console>:28) finished in 0.792 s
16/12/08 23:41:58 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/12/08 23:41:58 INFO scheduler.DAGScheduler: running: Set()
16/12/08 23:41:58 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
16/12/08 23:41:58 INFO scheduler.DAGScheduler: failed: Set()
16/12/08 23:41:58 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[2] at reduceByKey at <console>:28), which has no missing parents
16/12/08 23:41:58 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/08 23:41:58 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.6 KB, free 6.7 KB)
16/12/08 23:41:58 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1589.0 B, free 8.3 KB)
16/12/08 23:41:58 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.17.0.6:31122 (size: 1589.0 B, free: 511.5 MB)
16/12/08 23:41:58 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/12/08 23:41:58 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 1 (ShuffledRDD[2] at reduceByKey at <console>:28)
16/12/08 23:41:58 INFO cluster.YarnScheduler: Adding task set 1.0 with 3 tasks
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 34b943b3f6ea, partition 1,NODE_LOCAL, 1894 bytes)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 4, 34b943b3f6ea, partition 2,NODE_LOCAL, 1894 bytes)
16/12/08 23:41:58 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 34b943b3f6ea:39570 (size: 1589.0 B, free: 511.5 MB)
16/12/08 23:41:58 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 34b943b3f6ea:28772 (size: 1589.0 B, free: 511.5 MB)
16/12/08 23:41:58 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 34b943b3f6ea:60986
16/12/08 23:41:58 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 163 bytes
16/12/08 23:41:58 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 34b943b3f6ea:60984
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 5, 34b943b3f6ea, partition 0,PROCESS_LOCAL, 1894 bytes)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 4) in 331 ms on 34b943b3f6ea (1/3)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 351 ms on 34b943b3f6ea (2/3)
16/12/08 23:41:58 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 5) in 29 ms on 34b943b3f6ea (3/3)
16/12/08 23:41:58 INFO scheduler.DAGScheduler: ResultStage 1 (collect at <console>:28) finished in 0.359 s
16/12/08 23:41:58 INFO cluster.YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/12/08 23:41:58 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:28, took 1.381102 s
res14: Array[(String, Int)] = Array((a,3), (b,5))
As you can see, the processing started with collect() - This is the lazy initialization that spark uses. Even though you had map and reduceByKey, the process kicked off at collect. As map and reduceByKey are transformations
You can see 3 partitions and each having a task - since you initialized RDD with 3 partitions
Another point is how each of map and reduceByKey handled data locality. All three tasks in map have PROCESS_LOCAL. The
reduceByKey needs a data shuffle and so you might have PROCESS_LOCAL and NODE_LOCAL.
I have a external hive partitioned table which I'm trying to read from Spark using HiveContext. But I'm getting null values.
val maxClose = hiveContext.sql("select max(Close) from stock_partitioned_data where symbol = 'AAPL'");
maxClose.collect().foreach (println )
=====
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala> val hiveContext = new HiveContext(sc);
16/09/22 00:12:47 INFO HiveContext: Initializing execution hive, version 1.1.0
16/09/22 00:12:47 INFO ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.0
16/09/22 00:12:47 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.0
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#455aef06
scala> val maxClose = hiveContext.sql("select max(Close) from stock_data2")
16/09/22 00:12:53 INFO ParseDriver: Parsing command: select max(Close) from stock_data2
16/09/22 00:12:54 INFO ParseDriver: Parse Completed
16/09/22 00:12:54 INFO ClientWrapper: Inspected Hadoop version: 2.6.0-cdh5.5.0
16/09/22 00:12:54 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.5.0
maxClose: org.apache.spark.sql.DataFrame = [_c0: double]
scala> maxClose.collect().foreach (println )
16/09/22 00:13:04 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/09/22 00:13:04 INFO MemoryStore: ensureFreeSpace(425824) called with curMem=0, maxMem=556038881
16/09/22 00:13:04 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 415.8 KB, free 529.9 MB)
16/09/22 00:13:05 INFO MemoryStore: ensureFreeSpace(44793) called with curMem=425824, maxMem=556038881
16/09/22 00:13:05 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 43.7 KB, free 529.8 MB)
16/09/22 00:13:05 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.2.15:47553 (size: 43.7 KB, free: 530.2 MB)
16/09/22 00:13:05 INFO SparkContext: Created broadcast 0 from collect at <console>:27
16/09/22 00:13:05 INFO SparkContext: Starting job: collect at <console>:27
16/09/22 00:13:06 INFO FileInputFormat: Total input paths to process : 1
16/09/22 00:13:06 INFO DAGScheduler: Registering RDD 5 (collect at <console>:27)
16/09/22 00:13:06 INFO DAGScheduler: Got job 0 (collect at <console>:27) with 1 output partitions
16/09/22 00:13:06 INFO DAGScheduler: Final stage: ResultStage 1(collect at <console>:27)
16/09/22 00:13:06 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/09/22 00:13:06 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/09/22 00:13:06 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[5] at collect at <console>:27), which has no missing parents
16/09/22 00:13:06 INFO MemoryStore: ensureFreeSpace(18880) called with curMem=470617, maxMem=556038881
16/09/22 00:13:06 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 18.4 KB, free 529.8 MB)
16/09/22 00:13:06 INFO MemoryStore: ensureFreeSpace(8367) called with curMem=489497, maxMem=556038881
16/09/22 00:13:06 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 8.2 KB, free 529.8 MB)
16/09/22 00:13:06 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.0.2.15:47553 (size: 8.2 KB, free: 530.2 MB)
16/09/22 00:13:06 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
16/09/22 00:13:06 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[5] at collect at <console>:27)
16/09/22 00:13:06 INFO YarnScheduler: Adding task set 0.0 with 2 tasks
16/09/22 00:13:07 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1)
16/09/22 00:13:08 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 2)
16/09/22 00:13:11 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver#10.0.2.15:45637] <- [akka.tcp://driverPropsFetcher#quickstart.cloudera:33635]: Error [Shut down address: akka.tcp://driverPropsFetcher#quickstart.cloudera:33635] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://driverPropsFetcher#quickstart.cloudera:33635
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
akka.event.Logging$Error$NoCause$
16/09/22 00:13:12 INFO YarnClientSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor#quickstart.cloudera:49490/user/Executor#-842589632]) with ID 1
16/09/22 00:13:12 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
16/09/22 00:13:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, quickstart.cloudera, partition 0,NODE_LOCAL, 2291 bytes)
16/09/22 00:13:13 INFO BlockManagerMasterEndpoint: Registering block manager quickstart.cloudera:56958 with 530.3 MB RAM, BlockManagerId(1, quickstart.cloudera, 56958)
16/09/22 00:13:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on quickstart.cloudera:56958 (size: 8.2 KB, free: 530.3 MB)
16/09/22 00:13:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on quickstart.cloudera:56958 (size: 43.7 KB, free: 530.2 MB)
16/09/22 00:13:31 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, quickstart.cloudera, partition 1,NODE_LOCAL, 2291 bytes)
16/09/22 00:13:31 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 18583 ms on quickstart.cloudera (1/2)
16/09/22 00:13:31 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 157 ms on quickstart.cloudera (2/2)
16/09/22 00:13:31 INFO DAGScheduler: ShuffleMapStage 0 (collect at <console>:27) finished in 25.082 s
16/09/22 00:13:31 INFO DAGScheduler: looking for newly runnable stages
16/09/22 00:13:31 INFO DAGScheduler: running: Set()
16/09/22 00:13:31 INFO DAGScheduler: waiting: Set(ResultStage 1)
16/09/22 00:13:31 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/09/22 00:13:31 INFO DAGScheduler: failed: Set()
16/09/22 00:13:31 INFO DAGScheduler: Missing parents for ResultStage 1: List()
16/09/22 00:13:31 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at collect at <console>:27), which is now runnable
16/09/22 00:13:31 INFO MemoryStore: ensureFreeSpace(16544) called with curMem=497864, maxMem=556038881
16/09/22 00:13:31 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 16.2 KB, free 529.8 MB)
16/09/22 00:13:31 INFO MemoryStore: ensureFreeSpace(7375) called with curMem=514408, maxMem=556038881
16/09/22 00:13:31 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 7.2 KB, free 529.8 MB)
16/09/22 00:13:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.0.2.15:47553 (size: 7.2 KB, free: 530.2 MB)
16/09/22 00:13:31 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:861
16/09/22 00:13:31 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at collect at <console>:27)
16/09/22 00:13:31 INFO YarnScheduler: Adding task set 1.0 with 1 tasks
16/09/22 00:13:31 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, quickstart.cloudera, partition 0,PROCESS_LOCAL, 1914 bytes)
16/09/22 00:13:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on quickstart.cloudera:56958 (size: 7.2 KB, free: 530.2 MB)
16/09/22 00:13:31 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to quickstart.cloudera:49490
16/09/22 00:13:31 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 157 bytes
16/09/22 00:13:31 INFO DAGScheduler: ResultStage 1 (collect at <console>:27) finished in 0.245 s
16/09/22 00:13:31 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 245 ms on quickstart.cloudera (1/1)
16/09/22 00:13:31 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/09/22 00:13:31 INFO DAGScheduler: Job 0 finished: collect at <console>:27, took 26.194947 s
[null]
===
But if I do it directly from hive console, I'm getting the results.
hive> select max(Close) from stock_data2
> ;
Query ID = cloudera_20160922001414_4b684522-3e42-4957-8260-ff6b4da67c8f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1474445009419_0005, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1474445009419_0005/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1474445009419_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-09-22 00:14:45,000 Stage-1 map = 0%, reduce = 0%
2016-09-22 00:14:55,165 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2016-09-22 00:15:03,707 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.68 sec
MapReduce Total cumulative CPU time: 2 seconds 680 msec
Ended Job = job_1474445009419_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.68 sec HDFS Read: 43379 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 680 msec
OK
52.369999
Time taken: 42.57 seconds, Fetched: 1 row(s)
I'm getting count(*) just fine, but querying column value and max values as null.
This problem has been resolved in Spark version 1.6