Apache spark does not giving correct output - apache-spark

I am a beginner and want to learn about spark. I am working with spark-shell and doing some experiment to get fast results I want to get the results from the spark worker nodes.
I have total two machines and in that, I have a driver and one worker on a single machine and one another worker on the other machine.
when I am want to get the count the result is not from both nodes. I have a JSON file to read and doing some performance checking.
here is the code :
spark-shell --conf spark.sql.warehouse.dir=C:\spark-warehouse --master spark://192.168.0.31:7077
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfs = sqlContext.read.json("file:///C:/order.json")
dfs.count
I have the order.JSON file is distributed on both machines. but then also I am getting different output

1.If you are running your spark on different nodes, then you must have S3 or HDFS path, make sure each node could access your data source.
val dfs = sqlContext.read.json("file:///C:/order.json")
Change to
val dfs = sqlContext.read.json("HDFS://order.json")
2.If your data sources are pretty small then you can try to use Spark broadcast for share those data to other nodes, then each node have consistent data.https://spark.apache.org/docs/latest/rdd-programming-guide.html#shared-variables
3.In order to print log your logs in console
please configure your log4j file in your spark conf folder.
details access Override Spark log4j configurations

Related

Problem with data locality when running Spark query with local nature on apache Hadoop

I have a Hadoop cluster that uses Apache Spark to query parquet files saved on Hadoop. For example, when I'm using the following PySpark code to find a word in parquet files:
df = spark.read.parquet("hdfs://test/parquets/*")
df.filter(df['word'] == "jhon").show()
After running this code, I go to spark application UI, stages tab. I see that locality level summery set on Any. In contrast, because of this query's nature, it must run locally and on NODE_LOCAL locality level at least. When I check the network IO of the cluster while running this, I find out that this query use network (network IO increases while the query is running). The strange part of this situation is that the number shown in the spark UI's shuffle section is minimal.
With Russell Spitzer's help in the Apache Spark mailing list to determine the root cause of this problem, I ran the following codes to find each partition's preferred location. The result of this code makes me one step closer to solve this problem. I found out that preferred locations are in IP form and neither Hostname, but spark use executors' IP to match the preferred location and achieve data locality.
scala> def getRootRdd( rdd:RDD[_] ): RDD[_] = { if(rdd.dependencies.size == 0) rdd else getRootRdd(rdd.dependencies(0).rdd)}
getRootRdd: (rdd: org.apache.spark.rdd.RDD[_])org.apache.spark.rdd.RDD[_]
scala> val rdd = spark.read.parquet("hdfs://test/parquets/*").rdd
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[38] at rdd at <console>:24
scala> val scan = getRootRdd(rdd)
scan: org.apache.spark.rdd.RDD[_] = FileScanRDD[33] at rdd at <console>:24
scala> scan.partitions.map(scan.preferredLocations)
res12: Array[Seq[String]] = Array(WrappedArray(datanode-1, datanode-2, datanode-3), WrappedArray(datanode-2, datanode-3, datanode-4), WrappedArray(datanode-3, datanode-4, datanode-5),...
Now I try to find ways to make the Spark first resolve the hostname then match them with the executor's IPs. Is there any suggestion?
This problem was created because Spark's preferred locations
from Hadoop for partitions are datanode hostname,
but Spark workers registered to Spark master by IP.
Spark is trying to sketch the task to run on the executor
with the local partition. Because executors are mapped to IPs
and partitions to hostname, the scheduler can't match IPs with hostnames,
and tasks always run on "Any" locality level.
To solve this problem, we must run spark-workers with -h [hostname] flag.
As a result, workers registered in master by the hostname instead of IP, and solve the problem.

Accessing HDFS thru spark-scala program

I have written a simple program to join the orders and order_items files which are in HDFS.
My Code to read the data:
val orders = sc.textFile ("hdfs://quickstart.cloudera:8022/user/root/retail_db/orders/part-00000")
val orderItems = sc.textFile ("hdfs://quickstart.cloudera:8022/user/root/retail_db/order_items/part-00000")
I got the below exception:
**Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://quickstart.cloudera:8020/user/root/retail_db, expected: file:///**
Can you please let me know the issue here? Thanks!!
You are currently using the Cloudera Quickstart VM, which most likely means you are running Spark 1.6 as those are the parcels that can be installed directly from Cloudera Manager and the default version for CDH 5.x
If that is the case, Spark on Yarn points by default to HDFS so you don't need to specify hdfs.
Simply do this:
val orderItems = sc.textFile ("/user/cloudera/retail_db/order_items/part-00000")
Note I changed also to /user/cloudera. Make sure your current user has permissions.
The hdfs:// is only if you are using Spark standalone

Spark Standalone cluster cannot read the files in local filesystem

I have a Spark standalone cluster having 2 worker nodes and 1 master node.
Using spark-shell, I was able to read data from a file on local filesystem, then did some transformations and saved the final RDD in /home/output(let's say)
The RDD got saved successfully but only on one worker node and on master node only _SUCCESS file was there.
Now, if I want to read this output data from /home/output, I am not getting any data as it is getting 0 data on master and then I am assuming that it is not checking the other worker nodes for that.
It would be great if someone can throw some light on why Spark is not reading from all the worker nodes or what is the mechanism which Spark uses to read the data from worker nodes.
scala> sc.textFile("/home/output/")
res7: org.apache.spark.rdd.RDD[(String, String)] = /home/output/ MapPartitionsRDD[5] at wholeTextFiles at <console>:25
scala> res7.count
res8: Long = 0
SparkContext i.e. sc by default points to HADOOP_CONF_DIR.This is generally set to hdfs:// , which means when you say sc.textFile("/home/output/") it searches for the file/dir as hdfs:///home/output , which in your case is not present on HDFS. file:// points to local filesystem
Try sc.textFile("file:///home/output") ,thus explicitly telling Spark to read from the local filesystem.
You should put the file to all worker machine with the same path and name.

What is the correct way to query Hive on Spark for maximum performance?

Spark newbie here.
I have a pretty large table in Hive (~130M records, 180 columns) and I'm trying to use Spark to pack it as a parquet file.
I'm using the default EMR cluster configuration, 6 * r3.xlarge instances to submit my spark application written in Python. I then run it on YARN, in a cluster mode, usually giving a small amount of memory (couple of gb) to driver, and the rest of it to executors. Here's my code to do so:
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext(appName="ParquetTest")
hiveCtx = HiveContext(sc)
data = hiveCtx.sql("select * from my_table")
data.repartition(20).write.mode('overwrite').parquet("s3://path/to/myfile.parquet")
Later, I submit it with something similar to this:
spark-submit --master yarn --deploy-mode cluster --num-executors 5 --driver-memory 4g --driver-cores 1 --executor-memory 24g --executor-cores 2 --py-files test_pyspark.py test_pyspark.py
However, my task takes forever to complete. Spark shuts down all but one worker very quickly after the job starts, since others are not being used, and it takes a few hours before it has all the data from Hive. The Hive table itself is not partitioned or clustered yet (I also need some advices on that).
Could you help me understand what I'm doing wrong, where should I go from here and how to get the maximum performance out of resources I have?
Thank you!
I had similar use case where I used spark to write to s3 and had performance issue. Primary reason was spark was creating lot of zero byte part files and replacing temp files to actual file name was slowing down the write process. Tried below approach as work around
Write output of spark to HDFS and used Hive to write to s3. Performance was much better as hive was creating less number of part files. Problem I had is(also had same issue when using spark), delete action on Policy was not provided in prod env because of security reasons. S3 bucket was kms encrypted in my case.
Write spark output to HDFS and Copied hdfs files to local and used aws s3 copy to push data to s3. Had second best results with this approach. Created ticket with Amazon and they suggested to go with this one.
Use s3 dist cp to copy files from HDFS to S3. This was working with no issues, but not performant

How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?

Original title: Besides HDFS, what other DFS does spark support (and are recommeded)?
I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters.
From time to time, I would like to pull the entire cluster of data out, process each doc, and put all of them into a different Elasticsearch (ES) cluster (yes, data migration too).
Currently, there is no way to read ES data from a cluster into RDDs and write the RDDs into a different one with spark + elasticsearch-hadoop, because that would involve swapping SparkContext from RDD. So I would like to write the RDD into object files and then later on read them back into RDDs with different SparkContexts.
However, here comes the problem: I then need a DFS(Distributed File System) to share the big files across my entire spark cluster. The most popular solution is HDFS, but I would very much avoid introducing Hadoop into my stack. Is there any other recommended DFS that spark supports?
Update Below
Thanks to #Daniel Darabos's answer below, I can now read and write data from/into different ElasticSearch clusters using the following Scala code:
val conf = new SparkConf().setAppName("Spark Migrating ES Data")
conf.set("es.nodes", "from.escluster.com")
val sc = new SparkContext(conf)
val allDataRDD = sc.esRDD("some/lovelydata")
val cfg = Map("es.nodes" -> "to.escluster.com")
allDataRDD.saveToEsWithMeta("clone/lovelydata", cfg)
Spark uses the hadoop-common library for file access, so whatever file systems Hadoop supports will work with Spark. I've used it with HDFS, S3 and GCS.
I'm not sure I understand why you don't just use elasticsearch-hadoop. You have two ES clusters, so you need to access them with different configurations. sc.newAPIHadoopFile and rdd.saveAsHadoopFile take hadoop.conf.Configuration arguments. So you can without any problems use two ES clusters with the same SparkContext.

Resources