Accessing HDFS thru spark-scala program - apache-spark

I have written a simple program to join the orders and order_items files which are in HDFS.
My Code to read the data:
val orders = sc.textFile ("hdfs://quickstart.cloudera:8022/user/root/retail_db/orders/part-00000")
val orderItems = sc.textFile ("hdfs://quickstart.cloudera:8022/user/root/retail_db/order_items/part-00000")
I got the below exception:
**Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://quickstart.cloudera:8020/user/root/retail_db, expected: file:///**
Can you please let me know the issue here? Thanks!!

You are currently using the Cloudera Quickstart VM, which most likely means you are running Spark 1.6 as those are the parcels that can be installed directly from Cloudera Manager and the default version for CDH 5.x
If that is the case, Spark on Yarn points by default to HDFS so you don't need to specify hdfs.
Simply do this:
val orderItems = sc.textFile ("/user/cloudera/retail_db/order_items/part-00000")
Note I changed also to /user/cloudera. Make sure your current user has permissions.
The hdfs:// is only if you are using Spark standalone

Related

Problem with data locality when running Spark query with local nature on apache Hadoop

I have a Hadoop cluster that uses Apache Spark to query parquet files saved on Hadoop. For example, when I'm using the following PySpark code to find a word in parquet files:
df = spark.read.parquet("hdfs://test/parquets/*")
df.filter(df['word'] == "jhon").show()
After running this code, I go to spark application UI, stages tab. I see that locality level summery set on Any. In contrast, because of this query's nature, it must run locally and on NODE_LOCAL locality level at least. When I check the network IO of the cluster while running this, I find out that this query use network (network IO increases while the query is running). The strange part of this situation is that the number shown in the spark UI's shuffle section is minimal.
With Russell Spitzer's help in the Apache Spark mailing list to determine the root cause of this problem, I ran the following codes to find each partition's preferred location. The result of this code makes me one step closer to solve this problem. I found out that preferred locations are in IP form and neither Hostname, but spark use executors' IP to match the preferred location and achieve data locality.
scala> def getRootRdd( rdd:RDD[_] ): RDD[_] = { if(rdd.dependencies.size == 0) rdd else getRootRdd(rdd.dependencies(0).rdd)}
getRootRdd: (rdd: org.apache.spark.rdd.RDD[_])org.apache.spark.rdd.RDD[_]
scala> val rdd = spark.read.parquet("hdfs://test/parquets/*").rdd
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[38] at rdd at <console>:24
scala> val scan = getRootRdd(rdd)
scan: org.apache.spark.rdd.RDD[_] = FileScanRDD[33] at rdd at <console>:24
scala> scan.partitions.map(scan.preferredLocations)
res12: Array[Seq[String]] = Array(WrappedArray(datanode-1, datanode-2, datanode-3), WrappedArray(datanode-2, datanode-3, datanode-4), WrappedArray(datanode-3, datanode-4, datanode-5),...
Now I try to find ways to make the Spark first resolve the hostname then match them with the executor's IPs. Is there any suggestion?
This problem was created because Spark's preferred locations
from Hadoop for partitions are datanode hostname,
but Spark workers registered to Spark master by IP.
Spark is trying to sketch the task to run on the executor
with the local partition. Because executors are mapped to IPs
and partitions to hostname, the scheduler can't match IPs with hostnames,
and tasks always run on "Any" locality level.
To solve this problem, we must run spark-workers with -h [hostname] flag.
As a result, workers registered in master by the hostname instead of IP, and solve the problem.

Saving dataframe to local file system results in empty results

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).
This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Apache spark does not giving correct output

I am a beginner and want to learn about spark. I am working with spark-shell and doing some experiment to get fast results I want to get the results from the spark worker nodes.
I have total two machines and in that, I have a driver and one worker on a single machine and one another worker on the other machine.
when I am want to get the count the result is not from both nodes. I have a JSON file to read and doing some performance checking.
here is the code :
spark-shell --conf spark.sql.warehouse.dir=C:\spark-warehouse --master spark://192.168.0.31:7077
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfs = sqlContext.read.json("file:///C:/order.json")
dfs.count
I have the order.JSON file is distributed on both machines. but then also I am getting different output
1.If you are running your spark on different nodes, then you must have S3 or HDFS path, make sure each node could access your data source.
val dfs = sqlContext.read.json("file:///C:/order.json")
Change to
val dfs = sqlContext.read.json("HDFS://order.json")
2.If your data sources are pretty small then you can try to use Spark broadcast for share those data to other nodes, then each node have consistent data.https://spark.apache.org/docs/latest/rdd-programming-guide.html#shared-variables
3.In order to print log your logs in console
please configure your log4j file in your spark conf folder.
details access Override Spark log4j configurations

Why does reading from Hive fail with "java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found"?

I use Spark v1.6.1 and Hive v1.2.x with Python v2.7
For Hive, I have some tables (ORC files) stored in HDFS and some stored in S3. If we are trying to join 2 tables, where one is in HDFS and the other is in S3, a java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found is thrown.
For example this works when querying against a HIVE table in HDFS.
df1 = sqlContext.sql('select * from hdfs_db.tbl1')
This works when querying against a HIVE table in S3.
df2 = sqlContext.sql('select * from s3_db.tbl2')
This code below throws the above RuntimeException.
sql = """
select *
from hdfs_db.tbl1 a
join s3_db.tbl2 b on a.id = b.id
"""
df3 = sqlContext.sql(sql)
We are migrating from HDFS to S3, and so that's why there is a difference of storage backing HIVE tables (basically, ORC files in HDFS and S3). One interesting thing is that if we use DBeaver or the beeline clients to connect to Hive and issue the joined query, it works. I can also use sqlalchemy to issue the joined query and get result. This problem only shows on Spark's sqlContext.
More information on execution and environment: This code is executed in Jupyter notebook on an edge node (that already has spark, hadoop, hive, tez, etc... setup/configured). The Python environment is managed by conda for Python v2.7. Jupyter is started with pyspark as follows.
IPYTHON_OPTS="notebook --port 7005 --notebook-dir='~/' --ip='*' --no-browser" \
pyspark \
--queue default \
--master yarn-client
When I go to the Spark Application UI under Environment, the following Classpath Entries has the following.
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-api-jdo-3.2.6.jar
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-core-3.2.10.jar
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-rdbms-3.2.9.jar
/usr/hdp/2.4.2.0-258/spark/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar
/usr/hdp/current/hadoop-client/conf/
/usr/hdp/current/spark-historyserver/conf/
The sun.boot.class.path has the following value: /usr/jdk64/jdk1.8.0_60/jre/lib/resources.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/rt.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/sunrsasign.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jsse.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jce.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/charsets.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jfr.jar:/usr/jdk64/jdk1.8.0_60/jre/classes.
The spark.executorEnv.PYTHONPATH has the following value: /usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip:/usr/hdp/2.4.2.0-258/spark/python/:<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.9-src.zip.
The Hadoop distribution is via CDH: Hadoop 2.7.1.2.4.2.0-258
Quoting the Steve Loughran (who given his track record in Spark development seems to be the source of truth about the topic of accessing S3 file systems) from SPARK-15965 No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1:
This is being fixed with tests in my work in SPARK-7481; the manual
workaround is
Spark 1.6+
This needs my patch a rebuild of spark assembly. However, once that patch is in, trying to use the assembly without the AWS JARs will stop spark from starting —unless you move up to Hadoop 2.7.3
There are also some other sources where you can find workarounds:
https://issues.apache.org/jira/browse/SPARK-7442
https://community.mapr.com/thread/9635
How to access s3a:// files from Apache Spark?
https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html
https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.html
I've got no environment (and experience) to give the above a shot so after you give the above a try please report back to have a better understanding of the current situation regarding S3 support in Spark. Thanks.

How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?

Original title: Besides HDFS, what other DFS does spark support (and are recommeded)?
I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters.
From time to time, I would like to pull the entire cluster of data out, process each doc, and put all of them into a different Elasticsearch (ES) cluster (yes, data migration too).
Currently, there is no way to read ES data from a cluster into RDDs and write the RDDs into a different one with spark + elasticsearch-hadoop, because that would involve swapping SparkContext from RDD. So I would like to write the RDD into object files and then later on read them back into RDDs with different SparkContexts.
However, here comes the problem: I then need a DFS(Distributed File System) to share the big files across my entire spark cluster. The most popular solution is HDFS, but I would very much avoid introducing Hadoop into my stack. Is there any other recommended DFS that spark supports?
Update Below
Thanks to #Daniel Darabos's answer below, I can now read and write data from/into different ElasticSearch clusters using the following Scala code:
val conf = new SparkConf().setAppName("Spark Migrating ES Data")
conf.set("es.nodes", "from.escluster.com")
val sc = new SparkContext(conf)
val allDataRDD = sc.esRDD("some/lovelydata")
val cfg = Map("es.nodes" -> "to.escluster.com")
allDataRDD.saveToEsWithMeta("clone/lovelydata", cfg)
Spark uses the hadoop-common library for file access, so whatever file systems Hadoop supports will work with Spark. I've used it with HDFS, S3 and GCS.
I'm not sure I understand why you don't just use elasticsearch-hadoop. You have two ES clusters, so you need to access them with different configurations. sc.newAPIHadoopFile and rdd.saveAsHadoopFile take hadoop.conf.Configuration arguments. So you can without any problems use two ES clusters with the same SparkContext.

Resources