Fail to enable hive support in Spark submit (spark 3) - apache-spark

I am using spark in an integration test suite. It has to run locally and read/write files to local file-system. I also want to read/write these data as tables.
In the first step of the suite I write some hive tables in the db feature_store specifying
spark.sql.warehouse.dir=/opt/spark/work-dir/warehouse. The step completes correctly and I see the files in the folder I expect.
Afterwards I run a spark-submit step with (among others) these confs
--conf spark.sql.warehouse.dir=/opt/spark/work-dir/warehouse --conf spark.sql.catalogImplementation=hive
and when trying to read a table previously written I get
Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'feature_store' not found
However if I try to do exactly the same thing with exactly the same configs in a spark-shell I am able to read the data.
In the spark-submit I use the following code to get the spark-session
SparkSession spark = SparkSession.active();
I have also tried to use instead
SparkSession spark = SparkSession.builder().enableHiveSupport().getOrCreate();
but I keep getting the same problem as above.
I have understood that the problem is related to the spark-submit not picking up hive as
catalog implementation. In fact I see that the class spark.catalog is not an instance of HiveCatalogImpl during the spark-submit (while it is when using spark-shell).

Related

Out of memory error while running spark submit

I am trying to load a 60gb table data onto a spark python dataframe and then write that into a hive table.
I have set driver memory, executor memory, max result size sufficiently to handle the data. But i am getting error when i run through spark submit with all the above said configs mentioned in command line.
Note: Through spark python shell (by specifying driver & executor memory while launching the shell), i am able to populate the target hive table.
Any thoughts??
Try using syntax:
./spark-submit --conf ...
For the memory-related configuration. What I suspect you're doing is - you are setting them, while initializing SparkSession - which becomes irrelevant, since kernel is already started by then. Same parameters, as you set for running shell will do.
https://spark.apache.org/docs/latest/submitting-applications.html

Apache spark does not giving correct output

I am a beginner and want to learn about spark. I am working with spark-shell and doing some experiment to get fast results I want to get the results from the spark worker nodes.
I have total two machines and in that, I have a driver and one worker on a single machine and one another worker on the other machine.
when I am want to get the count the result is not from both nodes. I have a JSON file to read and doing some performance checking.
here is the code :
spark-shell --conf spark.sql.warehouse.dir=C:\spark-warehouse --master spark://192.168.0.31:7077
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfs = sqlContext.read.json("file:///C:/order.json")
dfs.count
I have the order.JSON file is distributed on both machines. but then also I am getting different output
1.If you are running your spark on different nodes, then you must have S3 or HDFS path, make sure each node could access your data source.
val dfs = sqlContext.read.json("file:///C:/order.json")
Change to
val dfs = sqlContext.read.json("HDFS://order.json")
2.If your data sources are pretty small then you can try to use Spark broadcast for share those data to other nodes, then each node have consistent data.https://spark.apache.org/docs/latest/rdd-programming-guide.html#shared-variables
3.In order to print log your logs in console
please configure your log4j file in your spark conf folder.
details access Override Spark log4j configurations

Why does reading from Hive fail with "java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found"?

I use Spark v1.6.1 and Hive v1.2.x with Python v2.7
For Hive, I have some tables (ORC files) stored in HDFS and some stored in S3. If we are trying to join 2 tables, where one is in HDFS and the other is in S3, a java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found is thrown.
For example this works when querying against a HIVE table in HDFS.
df1 = sqlContext.sql('select * from hdfs_db.tbl1')
This works when querying against a HIVE table in S3.
df2 = sqlContext.sql('select * from s3_db.tbl2')
This code below throws the above RuntimeException.
sql = """
select *
from hdfs_db.tbl1 a
join s3_db.tbl2 b on a.id = b.id
"""
df3 = sqlContext.sql(sql)
We are migrating from HDFS to S3, and so that's why there is a difference of storage backing HIVE tables (basically, ORC files in HDFS and S3). One interesting thing is that if we use DBeaver or the beeline clients to connect to Hive and issue the joined query, it works. I can also use sqlalchemy to issue the joined query and get result. This problem only shows on Spark's sqlContext.
More information on execution and environment: This code is executed in Jupyter notebook on an edge node (that already has spark, hadoop, hive, tez, etc... setup/configured). The Python environment is managed by conda for Python v2.7. Jupyter is started with pyspark as follows.
IPYTHON_OPTS="notebook --port 7005 --notebook-dir='~/' --ip='*' --no-browser" \
pyspark \
--queue default \
--master yarn-client
When I go to the Spark Application UI under Environment, the following Classpath Entries has the following.
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-api-jdo-3.2.6.jar
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-core-3.2.10.jar
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-rdbms-3.2.9.jar
/usr/hdp/2.4.2.0-258/spark/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar
/usr/hdp/current/hadoop-client/conf/
/usr/hdp/current/spark-historyserver/conf/
The sun.boot.class.path has the following value: /usr/jdk64/jdk1.8.0_60/jre/lib/resources.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/rt.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/sunrsasign.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jsse.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jce.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/charsets.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jfr.jar:/usr/jdk64/jdk1.8.0_60/jre/classes.
The spark.executorEnv.PYTHONPATH has the following value: /usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip:/usr/hdp/2.4.2.0-258/spark/python/:<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.9-src.zip.
The Hadoop distribution is via CDH: Hadoop 2.7.1.2.4.2.0-258
Quoting the Steve Loughran (who given his track record in Spark development seems to be the source of truth about the topic of accessing S3 file systems) from SPARK-15965 No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1:
This is being fixed with tests in my work in SPARK-7481; the manual
workaround is
Spark 1.6+
This needs my patch a rebuild of spark assembly. However, once that patch is in, trying to use the assembly without the AWS JARs will stop spark from starting —unless you move up to Hadoop 2.7.3
There are also some other sources where you can find workarounds:
https://issues.apache.org/jira/browse/SPARK-7442
https://community.mapr.com/thread/9635
How to access s3a:// files from Apache Spark?
https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html
https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.html
I've got no environment (and experience) to give the above a shot so after you give the above a try please report back to have a better understanding of the current situation regarding S3 support in Spark. Thanks.

spark streaming application and kafka log4j appender issue

I am testing my spark streaming application, and I have multiple functions in my code:
- some of them operate on a DStream[RDD[XXX]], some of them on RDD[XXX] (after I do DStream.foreachRDD).
I use Kafka log4j appender to log business cases that occur within my functions, that operate on both DStream[RDD] & RDD it self.
But data gets appended to Kafka only when from functions that operate on RDD -> it doesn't work when I want to append data to kafka from my functions that operate on DStream.
Does anyone know reason to this behaviour?
I am working on a single virtual machine, where I have Spark & Kafka. I submit applications using spark submit.
EDITED
Actually I have figured out the part of the problem. Data gets appended to Kafka only from the part of the code that is in my main function. All the code that Is outside of my main, doesnt write data to kafka.
In main I declared the logger like this:
val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
While outside of my main, I had to declare it like:
#transient lazy val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
in order to avoid serialization issues.
The reason might be behind JVM serialization concept, or simply because workers don't see the log4j configuration file (but my log4j file is in my source code, in resource folder)
Edited 2
I have tried in many ways to send log4j file to executors but not working. I tried:
sending log4j file in --files command of spark-submit
setting: --conf "spark.executor.extraJavaOptions =-Dlog4j.configuration=file:/home/vagrant/log4j.properties" in spark-submit
setting log4j.properties file in --driver-class-path of spark-submit...
None of this option worked.
Anyone has the solution? I do not see any errors in my error log..
Thank you
I think you are close..first you want to make sure all the files are exported to the WORKING DIRECTORY (not CLASSPATH) on all nodes using --files flag. And then you want to reference these files to extracClassPath option of executors and driver. I have attached the following command, hope it helps. Key is to understand once the files are exported, all the files can be accessed on the node using just file name of the working directory (and not url path).
Note: Putting log4j file in the resources folder will not work. (at least when i had tried, it didnt.)
sudo -u hdfs spark-submit --class "SampleAppMain" --master yarn --deploy-mode cluster --verbose --files file:///path/to/custom-log4j.properties,hdfs:///path/to/jar/kafka-log4j-appender-0.9.0.0.jar --conf "spark.driver.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.executor.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" /path/to/your/jar/SampleApp-assembly-1.0.jar

How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?

Several postings on stackoverflow has responses with partial information about How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine. So I'd like to ask the following questions for complete information about how to do that:
In the Spark SQL app, do we need to use HiveContext to register tables? Or can we use just SQL Context?
Where and how do we use HiveThriftServer2.startWithContext?
When we run start-thriftserver.sh as in
/opt/mapr/spark/spark-1.3.1/sbin/start-thriftserver.sh --master spark://spark-master:7077 --hiveconf hive.server2.thrift.bind.host spark-master --hiveconf hive.server2.trift.port 10001
besides specifying the jar and main class of the Spark SQL app, do we need to specify any other parameters?
Are there any other things we need to do?
Thanks.
To expose DataFrame temp tables through HiveThriftServer2.startWithContext(), you may need to write and run a simple application, may not need to run start-thriftserver.sh.
To your questions:
HiveContext is needed; the sqlContext converted to HiveContext implicitly in spark-shell
Write a simple application, example :
import org.apache.spark.sql.hive.thriftserver._
val hiveContext = new HiveContext(sparkContext)
hiveContext.parquetFile(path).registerTempTable("my_table1")
HiveThriftServer2.startWithContext(hiveContext)
No need to run start-thriftserver.sh, but run your own application instead, e.g.:
spark-submit --class com.xxx.MyJdbcApp ./package_with_my_app.jar
Nothing else from server side, should start on default port 10000;
you may verify by connecting to the server with beeline.
In Java I was able to expose dataframe as temp tables and read the table content via beeline (as like regular hive table)
I have n't posted the entire program (with the assumption that you know already how to create dataframes)
import org.apache.spark.sql.hive.thriftserver.*;
HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
DataFrame orgDf = sqlContext.createDataFrame(orgPairRdd.values(), OrgMaster.class);
orgPairRdd is a JavaPairRDD, orgPairRdd.values() -> contains the entire class value (Row fetched from Hbase)
OrgMaster is a java bean serializable class
orgDf.registerTempTable("spark_org_master_table");
HiveThriftServer2.startWithContext(sqlContext);
I submitted the program locally (as the Hive thrift server is not running in port 10000 in that machine)
hadoop_classpath=$(hadoop classpath)
HBASE_CLASSPATH=$(hbase classpath)
spark-1.5.2/bin/spark-submit --name tempSparkTable --class packageName.SparkCreateOrgMasterTableFile --master local[4] --num-executors 4 --executor-cores 4 --executor-memory 8G --conf "spark.executor.extraClassPath=${HBASE_CLASSPATH}" --conf "spark.driver.extraClassPath=${HBASE_CLASSPATH}" --conf "spark.executor.extraClassPath=${hadoop_classpath}" --conf --jars /path/programName-SNAPSHOT-jar-with-dependencies.jar
/path/programName-SNAPSHOT.jar
In another terminal start the beeline pointing to this thrift service started using this spark program
/opt/hive/hive-1.2/bin/beeline -u jdbc:hive2://<ipaddressofMachineWhereSparkPgmRunninglocally>:10000 -n anyUsername
Show tables -> command will display the table that you registered in Spark
You can do describe also
In this example
describe spark_org_master_table;
then you can run regular queries in beeline against this table. (Until you kill the spark program execution)

Resources