Local spark submit does not return any results - apache-spark

I am trying to the below script locally using spark submit it does not return. The same code works in spark shell. Not sure what am I missing?
Spark Submit
./bin/spark-submit \
--master local\
~/Desktop/projects/S3_Snowflake_Prototype/main.py
Code: main.py
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("~/Desktop/projects/S3_Snowflake_Prototype/csv/source.csv")
df.count()
Current Output
21/08/03 08:36:48 WARN Utils: Your hostname, vinays-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.3 instead (on interface en0)
21/08/03 08:36:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/08/03 08:36:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/08/03 08:36:49 INFO ShutdownHookManager: Shutdown hook called
21/08/03 08:36:49 INFO ShutdownHookManager: Deleting directory /private/var/folders/_l/r0yqws8j5hl5bsc5rvzjkm5c0000gn/T/spark-a3ea7970-7ef7-4edd-a539-f5c1f264b59d

Related

Spark accessing remote master

I'm trying to run my code in jupyter notebook locally, to access a spark cluster on my own server, but without success, so that's the code
I've tried this
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('SparkApp').setMaster('spark://X.X.X.123:7077')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
and this way
spark = SparkSession.builder.master("spark://X.X.X.123:7077").getOrCreate()
I received this error [updated]
### Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
###: java.lang.NullPointerException
new error after open port 7077
21/11/23 20:39:32 WARN NativeCodeLoader: Unable to load native-
hadoop library for your platform... using builtin-java classes
where applicable
21/11/23 20:39:33 WARN StandaloneAppClient$ClientEndpoint: Failed
to connect to master 192.168.0.123:7077
with 'local' work normally

How to forward spark log to jupyter notebook?

I know I can set up log level via spark.sparkContext.setLogLevel('INFO') Logs such as the following appears in the terminal, but not in the jupyter notebook.
2019-03-25 11:42:37 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-03-25 11:42:37 WARN SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
2019-03-25 11:42:38 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
The spark session is created in local mode in the jupyter notebook cell.
spark = SparkSession \
.builder \
.master('local[7]') \
.appName('Notebook') \
.getOrCreate()
Is there any way to forward the logs to the jupyter notebook?

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided.
AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
Observations:
- Using spark-shell (From EMR Master Node):
Works. Able to access Glue DB/Tables using below commands:
spark.catalog.setCurrentDatabase("test_db")
spark.catalog.listTables
- Using spark-submit (From EMR Step):
Does not work. Keep getting the error "Database 'test_db' does not exist"
Error Trace is as below:
INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO HiveMetaStore: 0: get_database: default
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: default
INFO HiveMetaStore: 0: get_database: global_temp
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: global_temp
WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/6d0f6b2c-cccd-4e90-a524-93dcc5301e20_resources
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/yarn/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20/_tmp_space.db
INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
INFO CodeGenerator: Code generated in > 191.063411 ms
INFO CodeGenerator: Code generated in 10.27313 ms
INFO HiveMetaStore: 0: get_database: test_db
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: test_db
WARN ObjectStore: Failed to get database test_db, returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Database 'test_db' does not exist.;
at org.apache.spark.sql.internal.CatalogImpl.requireDatabaseExists(CatalogImpl.scala:44)
at org.apache.spark.sql.internal.CatalogImpl.setCurrentDatabase(CatalogImpl.scala:64)
at org.griffin_test.GriffinTest.ingestGriffinRecords(GriffinTest.java:97)
at org.griffin_test.GriffinTest.main(GriffinTest.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
After lot of research and going through many suggestions in blogs, I have tried the below fixes but of no avail and we are still facing the discrepancy.
Reference Blogs:
https://forums.aws.amazon.com/thread.jspa?threadID=263860
Spark Catalog w/ AWS Glue: database not found
https://okera.zendesk.com/hc/en-us/articles/360005768434-How-can-we-configure-Spark-to-use-the-Hive-Metastore-for-metadata-
Fixes Tried:
- Enabling Hive support in spark-defaults.conf & SparkSession (Code):
Hive classes are on CLASSPATH and have set spark.sql.catalogImplementation internal configuration property to hive:
spark.sql.catalogImplementation hive
Adding Hive metastore config:
.config("hive.metastore.connect.retries", 15)
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
Code Snippet:
SparkSession spark = SparkSession.builder().appName("Test_Glue_Catalog")
.config("spark.sql.catalogImplementation", "hive")
.config("hive.metastore.connect.retries", 15)
.config("hive.metastore.client.factory.class","com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
.enableHiveSupport()
.getOrCreate();
Any suggestions in figuring out the root cause for this discrepancy would be really helpful.
Appreciate your help! Thank you!

Continuously getting error in spark-submit job

I am continuously getting this error:
16/02/29 14:49:40 WARN BlockManager: Block input-0-1456737579500 replicated to only 0 peer(s) instead of 1 peers
while running the spark-submit job.
./spark-submit --jars jar_names--driver-class-path --packages --executor-memory 6g --executor-cores 4 --master local[4] script.py
I am getting this error as soon as it is started: Error says:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN Utils: Your hostname, host_name resolves to a loopback address: 127.0.0.1; using ip instead (on interface eth0)
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
I have also tried using --master[4] but still getting the error.

trouble in adding spark-csv package in Cloudera VM

I am using Cloudera quickstart VM to test out some pyspark work. For one task, I need to add spark-csv package. And here is what I did:
PYSPARK_DRIVER_PYTHON=ipython pyspark -- packages com.databricks:spark-csv_2.10:1.3.0
pyspark started up fine, however I did get warnings as:
**16/02/09 17:41:22 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
16/02/09 17:41:22 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/02/09 17:41:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable**
then I ran my code in pyspark:
yelp_df = sqlCtx.load(
source="com.databricks.spark.csv",
header = 'true',
inferSchema = 'true',
path = 'file:///directory/file.csv')
But I am getting an error message:
Py4JJavaError: An error occurred while calling o19.load.: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27)
What could have gone wrong?? Thanks in advance for your help.
Try this
PYSPARK_DRIVER_PYTHON=ipython pyspark --packages com.databricks:spark-csv_2.10:1.3.0
Without the space, there's a typo.

Resources