Local spark submit does not return any results

Local spark submit does not return any results - apache-spark

I am trying to the below script locally using spark submit it does not return. The same code works in spark shell. Not sure what am I missing?
Spark Submit
./bin/spark-submit \
--master local\
~/Desktop/projects/S3_Snowflake_Prototype/main.py
Code: main.py
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("~/Desktop/projects/S3_Snowflake_Prototype/csv/source.csv")
df.count()
Current Output
21/08/03 08:36:48 WARN Utils: Your hostname, vinays-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.3 instead (on interface en0)
21/08/03 08:36:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/08/03 08:36:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/08/03 08:36:49 INFO ShutdownHookManager: Shutdown hook called
21/08/03 08:36:49 INFO ShutdownHookManager: Deleting directory /private/var/folders/_l/r0yqws8j5hl5bsc5rvzjkm5c0000gn/T/spark-a3ea7970-7ef7-4edd-a539-f5c1f264b59d

Related

Spark accessing remote master

I'm trying to run my code in jupyter notebook locally, to access a spark cluster on my own server, but without success, so that's the code
I've tried this
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('SparkApp').setMaster('spark://X.X.X.123:7077')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
and this way
spark = SparkSession.builder.master("spark://X.X.X.123:7077").getOrCreate()
I received this error [updated]
### Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
###: java.lang.NullPointerException
new error after open port 7077
21/11/23 20:39:32 WARN NativeCodeLoader: Unable to load native-
hadoop library for your platform... using builtin-java classes
where applicable
21/11/23 20:39:33 WARN StandaloneAppClient$ClientEndpoint: Failed
to connect to master 192.168.0.123:7077
with 'local' work normally

How to forward spark log to jupyter notebook?

I know I can set up log level via spark.sparkContext.setLogLevel('INFO') Logs such as the following appears in the terminal, but not in the jupyter notebook.
2019-03-25 11:42:37 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-03-25 11:42:37 WARN SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
2019-03-25 11:42:38 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
The spark session is created in local mode in the jupyter notebook cell.
spark = SparkSession \
.builder \
.master('local[7]') \
.appName('Notebook') \
.getOrCreate()
Is there any way to forward the logs to the jupyter notebook?

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided.
AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
Observations:
- Using spark-shell (From EMR Master Node):
Works. Able to access Glue DB/Tables using below commands:
spark.catalog.setCurrentDatabase("test_db")
spark.catalog.listTables
- Using spark-submit (From EMR Step):
Does not work. Keep getting the error "Database 'test_db' does not exist"
Error Trace is as below:
INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO HiveMetaStore: 0: get_database: default
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: default
INFO HiveMetaStore: 0: get_database: global_temp
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: global_temp
WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/6d0f6b2c-cccd-4e90-a524-93dcc5301e20_resources
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/yarn/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20/_tmp_space.db
INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
INFO CodeGenerator: Code generated in > 191.063411 ms
INFO CodeGenerator: Code generated in 10.27313 ms
INFO HiveMetaStore: 0: get_database: test_db
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: test_db
WARN ObjectStore: Failed to get database test_db, returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Database 'test_db' does not exist.;
at org.apache.spark.sql.internal.CatalogImpl.requireDatabaseExists(CatalogImpl.scala:44)
at org.apache.spark.sql.internal.CatalogImpl.setCurrentDatabase(CatalogImpl.scala:64)
at org.griffin_test.GriffinTest.ingestGriffinRecords(GriffinTest.java:97)
at org.griffin_test.GriffinTest.main(GriffinTest.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
After lot of research and going through many suggestions in blogs, I have tried the below fixes but of no avail and we are still facing the discrepancy.
Reference Blogs:
https://forums.aws.amazon.com/thread.jspa?threadID=263860
Spark Catalog w/ AWS Glue: database not found
https://okera.zendesk.com/hc/en-us/articles/360005768434-How-can-we-configure-Spark-to-use-the-Hive-Metastore-for-metadata-
Fixes Tried:
- Enabling Hive support in spark-defaults.conf & SparkSession (Code):
Hive classes are on CLASSPATH and have set spark.sql.catalogImplementation internal configuration property to hive:
spark.sql.catalogImplementation hive
Adding Hive metastore config:
.config("hive.metastore.connect.retries", 15)
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
Code Snippet:
SparkSession spark = SparkSession.builder().appName("Test_Glue_Catalog")
.config("spark.sql.catalogImplementation", "hive")
.config("hive.metastore.connect.retries", 15)
.config("hive.metastore.client.factory.class","com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
.enableHiveSupport()
.getOrCreate();
Any suggestions in figuring out the root cause for this discrepancy would be really helpful.
Appreciate your help! Thank you!

Continuously getting error in spark-submit job

I am continuously getting this error:
16/02/29 14:49:40 WARN BlockManager: Block input-0-1456737579500 replicated to only 0 peer(s) instead of 1 peers
while running the spark-submit job.
./spark-submit --jars jar_names--driver-class-path --packages --executor-memory 6g --executor-cores 4 --master local[4] script.py
I am getting this error as soon as it is started: Error says:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN Utils: Your hostname, host_name resolves to a loopback address: 127.0.0.1; using ip instead (on interface eth0)
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
I have also tried using --master[4] but still getting the error.

trouble in adding spark-csv package in Cloudera VM

I am using Cloudera quickstart VM to test out some pyspark work. For one task, I need to add spark-csv package. And here is what I did:
PYSPARK_DRIVER_PYTHON=ipython pyspark -- packages com.databricks:spark-csv_2.10:1.3.0
pyspark started up fine, however I did get warnings as:
**16/02/09 17:41:22 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
16/02/09 17:41:22 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/02/09 17:41:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable**
then I ran my code in pyspark:
yelp_df = sqlCtx.load(
source="com.databricks.spark.csv",
header = 'true',
inferSchema = 'true',
path = 'file:///directory/file.csv')
But I am getting an error message:
Py4JJavaError: An error occurred while calling o19.load.: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27)
What could have gone wrong?? Thanks in advance for your help.

Try this
PYSPARK_DRIVER_PYTHON=ipython pyspark --packages com.databricks:spark-csv_2.10:1.3.0
Without the space, there's a typo.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Local spark submit does not return any results - apache-spark

Related

Spark accessing remote master

How to forward spark log to jupyter notebook?

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

Continuously getting error in spark-submit job

trouble in adding spark-csv package in Cloudera VM

Categories

Resources