How to forward spark log to jupyter notebook? - apache-spark

I know I can set up log level via spark.sparkContext.setLogLevel('INFO') Logs such as the following appears in the terminal, but not in the jupyter notebook.
2019-03-25 11:42:37 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-03-25 11:42:37 WARN SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
2019-03-25 11:42:38 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
The spark session is created in local mode in the jupyter notebook cell.
spark = SparkSession \
.builder \
.master('local[7]') \
.appName('Notebook') \
.getOrCreate()
Is there any way to forward the logs to the jupyter notebook?

Related

Local spark submit does not return any results

I am trying to the below script locally using spark submit it does not return. The same code works in spark shell. Not sure what am I missing?
Spark Submit
./bin/spark-submit \
--master local\
~/Desktop/projects/S3_Snowflake_Prototype/main.py
Code: main.py
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("~/Desktop/projects/S3_Snowflake_Prototype/csv/source.csv")
df.count()
Current Output
21/08/03 08:36:48 WARN Utils: Your hostname, vinays-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.0.3 instead (on interface en0)
21/08/03 08:36:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/08/03 08:36:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/08/03 08:36:49 INFO ShutdownHookManager: Shutdown hook called
21/08/03 08:36:49 INFO ShutdownHookManager: Deleting directory /private/var/folders/_l/r0yqws8j5hl5bsc5rvzjkm5c0000gn/T/spark-a3ea7970-7ef7-4edd-a539-f5c1f264b59d

Pyspark freeze in client mode with Yarn Cluster Manager

Following theses instructions: https://www.linode.com/docs/databases/hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/ i set up a 3 node cluster and am able to run spark-shell. But when i try to run pyspark i got theses messages:
hadoop#master:~$ pyspark
Python 3.7.1 (default, Dec 14 2018, 19:28:38)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/02/15 21:51:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/02/15 21:51:06 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/02/15 21:51:12 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
and the screen freeze (there is no other messages).
I have no idea how i could solve this issue.
PS: As explained in the link i first deployed a 3 node hadoop-yarn cluster and then installed spark on the master Node (after launching yarn-start.sh.

PySpark WARN messages

How can I disable the following WARN messages when running PySpark code:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/06/08 21:04:55 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
18/06/08 21:04:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
I spent some time playing with log4.properties, but cannot figure out exactly which class logs these.
put this in your init of the spark context:
sc.setLogLevel("INFO")

Continuously getting error in spark-submit job

I am continuously getting this error:
16/02/29 14:49:40 WARN BlockManager: Block input-0-1456737579500 replicated to only 0 peer(s) instead of 1 peers
while running the spark-submit job.
./spark-submit --jars jar_names--driver-class-path --packages --executor-memory 6g --executor-cores 4 --master local[4] script.py
I am getting this error as soon as it is started: Error says:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN Utils: Your hostname, host_name resolves to a loopback address: 127.0.0.1; using ip instead (on interface eth0)
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
I have also tried using --master[4] but still getting the error.

trouble in adding spark-csv package in Cloudera VM

I am using Cloudera quickstart VM to test out some pyspark work. For one task, I need to add spark-csv package. And here is what I did:
PYSPARK_DRIVER_PYTHON=ipython pyspark -- packages com.databricks:spark-csv_2.10:1.3.0
pyspark started up fine, however I did get warnings as:
**16/02/09 17:41:22 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
16/02/09 17:41:22 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/02/09 17:41:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable**
then I ran my code in pyspark:
yelp_df = sqlCtx.load(
source="com.databricks.spark.csv",
header = 'true',
inferSchema = 'true',
path = 'file:///directory/file.csv')
But I am getting an error message:
Py4JJavaError: An error occurred while calling o19.load.: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv at scala.sys.package$.error(package.scala:27)
What could have gone wrong?? Thanks in advance for your help.
Try this
PYSPARK_DRIVER_PYTHON=ipython pyspark --packages com.databricks:spark-csv_2.10:1.3.0
Without the space, there's a typo.

Resources