windows log path for running Spark HistoryServer - apache-spark

I have followed instruction on spark website for configuring pySpark HistoryServer locally on Windows but cannot get past this error when I run: spark-class.cmd org.apache.spark.deploy.history.HistoryServer
: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory?
spark-defaults.conf has:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:/tmp/spark-events
spark.eventLog.dir file:/tmp/spark-events
I can get pyspark to run and I can successfully submit .py script with spark-submit
I have created the directory /tmp/spark-events in both SPARK_HOME and SPARK_HOME/bin because i'm not exactly sure where "file:/tmp/spark-events" should actually located. Where exactly on Windows do I need to create this directory "tmp/spark-events" so it can be found? Am I missing anything else? Also, even if I change the paths in the .conf file it still gives error saynig can't find tmp/spark-events so it's like it's not even using the values in the config.

You can choose where spark.history.fs.logDirectory points to! In your case, it should be a windows path. The idea is the following:
You make a directory wherever you would like it, with the proper permissions on there (more info on that here)
When that is done, you should be able to start up your history server, with spark.history.fs.logDirectory pointing to that one directory you made. This is not a relative path w.r.t. your $SPARK_HOME env variable, but an absolute path.
If that works, you should see a rather uninteresting screen (the default port is 18080 so locally you should visit localhost:18080): since none of your applications have written to your directory yet you will see an empty History Server screen
If you want to make use of the history server, you have to make your apps write to the eventlog directory you made. That can be done by adding --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=<your-dir> to your spark-submit call.
If that was successful, you should see a file in your log directory you made!
Have a look at your History server (by default on localhost:18080). You should see your application's logs in there!
Hope this helps :)

Related

Spark Event log directory

I am using PySpark (standalone without hadoop etc) and calling my pyspark jobs below and it works fine:
PYSPARK_PYTHON=python3 JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre" SPARK_HOME=~/.local/lib/python3.6/site-packages/pyspark spark-submit job.py --master local
The History Server is running however I am trying to configure the Spark History Server to read the correct directory. The settings I have configured are in /pyspark/conf/spark-env.sh:
....
SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=/home/Documents/Junk/logs/ -Dspark.history.fs.logDirectory=/home/Documents/Junk/logs"
....
But when I run jobs, this directory is empty (logs not writing to this directory)
Am I specifying the directory addresses correctly? (thes are local addresses in my file system)
To get it working, do the following. Do not use spark-env.sh and instead edit the conf/spark-defaults.conf file with the following, note the file:// prefix.
spark.eventLog.enabled true
spark.eventLog.dir file:///home/user/.local/lib/python3.6/site-packages/pyspark/logs
spark.history.fs.logDirectory file:///home/user/.local/lib/python3.6/site-packages/pyspark/logs

How to specify cluster location in HADOOP_CONF_DIR?

The Spark documentation about submitting applications says:
Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
I am afraid I did not get it. I found that HADOOP_CONF_DIR is set to /etc/hadoop that contains many shell scripts and configuration files.
Where exactly should I find the cluster location there?
HADOOP_CONF_DIR is the directory with the configuration files that Hadoop libraries use for various Hadoop-specific stuff. I wrote various Hadoop-specific stuff to highlight that there's not much here Spark-related.
What's more important is that HADOOP_CONF_DIR can also point to an empty directory (which says to assume the defaults).
To answer your question, you can define the cluster location in yarn-site.xml using yarn.resourcemanager.address. If yarn-site.xml is not found, the YARN cluster is available at localhost.
Where should I place yarn-site.xml so spark-submit will use it?
I used to use YARN_CONF_DIR to point to the directory with yarn-site.xml.
YARN_CONF_DIR=/tmp ./bin/spark-shell --master yarn

pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10

I have installed spark 2.2 with winutils in windows 10.when i am going to run pyspark i am facing bellow exception
pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
I have already tried permission 777 commands in tmp/hive folder as well.but it is not working for now
winutils.exe chmod -R 777 C:\tmp\hive
after applying this the problem remains same. I am using pyspark 2.2 in my windows 10.
Her is spark-shell env
Here is pyspark shell
Kindly help me to figure out
Thankyou
I had the same problem using the command 'pyspark' as well as 'spark-shell' (for scala) in my mac os with apache-spark 2.2. Based on some research I figured its because of my JDK version 9.0.1 which does not work well with Apache-Spark. Both errors got resolved by switching back from Java JDK 9 to JDK 8.
Maybe that might help with your windows spark installation too.
Port 9000?! It must be something Hadoop-related as I don't remember the port for Spark. I'd recommend using spark-shell first that would eliminate any additional "hops", i.e. spark-shell does not require two runtimes for Spark itself and Python.
Given the exception I'm pretty sure that the issue is that you've got some Hive-
or Hadoop-related configuration somewhere lying around and Spark uses it apparently.
The "Caused by" seems to show that 9000 is used when Spark SQL is created which is when Hive-aware subsystem is loaded.
Caused by: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.net.ConnectException: Call From DESKTOP-SDNSD47/192.168.10.143 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused
Please review the environment variables in Windows 10 (possibly using set command on command line) and remove anything Hadoop-related.
Posting this answer for posterity. I faced the same error.
The way i solved it is by first trying out spark-shell instead of pyspark. The error message was more direct.
This gave a better idea; there was S3 access error.
Next; i checked the ec2 role/instance profile for that instance; it has S3 administrator access.
Then i did a grep for s3:// in all the conf files under /etc/ directory.
Then i found that in core-site.xml there is a property called
<!-- URI of NN. Fully qualified. No IP.-->
<name>fs.defaultFS</name>
<value>s3://arvind-glue-temp/</value>
</property>
Then i remembered. I had removed HDFS as the default file system and set it to S3. I had created the ec2 instance from an earlier AMI and had forgotten to update the S3 bucket corresponding to the newer account.
Once i updated the s3 bucket to the one which is accessible by the current ec2 instance profile; it worked.
To use Spark on Windows OS, you may follow this guide.
NOTE: Ensure that you have correctly resolved your IP address against your hostname as well as localhost, lack of localhost resolution has caused problems for us in the past.
Also, you should provide the full stack trace as it helps to debug the issue quickly and saves the guesswork.
Let me know if this helps. Cheers.
Try this . It worked for me!. Open up a command prompt in administrator mode and then run the command 'pyspark'. This should help open a spark session without errors.
I also come across the error in Unbuntu 16.04:
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
this is because I have already run ./bin/spark-shell
So, just kill that spark-shell, and re-run ./bin/pyspark
I also come across the error in MacOS10, and I solved this by use Java8 instead of Java9.
When Java 9 is the default version getting resolved in the environment, pyspark will throw error below and you will see name 'xx' is not defined error when trying to access sc, spark etc. from shell / Jupyter.
more details you can see this link
You must have hive-site.xml file in the spark configuration directory.
Change the port from 9000 to 9083 resolved the problem for me.
Please ensure that the property is updated in both the hive-site.xml files which would be placed under hive config and spark config directory.
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property>
For me in ubuntu, the location for hive-site.xml are:
/home/hadoop/hive/conf/
and
/home/hadoop/spark/conf/

Custom log4j.properties on AWS EMR

I am unable to override and use a Custom log4j.properties on Amazon EMR. I am running Spark on EMR (Yarn) and have tried all the below combinations in the Spark-Submit to try and use the custom log4j.
--driver-java-options "-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"
I have also tried picking from local filesystem using file://// instead of hdfs. None of this seem to work. However, I can get this working when running on my local Yarn setup.
Any ideas?
Basically, after chatting with the support and reading the documentation, I see that there are 2 options available to do this:
1 - Pass the log4j.properties through configuration passed when bringing up EMR. Jonathan has mentioned this on his answer.
2 - Include the --files /path/to/log4j.properties switch to your spark-submit command. This will distribute the log4j.properties file to the working directory of each Spark Executor, then change your -Dlog4jconfiguration to point to the filename only: "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
log4j knows nothing about HDFS, so it can't accept an hdfs:// path as its configuration file. See here for more information about configuring log4j in general.
To configure log4j on EMR, you may use the Configuration API to add key-value pairs to the log4j.properties file that is loaded by the driver and executors. Specifically, you want to add your Properties to the spark-log4j configuration classification.
Here is the simplest solution which worked quite well in my case
ssh to the EMR cluster via terminal
Go to the conf directory (cd /usr/lib/spark/conf)
Replace the log4j.properties file with your custom values.
Make sure you are editing the file with root user access(type sudo -i to login as a root user)
Note: All the spark applications running in this cluster will output the logs defined in the custom log4j.properties file.
For those using terraform, it can be cumbersome to define a bootstrap action to create a new log4j file within in EMR or update the default one /etc/spark/conf/log4j.properties, because this will recreate the EMR cluster.
In this case, it's possible to use s3 paths in --files option so something like --files=s3://my-bucket/log4j.properties is valid. As mentioned by #Kaptrain, EMR will distribute the log4j.properties file to the working directory of each Spark Executor. Then we can pass these two flags to our Spark jobs in order to use the new log4j configuration:
spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties

PDI hadoop file browser no list

I've hadoop single instance cluster configured to run with some IP address ( instead of localhost ) on centos linux. I was able to execute example mapreduce job correctly. That tells me that the hadoop setup appears to be fine.
I have also addded couple of data files to hadoop databse under "/data" folder and are visible through the "dfs" comand
bin/hadoop dfs -ls /data
I am trying to connect to this HDFS system from PDI/Kettle. In the HDFS File browser, if I put the HDFS connection parameters incorrectly, e.g. incorrect port, it says it can not connect to the HDFS server. Instead, If I put in all parameters correctly ( server,port,user,password ), and click 'connect' it does not give the error, meaning it is able to connect. But in the file list, it shows "/" .
Doesnt show data folder. What could be going wrong ?
I've already tried this :
tried chmod 777 to the datafiles using "bin/hadoop dfs -chmod -R 777 /data"
tried using root and also hdfs linux user in the PDI file browser
tried adding the data files in some other location
re-formatting hdfs several times and adding data files again
copying the hadoop-core jar file from hadoop installable to PDI extlib
but it does not list files in the PDI browser. I can not see anything in the PDI log either... Need quick help ... thanks !!!
-abhay
I got past this issue. On windows, PDI was not logging anything in the log file. I tried same thing on linux, when it showed me in the log that it was missing a library from Apache, the commons-configuration. I downloaded latest version of the same and put it under the extlib/pentaho folder and boom ! it worked !!

Resources