Spark Notebook not working on HUE for EMR - apache-spark

Ok, I've got hue 3.8 pointing at my EMR cluster, and it's mostly working. THe one thing I'm missing that I really care about at this point is spark notebook
when I attempt to choose a language for a snippet, there is an error, "No usable value for lang Did not find value which can be converted into java.lang.String (error 400)" and the logs say this:
[03/Jun/2015 11:38:59 -0700] decorators ERROR error running <function create_session at 0x7fe30acd1d70>
Traceback (most recent call last):
File "/usr/local/hue/apps/spark/src/spark/decorators.py", line 77, in decorator
return func(*args, **kwargs)
File "/usr/local/hue/apps/spark/src/spark/api.py", line 44, in create_session
response['session'] = get_api(request.user, snippet).create_session(lang=snippet['type'])
File "/usr/local/hue/apps/spark/src/spark/models.py", line 284, in create_session
response = api.create_session(kind=lang)
File "/usr/local/hue/apps/spark/src/spark/job_server_api.py", line 87, in create_session
return self._root.post('sessions', data=json.dumps(kwargs), contenttype='application/json')
File "/usr/local/hue/desktop/core/src/desktop/lib/rest/resource.py", line 122, in post
return self.invoke("POST", relpath, params, data, self._make_headers(contenttype, headers))
File "/usr/local/hue/desktop/core/src/desktop/lib/rest/resource.py", line 78, in invoke
urlencode=self._urlencode)
File "/usr/local/hue/desktop/core/src/desktop/lib/rest/http_client.py", line 161, in execute
raise self._exc_class(ex)
RestException: No usable value for lang
Did not find value which can be converted into java.lang.String (error 400)
Is this a problem with the software or my config?
THis might be tied to the fact that attempting to run sudo ./hue livy_server yields:
Failed to run spark-submit executable: java.io.IOException:
Cannot run program "spark-submit": error=2, No such file or directory
spark-submit does in fact exist and is in path

The
spark-submit
command comes from Spark, it needs to be present on the Hue machine.

Related

Python Firestore insert return error 503 DNS resolution failed

I have a problem during the execution of my python script from crontab, which consists of an insert operation in the firestore database.
db.collection(u'ab').document(str(row["Name"])).collection(str(row["id"])).document(str(row2["id"])).set(self.packStructure(row2))
When I execute normally with python3 script.py command it works, but when I execute it from crontab it return the following error:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/axatel/angel_bridge/esportazione_firebase/main.py", line 23, in <module>
dato.getDati(dato, db, cursor, cursor2, fdb, select, anagrafica)
File "/home/axatel/angel_bridge/esportazione_firebase/dati.py", line 19, in getDati
db.collection(u'ab').document(str(row["Name"])).collection(str(row["id"])).document(str(row2["id"])).set(self.packStructure(row2))
File "/home/axatel/.local/lib/python3.7/site-packages/google/cloud/firestore_v1/document.py", line 234, in set
write_results = batch.commit()
File "/home/axatel/.local/lib/python3.7/site-packages/google/cloud/firestore_v1/batch.py", line 147, in commit
metadata=self._client._rpc_metadata,
File "/home/axatel/.local/lib/python3.7/site-packages/google/cloud/firestore_v1/gapic/firestore_client.py", line 1121, in commit
request, retry=retry, timeout=timeout, metadata=metadata
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 145, in __call__
return wrapped_func(*args, **kwargs)
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func
on_error=on_error,
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/retry.py", line 184, in retry_target
return target()
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
return func(*args, **kwargs)
File "/home/axatel/.local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
six.raise_from(exceptions.from_grpc_error(exc), exc)
File "<string>", line 3, in raise_from
google.api_core.exceptions.ServiceUnavailable: 503 DNS resolution failed for service: firestore.googleapis.com:443
I really don't understand what's the problem, because the connection at the database works every time the script is started in both ways.
Is there a fix for this kind of issue?
I found something that might be helpful. There is nice troubleshooting guide and there is a part there, which seems to be related:
If your command works by invoking a runtime like python some-command.py perform a few checks to determine that the runtime
version and environment is correct. Each language runtime has quirks
that can cause unexpected behavior under crontab.
For python you might find that your web app is using a virtual
environment you need to invoke in your crontab.
I haven't seen such error running Firestore API, but this seems to match to your issue.
I found the solution.
The problem occured because the timeout sleep() value was lower than expected, so the database connection function starts too early during boot phase of machine. Increasing this value to 45 or 60 seconds fixed the problem.
#time.sleep(10) # old version
time.sleep(60) # working version
fdb = firebaseConnection()
def firebaseConnection():
# firebase connection
cred = credentials.Certificate('/database/axatel.json')
firebase_admin.initialize_app(cred)
fdb = firestore.client()
if fdb:
return fdb
else:
print("Error")
sys.exit()

Unable to fix UnknownHostException while reading a csv file from a HDFS dir

My spark program is running on a server: serverA. I am running the code from pyspark terminal. Using that program, I am trying to read a csv file from another cluster set up on another server -> server: serverB, HDFS cluster: clusterB as below:
spark = SparkSession.builder.master('yarn').appName("Detector").config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.yarn.keytab','/home/testuser/testuser.keytab').config('spark.yarn.principal','krbtgt/HADOOP.NAME.COM#NAME.COM').config('spark.executor.instances','1').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://clusterB').config('spark.yarn.principal','testuser#NAME.COM').getOrCreate()
The file I am trying to read is on cluster: clusterB as below:
(base) testuser#hdptetl:[~] {46} $ hadoop fs -df -h
Filesystem Size Used Available Use%
hdfs://clusterB 787.3 T 554.5 T 230.7 T 70%
The keytab details (path of keytab, KDC REALM) I mentioned in the spark configuration is present on the server serverB
When I try to load the file as:
csv_df = spark.read.format('csv').load('hdfs://botest01/test/mr/wc.txt')
The code results in UnknownHostException as below:
>>> tdf = spark.read.format('csv').load('hdfs://clusterB/test/mr/wc.txt')
20/07/15 15:40:36 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'java.net.UnknownHostException: clusterB'
Could anyone let me know what is the mistake I did here and how can I fix it ?

Unable to Start Scheduler

I am new to Python and trying to install Airflow in my Mac, by following this tutorial
While these two commands work fine:
$ airflow initdb
$ airflow webserver -p 8080
The scheduler command (airflow scheduler) throws the following error:
[2020-02-18 13:18:09,012] {scheduler_job.py:1382} ERROR - Exception when executing execute_helper Traceback (most recent call last):
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1380, in _execute
self._execute_helper()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1413, in _execute_helper
self.processor_agent.start()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 554, in start
self._process.start()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'SchedulerJob._execute.<locals>.processor_factory'
[2020-02-18 13:18:09,035] {helpers.py:322} INFO - Sending Signals.SIGTERM to GPID None
Traceback (most recent call last): File "/Users/mac/Workspace/airflow/airflow_venv/bin/airflow", line 37, in <module>
args.func(args) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/cli.py", line 75, in wrapper
return f(*args, **kwargs) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/bin/cli.py", line 1040, in scheduler
job.run() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 221, in run
self._execute() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1384, in _execute
self.processor_agent.end() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 707, in end
reap_process_group(self._process.pid, log=self.log) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 324, in reap_process_group
signal_procs(sig) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 293, in signal_procs
os.killpg(pgid, sig)
TypeError: an integer is required (got type NoneType)
EDIT: Python 3.8 is supported now https://github.com/apache/airflow#requirements. So this answer might not be relevant now.
This due to the Python version you are using. Airflow doesn't support Python 3.8 yet https://github.com/apache/airflow#stable-version-1109.
Downgrade your Python to 3.7 and check.
Maybe there are some compatibility problems?
Using Python 3.6.10 and airflow v1.10.4, I can get airflow running. Maybe you could try some other versions?
This worked for me!
1- Make sure you are using the correct celery version that supports your other packages like RabbitMQ ( as V5 doesn't support AMQP in its usual format), my advice is to use V4.6.X
2-THIS HAS NOTHING TO DO WITH PYTHON VERSION IF YOU ARE USING AIRFLOW V2.0
3- simply make yourself happy with airflow db reset (command may differ if you are using airflow Version X<2.0 )
4- Avoid deleting any dag like you delete a file and use airflow dag ... commands to do so. (it makes up a mess in your environment that you wont like, trust me on this..)
Wish you luck bearing python stuff..

Pyspark in Docker based on Hortonworks 2.6.1 is throwing error with EnableHiveSupport()

I am trying to build an edge-node using docker with HDP2.6.1. Everything is available and running except Spark Support. I was able to install and run pyspark but only when I comment enableHiveSupport(). I have copied over the hive-site.xml to /etc/spark2/conf as well from ambari and all the spark confs are matching with the cluster settings. But still get this error:
17/10/27 02:35:57 WARN conf.HiveConf: HiveConf of name hive.groupby.position.alias does not exist
17/10/27 02:35:57 WARN conf.HiveConf: HiveConf of name hive.mv.files.thread does not exist
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/shell.py", line 43, in <module>
spark = SparkSession.builder\
File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 187, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
>>> spark.createDataFrame([(1,'a'), (2,'b')], ['id', 'nm'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'spark' is not defined
I have tried to search this error, but all the results that I get are possible windows errors related to permissions and hive-site.xml missing. But i am building it on centos:7.3.1611. And installing the following:
RUN wget http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.1.0/hdp.repo
RUN cp hdp.repo /etc/yum.repos.d
RUN yum -y install hadoop sqoop spark2_2_6_1_0_129-master spark2_2_6_1_0_129-python hive-hcatalog
So the solution to the above problem is that the hive-site.xml needs to only contain the property for hive.metastore.uris and NOTHING ELSE. (Reference: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_spark-component-guide/content/spark-config-hive.html). Once you take the other properties out, it works like a charm!

Why "Error loading column family" in OpsCenter when reading Column Family?

I'm trying to use the OpsCenter with my local multi-node development cluster created with CCM. I have manually installed and configured the Agents for each node using these instructions. I created my custom keyspace and its column families by uploading a SOURCE file in the CQLSH interface
I get the following error when clicking on Data > MyKeySpace > MyColumnFamily:
Error loading column family: Call to /test_cluster/keyspaces/flashcardsapp/cf/tag timed out.
I am however able to view the column families in the OpsCenter keyspace.
I am seeing the following in the OpsCenter log:
2015-03-14 07:58:35-0600 [] Unhandled Error
Traceback (most recent call last):
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 1076, in gotResult
_inlineCallbacks(r, g, deferred)
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 1063, in _inlineCallbacks
deferred.callback(e.value)
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 361, in callback
self._startRunCallbacks(result)
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 455, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/Users/justinrobbins/Documents/dev/cassandra/opscenter-5.1.0/lib/py-osx/2.7/amd64/twisted/internet/defer.py", line 542, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "build/lib/python2.7/site-packages/opscenterd/TwistedRouter.py", line 226, in controllerSucceeded
File "build/lib/python2.7/site-packages/opscenterd/WebServer.py", line 3953, in default_write
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 250, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
File "build/lib/python2.7/site-packages/opscenterd/WebServer.py", line 261, in default
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 184, in default
raise TypeError(repr(o) + " is not JSON serializable")
exceptions.TypeError: UUID('457d5450-ca0b-11e4-a99a-53fff8597215') is not JSON serializable
My environment is as follows:
Cassandra: dsc-cassandra-2.1.2
OpsCenter: opscenter-5.1.0
Agents: datastax-agent-5.1.0
OS: OSX 10.10.1
There’s a known bug in OpsCenter where UUID columns in Cassandra 2.1.x are not handled properly. I am not aware of any workarounds (switching from UUID columns or downgrading C* to 2.0.x should work, but it might be a bit too much work.)
It’s going to be fixed in the upcoming patch release of OpsCenter 5.1 (not 5.1.1 though)

Resources