How to read .hql file (to run hive query) in pyspark - apache-spark

I have .hql file with huge amount of queries. It is running slow in hive. I want to read and run .hql file using pyspark/sparksql.
I tried count = sqlContext.sql(open("file.hql").read()).count() but gives the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/CDH-5.7.1-1.cdh5.7.1/lib/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/data/CDH-5.7.1-1.cdh5.7.1/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/data/CDH-5.7.1-1.cdh5.7.1/lib/spark/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"missing EOF at ';' near 'db'; line 1 pos 36"

Related

Azure-quantum: problems with jobs submission

I use pyquil for azure quantum, and submit jobs with run_batch method of AzureQuantumComputer class. For batches with up to 10 circuits there are no problems, but larger batches result in an error below.
Traceback (most recent call last):
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1436, in _deserialize
found_value = key_extractor(attr, attr_desc, data)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1180, in rest_key_extractor
return working_data.get(key)
AttributeError: 'str' object has no attribute 'get'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1509, in failsafe_deserialize
return self(target_obj, data, content_type=content_type)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1376, in __call__
return self._deserialize(target_obj, data)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1454, in _deserialize
raise_with_traceback(DeserializationError, msg, err)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\exceptions.py", line
51, in raise_with_traceback
raise error.with_traceback(exc_traceback)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1436, in _deserialize
found_value = key_extractor(attr, attr_desc, data)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\msrest\serialization.py",
line 1180, in rest_key_extractor
return working_data.get(key)
azure.core.exceptions.DeserializationError: ("Unable to deserialize to object: type,
AttributeError: 'str' object has no attribute 'get'", AttributeError("'str' object has no
attribute 'get'"))
Traceback (most recent call last):
File "C:\Users\Enter\PycharmProjects\QREM_pipline_development\pyquil_experiments.py", line
209, in <module>
unprocessed_results_now = pyquil_utilities.run_batches_parametric(backend_name=backend_name,
File
"C:\Users\Enter\PycharmProjects\QREM_SECRET_DEVELOPMENT_LOC\backends_support\pyquil\pyquil_utiliti
es.py", line 415, in run_batches_parametric
results = backend_instance.run_batch(executable,
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\pyquil_for_azure_quantum_init_.py", line 141, in run_batch
return qam.run_batch(executable, memory_map)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\pyquil_for_azure_quantum_init_.py", line 336, in run_batch
job = self._target.submit(
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\quantum\target\rigetti\target.py", line 183, in submit
return super().submit(input_data, name, input_params, **kwargs)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\quantum\target\target.py", line 141, in submit
return Job.from_input_data(
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\quantum\job\base_job.py",
line 117, in from_input_data
return cls.from_storage_uri(
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\quantum\job\base_job.py",
line 207, in from_storage_uri
job.submit()
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\azure\quantum\job\job.py",
line 45, in submit
job = self.workspace.submit_job(self)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-packages\azure\quantum\workspace.py",
line 265, in submit_job
details = client.create(
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\core\tracing\decorator.py", line 78, in wrapper_use_tracer
return func(*args, **kwargs)
File "C:\Users\Enter\anaconda3\envs\qiskit_env\lib\site-
packages\azure\quantum_client\operations_jobs_operations.py", line 387, in create
raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: Operation returned an invalid status 'Forbidden'
Content:
403 Forbidden
403 Forbidden
Microsoft-Azure-Application-Gateway/v2
I tried to run different circuits, it seems to only depend on the number of circuits in a batch, not the structure of a circuit. Programs are compiled to native quil locally.

Mismatch in number of rows imported into cassandra table (COPY command)

I am trying to dump csv file into cassandra table using COPY command. But, number of rows in my csv file and number of rows in cassandra is not consistent.
Number of rows in CSV files : 49765 (excluding header)
Number of rows in cassandra table:
cqlsh:test_df> select Count(*) from test_table;
count
-------
46982
(1 rows)
Warnings :
Aggregation query used without partition key
copy command :
COPY test_table (column1,column2,column3) from 'temp.csv' with delimiter = ',' and header = True;
Error:
Starting copy of test_df.test_bhavcopy with columns [symbol, instrument, expiry_dt, strike_pr, option_typ, open, high, low, close, settle_pr, contracts, val_inlakh, open_int, ch_in_oi, price_date, key].
Process ImportProcess-3:ate: 8387 rows/s; Avg. rate: 3937 rows/s
Traceback (most recent call last):
P rocess ImportProcess-2:
File "X:\Anaconda\lib\multiprocessing\process.py", line 267, in _bootstrap
Traceback (most recent call last):
Process ImportProcess-1:
T raceback (most recent call last):
File "X:\Anaconda\lib\multiprocessing\process.py", line 267, in _bootstrap
File "X:\Anaconda\lib\multiprocessing\process.py", line 267, in _bootstrap
self.run()
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2328, in run
self.run()
self.run()
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2328, in run
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2328, in run
self.close()
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2332, in close
self._session.cluster.shutdown()
self.close()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 1259, in shutdown
self.close()
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2332, in close
File "X:\apache-cassandra-3.11.3\bin\..\pylib\cqlshlib\copyutil.py", line 2332, in close
self._session.cluster.shutdown()
self._session.cluster.shutdown()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 1259, in shutdown
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 1259, in shutdown
self.control_connection.shutdown()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 2850, in shutdown
self._connection.close()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 373, in close
AsyncoreConnection.create_timer(0, partial(asyncore.dispatcher.close, self))
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 335, in create_timer
cls._loop.add_timer(timer)
A ttributeError: 'NoneType' object has no attribute 'add_timer'
self.control_connection.shutdown()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 2850, in shutdown
self.control_connection.shutdown()
self._connection.close()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\cluster.py", line 2850, in shutdown
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 373, in close
self._connection.close()
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 373, in close
AsyncoreConnection.create_timer(0, partial(asyncore.dispatcher.close, self))
AsyncoreConnection.create_timer(0, partial(asyncore.dispatcher.close, self))
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 335, in create_timer
File "X:\apache-cassandra-3.11.3\bin\..\lib\cassandra-driver-internal-only-3.11.0-bb96859b.zip\cassandra-driver-3.11.0-bb96859b\cassandra\io\asyncorereactor.py", line 335, in create_timer
cls._loop.add_timer(timer)
A cls._loop.add_timer(timer)
ttributeError: 'NoneType' object has no attribute 'add_timer'
AttributeError: 'NoneType' object has no attribute 'add_timer'
Processed: 49765 rows; Rate: 4193 rows/s; Avg. rate: 3906 rows/s
49765 rows imported from 1 files in 12.742 seconds (0 skipped).
Maybe its due to this error.
Found a fix :
I edited my asyncorereactor.py in
cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/io/asyncorereactor.py
to self.create_timer() from AsyncoreConnection.create_timer() as suggested in this post
https://datastax-oss.atlassian.net/browse/PYTHON-862?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

Spark EC2 - Broken Pipe error

Getting following error while spinning Spark (1.6) cluster on EC2 with spark-ec2 script. I tried with --resume option but the error is consistant.
packet_write_wait: Connection to
<<host>>: Broken pipe Traceback (most recent call last): File "./spark_ec2.py", line 1535, in
<module>
main() File "./spark_ec2.py", line 1527,
in main real_main() File "./spark_ec2.py", line 1363,
in real_main setup_cluster(conn, master_nodes, slave_nodes, opts, True) File "./spark_ec2.py", line 811,
in setup_cluster dot_ssh_tar = ssh_read(master, opts,
['tar', 'c', '.ssh']) File "./spark_ec2.py", line 1209,
in ssh_read ssh_command(opts) + ['%s#%s' % (opts.user, host), stringify_command(command)]) File "./spark_ec2.py", line 1203,
in _check_output raise subprocess.CalledProcessError(retcode, cmd,
output=output)

SyntaxError raised when using joblib with lxml on python 3.5

I am trying to parallelize the tasks of correcting texts on many documents with Python, so I naturally found "joblib". I want each task to be to correct a given document. Here is the structure of the code:
if __name__ == '__main__':
lexicon = build_compact_lexicon()
from joblib import Parallel, delayed
import multiprocessing
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores)(delayed(find_errors)('GDL', i, 1, lexicon) for i in range(1798, 1820))
I am using the function find_errors summed up here :
def find_errors(newspaper, year, month, lexicon):
# parse the input newspaper text data using etree parser from LXML
# detect errors in the text
return found_errors_type1, found_errors_type2, found_errors_type3
This does raise me a few errors
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 130, in __call__
return self.func(*args, **kwargs)
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 72, in <listcomp>
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "hellowordParallel.py", line 85, in find_errors
tree = etree.parse(xml_file_path)
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:79801)
File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:116293)
TypeError: cannot parse from 'NoneType'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 392, in find_cookie
line_string = line.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 24: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 139, in __call__
tb_offset=1)
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/format_stack.py", line 373, in format_exc
frames = format_records(records)
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/format_stack.py", line 274, in format_records
for token in generate_tokens(linereader):
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 514, in _tokenize
line = readline()
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/format_stack.py", line 265, in linereader
line = getline(file, lnum[0])
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/linecache.py", line 16, in getline
lines = getlines(filename, module_globals)
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/linecache.py", line 47, in getlines
return updatecache(filename, module_globals)
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/linecache.py", line 136, in updatecache
with tokenize.open(fullname) as fp:
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 456, in open
encoding, lines = detect_encoding(buffer.readline)
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 433, in detect_encoding
encoding = find_cookie(first)
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 397, in find_cookie
raise SyntaxError(msg)
File "<string>", line None
SyntaxError: invalid or missing encoding declaration for '/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/lxml/etree.so'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "hellowordParallel.py", line 160, in <module>
results = Parallel(n_jobs=num_cores)(delayed(find_errors)('GDL', i, 1, lexicon) for i in range(1798, 1820))
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 810, in __call__
self.retrieve()
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 727, in retrieve
self._output.extend(job.get())
File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
SyntaxError: invalid or missing encoding declaration for '/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/lxml/etree.so'
I don't understand if this is due do something related with configs or if my function doesn't fit in a parallel implementation... (I guess it should...)
Did it happen to some of you before?
Hope my question is clear and there is enough information for someone to give me some help!

"Illegal MIME-Type" Error when trying to play_uri() a song using SoCo

SoCo is a Python library for controlling Sonos speakers. I'm trying to play a locally stored song:
device = SoCo("192.168.209.7")
device.play_uri("/home/myuser/mysong.ogg")
If I've read the docs correctly, mysong.ogg should start playing on the Sonos. However, this code immediately results in the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/site-packages/soco/core.py", line 95, in inner_function
return function(self, *args, **kwargs)
File "/usr/lib/python3.4/site-packages/soco/core.py", line 470, in play_uri
('CurrentURIMetaData', meta)
File "/usr/lib/python3.4/site-packages/soco/services.py", line 156, in _dispatcher
return self.send_command(action, *args, **kwargs)
File "/usr/lib/python3.4/site-packages/soco/services.py", line 357, in send_command
self.handle_upnp_error(response.text)
File "/usr/lib/python3.4/site-packages/soco/services.py", line 417, in handle_upnp_error
error_xml=xml_error
soco.exceptions.SoCoUPnPException: UPnP Error 714 received: Illegal MIME-Type from 192.168.209.7

Resources