Dask dataframe throws error when read parquet file in s3

Dask dataframe throws error when read parquet file in s3 - python-3.x

I try to use dask to read parquet table in s3 like this:
import dask.dataframe as dd
s3_path = "s3://my_bucket/my_table"
times = dd.read_parquet(
s3_path,
storage_options={
"client_kwargs": {
"endpoint_url": bucket_endpoint_url,
},
"profile_name": bucket_profile,
}
)
result = times.groupby(['account', 'system_id'])['exec_time'].sum().nlargest(num_row).compute().reset_index().to_dict(orient='records')
I only have pyarrow and s3fs install.
When I read it using LocalCluster like below, it works great
client = LocalCluster(n_workers=1, threads_per_worker=1, processes=False)
But when I read it using true cluster, it throws this error:
client = Client('master_ip:8786')
TypeError: ('Could not serialize object of type tuple.', "(<function apply at 0x7f9f9c9942f0>, <function _apply_chunk at 0x7f9f76ed1510>, [(<function _read_pyarrow_parquet_piece at 0x7f9f76eedea0>, <dask.bytes.s3.DaskS3FileSystem object at 0x7f9f5a83edd8>, ParquetDatasetPiece('my_bucket/my_table/0a0a6e71438a43cd82985578247d5c97.parquet', row_group=None, partition_keys=[]), ['account', 'system_id', 'upload_time', 'name', 'exec_time'], [], False, <pyarrow.parquet.ParquetPartitions object at 0x7f9f5a565278>, []), 'account', 'system_id'], {'chunk': <methodcaller: sum>, 'columns': 'exec_time'})")
distributed.batched - ERROR - Error in batched write
Traceback (most recent call last):
File "/project_folder/lib64/python3.6/site-packages/distributed/batched.py", line 94, in _background_send
on_error='raise')
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/project_folder/lib64/python3.6/site-packages/distributed/comm/tcp.py", line 224, in write
'recipient': self._peer_addr})
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/project_folder/lib64/python3.6/site-packages/distributed/comm/utils.py", line 50, in to_frames
res = yield offload(_to_frames)
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/project_folder/lib64/python3.6/site-packages/distributed/comm/utils.py", line 43, in _to_frames
context=context))
File "/project_folder/lib64/python3.6/site-packages/distributed/protocol/core.py", line 54, in dumps
for key, value in data.items()
File "/project_folder/lib64/python3.6/site-packages/distributed/protocol/core.py", line 55, in <dictcomp>
if type(value) is Serialize}
File "/project_folder/lib64/python3.6/site-packages/distributed/protocol/serialize.py", line 164, in serialize
raise TypeError(msg, str(x)[:10000])
Do you know what could be the problem?
Thanks,

Serialisation of pyarrow objects have been problematic in pyarrow 0.13.0, which should be fixed in the next release. Can you try downgrading your pyarrow version?

Related

ValueError returning a function from a function in a model during makemigrations

I have a model defined in Django 3.1 like
class MyModel(models.Model):
def get_path(path):
def wrapper(instance, filename):
# code to make the path
return 'path/created/for/filename.jpg'
return wrapper
image = models.ImageField(upload_to=get_path('files/'))
but when I do
python3 manage.py makemigrations
I am getting a ValueError:
Migrations for 'myproject':
myproject/migrations/0023_auto_20210128_1148.py
- Create model MyModel
Traceback (most recent call last):
File "manage.py", line 22, in <module>
main()
File "manage.py", line 18, in main
execute_from_command_line(sys.argv)
File "/Library/Python/3.7/site-packages/django/core/management/__init__.py", line 401, in execute_from_command_line
utility.execute()
File "/Library/Python/3.7/site-packages/django/core/management/__init__.py", line 395, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/Library/Python/3.7/site-packages/django/core/management/base.py", line 330, in run_from_argv
self.execute(*args, **cmd_options)
File "/Library/Python/3.7/site-packages/django/core/management/base.py", line 371, in execute
output = self.handle(*args, **options)
File "/Library/Python/3.7/site-packages/django/core/management/base.py", line 85, in wrapped
res = handle_func(*args, **kwargs)
File "/Library/Python/3.7/site-packages/django/core/management/commands/makemigrations.py", line 182, in handle
self.write_migration_files(changes)
File "/Library/Python/3.7/site-packages/django/core/management/commands/makemigrations.py", line 219, in write_migration_files
migration_string = writer.as_string()
File "/Library/Python/3.7/site-packages/django/db/migrations/writer.py", line 141, in as_string
operation_string, operation_imports = OperationWriter(operation).serialize()
File "/Library/Python/3.7/site-packages/django/db/migrations/writer.py", line 99, in serialize
_write(arg_name, arg_value)
File "/Library/Python/3.7/site-packages/django/db/migrations/writer.py", line 51, in _write
arg_string, arg_imports = MigrationWriter.serialize(item)
File "/Library/Python/3.7/site-packages/django/db/migrations/writer.py", line 271, in serialize
return serializer_factory(value).serialize()
File "/Library/Python/3.7/site-packages/django/db/migrations/serializer.py", line 37, in serialize
item_string, item_imports = serializer_factory(item).serialize()
File "/Library/Python/3.7/site-packages/django/db/migrations/serializer.py", line 199, in serialize
return self.serialize_deconstructed(path, args, kwargs)
File "/Library/Python/3.7/site-packages/django/db/migrations/serializer.py", line 86, in serialize_deconstructed
arg_string, arg_imports = serializer_factory(arg).serialize()
File "/Library/Python/3.7/site-packages/django/db/migrations/serializer.py", line 159, in serialize
'Could not find function %s in %s.\n' % (self.value.__name__, module_name)
ValueError: Could not find function wrapper in myproject.models.
Normally python can have a function return a function, so what is the error here? If I create the model with a dummy path, do makemigrations, and then add the function it works fine in my code. It just seems to be erroring in makemigrations.

Dask - trying to read hdfs data getting error ArrowIOError: HDFS file does not exist

I tried creating a dataframe from csv stored in hdfs. Connecting is successful. But when trying to get output of len function getting error.
Code:
from dask_yarn import YarnCluster
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import subprocess
import os
# GET HDFS CLASSPATH
classpath = subprocess.Popen(["/usr/hdp/current/hadoop-client/bin/hadoop", "classpath", "--glob"], stdout=subprocess.PIPE).communicate()[0]
os.environ["HADOOP_HOME"] = "/usr/hdp/current/hadoop-client"
os.environ["ARROW_LIBHDFS_DIR"] = "/usr/hdp/3.1.4.0-315/usr/lib/"
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java/"
os.environ["CLASSPATH"] = classpath.decode("utf-8")
# GET HDFS CLASSPATH
classpath = subprocess.Popen(["/usr/hdp/current/hadoop-client/bin/hadoop", "classpath", "--glob"], stdout=subprocess.PIPE).communicate()[0]
cluster = YarnCluster(environment='python:///opt/anaconda3/bin/python3', worker_vcores=32, worker_memory="128GiB", n_workers=10)
client = Client(cluster)
client
df = dd.read_csv('hdfs://masterha/data/batch/82.csv')
len(df)
Error:
>>> len(ddf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 504, in __len__
len, np.sum, token="len", meta=int, split_every=False
File "/opt/anaconda3/lib/python3.7/site-packages/dask/base.py", line 165, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/dask/base.py", line 436, in compute
results = schedule(dsk, keys, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 2539, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 1839, in gather
asynchronous=asynchronous,
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 756, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py", line 333, in sync
raise exc.with_traceback(tb)
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py", line 317, in f
result[0] = yield future
File "/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/opt/anaconda3/lib/python3.7/site-packages/distributed/client.py", line 1695, in _gather
raise exception.with_traceback(traceback)
File "/opt/anaconda3/lib/python3.7/site-packages/dask/bytes/core.py", line 181, in read_block_from_file
with copy.copy(lazy_file) as f:
File "/opt/anaconda3/lib/python3.7/site-packages/fsspec/core.py", line 88, in __enter__
f = self.fs.open(self.path, mode=mode)
File "/opt/anaconda3/lib/python3.7/site-packages/fsspec/implementations/hdfs.py", line 116, in <lambda>
return lambda *args, **kw: getattr(PyArrowHDFS, item)(self, *args, **kw)
File "/opt/anaconda3/lib/python3.7/site-packages/fsspec/spec.py", line 708, in open
path, mode=mode, block_size=block_size, autocommit=ac, **kwargs
File "/opt/anaconda3/lib/python3.7/site-packages/fsspec/implementations/hdfs.py", line 116, in <lambda>
return lambda *args, **kw: getattr(PyArrowHDFS, item)(self, *args, **kw)
File "/opt/anaconda3/lib/python3.7/site-packages/fsspec/implementations/hdfs.py", line 72, in _open
return HDFSFile(self, path, mode, block_size, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/fsspec/implementations/hdfs.py", line 171, in __init__
self.fh = fs.pahdfs.open(path, mode, block_size, **kwargs)
File "pyarrow/io-hdfs.pxi", line 431, in pyarrow.lib.HadoopFileSystem.open
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS file does not exist: /data/batch/82.csv

It looks like your file "/data/batch/82.csv" doesn't exist. You might want to verify that you have the right path.

Upload Excel file to Python

I'm simply trying to upload excel.xlsx file with Python panda package so i can tokenize the text. I tried for hours but nothing works. Any help will be great:
import pandas as pd
excel_file = open(r'''C:\Users\farid-PC\Desktop\Tester.xlsx''', errors
='ignore')
movies = pd.read_excel(excel_file)
movies.head()
Errors:
Traceback (most recent call last):
File "C:/Users/farid-PC/PycharmProjects/fake_news/fault.py", line 4, in <module>
movies = pd.read_excel(excel_file)
File "C:\Users\farid-PC\PycharmProjects\fake_news\venv\lib\site-packages\pandas\util\_decorators.py", line 178, in wrapper
return func(*args, **kwargs)
File "C:\Users\farid-PC\PycharmProjects\fake_news\venv\lib\site-packages\pandas\util\_decorators.py", line 178, in wrapper
return func(*args, **kwargs)
File "C:\Users\farid-PC\PycharmProjects\fake_news\venv\lib\site-packages\pandas\io\excel.py", line 307, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\farid-PC\PycharmProjects\fake_news\venv\lib\site-packages\pandas\io\excel.py", line 392, in __init__
self.book = xlrd.open_workbook(file_contents=data)
File "C:\Users\farid-PC\PycharmProjects\fake_news\venv\lib\site-packages\xlrd\__init__.py", line 162, in open_workbook
ragged_rows=ragged_rows,
File "C:\Users\farid-PC\PycharmProjects\fake_news\venv\lib\site-packages\xlrd\book.py", line 91, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Users\farid-PC\PycharmProjects\fake_news\venv\lib\site-packages\xlrd\book.py", line 1267, in getbof
opcode = self.get2bytes()
File "C:\Users\farid-PC\PycharmProjects\fake_news\venv\lib\site-packages\xlrd\book.py", line 672, in get2bytes
return (BYTES_ORD(hi) << 8) | BYTES_ORD(lo)
TypeError: unsupported operand type(s) for <<: 'str' and 'int'

Problems to build duktape using low_memory.yaml and pointer compression options

I'm trying to build duktape using the low_memory.yaml profile, and enabling the pointer compression options. Specifically, I uncommented the following lines:
DUK_USE_STRTAB_PTRCOMP: true # sometimes useful with pointer compression
DUK_USE_REFCOUNT16: true
DUK_USE_REFCOUNT32: false
DUK_USE_STRHASH16: true
DUK_USE_STRLEN16: true
DUK_USE_BUFLEN16: true
DUK_USE_OBJSIZES16: true
DUK_USE_HSTRING_CLEN: false
DUK_USE_HSTRING_LAZY_CLEN: false
DUK_USE_HOBJECT_HASH_PART: false
DUK_USE_HEAPPTR16
DUK_USE_HEAPPTR_DEC16
DUK_USE_HEAPPTR_ENC16
The remaining lines are left untouched. When I use the python utility like this:
python tools/configure.py --output-directory ~/duktape-src/low_mem_t --option-file config/examples/low_memory_t1.yaml
I got a lot of exceptions:
Traceback (most recent call last):
File "/home/pi/duktape-2.2.1/tools/genconfig.py", line 1522, in <module>
main()
File "/home/pi/duktape-2.2.1/tools/genconfig.py", line 1519, in main
genconfig(opts, args)
File "/home/pi/duktape-2.2.1/tools/genconfig.py", line 1498, in genconfig
result, active_opts = generate_duk_config_header(opts, meta_dir)
File "/home/pi/duktape-2.2.1/tools/genconfig.py", line 970, in generate_duk_config_header
forced_opts = get_forced_options(opts)
File "/home/pi/duktape-2.2.1/tools/genconfig.py", line 795, in get_forced_options
doc = yaml.load(StringIO(val))
File "/home/pi/.local/lib/python2.7/site-packages/yaml/__init__.py", line 71, in load
return loader.get_single_data()
File "/home/pi/.local/lib/python2.7/site-packages/yaml/constructor.py", line 37, in get_single_data
node = self.get_single_node()
File "/home/pi/.local/lib/python2.7/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/home/pi/.local/lib/python2.7/site-packages/yaml/composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "/home/pi/.local/lib/python2.7/site-packages/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/home/pi/.local/lib/python2.7/site-packages/yaml/composer.py", line 127, in compose_mapping_node
while not self.check_event(MappingEndEvent):
File "/home/pi/.local/lib/python2.7/site-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/home/pi/.local/lib/python2.7/site-packages/yaml/parser.py", line 428, in parse_block_mapping_key
if self.check_token(KeyToken):
File "/home/pi/.local/lib/python2.7/site-packages/yaml/scanner.py", line 115, in check_token
while self.need_more_tokens():
File "/home/pi/.local/lib/python2.7/site-packages/yaml/scanner.py", line 149, in need_more_tokens
self.stale_possible_simple_keys()
File "/home/pi/.local/lib/python2.7/site-packages/yaml/scanner.py", line 289, in stale_possible_simple_keys
"could not find expected ':'", self.get_mark())
yaml.scanner.ScannerError: while scanning a simple key
in "<file>", line 85, column 1
could not find expected ':'
in "<file>", line 86, column 1
Traceback (most recent call last):
File "tools/configure.py", line 993, in <module>
main()
File "tools/configure.py", line 605, in main
exec_print_stdout(cmd)
File "tools/configure.py", line 60, in exec_print_stdout
ret = exec_get_stdout(cmd, input=input, print_stdout=True)
File "tools/configure.py", line 51, in exec_get_stdout
raise Exception('command failed, return code %d: %r' % (proc.returncode, cmd))
Exception: command failed, return code 1: ['/usr/bin/python', '/home/pi/duktape-2.2.1/tools/genconfig.py', '--output', '/tmp/tmp-duk-prepare-Xu0Jx4/duk_config.h.tmp', '--output-active-options', '/tmp/tmp-duk-prepare-Xu0Jx4/duk_config_active_options.json', '--git-commit', 'external', '--git-describe', 'external', '--git-branch', 'external', '--used-stridx-metadata', '/tmp/tmp-duk-prepare-Xu0Jx4/duk_used_stridx_bidx_defs.json.tmp', '--metadata', '/home/pi/duktape-2.2.1/config', '--option-file', '/tmp/tmp-duk-prepare-Xu0Jx4/genconfig0.yaml', 'duk-config-header']
Thanks for the advice.

The error seems to indicate that the config file cannot be parsed as a YAML file -- could you check that the commenting out respects YAML syntax?

sqlalchemy insert - string argument without an encoding

The code below worked when using Python 2.7, but raises a StatementError when using Python 3.5. I haven't found a good explanation for this online yet.
Why doesn't sqlalchemy accept simple Python 3 string objects in this situation? Is there a better way to insert rows into a table?
from sqlalchemy import Table, MetaData, create_engine
import json
def add_site(site_id):
engine = create_engine('mysql+pymysql://root:password#localhost/database_name', encoding='utf8', convert_unicode=True)
metadata = MetaData()
conn = engine.connect()
table_name = Table('table_name', metadata, autoload=True, autoload_with=engine)
site_name = 'Buffalo, NY'
p_profile = {"0": 300, "1": 500, "2": 100}
conn.execute(table_name.insert().values(updated=True,
site_name=site_name,
site_id=site_id,
p_profile=json.dumps(p_profile)))
add_site(121)
EDIT The table was previously created with this function:
def create_table():
engine = create_engine('mysql+pymysql://root:password#localhost/database_name')
metadata = MetaData()
# Create table for updating sites.
table_name = Table('table_name', metadata,
Column('id', Integer, Sequence('user_id_seq'), primary_key=True),
Column('updated', Boolean),
Column('site_name', BLOB),
Column('site_id', SMALLINT),
Column('p_profile', BLOB))
metadata.create_all(engine)
EDIT Full error:
>>> scd.add_site(121)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1073, in _execute_context
context = constructor(dialect, self, conn, *args)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/default.py", line 610, in _init_compiled
for key in compiled_params
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/default.py", line 610, in <genexpr>
for key in compiled_params
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/sql/sqltypes.py", line 834, in process
return DBAPIBinary(value)
File "/usr/local/lib/python3.5/dist-packages/pymysql/__init__.py", line 79, in Binary
return bytes(x)
TypeError: string argument without an encoding
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user1/Desktop/server_algorithm/database_tools.py", line 194, in add_site
failed_acks=json.dumps(p_profile)))
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 914, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1078, in _execute_context
None, None)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
exc_info
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/util/compat.py", line 202, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/util/compat.py", line 185, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/base.py", line 1073, in _execute_context
context = constructor(dialect, self, conn, *args)
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/default.py", line 610, in _init_compiled
for key in compiled_params
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/engine/default.py", line 610, in <genexpr>
for key in compiled_params
File "/usr/local/lib/python3.5/dist-packages/sqlalchemy/sql/sqltypes.py", line 834, in process
return DBAPIBinary(value)
File "/usr/local/lib/python3.5/dist-packages/pymysql/__init__.py", line 79, in Binary
return bytes(x)
sqlalchemy.exc.StatementError: (builtins.TypeError) string argument without an encoding [SQL: 'INSERT INTO table_name (updated, site_name, site_id, p_profile) VALUES (%(updated)s, %(site_name)s, %(site_id)s, %(p_profile)s)']

As univerio mentioned, the solution was to encode the string as follows:
conn.execute(table_name.insert().values(updated=True,
site_name=site_name,
site_id=site_id,
p_profile=bytes(json.dumps(p_profile), 'utf8')))
BLOBs require binary data, so we need bytes in Python 3 and str in Python 2, since Python 2 strings are sequences of bytes.
If we want to use Python 3 str, we need to use TEXT instead of BLOB.

You simply just need to convert your string to a byte string ex:
site_name=str.encode(site_name),
site_id=site_id,
p_profile=json.dumps(p_profile)))```
or
```site_name = b'Buffalo, NY'```

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Dask dataframe throws error when read parquet file in s3 - python-3.x

Serialisation of pyarrow objects have been problematic in pyarrow 0.13.0, which should be fixed in the next release. Can you try downgrading your pyarrow version?

Related

ValueError returning a function from a function in a model during makemigrations

Dask - trying to read hdfs data getting error ArrowIOError: HDFS file does not exist

Upload Excel file to Python

Problems to build duktape using low_memory.yaml and pointer compression options

sqlalchemy insert - string argument without an encoding

Categories

Resources