Pass a partitioned TabularDataset into ParallelRunStep with azureml sdkv1

Pass a partitioned TabularDataset into ParallelRunStep with azureml sdkv1 - azure-machine-learning-service

Trying to pass a partitioned TabularDataset into a ParallelRunStep as input, but getting the error and can't figure out why azureml ParallelRunStep can't recognize the partitioned dataset:
UserInputNotPartitionedByGivenKeys: The input dataset 'partitioned_combined_scored_dataset_input' is not partitioned by 'model_name'.
Traceback (most recent call last):
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role_process.py", line 111, in run
loop.run_until_complete(self.master_role.start())
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role.py", line 303, in start
await self.wait_for_first_task()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role.py", line 288, in wait_for_first_task
await self.wait_for_input_init()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/master_role.py", line 126, in wait_for_input_init
self.future_create_tasks.result()
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_producer.py", line 199, in create_tasks
raise exc
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_producer.py", line 190, in create_tasks
for task_group in self.get_task_groups(provider.get_tasks()):
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_producer.py", line 169, in get_task_groups
for index, task in enumerate(tasks):
File "/tmp/48a0ec47-b89c-41ff-89f8-3482d2823d20/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/partition_by_keys_provider.py", line 77, in get_tasks
raise UserInputNotPartitionedByGivenKeys(message=message, compliant_message=compliant_message)
UserInputNotPartitionedByGivenKeys: The input dataset 'partitioned_combined_scored_dataset_input' is not partitioned by 'model_name'.
ParallelRunConfig & ParallelRunStep
parallel_run_config = ParallelRunConfig(
source_directory=source_dir_for_snapshot,
entry_script="src/steps/script.py",
partition_keys=["model_name"],
error_threshold=10,
allowed_failed_count=15,
allowed_failed_percent=10,
run_max_try=3,
output_action="append_row",
append_row_file_name="output_file.csv",
environment=aml_run_config.environment,
compute_target=aml_run_config.target,
node_count=2
)
parallelrun_step = ParallelRunStep(
name="Do Some Parallel Stuff on Each model_name",
parallel_run_config=parallel_run_config ,
inputs=[partitioned_combined_scored_dataset],
output=OutputFileDatasetConsumptionConfig(name='output_dataset'),
arguments=["--score-id", score_id_pipeline_param,
"--partitioned-combined-dataset", partitioned_combined_scored_dataset],
allow_reuse=True
)
partitioned_combined_scored_dataset
partitioned_combined_scored_dataset = DatasetConsumptionConfig(
name="partitioned_combined_scored_dataset_input",
dataset=PipelineParameter(
name="partitioned_combined_dataset",
default_value=future_partitioned_dataset)
)
and then partitioned_combined_scored_dataset was previously created and uploaded using:
partitioned_dataset = TabularDatasetFactory.from_parquet_files(path=(Datastore.get(ws, ), f"{partitioned_combined_datasets_dir}/*.parquet"))\
.partition_by(
partition_keys=['model_name'],
target=DataPath(Datastore(), 'some/path/to/partitioned')
)
I know TabularDataset.partition_by() uploads to a GUID folder generated by AML so that some/path/to/partitioned actually creates some/path/to/partitioned/XXXXXXXX/{model_name}/part0.parquet for each partition on model_name according to documentation, so we've accounted for this when defining the tabular dataset passed into the PipelineParameter for partitioned_combined_scored_dataset at runtime... using
TabularDatasetFactory.from_parquet_files(path=(Datastore(), f"{partitioned_combined_dataset_dir}/*/*/*.parquet"))

Related

pytorch: Merge three datasets with predefined and custom datasets

I am training an AI model to recognize handwritten hangul characters along with English characters and numbers. It means that I require three datasets custom korean character dataset and other datasets.
I have three datasets and now I am merging three datasets but when I print the train_set path it shows MJSynth only which is wrong.
긴장_1227682.jpg is in my custom korean dataset not in MJSynth
Code
custom_train_set = RecognitionDataset(
parts[0].joinpath("images"),
parts[0].joinpath("labels.json"),
img_transforms=Compose(
[
T.Resize((args.input_size, 4 * args.input_size), preserve_aspect_ratio=True),
# Augmentations
T.RandomApply(T.ColorInversion(), 0.1),
ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.02),
]
),
)
if len(parts) > 1:
for subfolder in parts[1:]:
custom_train_set.merge_dataset(
RecognitionDataset(subfolder.joinpath("images"), subfolder.joinpath("labels.json"))
)
train_set = MJSynth(
train=True,
img_folder='/media/cvpr/CM_22/mjsynth/mnt/ramdisk/max/90kDICT32px',
label_path='/media/cvpr/CM_22/mjsynth/mnt/ramdisk/max/90kDICT32px/imlist.txt',
img_transforms=T.Resize((args.input_size, 4 * args.input_size), preserve_aspect_ratio=True),
)
_train_set = SynthText(
train=True,
recognition_task=True,
download=True, # NOTE: download can take really long depending on your bandwidth
img_transforms=T.Resize((args.input_size, 4 * args.input_size), preserve_aspect_ratio=True),
)
train_set.data.extend([(np_img, target) for np_img, target in _train_set.data])
train_set.data.extend([(np_img, target) for np_img, target in custom_train_set.data])
Traceback
Traceback (most recent call last):
File "/media/cvpr/CM_22/doctr/references/recognition/train_pytorch.py", line 485, in <module>
main(args)
File "/media/cvpr/CM_22/doctr/references/recognition/train_pytorch.py", line 396, in main
fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, mb, amp=args.amp)
File "/media/cvpr/CM_22/doctr/references/recognition/train_pytorch.py", line 118, in fit_one_epoch
for images, targets in progress_bar(train_loader, parent=mb):
File "/home/cvpr/anaconda3/envs/pytesseract/lib/python3.9/site-packages/fastprogress/fastprogress.py", line 50, in __iter__
raise e
File "/home/cvpr/anaconda3/envs/pytesseract/lib/python3.9/site-packages/fastprogress/fastprogress.py", line 41, in __iter__
for i,o in enumerate(self.gen):
File "/home/cvpr/anaconda3/envs/pytesseract/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/home/cvpr/anaconda3/envs/pytesseract/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/home/cvpr/anaconda3/envs/pytesseract/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/home/cvpr/anaconda3/envs/pytesseract/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/cvpr/anaconda3/envs/pytesseract/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/cvpr/anaconda3/envs/pytesseract/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/cvpr/anaconda3/envs/pytesseract/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/media/cvpr/CM_22/doctr/doctr/datasets/datasets/base.py", line 48, in __getitem__
img, target = self._read_sample(index)
File "/media/cvpr/CM_22/doctr/doctr/datasets/datasets/pytorch.py", line 37, in _read_sample
else read_img_as_tensor(os.path.join(self.root, img_name), dtype=torch.float32)
File "/media/cvpr/CM_22/doctr/doctr/io/image/pytorch.py", line 52, in read_img_as_tensor
pil_img = Image.open(img_path, mode="r").convert("RGB")
File "/home/cvpr/anaconda3/envs/pytesseract/lib/python3.9/site-packages/PIL/Image.py", line 2912, in open
fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/media/cvpr/CM_22/mjsynth/mnt/ramdisk/max/90kDICT32px/긴장_1227682.jpg'

pyarrow ds.dataset fails with FileNotFoundError using Azure blob filesystem in azure functions but not locally

I have a couple functions in Azure Functions which download data into my data lake v2 storage. They work by downloading CSVs, converting them to pyarrow and then saving to individual parquet files on Azure's storage which works fine.
I have another function in the same app which is supposed to consolidate and repartition the data. That function results in a FileNotFoundError exception when trying to create a dataset.
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import pyarrow.compute as pc
import os
from adlfs import AzureBlobFileSystem
abfs=AzureBlobFileSystem(connection_string=os.environ['Synblob'])
newds=ds.dataset(f"stdatalake/miso/dailyfiles/{filetype}", filesystem=abfs, partitioning="hive")
When running locally, the code runs fine but when running on the cloud I get....
Result: Failure Exception: FileNotFoundError: stdatalake/miso/dailyfiles/daprices/exante/date=20210926
Stack: File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 402, in _handle__invocation_request call_result = await self._loop.run_in_executor( File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs)
File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 611, in _run_sync_func return ExtensionManager.get_sync_invocation_wrapper(context, File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/extension.py",
line 215, in _raw_invocation_wrapper result = function(**args) File "/home/site/wwwroot/consolidate/__init__.py",
line 47, in main newds=ds.dataset(f"stdatalake/miso/dailyfiles/{filetype}", filesystem=abfs, partitioning="hive")
File "/home/site/wwwroot/.python_packages/lib/site-packages/pyarrow/dataset.py", line 667, in dataset return _filesystem_dataset(source, **kwargs) File "/home/site/wwwroot/.python_packages/lib/site-packages/pyarrow/dataset.py",
line 422, in _filesystem_dataset return factory.finish(schema) File "pyarrow/_dataset.pyx", line 1680,
in pyarrow._dataset.DatasetFactory.finish File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/_fs.pyx", line 1179,
in pyarrow._fs._cb_open_input_file File "/home/site/wwwroot/.python_packages/lib/site-packages/pyarrow/fs.py", line 394, in open_input_file raise FileNotFoundError(path)

Dask dataframe throws error when read parquet file in s3

I try to use dask to read parquet table in s3 like this:
import dask.dataframe as dd
s3_path = "s3://my_bucket/my_table"
times = dd.read_parquet(
s3_path,
storage_options={
"client_kwargs": {
"endpoint_url": bucket_endpoint_url,
},
"profile_name": bucket_profile,
}
)
result = times.groupby(['account', 'system_id'])['exec_time'].sum().nlargest(num_row).compute().reset_index().to_dict(orient='records')
I only have pyarrow and s3fs install.
When I read it using LocalCluster like below, it works great
client = LocalCluster(n_workers=1, threads_per_worker=1, processes=False)
But when I read it using true cluster, it throws this error:
client = Client('master_ip:8786')
TypeError: ('Could not serialize object of type tuple.', "(<function apply at 0x7f9f9c9942f0>, <function _apply_chunk at 0x7f9f76ed1510>, [(<function _read_pyarrow_parquet_piece at 0x7f9f76eedea0>, <dask.bytes.s3.DaskS3FileSystem object at 0x7f9f5a83edd8>, ParquetDatasetPiece('my_bucket/my_table/0a0a6e71438a43cd82985578247d5c97.parquet', row_group=None, partition_keys=[]), ['account', 'system_id', 'upload_time', 'name', 'exec_time'], [], False, <pyarrow.parquet.ParquetPartitions object at 0x7f9f5a565278>, []), 'account', 'system_id'], {'chunk': <methodcaller: sum>, 'columns': 'exec_time'})")
distributed.batched - ERROR - Error in batched write
Traceback (most recent call last):
File "/project_folder/lib64/python3.6/site-packages/distributed/batched.py", line 94, in _background_send
on_error='raise')
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/project_folder/lib64/python3.6/site-packages/distributed/comm/tcp.py", line 224, in write
'recipient': self._peer_addr})
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/project_folder/lib64/python3.6/site-packages/distributed/comm/utils.py", line 50, in to_frames
res = yield offload(_to_frames)
File "/project_folder/lib64/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/project_folder/lib64/python3.6/site-packages/distributed/comm/utils.py", line 43, in _to_frames
context=context))
File "/project_folder/lib64/python3.6/site-packages/distributed/protocol/core.py", line 54, in dumps
for key, value in data.items()
File "/project_folder/lib64/python3.6/site-packages/distributed/protocol/core.py", line 55, in <dictcomp>
if type(value) is Serialize}
File "/project_folder/lib64/python3.6/site-packages/distributed/protocol/serialize.py", line 164, in serialize
raise TypeError(msg, str(x)[:10000])
Do you know what could be the problem?
Thanks,

Serialisation of pyarrow objects have been problematic in pyarrow 0.13.0, which should be fixed in the next release. Can you try downgrading your pyarrow version?

How to remove a field in Django model - variation on answered question

Although a couple of similar question were posted, none of those questions are the same as my case.
I had a model that looked like this:
class Person(models.Model):
"""Definition of persons that will fulfill a role in a committee
or will be in a way associated with a committee as an administrator
"""
ClientId = models.ForeignKey('clients.Client', on_delete=models.CASCADE,
to_field='id')
PersNumber = models.PositiveIntegerField(null=False)
PersSurName = models.CharField(max_length=40, null=False)
PersNames = models.CharField(max_length=40, null=False)
I set the uniqueness of the record on ClientId and PersNumber.
I have created 3 records in the database.
Along the way I became convinced that I might as well use the auto generated id of the record as the person number (I am learning Django).
I removed the PersNumber from my model and ran makemigrations.
All is well until I ran migrate...
I get the following error:
django.core.exceptions.FieldDoesNotExist: Person has no field named 'PersNumber'
Any idee on how to get past this error
The full trace looks like this:
Operations to perform:
Apply all migrations: admin, auth, clients, contenttypes, komadm_conf, sessions
Running migrations:
Applying komadm_conf.0017_auto_20180830_1806...Traceback (most recent call last):
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\models\options.py", line 564, in get_field
return self.fields_map[field_name]
KeyError: 'PersNumber'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "manage.py", line 15, in
execute_from_command_line(sys.argv)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\core\management__init__.py", line 381, in execute_from_command_line
utility.execute()
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\core\management__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\core\management\base.py", line 316, in run_from_argv
self.execute(*args, **cmd_options)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\core\management\base.py", line 353, in execute
output = self.handle(*args, **options)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\core\management\base.py", line 83, in wrapped
res = handle_func(*args, **kwargs)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\core\management\commands\migrate.py", line 203, in handle
fake_initial=fake_initial,
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\migrations\executor.py", line 117, in migrate
state = self._migrate_all_forwards(state, plan, full_plan, fake=fake, fake_initial=fake_initial)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\migrations\executor.py", line 147, in _migrate_all_forwards
state = self.apply_migration(state, migration, fake=fake, fake_initial=fake_initial)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\migrations\executor.py", line 244, in apply_migration
state = migration.apply(state, schema_editor)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\migrations\migration.py", line 124, in apply
operation.database_forwards(self.app_label, schema_editor, old_state, project_state)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\migrations\operations\fields.py", line 150, in database_forwards
schema_editor.remove_field(from_model, from_model._meta.get_field(self.name))
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\backends\sqlite3\schema.py", line 318, in remove_field
self._remake_table(model, delete_field=field)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\backends\sqlite3\schema.py", line 257, in _remake_table
self.create_model(temp_model)
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\backends\base\schema.py", line 300, in create_model
columns = [model._meta.get_field(field).column for field in fields]
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\backends\base\schema.py", line 300, in
columns = [model._meta.get_field(field).column for field in fields]
File "C:\ApplicationDef\za\co\drie_p\Komadmin.db\KomAdmin\KomadmTest\komadm_app\komadm_env\lib\site-packages\django\db\models\options.py", line 566, in get_field
raise FieldDoesNotExist("%s has no field named '%s'" % (self.object_name, field_name))
django.core.exceptions.FieldDoesNotExist: Person has no field named 'PersNumber'

Giving the fact that nobody else answered i will(but i am learning too)..
did you check if this field still exist ? (in the shell for instance).
i'll recommand to manually delete the migration file associated and try again (makemigrations + migrate).
The migration file i am talking about can be found MyProject/Myapp/migrations/00xx_something.py
If it still does not work you can delete the table directly in your database, and the migration file associated.

Gremlin paging for big dataset query

I am using gremlin server, I have a big data set and I performing the gremlin paging. Following is the sample of query:
query = """g.V().both().both().count()"""
data = execute_query(query)
for x in range(0,int(data[0]/10000)+1):
print(x*10000, " - ",(x+1)*10000)
query = """g.V().both().both().range({0}*10000, {1}*10000)""".format(x,x+1)
data = execute_query(query)
def execute_query(query):
"""query execution"""
Above query is working fine, for pagination i have to know the rang where to stop the execution of the query. for getting the range i have to first fetch the count of the query and pass to the for loop. Is there any other to use the pagination of gremlin.
-- Pagination is required, because its fails when fetching 100k data in a single ex. g.V().both().both().count()
if we don't use pagination then its giving me this following error:
ERROR:tornado.application:Uncaught exception, closing connection.
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 554, in wrapper
return callback(*args)
File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 343, in wrapped
raise_exc_info(exc)
File "<string>", line 3, in raise_exc_info
File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 314, in wrapped
ret = fn(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 807, in _on_frame_data
self._receive_frame()
File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 697, in _receive_frame
self.stream.read_bytes(2, self._on_frame_start)
File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 312, in read_bytes
assert isinstance(num_bytes, numbers.Integral)
File "/usr/lib/python3.5/abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:
File "/usr/lib/python3.5/_weakrefset.py", line 75, in __contains__
return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison
ERROR:tornado.application:Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f3e1c409ae8>)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 604, in _run_callback
ret = callback()
File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 554, in wrapper
return callback(*args)
File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 343, in wrapped
raise_exc_info(exc)
File "<string>", line 3, in raise_exc_info
File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 314, in wrapped
ret = fn(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 807, in _on_frame_data
self._receive_frame()
File "/usr/local/lib/python3.5/dist-packages/tornado/websocket.py", line 697, in _receive_frame
self.stream.read_bytes(2, self._on_frame_start)
File "/usr/local/lib/python3.5/dist-packages/tornado/iostream.py", line 312, in read_bytes
assert isinstance(num_bytes, numbers.Integral)
File "/usr/lib/python3.5/abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:
File "/usr/lib/python3.5/_weakrefset.py", line 75, in __contains__
return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison
Traceback (most recent call last):
File "/home/rgupta/Documents/BitBucket/ecodrone/ecodrone/test2.py", line 59, in <module>
data = execute_query(query)
File "/home/rgupta/Documents/BitBucket/ecodrone/ecodrone/test2.py", line 53, in execute_query
results = future_results.result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 405, in result
return self.__get_result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
raise self._exception
File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/resultset.py", line 81, in cb
f.result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
return self.__get_result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
raise self._exception
File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/connection.py", line 77, in _receive
self._protocol.data_received(data, self._results)
File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received
self.data_received(data, results_dict)
File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received
self.data_received(data, results_dict)
File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received
self.data_received(data, results_dict)
File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received
this line repeats 100 times File "/usr/local/lib/python3.5/dist-packages/gremlin_python/driver/protocol.py", line 100, in data_received

This question is largely answered here but I'll add some more comment.
Your approach to pagination is really expensive as I'm not aware of any graphs that will optimize that particular traversal and you're basically iterating all that data a lot of times. You do it once for the count(), then you iterate the first 10000, then for the second 10000, you iterate the first 10000 followed by the second 10000, and then on the third 10000, you iterate the first 20000 followed by the third 10000 and so on...
I'm not sure if there is more to your logic, but what you have looks like a form of "batching" to get smaller bunches of results. There isn't much need to do it that way as Gremlin Server is already doing that for you internally. Were you to just send g.V().both().both() Gremlin Server is going to batch up results given the resultIterationBatchSize configuration option.
Anyway, there isn't really a better way to make paging work that I am aware of beyond what was explained in the other question that I mentioned.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pass a partitioned TabularDataset into ParallelRunStep with azureml sdkv1 - azure-machine-learning-service

Related

pytorch: Merge three datasets with predefined and custom datasets

pyarrow ds.dataset fails with FileNotFoundError using Azure blob filesystem in azure functions but not locally

Dask dataframe throws error when read parquet file in s3

How to remove a field in Django model - variation on answered question

Gremlin paging for big dataset query

Categories

Resources