How to pass deep learning model data to map function in Spark

How to pass deep learning model data to map function in Spark - apache-spark

I have a very simple use-case where I am reading large number of images as rdd from s3 using sc.binaryFiles method. Once this RDD is created I am passing the content inside the rdd to the vgg16 feature extractor function. So, in this I will need the model data using which the feature extraction will be done, so I am putting the model data into broadcast variable and then accesing the value in each map function. Below is the code:-
s3_files_rdd = sc.binaryFiles(RESOLVED_IMAGE_PATH)
s3_files_rdd.persist()
model_data = initVGG16()
broadcast_model = sc.broadcast(model_data)
features_rdd = s3_files_rdd.mapPartitions(extract_features_)
response_rdd = features_rdd.map(lambda x: (x[0], write_to_s3(x, OUTPUT, FORMAT_NAME)))
extract_features_ method:-
def extract_features_(xs):
model_data = initVGG16()
for k, v in xs:
yield k, extract_features2(model_data,v)
extract_features method:-
from keras.preprocessing import image
from keras.applications.vgg16 import VGG16
from keras.models import Model
from io import BytesIO
from keras.applications.vgg16 import preprocess_input
def extract_features(model,obj):
try:
print('executing vgg16 feature extractor...')
img = image.load_img(BytesIO(obj), target_size=(224, 224,3))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
vgg16_feature = model.predict(img_data)[0]
print('++++++++++++++++++++++++++++',vgg16_feature.shape)
return vgg16_feature
except Exception as e:
print('Error......{}'.format(e.args))
return []
write to s3 method:-
def write_to_s3(rdd, output_path, format_name):
file_path = rdd[0]
file_name_without_ext = get_file_name_without_ext(file_name)
bucket_name = output_path.split('/', 1)[0]
final_path = 'deepak' + '/' + file_name_without_ext + '.' + format_name
LOGGER.info("Saving to S3....")
cci = cc.get_interface(bucket_name, ACCESS_KEY=os.environ.get("AWS_ACCESS_KEY_ID"),
SECRET_KEY=os.environ.get("AWS_SECRET_ACCESS_KEY"), endpoint_url='https://s3.amazonaws.com')
response = cci.upload_npy_array(final_path, rdd[1])
return response
Inside the write_to_s3 method I am getting the RDD, extracting the key name to be saved and bucket. then using a library called cottoncandy to drectly save the RDD content which is numpy array in my case instead of saving any intermediate file.
I am getting below error :-
127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 600, in save_reduce
save(state)
File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib64/python2.7/pickle.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib64/python2.7/pickle.py", line 687, in _batch_setitems
save(v)
File "/usr/lib64/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
TypeError: can't pickle thread.lock objects
Traceback (most recent call last):
File "one_file5.py", line 98, in <module>
run()
File "one_file5.py", line 89, in run
LOGGER.info('features_rdd rdd created,...... %s',features_rdd.count())
File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1041, in count
File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 1032, in sum
File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 906, in fold
File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 809, in collect
File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd
File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2388, in _wrap_function
File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/rdd.py", line 2374, in _prepare_for_python_RDD
File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/serializers.py", line 464, in dumps
File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 704, in dumps
File "/mnt/yarn/usercache/hadoop/appcache/application_1541576150127_0010/container_1541576150127_0010_01_000001/pyspark.zip/pyspark/cloudpickle.py", line 162, in dump
pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects.
When I am commenting out the the code part of features_rdd, then the program runs fine which means something is not proper in the features_rdd part. Not sure what I am doing wrong here.
I am running the program in AWS EMR, with 4 executors.
executor core 7
executor RAM 8GB
Spark version 2.2.1

Replace your current code with mapPartitions:
def extract_features_(xs):
model_data = initVGG16()
for k, v in xs:
yield k, extract_features(model_data, v)
features_rdd = s3_files_rdd.mapPartitions(extract_features_)

Related

KeyError when trying to fine tuning Bert for text classification

I am trying to fine tune Bert for text classification on my dataset and I am getting the following error:
KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'
Here is the full error:
1/1 * Epoch (train): 0% 0/613 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 47, in <module>
runner.train(
File "/usr/local/lib/python3.8/dist-packages/catalyst/runners/runner.py", line 377, in train
self.run()
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 422, in run
self._run_event("on_exception")
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 365, in _run_event
getattr(self, event)(self)
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 357, in on_exception
raise self.exception
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 419, in run
self._run()
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 410, in _run
self.engine.spawn(self._run_local)
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/engine.py", line 59, in spawn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 405, in _run_local
self._run_experiment()
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 399, in _run_experiment
self._run_epoch()
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 391, in _run_epoch
self._run_loader()
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 384, in _run_loader
self._run_event("on_batch_start")
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 361, in _run_event
getattr(self, event)(self)
File "/usr/local/lib/python3.8/dist-packages/catalyst/runners/supervised.py", line 150, in on_batch_start
super().on_batch_start(runner)
File "/usr/local/lib/python3.8/dist-packages/catalyst/core/runner.py", line 321, in on_batch_start
self.batch_size = len(self.batch[0])
File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 241, in __getitem__
raise KeyError(
KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'
The code I am using for data preparation:
import logging
from pathlib import Path
from typing import List, Mapping, Tuple
import pandas as pd
import torch
from catalyst.utils import set_global_seed
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer
class TextClassificationDataset(Dataset):
"""
Wrapper around Torch Dataset to perform text classification
"""
def __init__(
self,
texts: List[str],
labels: List[str] = None,
label_dict: Mapping[str, int] = None,
max_seq_length: int = 512,
model_name: str = "GroNLP/hateBERT",
):
"""
Args:
texts (List[str]): a list with texts to classify or to train the
classifier on
labels List[str]: a list with classification labels (optional)
label_dict (dict): a dictionary mapping class names to class ids,
to be passed to the validation data (optional)
max_seq_length (int): maximal sequence length in tokens,
texts will be stripped to this length
model_name (str): transformer model name, needed to perform
appropriate tokenization
"""
self.texts = texts
self.labels = labels
self.label_dict = label_dict
self.max_seq_length = max_seq_length
if self.label_dict is None and labels is not None:
# {'class1': 0, 'class2': 1, 'class3': 2, ...}
# no easily handle unknown target values
self.label_dict = dict(zip(sorted(set(labels)), range(len(set(labels)))))
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
# suppresses tokenizer warnings
logging.getLogger("transformers.tokenization_utils").setLevel(logging.FATAL)
def __len__(self) -> int:
"""
Returns:
int: length of the dataset
"""
return len(self.texts)
def __getitem__(self, index) -> Mapping[str, torch.Tensor]:
"""Gets element of the dataset
Args:
index (int): index of the element in the dataset
Returns:
Single element by index
"""
# encoding the text
x = self.texts[index]
# a dictionary with `input_ids` and `attention_mask` as keys
output_dict = self.tokenizer.encode_plus(
x,
add_special_tokens=True,
padding="max_length",
max_length=self.max_seq_length,
return_tensors="pt",
truncation=True,
return_attention_mask=True,
)
# for Catalyst, there needs to be a key called features
output_dict["features"] = output_dict["input_ids"].squeeze(0)
del output_dict["input_ids"]
# encoding target
if self.labels is not None:
y = self.labels[index]
y_encoded = torch.Tensor([self.label_dict.get(y, -1)]).long().squeeze(0)
output_dict["targets"] = y_encoded
return output_dict
What is the problem?
I know questions about similar error have been already asked but they were not of much help in solving this problem.

Part of the data was lost when the data was processed in the way provided by huggingface(pytorch)

I'm reproducing a project at https://github.com/xplip/pixel.
It is based on huggingface's datasets and Trainer to implement.
I loaded sst2under gluewithout any problems:
raw_datasets = load_dataset("glue", data_args.task_name, cache_dir=model_args.cache_dir)
Printing the information in the dataset is also normal:
train_dataset = raw_datasets["train"]
...
if training_args.do_train:
for index in random.sample(range(len(train_dataset)), 3):
logger.info(f"Sample {index} of the training set: {train_dataset[index]}.")
>>> examples: {'sentence': ["although german cooking does not come readily to mind when considering the world 's best cuisine , mostly martha could make deutchland a popular destination for hungry tourists . "], 'label': [1], 'idx': [558]}
...
However, during the training phase of the model, an error is reported.
trainer = PIXELTrainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
compute_metrics=compute_metrics,
tokenizer=processor,
data_collator=get_collator(training_args, processor, modality, is_regression=is_regression),
callbacks=[EarlyStoppingCallback(early_stopping_patience=training_args.early_stopping_patience)]
if training_args.early_stopping
else None,
)
# PIXELTrainer is a subclass of Trainer
...
train_result = trainer.train(resume_from_checkpoint=checkpoint)
The following is my error message:
Traceback (most recent call last):
File "C:\Development\Anaconda3\envs\pixels\lib\site-packages\torch\utils\data\dataloader.py", line 517, in __next__
data = self._next_data()
File "C:\Development\Anaconda3\envs\pixels\lib\site-packages\torch\utils\data\dataloader.py", line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "C:\Development\Anaconda3\envs\pixels\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "C:\Development\Anaconda3\envs\pixels\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "C:\Development\Anaconda3\envs\pixels\lib\site-packages\datasets\arrow_dataset.py", line 2125, in __getitem__
key,
File "C:\Development\Anaconda3\envs\pixels\lib\site-packages\datasets\arrow_dataset.py", line 2110, in _getitem
pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns
File "C:\Development\Anaconda3\envs\pixels\lib\site-packages\datasets\formatting\formatting.py", line 533, in format_table
return formatter(pa_table, query_type=query_type)
File "C:\Development\Anaconda3\envs\pixels\lib\site-packages\datasets\formatting\formatting.py", line 281, in __call__
return self.format_row(pa_table)
File "C:\Development\Anaconda3\envs\pixels\lib\site-packages\datasets\formatting\formatting.py", line 387, in format_row
formatted_batch = self.format_batch(pa_table)
File "C:\Development\Anaconda3\envs\pixels\lib\site-packages\datasets\formatting\formatting.py", line 419, in format_batch
return self.transform(batch)
File "E:/Github_Pro/pixel/scripts/training/run_glue.py", line 358, in image_preprocess_fn
encodings = [processor(text=format_fn(a)) for a in examples[sentence1_key]]
KeyError: 'sentence'
According to the error message, I located this part:
def image_preprocess_fn(examples):
# I print the information about examples
print("examples:", examples)
if sentence2_key:
encodings = [
processor(text=(format_fn(a), format_fn(b)))
for a, b in zip(examples[sentence1_key], examples[sentence2_key])
]
else:
encodings = [processor(text=format_fn(a)) for a in examples[sentence1_key]]
examples["pixel_values"] = [transforms(Image.fromarray(e.pixel_values)) for e in encodings]
examples["attention_mask"] = [
get_attention_mask(e.num_text_patches, seq_length=data_args.max_seq_length) for e in encodings
]
return examples
Then I print examples:
>>> examples: {'label': [1]}
There is indeed no 'sentence'.
But there is 'sentence' in my original data, and since I am not familiar with datasets and Trainer, I don't know how to deal with this error anymore.
I would like someone to tell me why this is and show me how to modify the code to make it work.

WSQ files not opening with Pillow/wsq when using joblib.Parallel

I am trying to preprocess large amounts of WSQ images for model training using both the Pillow and wsq libraries. To speed up my code, I am trying to use Parallel but this causes an UnidentifiedImageError.
I verified that the files are there where they should be, and that the function runs without errors when used in a regular for-loop. Other files (eg csv files) can be opened inside the function without errors, so I presume that the error lies with the combination of Parallel and Pillow/wsq. All libraries are up to date. As I am just starting out with Pillow and multiprocessing, I have no idea yet on how to fix this and any help would be highly appreciated.
Code:
from joblib import Parallel, delayed
from PIL import Image
import multiprocessing
import wsq
import numpy as np
def process_image(i):
path = "/home/user/project/wsq/image_"+str(i)+".wsq"
img = np.array(Image.open(path))
#some preprocessing, saving as npz
output_path = "/home/user/project/npz/image_"+str(i)+".npz"
np.savez_compressed(output_path, img)
return None
inputs = range(100000)
num_cores = multiprocessing.cpu_count()
Parallel(n_jobs=num_cores)(delayed(process_image)(i) for i in inputs)
Output:
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
r = call_item()
File "/home/user/.local/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/user/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "preprocess_images.py", line 9, in process_image
img = np.array(Image.open(path))
File "/home/user/.local/lib/python3.8/site-packages/PIL/Image.py", line 2967, in open
raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file '/home/user/project/wsq/image_1.wsq'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "preprocess_images.py", line 18, in <module>
Parallel(n_jobs=num_cores)(delayed(process_image)(i) for i in inputs)
File "/home/user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 1054, in __call__
self.retrieve()
File "/home/user/.local/lib/python3.8/site-packages/joblib/parallel.py", line 933, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/user/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
PIL.UnidentifiedImageError: cannot identify image file '/home/user/project/wsq/image_1.wsq'

tensorflow graph error when trying to generate data in transfer learning

I'm trying to use transfer learning on the pretrained inception model, so I created a class for feature extraction from the model:
from prototype import Dataset, VideoStreamHandler
import numpy
import random
from keras.applications.inception_v3 import preprocess_input
from keras.preprocessing import image
from scipy.misc import imresize
import time
class Extractor(Dataset.Dataset):
"""
"""
def __init__(self, path_to_data, seq_len, base_model, image_shape=(299, 299, 3)):
super().__init__(path_to_data, seq_len, input_shape=image_shape)
self._extractor = base_model
def extract_features(self, batch_size):
"""
passes the data through the base model to get the feature map to later train on
:return: feature map
"""
class_one_hot = self.one_hot_encode() # get the one hot for the classes
data = self.clean_data(self.get_data(), self._input_shape[0])
print("Processing {} videos".format(len(self.get_data())))
transfer_maps, labels = [], []
rand = random.SystemRandom()
while True:
for _ in range(batch_size):
row = rand.choice(data)
sequence = self.get_frames(row[0])
if len(sequence) > self._input_shape[0]:
sequence = self.rescale_frame_list(sequence, self._input_shape[0])
print("{} video processing is complete".format(row[0].split('\\')[-1]))
features = []
for frame in sequence:
frame_arr = image.img_to_array(frame) # turn image to numpy array
frame_arr = numpy.expand_dims(frame_arr, axis=0)
frame_arr = preprocess_input(frame_arr)
features.append(self._extractor.predict(frame_arr))
transfer_maps.append(features)
labels.append(class_one_hot[row[1]])
yield numpy.array(transfer_maps), numpy.array(labels)
def get_frames(self, pth):
"""
:type: string
:param pth: path to the specific file from which we take the frames
:return: the frames in the file
"""
f_queue = VideoStreamHandler.VideoStream(pth) # This object opens a thread that reads frames with opencv
# capture independently from the frame processing to prevent i/o delay and speed up processing
f_queue.start()
time.sleep(1.0) # wait a moment so the thread could start reading frames
sequence = []
while f_queue.isnt_empty():
frame = f_queue.read()
# resize is used to keep all frames from all videos the same size
frame = imresize(frame, (self._input_shape[1], self._input_shape[2]))
sequence.append(frame)
f_queue.close() # close the thread
return sequence
Then, I attempt to train a new model with keras's fit_generator:
my_model.fit_generator(generator=train_gen, epochs=10, steps_per_epoch=steps_per_epoch, verbose=1, workers=4)
However, I get this error:
Blockquote
Traceback (most recent call last):
File "C:/Users/Aviad Lazar/Desktop/project/prototype/transfer_learning.py", line 41, in
main()
File "C:/Users/Aviad Lazar/Desktop/project/prototype/transfer_learning.py", line 34, in main
my_model.fit_generator(generator=train_gen, epochs=10, steps_per_epoch=steps_per_epoch, verbose=1, workers=4)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\keras\models.py", line 1315, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\keras\engine\training.py", line 2194, in fit_generator
generator_output = next(output_generator)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\keras\utils\data_utils.py", line 793, in get
six.reraise(value.class, value, value.traceback)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\keras\utils\data_utils.py", line 658, in _data_generator_task
generator_output = next(self._generator)
File "C:\Users\Aviad Lazar\Desktop\project\prototype\FeatureExtractor.py", line 48, in extract_features
features.append(self._extractor.predict(frame_arr))
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\keras\engine\training.py", line 1832, in predict
self._make_predict_function()
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\keras\engine\training.py", line 1031, in _make_predict_function
**kwargs)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 2506, in function
return Function(inputs, outputs, updates=updates, **kwargs)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 2449, in init
with tf.control_dependencies(self.outputs):
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 4863, in control_dependencies
return get_default_graph().control_dependencies(control_inputs)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 4481, in control_dependencies
c = self.as_graph_element(c)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 3478, in as_graph_element
return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
File "C:\Users\Aviad Lazar\Desktop\project\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 3557, in _as_graph_element_locked
raise ValueError("Tensor %s is not an element of this graph." % obj)
ValueError: Tensor Tensor("global_average_pooling2d_1/Mean:0", shape=(?, 2048), dtype=float32) is not an element of this graph.

Uploading CSV files to Fusion Tables through Python

I am trying to grab data from looker and insert it directly into Google Fusion Tables using the MediaFileUpload so as to not download any files and upload from memory. My current code below returns a TypeError. Any help would be appreciated. Thanks!
Error returned to me:
Traceback (most recent call last):
File "csvpython.py", line 96, in <module>
main()
File "csvpython.py", line 88, in main
media = MediaFileUpload(dataq, mimetype='application/octet-stream', resumable=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/oauth2client/_helpers.py", line 133, in positional_wrapper
return wrapped(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 548, in __init__
fd = open(self._filename, 'rb')
TypeError: expected str, bytes or os.PathLike object, not NoneType
Code in question:
for x, y, z in zip(look, destination, fusion):
look_data = lc.run_look(x)
df = pd.DataFrame(look_data)
stream = io.StringIO()
dataq = df.to_csv(path_or_buf=stream, sep=";", index=False)
media = MediaFileUpload(dataq, mimetype='application/octet-stream', resumable=True)
replace = ftserv.table().replaceRows(tableId=z, media_body=media, startLine=None, isStrict=False, encoding='UTF-8', media_mime_type='application/octet-stream', delimiter=';', endLine=None).execute()
After switching dataq to stream in MediaFileUpload, I have had the following returned to me:
Traceback (most recent call last):
File "quicktestbackup.py", line 96, in <module>
main()
File "quicktestbackup.py", line 88, in main
media = MediaFileUpload(stream, mimetype='application/octet-stream', resumable=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/oauth2client/_helpers.py", line 133, in positional_wrapper
return wrapped(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 548, in __init__
fd = open(self._filename, 'rb')
TypeError: expected str, bytes or os.PathLike object, not _io.StringIO

DataFrame.to_csv is a void method and any side effects from calling it are passed to stream and not dataq. That is, dataq is NoneType and has no data - your CSV data is in stream.
When you construct the media file from the io object, you need to feed it the data from the stream (and not the stream itself), thus its getvalue() method is needed.
df.to_csv(path_or_buf=stream, ...)
media = MediaFileUpload(stream.getvalue(), ...)
The call to FusionTables looks to be perfectly valid.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to pass deep learning model data to map function in Spark - apache-spark

Replace your current code with mapPartitions: def extract_features_(xs): model_data = initVGG16() for k, v in xs: yield k, extract_features(model_data, v) features_rdd = s3_files_rdd.mapPartitions(extract_features_)

Related

KeyError when trying to fine tuning Bert for text classification

Part of the data was lost when the data was processed in the way provided by huggingface(pytorch)

WSQ files not opening with Pillow/wsq when using joblib.Parallel

tensorflow graph error when trying to generate data in transfer learning

Uploading CSV files to Fusion Tables through Python

Categories

Resources