How can I read LIBSVM models (saved using LIBSVM) into PySpark?

How can I read LIBSVM models (saved using LIBSVM) into PySpark? - apache-spark

I have a LIBSVM scaling model (generated with svm-scale) that I would like to port over to PySpark. I've naively tried the following:
scaler_path = "path to model"
a = MinMaxScaler().load(scaler_path)
But I'm thrown an error, expecting a metadata directory:
Py4JJavaErrorTraceback (most recent call last)
<ipython-input-22-1942e7522174> in <module>()
----> 1 a = MinMaxScaler().load(scaler_path)
/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/util.pyc in load(cls, path)
226 def load(cls, path):
227 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 228 return cls.read().load(path)
229
230
/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/util.pyc in load(self, path)
174 if not isinstance(path, basestring):
175 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 176 java_obj = self._jread.load(path)
177 if not hasattr(self._clazz, "_from_java"):
178 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.pyc in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/lib/python2.7/dist-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o321.load.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:[filename]/metadata
```
Is there a simple work-around for loading this? The format of the LIBSVM model is
x
0 1
1 -1050 1030
2 0 1
3 0 3
4 0 1
5 0 1

First, the file presented isn't in libsvm format. The correct format of a libsvm file is the following :
<label> <index1>:<value1> <index2>:<value2> ... <indexN>:<valueN>
Thus your data preparation is incorrect to start with.
Secondly, the class method load(path) that you are using with MinMaxScaler reads an ML instance from the input path.
Remember that : MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel. The model can then transform each feature individually such that it is in the given range.
e.g :
from pyspark.ml.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.feature import MinMaxScaler
df = spark.createDataFrame([(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])) ,(0.0, Vectors.dense([1.01, 2.02, 3.03]))],['label','features'])
df.show(truncate=False)
# +-----+---------------------+
# |label|features |
# +-----+---------------------+
# |1.1 |(3,[0,2],[1.23,4.56])|
# |0.0 |[1.01,2.02,3.03] |
# +-----+---------------------+
mmScaler = MinMaxScaler(inputCol="features", outputCol="scaled")
temp_path = "/tmp/spark/"
minMaxScalerPath = temp_path + "min-max-scaler"
mmScaler.save(minMaxScalerPath)
The snippet above will save the MinMaxScaler feature transformer so it can be loaded after with the class method load.
Now, let's take a look at what actually happened. The class method save will create the following file structure :
/tmp/spark/
└── min-max-scaler
└── metadata
├── part-00000
└── _SUCCESS
Let's check the content of that part-0000 file :
$ cat /tmp/spark/min-max-scaler/metadata/part-00000 | python -m json.tool
{
"class": "org.apache.spark.ml.feature.MinMaxScaler",
"paramMap": {
"inputCol": "features",
"max": 1.0,
"min": 0.0,
"outputCol": "scaled"
},
"sparkVersion": "2.0.0",
"timestamp": 1480501003244,
"uid": "MinMaxScaler_42e68455a929c67ba66f"
}
So actually when you load the transformer :
loadedMMScaler = MinMaxScaler.load(minMaxScalerPath)
You are actually load that file. It won't take a libsvm file !
Now you can apply your transformer to create the model and transform your DataFrame :
model = loadedMMScaler.fit(df)
model.transform(df).show(truncate=False)
# +-----+---------------------+-------------+
# |label|features |scaled |
# +-----+---------------------+-------------+
# |1.1 |(3,[0,2],[1.23,4.56])|[1.0,0.0,1.0]|
# |0.0 |[1.01,2.02,3.03] |[0.0,1.0,0.0]|
# +-----+---------------------+-------------+
Now let's get back to that libsvm file and let us create some dummy data and save it to a libsvm format using MLUtils
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.util import MLUtils
data = sc.parallelize([LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])), LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))])
MLUtils.saveAsLibSVMFile(data, temp_path + "data")
Back to our file structure :
/tmp/spark/
├── data
│   ├── part-00000
│   ├── part-00001
│   ├── part-00002
│   ├── part-00003
│   ├── part-00004
│   ├── part-00005
│   ├── part-00006
│   ├── part-00007
│   └── _SUCCESS
└── min-max-scaler
└── metadata
├── part-00000
└── _SUCCESS
You can check the content of those file which is in libsvm format now :
$ cat /tmp/spark/data/part-0000*
1.1 1:1.23 3:4.56
0.0 1:1.01 2:2.02 3:3.03
Now let's load that data and apply :
loadedData = MLUtils.loadLibSVMFile(sc, temp_path + "data")
loadedDataDF = spark.createDataFrame(loadedData.map(lambda lp : (lp.label, lp.features.asML())), ['label','features'])
loadedDataDF.show(truncate=False)
# +-----+----------------------------+
# |label|features |
# +-----+----------------------------+
# |1.1 |(3,[0,2],[1.23,4.56]) |
# |0.0 |(3,[0,1,2],[1.01,2.02,3.03])|
# +-----+----------------------------+
Note that converting MLlib Vectors to ML Vectors is very important. You can read more about it here.
model.transform(loadedDataDF).show(truncate=False)
# +-----+----------------------------+-------------+
# |label|features |scaled |
# +-----+----------------------------+-------------+
# |1.1 |(3,[0,2],[1.23,4.56]) |[1.0,0.0,1.0]|
# |0.0 |(3,[0,1,2],[1.01,2.02,3.03])|[0.0,1.0,0.0]|
# +-----+----------------------------+-------------+
I hope that this answers your question!

Related

Cannot interpret SVM model using Shapash

Currently, I'm exploring machine learning interpretability tools for one of my project. I found Shapash quite a new tool and many people suggesting to use it to create a few easily interpretable charts for ML model. When I tried it with RandomForestClassifier it worked fine and generate a webpage full of different charts but the same I cannot achieve while using SVM(just exploring this library, not focusing on the perfect ML model for a problem).
Note - using Shapash link here
#Fit blackbox model
svc = svm.SVC()
svc.fit(X_train_smote, y_train_smote)
y_pred = svc.predict(X_test)
print(f"F1 Score {f1_score(y_test, y_pred, average='macro')}")
print(f"Accuracy {accuracy_score(y_test, y_pred)}")
from shapash import SmartExplainer
xpl = SmartExplainer(model=svc)
error which I'm getting -
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/tmp/ipykernel_13648/1233939729.py in <module>
----> 1 xpl = SmartExplainer(model=svc)
~/Python_AI/ai_env/lib/python3.8/site-packages/shapash/explainer/smart_explainer.py in __init__(self, model, backend, preprocessing, postprocessing, features_groups, features_dict, label_dict, title_story, palette_name, colors_dict, **kwargs)
194 if isinstance(backend, str):
195 backend_cls = get_backend_cls_from_name(backend)
--> 196 self.backend = backend_cls(
197 model=self.model, preprocessing=preprocessing, **kwargs)
198 elif isinstance(backend, BaseBackend):
~/Python_AI/ai_env/lib/python3.8/site-packages/shapash/backend/shap_backend.py in __init__(self, model, preprocessing, explainer_args, explainer_compute_args)
16 self.explainer_args = explainer_args if explainer_args else {}
17 self.explainer_compute_args = explainer_compute_args if explainer_compute_args else {}
---> 18 self.explainer = shap.Explainer(model=model, **self.explainer_args)
19
20 def run_explainer(self, x: pd.DataFrame) -> dict:
~/Python_AI/ai_env/lib/python3.8/site-packages/shap/explainers/_explainer.py in __init__(self, model, masker, link, algorithm, output_names, feature_names, **kwargs)
166 # if we get here then we don't know how to handle what was given to us
167 else:
--> 168 raise Exception("The passed model is not callable and cannot be analyzed directly with the given masker! Model: " + str(model))
169
170 # build the right subclass
Exception: The passed model is not callable and cannot be analyzed directly with the given masker! Model: SVC()

unabel to load a ppo model

hello I've trained a PPO model from stabel_baselines3 on collab I saved it
model.save("model")
but when I tried loading it I got the following error:
m = PPO.load("model", env=env)
AttributeError Traceback (most recent call last)
/tmp/ipykernel_25649/121834194.py in <module>
2 env = e.MinitaurBulletEnv(render=False)
3 env.reset()
----> 4 m2 = PPO.load("model", env=env)
5 for episode in range(1, 6):
6 obs = env.reset()
~/anaconda3/lib/python3.8/site-packages/stable_baselines3/common/base_class.py in load(cls, path, env, device, custom_objects, **kwargs)
668 env = cls._wrap_env(env, data["verbose"])
669 # Check if given env is valid
--> 670 check_for_correct_spaces(env, data["observation_space"], data["action_space"])
671 else:
672 # Use stored env, if one exists. If not, continue as is (can be used for predict)
~/anaconda3/lib/python3.8/site-packages/stable_baselines3/common/utils.py in check_for_correct_spaces(env, observation_space, action_space)
217 :param action_space: Action space to check against
218 """
--> 219 if observation_space != env.observation_space:
220 raise ValueError(f"Observation spaces do not match: {observation_space} != {env.observation_space}")
221 if action_space != env.action_space:
~/anaconda3/lib/python3.8/site-packages/gym/spaces/box.py in __eq__(self, other)
138
139 def __eq__(self, other):
--> 140 return isinstance(other, Box) and (self.shape == other.shape) and np.allclose(self.low, other.low) and np.allclose(self.high, other.high)
AttributeError: 'Box' object has no attribute 'shape'
knowing that the env is a box env from pybullet
import pybullet_envs.bullet.minitaur_gym_env as e
import gym
env = e.MinitaurBulletEnv(render=False)
env.reset()
additional info is that the model loaded perfectly in collab

From your question, I can't tell if you are or aren't working on Google Colab, but if you are, I think you should definitely include the whole path to the saved model when you load it. Maybe you need to do this even if not in Colab.
What I mean is that your line of code should probably look something like this when you're loading the model:
m = PPO.load("./model.zip/", env=env)
I hope this helps!

last.ckpt | RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading zip archive: failed finding central directory

I am using JupyterLab on AWS SageMaker. Kernel: conda_pytorch_latest_p36.
I have successfully performed training.
Now, I attempt to set up the model for predictions, i.e. testing.
I suspect last.ckpt file is corrupt; as it fails on line:
model = OntologyTaggerModel.load_from_checkpoint('last.ckpt.2cCC2f52', map_location=torch.device(device), from_checkpoint=True)
Where does last.ckpt file come from - BERT download or my own model definition?
How do I regenerate it?
Update: I was able to re-generate it: last.ckpt.E342d53e.
Run model load with last.ckpt.**E342d53e**:
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading zip archive: failed finding central directory
Run model load with last.ckpt (without unique string in filename):
FileNotFoundError: [Errno 2] No such file or directory: '/home/ec2-user/SageMaker/last.ckpt'
I launched a new AWS SageMaker instance without luck.
Suspect Code (2nd last line):
def get_device():
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
return device
def preprocess(input, preprocessor):
result = [torch.tensor(preprocessor.tokenise(i)).unsqueeze(dim=0) for i in input]
result = torch.cat(result)
return result
def predict_fn(input, model_artifacts):
preprocessor, model, label_mapper = model_artifacts
# Pre-process
input_tensor = preprocess(input, preprocessor)
# Copy input to gpu if available
device = get_device()
input_tensor = input_tensor.to(device=device)
# Invoke
model.eval()
classes = []
probs = []
with torch.no_grad():
output_tensors = model(input_tensor)[1]
# Convert to probabilities
softmax = torch.nn.Softmax()
for class_index, output_tensor in enumerate(output_tensors):
output_tensor = softmax(output_tensor)
prob, predictions = torch.max(output_tensor, dim=1)
classes.append(label_mapper.reverse_map(predictions, class_index))
probs.append(prob)
classes = [c for c in zip(*classes)]
probs = [c for c in zip(*probs)]
return classes, probs
device = get_device()
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
label_mapper = LabelMapper('classes.txt')
model = OntologyTaggerModel.load_from_checkpoint('last.ckpt.2cCC2f52', map_location=torch.device(device), from_checkpoint=True) # CRASH !
model = model.to(device)
Traceback:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-8-ba98e0974205> in <module>
36 tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
37 label_mapper = LabelMapper('classes.txt')
---> 38 model = OntologyTaggerModel.load_from_checkpoint('last.ckpt.2cCC2f52', map_location=torch.device(device), from_checkpoint=True)
39 model = model.to(device)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/core/saving.py in load_from_checkpoint(cls, checkpoint_path, map_location, hparams_file, strict, **kwargs)
131 """
132 if map_location is not None:
--> 133 checkpoint = pl_load(checkpoint_path, map_location=map_location)
134 else:
135 checkpoint = pl_load(checkpoint_path, map_location=lambda storage, loc: storage)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/cloud_io.py in load(path_or_url, map_location)
44 fs = get_filesystem(path_or_url)
45 with fs.open(path_or_url, "rb") as f:
---> 46 return torch.load(f, map_location=map_location)
47
48
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
585 # reset back to the original position.
586 orig_position = opened_file.tell()
--> 587 with _open_zipfile_reader(opened_file) as opened_zipfile:
588 if _is_torchscript_zip(opened_zipfile):
589 warnings.warn("'torch.load' received a zip file that looks like a TorchScript archive"
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/serialization.py in __init__(self, name_or_buffer)
240 class _open_zipfile_reader(_opener):
241 def __init__(self, name_or_buffer) -> None:
--> 242 super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
243
244
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading zip archive: failed finding central directory
Please let me know if I should add anything else.

Tensorflow load dataset: UnimplementedError: Append(absl::Cord) is not implemented [Op:TakeDataset]

I am trying to extract batches from my Tensorflow dataset using Tensorflow 2.4, and I get a very strange error:
--> 221 for batch, (input_seq, target_seq_in, target_seq_out) in enumerate(dataset.take(-1)):
222 # Train and get the loss value
223 loss, accuracy = train_step(input_seq, target_seq_in, target_seq_out, en_initial_states, optimizer)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py in take(self, count)
1417 Dataset: A `Dataset`.
1418 """
-> 1419 return TakeDataset(self, count)
1420
1421 def skip(self, count):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py in __init__(self, input_dataset, count)
3856 input_dataset._variant_tensor, # pylint: disable=protected-access
3857 count=self._count,
-> 3858 **self._flat_structure)
3859 super(TakeDataset, self).__init__(input_dataset, variant_tensor)
3860
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_dataset_ops.py in take_dataset(input_dataset, count, output_types, output_shapes, name)
6608 return _result
6609 except _core._NotOkStatusException as e:
-> 6610 _ops.raise_from_not_ok_status(e, name)
6611 except _core._FallbackException:
6612 pass
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
6860 message = e.message + (" name: " + name if name is not None else "")
6861 # pylint: disable=protected-access
-> 6862 six.raise_from(core._status_to_exception(e.code, message), None)
6863 # pylint: enable=protected-access
6864
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
UnimplementedError: Append(absl::Cord) is not implemented [Op:TakeDataset]
My process is as following:
dataset = tf.data.Dataset.from_tensor_slices((encoder_inputs, decoder_inputs, decoder_targets))
dataset = dataset.batch(batch_size, drop_remainder=True)
tf.data.experimental.save(dataset, save_path + 'dataset_' + str(index))
...
dataset = tf.data.experimental.load(folder_path +'dataset_'+str(index), (tf.TensorSpec(shape=(MAX_LEN,), dtype=tf.int64, name=None), tf.TensorSpec(shape=(MAX_LEN,), dtype=tf.int64, name=None), tf.TensorSpec(shape=(MAX_LEN,), dtype=tf.int64, name=None)))
I don't understand where could this error come from and wasn't able to find anything related.

Your stack trace seems to be missing the actual line that triggered the error but I am gonna try to guess anyway.
The error seems related to dataset writing to a file that already exists and then it tries to append to it but whatever it uses as a WritableFile did not override Append (see: https://github.com/tensorflow/tensorflow/blob/516ae286f6cc796e646d14671d94959b129130a4/tensorflow/core/platform/file_system.h#L783)
To continue with the wild guess - if this line:
tf.data.experimental.save(dataset, save_path + 'dataset_' + str(index))
is triggering the error, try something silly like - changing the file name.

CelebA Dataset inaccessible using tfds.load()

I am trying to use the CelebA dataset in a deep learning project. I have the zipped folder from Kaggle.
I wanted to unzip and then split the images into training, testing, and validation, but then found out that it would not be possible on my not-so-powerful system.
So, to avoid wasting time, I wanted to use the TensorFlow-datasets method to load the CelebA dataset. But unfortunately, the dataset is inaccessible with the following error:
(Code first)
ds = tfds.load('celeb_a', split='train', download=True)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-69-d7b9371eb674> in <module>
----> 1 ds = tfds.load('celeb_a', split='train', download=True)
c:\users\aman\appdata\local\programs\python\python38\lib\site-packages\tensorflow_datasets\core\load.py in load(name, split, data_dir, batch_size, shuffle_files, download, as_supervised, decoders, read_config, with_info, builder_kwargs, download_and_prepare_kwargs, as_dataset_kwargs, try_gcs)
344 if download:
345 download_and_prepare_kwargs = download_and_prepare_kwargs or {}
--> 346 dbuilder.download_and_prepare(**download_and_prepare_kwargs)
347
348 if as_dataset_kwargs is None:
c:\users\aman\appdata\local\programs\python\python38\lib\site-packages\tensorflow_datasets\core\dataset_builder.py in download_and_prepare(self, download_dir, download_config)
383 self.info.read_from_directory(self._data_dir)
384 else:
--> 385 self._download_and_prepare(
386 dl_manager=dl_manager,
387 download_config=download_config)
c:\users\aman\appdata\local\programs\python\python38\lib\site-packages\tensorflow_datasets\core\dataset_builder.py in _download_and_prepare(self, dl_manager, download_config)
1020 def _download_and_prepare(self, dl_manager, download_config):
1021 # Extract max_examples_per_split and forward it to _prepare_split
-> 1022 super(GeneratorBasedBuilder, self)._download_and_prepare(
1023 dl_manager=dl_manager,
1024 max_examples_per_split=download_config.max_examples_per_split,
c:\users\aman\appdata\local\programs\python\python38\lib\site-packages\tensorflow_datasets\core\dataset_builder.py in _download_and_prepare(self, dl_manager, **prepare_split_kwargs)
959 split_generators_kwargs = self._make_split_generators_kwargs(
960 prepare_split_kwargs)
--> 961 for split_generator in self._split_generators(
962 dl_manager, **split_generators_kwargs):
963 if str(split_generator.split_info.name).lower() == "all":
c:\users\aman\appdata\local\programs\python\python38\lib\site-packages\tensorflow_datasets\image\celeba.py in _split_generators(self, dl_manager)
137 all_images = {
138 os.path.split(k)[-1]: img for k, img in
--> 139 dl_manager.iter_archive(downloaded_dirs["img_align_celeba"])
140 }
141
c:\users\aman\appdata\local\programs\python\python38\lib\site-packages\tensorflow_datasets\core\download\download_manager.py in iter_archive(self, resource)
559 if isinstance(resource, six.string_types):
560 resource = resource_lib.Resource(path=resource)
--> 561 return extractor.iter_archive(resource.path, resource.extract_method)
562
563 def extract(self, path_or_paths):
c:\users\aman\appdata\local\programs\python\python38\lib\site-packages\tensorflow_datasets\core\download\extractor.py in iter_archive(path, method)
221 An iterator of `(path_in_archive, f_obj)`
222 """
--> 223 return _EXTRACT_METHODS[method](path)
KeyError: <ExtractMethod.NO_EXTRACT: 1>
Could someone explain what I am doing wrong?
On a side-note, if this does not work, is there a way to convert the already downloaded zipped file from Kaggle into the required format without unzipping and then iterating over each image individually? Basically, I cannot go down the unzip-then-split route for such a large dataset...
TIA!
EDIT: I tried the same on Colab, but getting a similar error:

It seems like there is some sort of quota limit for downloading form GDrive. Go to the google drive link shown in the error, and make a copy to your drive. You can download the copy alternatively through libraries such as gdown, google_drive_downloader.

upgrade the tfds to the nightly version, which worked for me

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How can I read LIBSVM models (saved using LIBSVM) into PySpark? - apache-spark

Related

Cannot interpret SVM model using Shapash

unabel to load a ppo model

last.ckpt | RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading zip archive: failed finding central directory

Tensorflow load dataset: UnimplementedError: Append(absl::Cord) is not implemented [Op:TakeDataset]

CelebA Dataset inaccessible using tfds.load()

Categories

Resources