I store some from the images in the database as 128 vector arrays. My problem is that when I put new images in the dataset or delete images from dataset, pickle re-save the images that you previously saved and does not know that the array vectors has already been saved to it.
This causes When I have a lot of pictures in the dataset, it's time to spend on saving them in the pickle. How can I fix it?
├── dataset
│ ├── jack [10 entries]
│ ├── john [7 entries]
│ ├── mori [24 entries]
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--dataset", required=True,help="path to input directory of faces + images")
ap.add_argument("-e", "--encodings", required=True,help="path to serialized db of facial encodings")
features = []
faces = []
for (i, imagePath) in enumerate(imagePaths):
name = imagePath.split(os.path.sep)[-2]
encodings = face_recognition.face_encodings(rgb, boxes)
for encoding in encodings:
knownEncodings.append(encoding)
knownNames.append(name)
data = {"encodings": knownEncodings, "names": knownNames}
f = open(args["encodings"], "wb")
f.write(pickle.dumps(data))
f.close()
Related
The annotations from the coco dataset looks like this:
[{"segmentation": [[510.66,423.01,511.72,420.03,510.45,416.0,510.34,413.02,510.77,410.26,510.77,407.5,510.34,405.16,407.71,476.68,409.41,479.23,409.73,481.56,410.69,480.4,411.85,481.35,414.93,479.86]],"area": 702.1057499999998,"iscrowd": 0,"image_id": 289343,"bbox": [473.07,395.93,38.65,28.67],"category_id": 18,"id": 1768},{"segmentation": [[290.26,471.25,285.94,472.33,283.79,464.78,280.01,462.62,284.33,454.53,285.94,453.45,282.71,448.59,288.64,444.27,291.88,443.74]],"area": 27718.476299999995,"iscrowd": 0,"image_id": 61471,"bbox": [272.1,200.23,151.97,279.77],"category_id": 18,"id": 1773}, ....
I have already put the image arrays corresponding to these annotations to a h5py file, how do you put these into the same h5py file? There are different data types. I am using this code from https://realpython.com/storing-images-in-python/#reading-many-images
def store_many_hdf5(images):
""" Stores an array of images to HDF5.
Parameters:
---------------
images images array, (N, 32, 32, 3) to be stored
labels labels array, (N, 1) to be stored
"""
num_images = len(images)
# Create a new HDF5 file
file = h5py.File(hdf5_dir/f"{num_images}_many.h5", "w")
# Create a dataset in the file
dataset = file.create_dataset(
"images", np.shape(images), h5py.h5t.STD_U8BE, data=images
)
# meta_set = file.create_dataset(
# "meta", np.shape(labels), h5py.h5t.STD_U8BE, data=labels
# )
file.close()
They are already 2 posts about this topics, but they have not been updated for the recent TF2.1 release...
In brief, I've got a lot of tif images to read and parse with a specific pipeline.
import tensorflow as tf
import numpy as np
files = # a list of str
labels = # a list of int
n_unique_label = len(np.unique(labels))
gen = functools.partial(generator, file_list=files, label_list=labels, param1=x1, param2=x2)
dataset = tf.data.Dataset.from_generator(gen, output_types=(tf.float32, tf.int32))
dataset = dataset.map(lambda b, c: (b, tf.one_hot(c, depth=n_unique_label)))
This processing works well. Nevertheless, I need to parallelize the file parsing part, trying the following solution:
files = # a list of str
files = tensorflow.data.Dataset.from_tensor_slices(files)
def wrapper(file_path):
parser = partial(tif_parser, param1=x1, param2=x2)
return tf.py_function(parser, inp=[file_path], Tout=[tf.float32])
dataset = files.map(wrapper, num_parallel_calls=2)
The difference is that I parse one file at a time here with the parser function. However, but it does not work:
File "loader.py", line 643, in tif_parser
image = numpy.array(Image.open(file_path)).astype(float)
File "python3.7/site-packages/PIL/Image.py", line 2815, in open
fp = io.BytesIO(fp.read())
AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'read'
[[{{node EagerPyFunc}}]] [Op:IteratorGetNextSync]
As far as I understand, the tif_parser function does not receive a string but an (unevaluated) tensor. At now, this function is fairly simple:
def tif_parser(file_path, param1=1, param2=2):
image = numpy.array(Image.open(file_path)).astype(float)
image /= 255.0
return image
Here is how I have proceeded
dataset = tf.data.Dataset.from_tensor_slices((files, labels))
def wrapper(file_path, label):
import functools
parser = functools.partial(tif_parser, param1=x1, param2=x2)
return tf.data.Dataset.from_generator(parser, (tf.float32, tf.int32), args=(file_path, label))
dataset = dataset.interleave(wrapper, cycle_length=tf.data.experimental.AUTOTUNE)
# The labels are converted to 1-hot vectors, could be integrated in tif_parser
dataset = dataset.map(lambda i, l: (i, tf.one_hot(l, depth=unique_label_count)))
dataset = dataset.shuffle(buffer_size=file_count, reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
Concretely, I generate a data set every time the parser is called. The parser is run cycle_length time at each call, meaning that cycle_length images are read at once. This is suited to my specific case, because I cannot load all the images in memory. I am unsure whether the prefetch is used correctly or not here.
I have build a DBSCAN clustering model, the output result and the result after using the pkl files are not matching
Below, for 1st record the cluster is 0
But after running it from 'pkl' file, it is showing predicted result as [-1]
Dataframe:
HD MC WT Cluster
200 Other 4.5 0
150 Pep 5.6 0
100 Pla 35 -1
50 Same 15 0
Code
######## Label encoder for column MC ##############
le = preprocessing.LabelEncoder()
df['MC encoded'] = le.fit_transform(df['MC'])
col_1 = ['HD','MC encoded','WT']
data = df[col_1]
data = data.fillna(value=0)
######### DBSCAN Clustering ##################
model = DBSCAN(eps=7, min_samples=2).fit(data)
outliers_df = pd.DataFrame(data)
print(Counter(model.labels_))
######## Predict ###############
x = model.fit_predict(data)
df["Cluster"] = x
####### Create model pkl files and dump ################
filename1 = '/model.pkl'
model_df = open(filename1, 'wb')
pickle.dump(model,model_df)
model_df.close()
######## Create Encoder pkl files and dump ############
output = open('/MC.pkl', 'wb')
pickle.dump(le, output)
output.close()
####### Load the model pkl file ##############
with open('model.pkl', 'rb') as file:
pickle_model = pickle.load(file)
########## Load Encoder pkl file ############
pkl_file = open('MC.pkl', 'rb')
le_mc = pickle.load(pkl_file)
pkl_file.close()
######## Function to predict new data ##############
def testing(HD,MC,WT):
test = {'HD':[HD],'MC':[MC], 'WT':[WT]}
test = pd.DataFrame(test)
test['MC_encoded'] = le_mc.transform(test['MC'])
pred_val = pickle_model.fit_predict(test[['HD','MC_encoded',WT]])
print(pred_val)
return(pred_val)
###### Predict with new observation ###########
pred_val = testing(200,'Other',4.5)
Resulting cluster
[-1]
Expected cluster
[0]
Clustering is not predictive.
If you want to classify new instances, use a classifier.
So in my opinion you are using it entirely on the wrong premises...
Nevertheless, your mistake is that you use the wrong function.
fit_predict literally means discard the old model, then fit, and return the labels. This is because of a pretty poor design of sklearn that conflates learning algorithms and the resulting models. A model should not have a fit method anymore, a training algorithm not a predict as there is no model yet...
Now if you fit to a dataset of fewer than min_samples points, they must all be noise (-1) by definition. You meant to use predict only - which does not exist, because DBSCAN does not predict for new data points.
I'm trying to use TensorBoard embedding visualizer to represent a set of 7307 verb embeddings that I've just generated, but the plotted points disappears when I select the Enable 3d labels mode.
Here's my code:
def plot(tsne_matrix, labels_path):
PATH = os.getcwd()
LOG_DIR = PATH
metadata = os.path.join(LOG_DIR, labels_path)
# Setup a 2D tensor that holds the embeddings
words = tf.Variable(tsne_matrix, name = "words")
with tf.Session() as session:
# Periodically save the model variables in a checkpoint in LOG_DIR.
saver = tf.train.Saver([words])
session.run(words.initializer)
saver.save(session, os.path.join(LOG_DIR, "model.ckpt"))
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = words.name
embedding.metadata_path = metadata
summary_writer = tf.summary.FileWriter(LOG_DIR)
projector.visualize_embeddings(summary_writer, config)
The metadata that I want to use consists just of the names of the embeddings (in my case verbs). They are stored in a list in a dictionary with other lists, so I'm using this function to load them to a tsv file (the required format):
# Extract list of labels:
def labels2tsv(name, path):
output = json2dict("output_parsed.json")
if name == 'verbs':
labels_list = list(output["verbs"].keys())
elif name == 'objects':
labels_list = list(output["objects"].keys())
with open(path, 'w') as f:
wr = csv.writer(f, delimiter='\t')
wr.writerow(str(labels_list))
The code that I execute then is:
# obtain labels
labels2tsv('verbs', 'verbs_metadata.tsv')
labels2tsv('objects', 'objects_metadata.tsv')
# plotting
tsne_verbs = np.load('verbs_tsne.npy')
plot(tsne_verbs, "verbs_metadata.tsv")
Finally, I access to TensorBoard through the command tensorboard --logdir=LOG_DIR.
The generated projector_config.pbtxt file (which is also in LOG_DIR) has the following content:
embeddings {
tensor_name: "Variable:0"
metadata_path: "verbs_metadata.tsv"
}
I guess that the points disappear because I'm not doing a correct metadata association, but I can't see the mistake. It also crashes on both Chrome and Firefox.
I have a LIBSVM scaling model (generated with svm-scale) that I would like to port over to PySpark. I've naively tried the following:
scaler_path = "path to model"
a = MinMaxScaler().load(scaler_path)
But I'm thrown an error, expecting a metadata directory:
Py4JJavaErrorTraceback (most recent call last)
<ipython-input-22-1942e7522174> in <module>()
----> 1 a = MinMaxScaler().load(scaler_path)
/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/util.pyc in load(cls, path)
226 def load(cls, path):
227 """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 228 return cls.read().load(path)
229
230
/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/util.pyc in load(self, path)
174 if not isinstance(path, basestring):
175 raise TypeError("path should be a basestring, got type %s" % type(path))
--> 176 java_obj = self._jread.load(path)
177 if not hasattr(self._clazz, "_from_java"):
178 raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"
/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.pyc in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/srv/data/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/usr/local/lib/python2.7/dist-packages/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o321.load.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:[filename]/metadata
```
Is there a simple work-around for loading this? The format of the LIBSVM model is
x
0 1
1 -1050 1030
2 0 1
3 0 3
4 0 1
5 0 1
First, the file presented isn't in libsvm format. The correct format of a libsvm file is the following :
<label> <index1>:<value1> <index2>:<value2> ... <indexN>:<valueN>
Thus your data preparation is incorrect to start with.
Secondly, the class method load(path) that you are using with MinMaxScaler reads an ML instance from the input path.
Remember that : MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel. The model can then transform each feature individually such that it is in the given range.
e.g :
from pyspark.ml.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.feature import MinMaxScaler
df = spark.createDataFrame([(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])) ,(0.0, Vectors.dense([1.01, 2.02, 3.03]))],['label','features'])
df.show(truncate=False)
# +-----+---------------------+
# |label|features |
# +-----+---------------------+
# |1.1 |(3,[0,2],[1.23,4.56])|
# |0.0 |[1.01,2.02,3.03] |
# +-----+---------------------+
mmScaler = MinMaxScaler(inputCol="features", outputCol="scaled")
temp_path = "/tmp/spark/"
minMaxScalerPath = temp_path + "min-max-scaler"
mmScaler.save(minMaxScalerPath)
The snippet above will save the MinMaxScaler feature transformer so it can be loaded after with the class method load.
Now, let's take a look at what actually happened. The class method save will create the following file structure :
/tmp/spark/
└── min-max-scaler
└── metadata
├── part-00000
└── _SUCCESS
Let's check the content of that part-0000 file :
$ cat /tmp/spark/min-max-scaler/metadata/part-00000 | python -m json.tool
{
"class": "org.apache.spark.ml.feature.MinMaxScaler",
"paramMap": {
"inputCol": "features",
"max": 1.0,
"min": 0.0,
"outputCol": "scaled"
},
"sparkVersion": "2.0.0",
"timestamp": 1480501003244,
"uid": "MinMaxScaler_42e68455a929c67ba66f"
}
So actually when you load the transformer :
loadedMMScaler = MinMaxScaler.load(minMaxScalerPath)
You are actually load that file. It won't take a libsvm file !
Now you can apply your transformer to create the model and transform your DataFrame :
model = loadedMMScaler.fit(df)
model.transform(df).show(truncate=False)
# +-----+---------------------+-------------+
# |label|features |scaled |
# +-----+---------------------+-------------+
# |1.1 |(3,[0,2],[1.23,4.56])|[1.0,0.0,1.0]|
# |0.0 |[1.01,2.02,3.03] |[0.0,1.0,0.0]|
# +-----+---------------------+-------------+
Now let's get back to that libsvm file and let us create some dummy data and save it to a libsvm format using MLUtils
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.util import MLUtils
data = sc.parallelize([LabeledPoint(1.1, Vectors.sparse(3, [(0, 1.23), (2, 4.56)])), LabeledPoint(0.0, Vectors.dense([1.01, 2.02, 3.03]))])
MLUtils.saveAsLibSVMFile(data, temp_path + "data")
Back to our file structure :
/tmp/spark/
├── data
│ ├── part-00000
│ ├── part-00001
│ ├── part-00002
│ ├── part-00003
│ ├── part-00004
│ ├── part-00005
│ ├── part-00006
│ ├── part-00007
│ └── _SUCCESS
└── min-max-scaler
└── metadata
├── part-00000
└── _SUCCESS
You can check the content of those file which is in libsvm format now :
$ cat /tmp/spark/data/part-0000*
1.1 1:1.23 3:4.56
0.0 1:1.01 2:2.02 3:3.03
Now let's load that data and apply :
loadedData = MLUtils.loadLibSVMFile(sc, temp_path + "data")
loadedDataDF = spark.createDataFrame(loadedData.map(lambda lp : (lp.label, lp.features.asML())), ['label','features'])
loadedDataDF.show(truncate=False)
# +-----+----------------------------+
# |label|features |
# +-----+----------------------------+
# |1.1 |(3,[0,2],[1.23,4.56]) |
# |0.0 |(3,[0,1,2],[1.01,2.02,3.03])|
# +-----+----------------------------+
Note that converting MLlib Vectors to ML Vectors is very important. You can read more about it here.
model.transform(loadedDataDF).show(truncate=False)
# +-----+----------------------------+-------------+
# |label|features |scaled |
# +-----+----------------------------+-------------+
# |1.1 |(3,[0,2],[1.23,4.56]) |[1.0,0.0,1.0]|
# |0.0 |(3,[0,1,2],[1.01,2.02,3.03])|[0.0,1.0,0.0]|
# +-----+----------------------------+-------------+
I hope that this answers your question!