I'm trying to perform some benchmarking in clustering by various frameworks, But in the case of porting Scikit-learn from python to julia, I can't make it even work. Here is the code:
using PyCall
Train = rand(Float64, 1611, 10)
py"""
def Silhouette_py(Train, k):
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
model = KMeans(n_clusters=k)
return silhouette_score(Train, model.labels_)
"""
function test(Train, k)
py"Silhouette_py"(Train, k)
end
The following code leads to an error:
julia> test(Train, 3)
ERROR: PyError ($(Expr(:escape, :(ccall(#= C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:43 =# #pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'AttributeError'>
AttributeError("'KMeans' object has no attribute 'labels_'")
File "C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyeval.jl", line 5, in Silouhette_py
const _namespaces = Dict{Module,PyDict{String,PyObject,true}}()
^^^^^^^^^^^^^
Stacktrace:
[1] pyerr_check
# C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\exception.jl:62 [inlined]
[2] pyerr_check
# C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\exception.jl:66 [inlined]
[3] _handle_error(msg::String)
# PyCall C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\exception.jl:83
[4] macro expansion
# C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\exception.jl:97 [inlined]
[5] #107
# C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:43 [inlined]
[6] disable_sigint
# .\c.jl:473 [inlined]
[7] __pycall!
# C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:42 [inlined]
[8] _pycall!(ret::PyObject, o::PyObject, args::Tuple{Matrix{Float64}, Int64}, nargs::Int64, kw::Ptr{Nothing})
# PyCall C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:29
[9] _pycall!(ret::PyObject, o::PyObject, args::Tuple{Matrix{Float64}, Int64}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# PyCall C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:11
[10] (::PyObject)(::Matrix{Float64}, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(),
Tuple{}}})
# PyCall C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:86
[11] (::PyObject)(::Matrix{Float64}, ::Vararg{Any})
# PyCall C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:86
[12] t(Train::Matrix{Float64}, k::Int64)
# Main .\REPL[12]:2
[13] top-level scope
# REPL[20]:1
The libpython and related stuff configuration:
julia> PyCall.libpython
"C:\\Users\\Shayan\\AppData\\Local\\Programs\\Python\\Python311\\python311.dll"
julia> PyCall.pyversion
v"3.11.0"
julia> PyCall.current_python()
"C:\\Users\\Shayan\\AppData\\Local\\Programs\\Python\\Python311\\python.exe"
Further tests
But if I say:
julia> sk = pyimport("sklearn")
julia> model = sk.cluster.KMeans(3)
PyObject KMeans(n_clusters=3)
julia> model.fit(Train)
sys:1: ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (3). Possibly due to duplicate points in X.
PyObject KMeans(n_clusters=3)
julia> model.labels_
1611-element Vector{Int32}:
0
0
0
0
0
0
⋮
But I need it to work in a function. As you can see, it doesn't throw AttributeError("'KMeans' object has no attribute 'labels_'") anymore in this case.
It seems this would work:
KMeans = pyimport("sklearn.cluster").KMeans
silhouette_score = pyimport("sklearn.metric").silhouette_score
Train = rand(Float64, 1611, 10);
function test(Train, k)
model = KMeans(k)
model.fit(Train)
return silhouette_score(Train, model.labels_)
end
julia> test(Train, 3)
0.7885442174636309
Related
[] : this indicates a batch. For example, if the batch size is 5, then the batch will look something like this [1,4,7,4,2]. The length of [] indicates the batch size.
What I want to make a training set something looks like this:
[1] -> [1] -> [1] -> [1] -> [1] -> [7] -> [7] -> [7] -> [7] -> [7] -> [3] -> [3] -> [3] -> [3] -> [3] -> ... and so on
Which means that firstly five 1s (batch size = 1), secondly five 7s (batch size = 1), thirdly five 3s (batch size = 1) and so on...
Can someone please provide me an idea?
It will be very helpful if someone can explain how to implement this with codes.
Thank you! :)
If you want a DataLoader where you just want to define the class label for each sample then you can make use of the torch.data.utils.Subset class. Despite its name it doesn't necessarily need to define a subset of dataset. For example
import torch
import torchvision
import torchvision.transforms as T
from itertools import cycle
mnist = torchvision.datasets.MNIST(root='./', train=True, transform=T.ToTensor())
# not sure what "...and so on" implies, but define this list however you like
target_classes = [1, 1, 1, 1, 1, 7, 7, 7, 7, 7, 3, 3, 3, 3, 3]
# create cyclic iterators of indices for each class in MNIST
indices = dict()
for label in torch.unique(mnist.targets).tolist():
indices[label] = cycle(torch.nonzero(mnist.targets == label).flatten().tolist())
# define the order of indices in the new mnist subset based on target_classes
new_indices = []
for t in target_classes:
new_indices.append(next(indices[t]))
# create a Subset of MNIST based on new_indices
mnist_modified = torch.utils.data.Subset(mnist, new_indices)
dataloader = torch.utils.data.DataLoader(mnist_modified, batch_size=1, shuffle=False)
for idx, (x, y) in enumerate(dataloader):
# training loop
print(f'Batch {idx+1} labels: {y.tolist()}')
If you want a DataLoader that returns five samples in a row of the same class, but you don't want to define the class for each index manually then you can create a custom sampler. For example
import torch
import torchvision
import torchvision.transforms as T
from itertools import cycle
class RepeatClassSampler(torch.utils.data.Sampler):
def __init__(self, targets, repeat_count, length, shuffle=False):
if not torch.is_tensor(targets):
targets = torch.tensor(targets)
self.targets = targets
self.repeat_count = repeat_count
self.length = length
self.shuffle = shuffle
self.classes = torch.unique(targets).tolist()
self.class_indices = dict()
for label in self.classes:
self.class_indices[label] = torch.nonzero(targets == label).flatten()
def __iter__(self):
class_index_iters = dict()
for label in self.classes:
if self.shuffle:
class_index_iters[label] = cycle(self.class_indices[label][torch.randperm(len(self.class_indices))].tolist())
else:
class_index_iters[label] = cycle(self.class_indices[label].tolist())
if self.shuffle:
target_iter = cycle(self.targets[torch.randperm(len(self.targets))].tolist())
else:
target_iter = cycle(self.targets.tolist())
def index_generator():
for i in range(self.length):
if i % self.repeat_count == 0:
current_class = next(target_iter)
yield next(class_index_iters[current_class])
return index_generator()
def __len__(self):
return self.length
mnist = torchvision.datasets.MNIST(root='./', train=True, transform=T.ToTensor())
dataloader = torch.utils.data.DataLoader(
mnist,
batch_size=1,
sampler=RepeatClassSampler(
targets=mnist.targets,
repeat_count=5,
length=15, # How many total to pick from your dataset
shuffle=True))
for idx, (x, y) in enumerate(dataloader):
# training loop
print(f'Batch {idx+1} labels: {y.tolist()}')
I am doing deep learning using Keras in Rstudio.I copy and paste this link https://tensorflow.rstudio.com/tutorials/beginners/basic-ml/tutorial_basic_regression/
boston_housing <- dataset_boston_housing()
c(train_data, train_labels) %<-% boston_housing$train
c(test_data, test_labels) %<-% boston_housing$test
paste0("Training entries: ", length(train_data), ", labels: ", length(train_labels))
train_data[1, ] # Display sample features, notice the different scales
library(dplyr)
column_names <- c('CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT')
train_df <- train_data %>%
as_tibble(.name_repair = "minimal") %>%
setNames(column_names) %>%
mutate(label = train_labels)
test_df <- test_data %>%
as_tibble(.name_repair = "minimal") %>%
setNames(column_names) %>%
mutate(label = test_labels)
train_labels[1:10] # Display first 10 entries
spec <- feature_spec(train_df, label ~ . ) %>%
step_numeric_column(all_numeric(), normalizer_fn = scaler_standard())
spec <- fit(spec)
layer <- layer_dense_features(
feature_columns = dense_features(spec),
dtype = tf$float32
)
layer(train_df)
layer(train_df)
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: ('We expected a dictionary here. Instead we got: ', CRIM ZN INDUS CHAS NOX ... TAX PTRATIO B LSTAT label
0 1.23247 0.0 8.14 0.0 0.5380 ... 307.0 21.0 396.90 18.72 15.2
1 0.02177 82.5 2.03 0.0 0.4150 ... 348.0 14.7 395.38 3.11 42.3
**sessionInfo()**
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Chile.1252 LC_CTYPE=Spanish_Chile.1252 LC_MONETARY=Spanish_Chile.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Chile.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.8.5 tfdatasets_2.0.0 keras_2.2.5.0 tensorflow_2.0.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 pillar_1.4.3 compiler_3.6.3 prettyunits_1.1.1 base64enc_0.1-3 tools_3.6.3
[7] progress_1.2.2 zeallot_0.1.0 digest_0.6.25 packrat_0.5.0 jsonlite_1.6.1 evaluate_0.14
[13] tibble_2.1.3 pkgconfig_2.0.3 rlang_0.4.5 cli_2.0.2 rstudioapi_0.11 yaml_2.2.1
[19] xfun_0.12 knitr_1.28 generics_0.0.2 vctrs_0.2.4 rappdirs_0.3.1 hms_0.5.3
[25] tidyselect_1.0.0 reticulate_1.14 glue_1.3.2 forge_0.2.0 R6_2.4.1 fansi_0.4.1
[31] rmarkdown_2.1 purrr_0.3.3 magrittr_1.5 whisker_0.4 tfestimators_1.9.1 tfruns_1.4
[37] htmltools_0.4.0 assertthat_0.2.1 crayon_1.3.4
Can you please try the fix mentioned here.
Provided the solution below as well if in case the link is broken -
To install the fix you should be sure to close all R sessions then open a fresh R session and execute:
devtools::install_github("rstudio/reticulate")
The reason you need to close all R sessions is that windows shared libraries won't be successfully overwritten if they are in use during the installation.
Hope this works and fixes the issue you are facing.
I am getting unhashable type: 'numpy.ndarray' error. so I cast the df_subset , 'Views' to int,however, it is returning object
here is the script:
tsne = TSNE(n_components=2, verbose=1, perplexity=20, n_iter=1000)
tsne_results = tsne.fit_transform(logits_list)
df_subset = pd.DataFrame({'X':tsne_results[:,0], 'Y':tsne_results[:,1], 'Views':targets})
print(df_subset)
df_subset.astype({'Views': 'int'}).dtypes
print(df_subset.dtypes)
colors = {'A2CH':'red', 'A3CH':'green', 'A4CH_LV':'blue', 'A4CH_RV':'cyan', 'A5CH':'magneta', 'Apical_MV_LA_IAS':'yellow',
'PLAX_TV':'black', 'PLAX_full':'white', 'PLAX_valves':'orange', 'PSAX_AV':'purple', 'PSAX_LV':'dodgerblue', 'Subcostal_IVC':'lightgreen', 'Subcostal_heart':'darkcyan', 'Suprasternal':'grey'}
ax = sns.scatterplot(x= "X", y= "Y", hue='Views', legend = 'full',palette = colors, data=df_subset)
plt.show()
here is a print of df_subset and dtype:
X Y Views
0 13.208739 -19.657906 [11]
1 7.932375 -31.547863 [6]
2 -3.896450 -23.075047 [9]
3 -11.836237 -12.138339 [9]
4 -8.077571 17.220371 [11]
5 9.463497 23.756912 [2]
6 8.354083 -47.790867 [10]
7 -2.848731 -0.220144 [9]
8 25.724466 -29.862696 [9]
9 -26.956612 -8.361418 [9]
10 -16.011475 2.309184 [7]
11 16.193329 -0.280985 [8]
12 5.060284 -9.906323 [9]
13 37.827713 -16.174528 [4]
14 -5.971475 -39.845860 [7]
15 6.608039 9.085782 [12]
16 -20.108206 -26.253906 [8]
17 32.851559 0.332044 [2]
18 23.818949 13.762548 [2]
19 23.625357 -12.107020 [3]
X float32
Y float32
Views object
dtype: object
I assume I am getting the unhashable type: 'numpy.ndarray' error because of object type? Any help would be appreciated.
.astype() returns a copy so it should work if you do
df_subset = df_subset.astype({'Views': int})
trying to map a tuple to a tuple in a dataset in tf 2 (please see code below). my output (please see below) shows that the map function is only called once. and i can not seem to get at the tuple.
how do i get at the "a","b","c" from the input parameter which is a:
tuple Tensor("args_0:0", shape=(3,), dtype=string)
type <class 'tensorflow.python.framework.ops.Tensor'>
edit: it seems like using Dataset.from_tensor_slices produces the data all at once. this explcains why map is only called once. so i probably need to make the dataset in some other way.
from __future__ import absolute_import, division, print_function, unicode_literals
from timeit import default_timer as timer
print('import tensorflow')
start = timer()
import tensorflow as tf
end = timer()
print('Elapsed time: ' + str(end - start),"for",tf.__version__)
import numpy as np
def map1(tuple):
print("<<<")
print("tuple",tuple)
print("type",type(tuple))
print("shape",tuple.shape)
print("tuple 0",tuple[0])
print("type 0",type(tuple[0]))
print("shape 0",tuple.shape[0])
# how do i get "a","b","c" from the input parameter?
print(">>>")
return ("1","2","3")
l=[]
l.append(("a","b","c"))
l.append(("d","e","f"))
print(l)
ds=tf.data.Dataset.from_tensor_slices(l)
print("ds",ds)
print("start mapping")
result = ds.map(map1)
print("end mapping")
$ py mapds.py
import tensorflow
Elapsed time: 12.002168990751619 for 2.0.0
[('a', 'b', 'c'), ('d', 'e', 'f')]
ds <TensorSliceDataset shapes: (3,), types: tf.string>
start mapping
<<<
tuple Tensor("args_0:0", shape=(3,), dtype=string)
type <class 'tensorflow.python.framework.ops.Tensor'>
shape (3,)
tuple 0 Tensor("strided_slice:0", shape=(), dtype=string)
type 0 <class 'tensorflow.python.framework.ops.Tensor'>
shape 0 3
>>>
end mapping
The value or values returned by map function (map1) determine the structure of each element in the returned dataset. [Ref]
In your case, result is a tf dataset and there is nothing wrong in your coding.
To check whether every touple is mapped correctly you can traverse every sample of your dataset like follows:
[Updated Code]
def map1(tuple):
print(tuple[0].numpy().decode("utf-8")) # Print first element of tuple
return ("1","2","3")
l=[]
l.append(("a","b","c"))
l.append(("d","e","f"))
ds=tf.data.Dataset.from_tensor_slices(l)
ds = ds.map(lambda tpl: tf.py_function(map1, [tpl], [tf.string, tf.string, tf.string]))
for sample in ds:
print(str(sample[0].numpy().decode()), sample[1].numpy().decode(), sample[2].numpy().decode())
Output:
a
1 2 3
d
1 2 3
Hope it will help.
I'm trying to speed up a comparison between two pointclouds, I have some code which took up to an hour to complete. I've butchered it to this and tried to implement numba. The code works with the exception of the scipy cdist function. It's my first test of using numba, where am I going wrong?
from numba import jit
#jit(nopython=True)
def near_dist_top(T, B):
xi = [i[0] for i in T]
yi = [i[1] for i in T]
zi = [i[2] for i in T]
XB = B
insert_params = []
for i in range(len(T)):
XA = [T[i]]
disti = cdist(XA, XB, metric='euclidean').min()
insert_params.append((xi[i], yi[i], zi[i], disti))
# print("Top: " + str(i) + " of " + str(len(T)))
print(i)
return insert_params
print(XB)
### Edits ###
Both T and B are lists of coordinates
(580992.507, 4275268.8321, 192.4599), (580992.507, 4275268.8391, 192.4209), (580992.507, 4275268.8391, 192.4209)
hmmm, does numba handle lists, does it need to be a numpy array, would cdist handle a numpy array...?
The error
numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'cdist': cannot determine Numba type of <class 'function'>
File "scratch_28.py", line 132:
def near_dist_top(T, B):
<source elided>
XA = [T[i]]
disti = cdist(XA, XB, metric='euclidean').min()
^