Survcomp package, y z variables - survival-analysis

I am running the survcomp package and wonder about the y and z values. I have multiple clinical data:
> colnames(ClinicalDataHep)
[1] "follow_upTime"
[2] "RecurrenceTime"
[3] "Age"
[4] "OS"
[5] "Survival_dead0_alive1"
[6] "Tumour_size"
[7] "HVB_preop"
[8] "HCV_preop"
[9] "HBD_preop"
[10] "Cirrhosis_preop"
[11] "Status:_no_recurrence-0._recurrence-1_"
[12] "Surgery:_resection-1._tx-2;_rfa-3;_resection+rfa-4;tx+rfa-5"
[13] "new_time"
[14] "new_death"
[15] "death_event"
Is it corrent to use Overall Survival as the y-variable and dead/alive as the z variable?
cindexall.Hep.serum <- as.data.frame(t(apply(X=matrix_cpm, MARGIN=1, function(x, y, z) {
tt <- concordance.index(x=x, surv.time=y, surv.event=z, method="noether", na.rm=TRUE);
return(c("cindex"=tt$c.index, "cindex.se"=tt$se, "lower"=tt$lower, "upper"=tt$upper,"p.value"=tt$p.value)); },
y=ClinicalData$OS, z=ClinicalData$Survival_dead0_alive1)))

Related

Utilizing Scikit-learn with Python3.11 path in Julia

I'm trying to perform some benchmarking in clustering by various frameworks, But in the case of porting Scikit-learn from python to julia, I can't make it even work. Here is the code:
using PyCall
Train = rand(Float64, 1611, 10)
py"""
def Silhouette_py(Train, k):
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
model = KMeans(n_clusters=k)
return silhouette_score(Train, model.labels_)
"""
function test(Train, k)
py"Silhouette_py"(Train, k)
end
The following code leads to an error:
julia> test(Train, 3)
ERROR: PyError ($(Expr(:escape, :(ccall(#= C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:43 =# #pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'AttributeError'>
AttributeError("'KMeans' object has no attribute 'labels_'")
File "C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyeval.jl", line 5, in Silouhette_py
const _namespaces = Dict{Module,PyDict{String,PyObject,true}}()
^^^^^^^^^^^^^
Stacktrace:
[1] pyerr_check
# C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\exception.jl:62 [inlined]
[2] pyerr_check
# C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\exception.jl:66 [inlined]
[3] _handle_error(msg::String)
# PyCall C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\exception.jl:83
[4] macro expansion
# C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\exception.jl:97 [inlined]
[5] #107
# C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:43 [inlined]
[6] disable_sigint
# .\c.jl:473 [inlined]
[7] __pycall!
# C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:42 [inlined]
[8] _pycall!(ret::PyObject, o::PyObject, args::Tuple{Matrix{Float64}, Int64}, nargs::Int64, kw::Ptr{Nothing})
# PyCall C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:29
[9] _pycall!(ret::PyObject, o::PyObject, args::Tuple{Matrix{Float64}, Int64}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
# PyCall C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:11
[10] (::PyObject)(::Matrix{Float64}, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(),
Tuple{}}})
# PyCall C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:86
[11] (::PyObject)(::Matrix{Float64}, ::Vararg{Any})
# PyCall C:\Users\Shayan\.julia\packages\PyCall\ygXW2\src\pyfncall.jl:86
[12] t(Train::Matrix{Float64}, k::Int64)
# Main .\REPL[12]:2
[13] top-level scope
# REPL[20]:1
The libpython and related stuff configuration:
julia> PyCall.libpython
"C:\\Users\\Shayan\\AppData\\Local\\Programs\\Python\\Python311\\python311.dll"
julia> PyCall.pyversion
v"3.11.0"
julia> PyCall.current_python()
"C:\\Users\\Shayan\\AppData\\Local\\Programs\\Python\\Python311\\python.exe"
Further tests
But if I say:
julia> sk = pyimport("sklearn")
julia> model = sk.cluster.KMeans(3)
PyObject KMeans(n_clusters=3)
julia> model.fit(Train)
sys:1: ConvergenceWarning: Number of distinct clusters (1) found smaller than n_clusters (3). Possibly due to duplicate points in X.
PyObject KMeans(n_clusters=3)
julia> model.labels_
1611-element Vector{Int32}:
0
0
0
0
0
0
⋮
But I need it to work in a function. As you can see, it doesn't throw AttributeError("'KMeans' object has no attribute 'labels_'") anymore in this case.
It seems this would work:
KMeans = pyimport("sklearn.cluster").KMeans
silhouette_score = pyimport("sklearn.metric").silhouette_score
Train = rand(Float64, 1611, 10);
function test(Train, k)
model = KMeans(k)
model.fit(Train)
return silhouette_score(Train, model.labels_)
end
julia> test(Train, 3)
0.7885442174636309

ValueError: Invalid classes inferred from unique values of `y`. Expected: [0 1 2 ... 1387 1388 1389], got [0 1 2 ... 18609 24127 41850]

Situation: I am trying to use XGBoost classifier, however this error pops up to me:
"ValueError: Invalid classes inferred from unique values of y. Expected: [0 1 2 ... 1387 1388 1389], got [0 1 2 ... 18609 24127 41850]".
Unlike this solved one: Invalid classes inferred from unique values of `y`. Expected: [0 1 2 3 4 5], got [1 2 3 4 5 6], it seems that I have a different scenario which is about not starting from 0.
Code:
X = data_concat
y = data_concat[['forward_count','comment_count','like_count']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=72)
#Train, test split
print ('Train set:', X_train.shape, y_train.shape) #Check the size after split
print ('Test set:', X_test.shape, y_test.shape)
xgb = XGBClassifier()
clf = xgb.fit(X_train, y_train, eval_metric='auc') #HERE IS WHERE GET THE ERROR
The Datafrme and frame info is like this:
DataFrame
DataFrame Info.
I have adopted different y, meaning when y has less or more columns, the list "[0 1 2 ... 1387 1388 1389]" will simultaneously shrink or expand.
If you need further info, please let me know. Appreciate your help :)
Need to transform the y_train value to fit xgboost, it starts from 0 but not 1.
Here is the code:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)

Keras Error in py_call_impl(callable, dots$args, dots$keywords) :

I am doing deep learning using Keras in Rstudio.I copy and paste this link https://tensorflow.rstudio.com/tutorials/beginners/basic-ml/tutorial_basic_regression/
boston_housing <- dataset_boston_housing()
c(train_data, train_labels) %<-% boston_housing$train
c(test_data, test_labels) %<-% boston_housing$test
paste0("Training entries: ", length(train_data), ", labels: ", length(train_labels))
train_data[1, ] # Display sample features, notice the different scales
library(dplyr)
column_names <- c('CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT')
train_df <- train_data %>%
as_tibble(.name_repair = "minimal") %>%
setNames(column_names) %>%
mutate(label = train_labels)
test_df <- test_data %>%
as_tibble(.name_repair = "minimal") %>%
setNames(column_names) %>%
mutate(label = test_labels)
train_labels[1:10] # Display first 10 entries
spec <- feature_spec(train_df, label ~ . ) %>%
step_numeric_column(all_numeric(), normalizer_fn = scaler_standard())
spec <- fit(spec)
layer <- layer_dense_features(
feature_columns = dense_features(spec),
dtype = tf$float32
)
layer(train_df)
layer(train_df)
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: ('We expected a dictionary here. Instead we got: ', CRIM ZN INDUS CHAS NOX ... TAX PTRATIO B LSTAT label
0 1.23247 0.0 8.14 0.0 0.5380 ... 307.0 21.0 396.90 18.72 15.2
1 0.02177 82.5 2.03 0.0 0.4150 ... 348.0 14.7 395.38 3.11 42.3
**sessionInfo()**
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Chile.1252 LC_CTYPE=Spanish_Chile.1252 LC_MONETARY=Spanish_Chile.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Chile.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.8.5 tfdatasets_2.0.0 keras_2.2.5.0 tensorflow_2.0.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 pillar_1.4.3 compiler_3.6.3 prettyunits_1.1.1 base64enc_0.1-3 tools_3.6.3
[7] progress_1.2.2 zeallot_0.1.0 digest_0.6.25 packrat_0.5.0 jsonlite_1.6.1 evaluate_0.14
[13] tibble_2.1.3 pkgconfig_2.0.3 rlang_0.4.5 cli_2.0.2 rstudioapi_0.11 yaml_2.2.1
[19] xfun_0.12 knitr_1.28 generics_0.0.2 vctrs_0.2.4 rappdirs_0.3.1 hms_0.5.3
[25] tidyselect_1.0.0 reticulate_1.14 glue_1.3.2 forge_0.2.0 R6_2.4.1 fansi_0.4.1
[31] rmarkdown_2.1 purrr_0.3.3 magrittr_1.5 whisker_0.4 tfestimators_1.9.1 tfruns_1.4
[37] htmltools_0.4.0 assertthat_0.2.1 crayon_1.3.4
Can you please try the fix mentioned here.
Provided the solution below as well if in case the link is broken -
To install the fix you should be sure to close all R sessions then open a fresh R session and execute:
devtools::install_github("rstudio/reticulate")
The reason you need to close all R sessions is that windows shared libraries won't be successfully overwritten if they are in use during the installation.
Hope this works and fixes the issue you are facing.

unhashable type: 'numpy.ndarray' for scatter plot

I am getting unhashable type: 'numpy.ndarray' error. so I cast the df_subset , 'Views' to int,however, it is returning object
here is the script:
tsne = TSNE(n_components=2, verbose=1, perplexity=20, n_iter=1000)
tsne_results = tsne.fit_transform(logits_list)
df_subset = pd.DataFrame({'X':tsne_results[:,0], 'Y':tsne_results[:,1], 'Views':targets})
print(df_subset)
df_subset.astype({'Views': 'int'}).dtypes
print(df_subset.dtypes)
colors = {'A2CH':'red', 'A3CH':'green', 'A4CH_LV':'blue', 'A4CH_RV':'cyan', 'A5CH':'magneta', 'Apical_MV_LA_IAS':'yellow',
'PLAX_TV':'black', 'PLAX_full':'white', 'PLAX_valves':'orange', 'PSAX_AV':'purple', 'PSAX_LV':'dodgerblue', 'Subcostal_IVC':'lightgreen', 'Subcostal_heart':'darkcyan', 'Suprasternal':'grey'}
ax = sns.scatterplot(x= "X", y= "Y", hue='Views', legend = 'full',palette = colors, data=df_subset)
plt.show()
here is a print of df_subset and dtype:
X Y Views
0 13.208739 -19.657906 [11]
1 7.932375 -31.547863 [6]
2 -3.896450 -23.075047 [9]
3 -11.836237 -12.138339 [9]
4 -8.077571 17.220371 [11]
5 9.463497 23.756912 [2]
6 8.354083 -47.790867 [10]
7 -2.848731 -0.220144 [9]
8 25.724466 -29.862696 [9]
9 -26.956612 -8.361418 [9]
10 -16.011475 2.309184 [7]
11 16.193329 -0.280985 [8]
12 5.060284 -9.906323 [9]
13 37.827713 -16.174528 [4]
14 -5.971475 -39.845860 [7]
15 6.608039 9.085782 [12]
16 -20.108206 -26.253906 [8]
17 32.851559 0.332044 [2]
18 23.818949 13.762548 [2]
19 23.625357 -12.107020 [3]
X float32
Y float32
Views object
dtype: object
I assume I am getting the unhashable type: 'numpy.ndarray' error because of object type? Any help would be appreciated.
.astype() returns a copy so it should work if you do
df_subset = df_subset.astype({'Views': int})

EDITED Learning data not correctly

I'm studying deep-learning.
I'm making figure classifier: circle, rectangle, triangle, pentagon, star. And one-hot-encoded into label2idx = dict(rectangle=0, circle=1, pentagon=2, star=3, triangle=4)
But every learning rates per epoch are same and it do not learn about the image.
I made a Layer with using Relu function for activation function, Affine for each layer, Softmax for the last layer, and using Adam to optimizing the gradients.
I have totally 234 RGB images to learn, which has created on window paint 2D tool and it is 128 * 128 size but not using the whole canvas to draw the figure.
And the picture looks like:
The train result. left [] is predict, and the right [] is answer lable(I picked random images to print predict value and answer lable).:
epoch: 0.49572649572649574
[ 0.3149641 -0.01454905 -0.23183 -0.2493432 0.11655246] [0 0 0 0 1]
epoch: 0.6837606837606838
[ 1.67341673 0.27887525 -1.09800398 -1.12649948 -0.39533065] [1 0 0 0 0]
epoch: 0.7094017094017094
[ 0.93106499 1.49599772 -0.98549052 -1.20471573 -0.24997779] [0 1 0 0 0]
epoch: 0.7905982905982906
[ 0.48447043 -0.05460748 -0.23526179 -0.22869489 0.05468969] [1 0 0 0 0]
...
epoch: 0.9230769230769231
[14.13835867 0.32432293 -5.01623202 -6.62469261 -3.21594355] [1 0 0 0 0]
epoch: 0.9529914529914529
[ 1.61248239 -0.47768294 -0.41580036 -0.71899219 -0.0901478 ] [1 0 0 0 0]
epoch: 0.9572649572649573
[ 5.93142154 -1.16719891 -1.3656573 -2.19785097 -1.31258801] [1 0 0 0 0]
epoch: 0.9700854700854701
[ 7.42198941 -0.85870225 -2.12027192 -2.81081263 -1.83810873] [1 0 0 0 0]
I think the more it learn, prediction should like [ 0.00143 0.09357 0.352 0.3 0.253 ] [ 1 0 0 0 0 ], which means answer index should be close to 0, but it does not.
Even the train accuracy sometimes goes to 1.0 ( 100% ).
I'm loading and normalizing the images with below codes.
#data_list = data_list = glob('dataset\\training\\*\\*.jpg')
dataset['train_img'] = _load_img()
def _load_img():
data = [np.array(Image.open(v)) for v in data_list]
a = np.array(data)
a = a.reshape(-1, img_size * 3)
return a
#normalize
for v in dataset:
dataset['train_img'] = dataset['train_img'].astype(np.float32)
dataset['train_img'] /= dataset['train_img'].max()
dataset['train_img'] -= dataset['train_img'].mean(axis=1).reshape(len(dataset['train_img']), 1)
EDIT
I let the images to gray scale with Image.open(v).convert('LA')
and checking my prediction value, and it's example:
[-3.98576886e-04 3.41216374e-05] [1 0]
[ 0.00698861 -0.01111879] [1 0]
[-0.42003415 0.42222863] [0 1]
still not learning about the images. I removed 3 figures to test it, so I just have rectangle, and triangle total 252 images ( I drew more imges. )
And the prediction value is usually like opposite value( 3.1323, -3.1323 or 3.1323, -3.1303 ), I cannot figure out the reason.
Not just increasing numerical accuracy, when I use SGD for optimizer, the accuracy do not increase. Just same accuracy.
[ 0.02090227 -0.02085848] [1 0]
epoch: 0.5873015873015873
[ 0.03058879 -0.03086193] [0 1]
epoch: 0.5873015873015873
[ 0.04006064 -0.04004988] [1 0]
[ 0.04545139 -0.04547538] [1 0]
epoch: 0.5873015873015873
[ 0.05605123 -0.05595288] [0 1]
epoch: 0.5873015873015873
[ 0.06495255 -0.06500597] [1 0]
epoch: 0.5873015873015873
Yes. Your model is performing pretty well. The problem is not related to normalization(not even a problem). The model actually predicted outside of 0,1 which means the model is really confident.
The model will not try to optimize towards [1,0,0,0] because when it calculates the loss, it will firstly clip the values.
Hope this helps!

Resources