I got this error. How to define the players_data?
NameError Traceback (most recent call last)
in
----> 1 data = np.vstack((asia[1:], eu[1:], na[1:], oc[1:], sea[1:], players_data[1:]))
2 df = pd.DataFrame({data[0, i]: data[1:, i] for i in range(data.shape[1])})
3 m = asfloat(data[1:, :4])
NameError: name players_data is not defined
asia = open_exl('pubg_as.xls', 0)
eu = open_exl('pubg_eu.xls', 0)
na = open_exl('pubg_na.xls', 0)
oc = open_exl('pubg_oc.xls', 0)
sea = open_exl('pubg_sea.xls', 0)
# Load all data
all_data = np.genfromtxt('PUBG_Player_Statistics.csv', delimiter=',')
all_data[:, 28] = all_data[:, 28] * 100
# Train data
train_data = all_data[1:2000, :][:, [3, 2, 28, 9]]
test_data = all_data[2000:, :][:, [3, 2, 28, 9]]
data = np.vstack((asia[1:], eu[1:], na[1:], oc[1:], sea[1:], players_data[1:]))
df = pd.DataFrame({data[0, i]: data[1:, i] for i in range(data.shape[1])})
m = asfloat(data[1:, :4])
It is hard to know what was the original value stored in players_data, as this is an incomplete code from another user; however, based on what they were doing, my guess is that players_data is:
players_data = train_data
But, why??
They used kmeans algorithm to create 6 clusters that represent the following categories:
['Normal player', 'Waller', 'Experienced Player', 'Both', 'God', 'Aimbot']
In the first 5 variables used in vstack they have information from the best players in the 5 servers. They wanted used this information, and leverage it with "normal players".
At the end, they didn't use neither "train_data" nor "test_data"; however, in the README.md, they mentioned the following:
The reason we mix data from the normal data set is to increase the density of normal players, and make the clustering tight. After test, we think 2000 rows of data has the best performance.
An important thing to notice, is that in the train and test data, they selected 4 columns:
[3, 2, 28, 9]
Which are the same columns that they used in the "top performers files"
def open_exl(address, idx):
data = xlrd.open_workbook(address)
table = data.sheets()[idx]
rows = table.nrows
ct_data = []
for row in range(rows):
ct_data.append(table.row_values(row))
return np.array(ct_data)[:, :4]
Since the code has inconsistencies, it might not be possible to obtain the results as they did; however, it seems as a great opportunity to play with the data and compare the results that you obtain with their previous investigation.
Related
I am having issues with my code running out of memory on large data sets. I attempted to chunk the data to feed it into the calculation graph but I eventually get an out of memory error. Would setting it up to use the feed_dict functionality get around this problem?
My code is set up like the following, with a nested map_fn function due to a result of the tf_itertools_product_2D_nest function.
tf_itertools_product_2D_nest function is from Cartesian Product in Tensorflow
I also tried a variation where I made a list of tensor-lists which was significantly slower than doing it purely in tensorflow so I'd prefer to avoid that method.
import tensorflow as tf
import numpy as np
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.9
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess = tf.Session()
sess.run(tf.global_variables_initializer())
tensorboard_log_dir = "../log/"
def tf_itertools_product_2D_nest(a,b): #does not work on nested tensors
a, b = a[ None, :, None ], b[ :, None, None ]
#print(sess.run(tf.shape(a)))
#print(sess.run(tf.shape(b)))
n_feat_dimension_in_common = tf.shape(a)[-1]
c = tf.concat( [ a + tf.zeros_like( b ), tf.zeros_like( a ) + b ], axis = 2 )
return c
def do_calc(arr_pair):
arr_1 = arr_pair[0]
arr_binary = arr_pair[1]
return tf.reduce_max(tf.cumsum(arr_1*arr_binary))
def calc_row_wrapper(row):
return tf.map_fn(do_calc,row)
for i in range(0,10):
a = tf.constant(np.random.random((7,10))*10,tf.float64)
b = tf.constant(np.random.randint(2, size=(3,10)),tf.float64)
a_b_itertools_product = tf_itertools_product_2D_nest(a,b)
'''Creates array like this:
[ [[arr_a0,arr_b0], [arr_a1,arr_b0],...],
[[arr_a0,arr_b1], [arr_a1,arr_b1],...],
[[arr_a0,arr_b2], [arr_a1,arr_b2],...],
...]
'''
with tf.summary.FileWriter(tensorboard_log_dir, sess.graph) as writer:
result_array = sess.run(tf.map_fn(calc_row_wrapper,a_b_itertools_product),
options=run_options,run_metadata=run_metadata)
writer.add_run_metadata(run_metadata,"iteration {}".format(i))
print(result_array.shape)
print(result_array)
print("")
# result_array should be an array with 3 rows (1 for each binary vector in b) and 7 columns (1 for each row in a)
I can imagine that is unnecessarily consuming memory due to the extra dimension added. Is there a way to mimic the outcome of the standard itertools.product() function to output 1 long list of every possible combination of items in the 2 input iterables? Like the result of:
itertools.product([[1,2],[3,4]],[[5,6],[7,8]])
# [([1, 2], [5, 6]), ([1, 2], [7, 8]), ([3, 4], [5, 6]), ([3, 4], [7, 8])]
That would eliminate the need to call map_fn twice.
When map_fn is called within a loop as my code shows, will it keep spawning graphs for every iteration? There appears to be a big "map_" node for every iteration cycle in this code's Tensorboardgraph.
Tensorboard Default View (not enough reputation yet)
When I select a particular iteration based on the tag in Tensorboard, only the map node corresponding to the iteration is highlighted with all the others grayed out. Does that mean that for that cycle only the map node for that cycle is present (and the others no longer, if from a previous cycle , exist in memory)?
Tensorboard 1 iteration view
I am trying to plot the results of PCA of the dataset pima-indians-diabetes.csv. My code shows a problem only in the plotting piece:
import numpy
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
# Dataset Description:
# 1. Number of times pregnant
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
# 3. Diastolic blood pressure (mm Hg)
# 4. Triceps skin fold thickness (mm)
# 5. 2-Hour serum insulin (mu U/ml)
# 6. Body mass index (weight in kg/(height in m)^2)
# 7. Diabetes pedigree function
# 8. Age (years)
# 9. Class variable (0 or 1)
path = 'pima-indians-diabetes.data.csv'
dataset = numpy.loadtxt(path, delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]
features = ['1','2','3','4','5','6','7','8','9']
df = pd.read_csv(path, names=features)
x = df.loc[:, features].values # Separating out the values
y = df.loc[:,['9']].values # Separating out the target
x = StandardScaler().fit_transform(x) # Standardizing the features
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
# principalDf = pd.DataFrame(data=principalComponents, columns=['pca1', 'pca2'])
# finalDf = pd.concat([principalDf, df[['9']]], axis = 1)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], ['Negative', 'Positive']):
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of pima-indians-diabetes Dataset')
The error is located at the following line:
Traceback (most recent call last):
File "test.py", line 53, in <module>
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
IndexError: too many indices for array
Kindly, how to fix this?
As the error indicates some kind of shape/dimension mismatch, a good starting point is to check the shapes of the arrays involved in the operation:
principalComponents.shape
yields
(768, 2)
while
(y==i).shape
(768, 1)
Which leads to a shape mismatch when trying to run
principalComponents[y==i, 0]
as the first array is already multidimensional, therefore the error is indicating that you used too many indices for the array.
You can fix this by forcing the shape of y==i to a 1D array ((768,)), e.g. by changing your call to scatter to
plt.scatter(principalComponents[(y == i).reshape(-1), 0],
principalComponents[(y == i).reshape(-1), 1],
color=color, alpha=.8, lw=lw, label=target_name)
which then creates the plot for me
For more information on the difference between arrays of the shape (R, 1)and (R,) this question on StackOverflow provides a nice starting point.
I am following the H2O example to run target mean encoding in Sparking Water (sparking water 2.4.2 and H2O 3.22.04). It runs well in all the following paragraph
from h2o.targetencoder import TargetEncoder
# change label to factor
input_df_h2o['label'] = input_df_h2o['label'].asfactor()
# add fold column for Target Encoding
input_df_h2o["cv_fold_te"] = input_df_h2o.kfold_column(n_folds = 5, seed = 54321)
# find all categorical features
cat_features = [k for (k,v) in input_df_h2o.types.items() if v in ('string')]
# convert string to factor
for i in cat_features:
input_df_h2o[i] = input_df_h2o[i].asfactor()
# target mean encode
targetEncoder = TargetEncoder(x= cat_features, y = y, fold_column = "cv_fold_te", blending_avg=True)
targetEncoder.fit(input_df_h2o)
But when I start to use the same data set used to fit Target Encoder to run the transform code (see code below):
ext_input_df_h2o = targetEncoder.transform(frame=input_df_h2o,
holdout_type="kfold", # mean is calculating on out-of-fold data only; loo means leave one out
is_train_or_valid=True,
noise = 0, # determines if random noise should be added to the target average
seed=54321)
I will have error like
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6773422589366407956.py", line 331, in <module>
exec(code)
File "<stdin>", line 5, in <module>
File "/usr/lib/envs/env-1101-ver-1619-a-4.2.9-py-3.5.3/lib/python3.5/site-packages/h2o/targetencoder.py", line 97, in transform
assert self._encodingMap.map_keys['string'] == self._teColumns
AssertionError
I found the code in its source code http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/targetencoder.html
but how to fix this issue? It is the same table used to run the fit.
The issue is because you are trying encoding multiple categorical features. I think that is a bug of H2O, but you can solve putting the transformer in a for loop that iterate over all categorical names.
import numpy as np
import pandas as pd
import h2o
from h2o.targetencoder import TargetEncoder
h2o.init()
df = pd.DataFrame({
'x_0': ['a'] * 5 + ['b'] * 5,
'x_1': ['c'] * 9 + ['d'] * 1,
'x_2': ['a'] * 3 + ['b'] * 7,
'y_0': [1, 1, 1, 1, 0, 1, 0, 0, 0, 0]
})
hf = h2o.H2OFrame(df)
hf['cv_fold_te'] = hf.kfold_column(n_folds=2, seed=54321)
hf['y_0'] = hf['y_0'].asfactor()
cat_features = ['x_0', 'x_1', 'x_2']
for item in cat_features:
target_encoder = TargetEncoder(x=[item], y='y_0', fold_column = 'cv_fold_te')
target_encoder.fit(hf)
hf = target_encoder.transform(frame=hf, holdout_type='kfold',
seed=54321, noise=0.0)
hf
Thanks everyone for letting us know. Assertion was a precaution as I was not sure whether there could be the case that order could be changed. Rest of the code was written with this assumption in mind and therefore safe to use with changed order anyway, but assertion was left and forgotten. Added test and removed assertion. Now this issue is fixed and merged. Should be available in the upcoming fix release. 0xdata.atlassian.net/browse/PUBDEV-6474
I am running into problems when I am trying to use pm.Categorical to sample many instances at once (either with multidimensional p or using theano.scan). What is the best way to go here? My goal is to sample one response per draw for each of many individuals, based on the probability of each response for each individual.
Here is a working example for many individuals, one draw:
import pymc3 as pm
import theano.tensor as T
import numpy as np
actions = T.as_tensor_variable(np.array([0, 1, 2])) # indiv0 selects response0, indiv1 response1, etc.
with pm.Model() as model:
p = pm.Beta('p', alpha=2, beta=2, shape=[3, 3]) # prob. for each indiv. for each response
actions = pm.Categorical('actions', p=p, observed=actions)
trace = pm.sample()
Everything works fine here. The model samples nine p's, one for each individual and each response. But what I'd like to do is to sample several responses for each individual. I first tried to add a dimension of trials to the actions data:
n_trials = 10
actions = T.as_tensor_variable(np.array([[0, 1, 2]] * n_trials)) # same thing, 10 trials
with pm.Model() as model:
p = pm.Beta('p', alpha=2, beta=2, shape=[3, 3]) # same as above, but more trials
actions = pm.Categorical('actions', p=p, observed=actions)
trace = pm.sample()
Obviously, this leads to an error:
Wrong number of dimensions: expected 1, got 2 with shape (10, 3).
Surprisingly, it also leads to an error when I draw a separate p for each sample, each individual, and each response:
p = pm.Beta('p', alpha=2, beta=2, shape=[n_trials, 3, 3])
It seems like pm.Categorical only accepts 2-dimensional p. To keep p 2-dimensional, I tried scanning:
actions = T.as_tensor_variable(np.array([[0, 1, 2]] * 10)
with pm.Model() as model:
p = pm.Beta('p', alpha=2, beta=2, shape=[3, 3]) # same as above
actions, _ = theano.scan(fn=lambda action, p: pm.Categorical('actions', p=p, observed=action),
sequences=[actions],
non_sequences=[p])
trace = pm.sample()
The code fails with the following error message:
theano.gof.fg.MissingInputError: Input 0 of the graph (indices start from 0), used to compute Elemwise{Cast{int64}}(<TensorType(int32, vector)>), was not provided and not given a value. Use the Theano flag exception_verbosity='high', for more information on this error.
So maybe it is not possible to use pm.Categorical within theano.scan? Any help as to how to approach this problem would be super helpful!
The following solved my problem. I made actions 1-dimensional and p 2-dimensional:
n_trials, n_subj, n_actions = 10, 3, 3
actions = T.as_tensor_variable(np.array([0, 1, 2] * n_trials)) # shape (30, ) instead of (10, 3)
with pm.Model() as model:
p = pm.Uniform('p', lower=0, upper=1, shape=[n_subj, n_actions])
indiv = T.tile(T.arange(n_subj), n_trials)
p = p[indiv]
actions = pm.Categorical('actions', p=p, observed=actions)
trace = pm.sample()
I have a very small list of short strings which I want to (1) cluster and (2) use that model to predict which cluster a new string belongs to.
Running the first part works fine, getting a prediction for the new string does not.
First Part
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# List of
documents_lst = ['a small, narrow river',
'a continuous flow of liquid, air, or gas',
'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
'a group in which schoolchildren of the same age and ability are taught',
'(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
'put (schoolchildren) in groups of the same age and ability to be taught together',
'a natural body of running water flowing on or under the earth']
# 1. Vectorize the text
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3
# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)
clusters = km.labels_.tolist()
print(clusters)
Which returns:
tfidf_matrix.shape: (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]
Second Part
The failing part:
predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
km.predict(tfidf_matrix)
The error:
ValueError: Incorrect number of features. Got 7 features, expected 39
FWIW: I somewhat understand that the training and predict have a different amount of features after vectorizing ...
I am open to any solution including changing from kmeans to an algorithm more suitable for short text clustering.
Thanks in advance
For completeness I will answer my own question with an answer from here , that doesn't answer that question. But answers mine
from sklearn.cluster import KMeans
list1 = ["My name is xyz", "My name is pqr", "I work in abc"]
list2 = ["My name is xyz", "I work in abc"]
vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = "english", charset_error = "ignore", ngram_range = (1,3))
vec = vectorizer.fit(list1) # train vec using list1
vectorized = vec.transform(list1) # transform list1 using vec
km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances=True, verbose=0, random_state=None, n_jobs=1)
km.fit(vectorized)
list2Vec = vec.transform(list2) # transform list2 using vec
km.predict(list2Vec)
The credit goes to #IrshadBhat