One hot encoding for ages categorical data

One hot encoding for ages categorical data - python-3.x

When trying to implement encoding for the below categories using one hot encoder, I got a couldn't convert string to float error.
['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25']

I made something real quick that should work. You will see that I had a really nasty looking one-liner for preconditioning your limits; however, it will be much easier if you just convert the limits directly to the proper format.
Essentially, this just iterates through a list of limits and makes comparisons to the limits. If the sample of data is less than the limit, we make that index a 1 and break.
import random
# str_limits = ['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25']
#
# oneline conditioning for the limit string format
# limits = sorted(list(filter(lambda x: not x.endswith("+"), map(lambda v: v.split("-")[-1], str_limits))))
# limits.append('1000')
# do this instead
limits = sorted([17, 35, 50, 55, 45, 25, 1000])
# sample 100 random datapoints between 0 and 65 for testing
samples = [random.choice(list(range(65))) for i in range(100)]
onehot = [] # this is where we will store our one-hot encodings
for sample in samples:
row = [0]*len(limits) # preallocating a list
for i, limit in enumerate(limits):
if sample <= limit:
row[i] = 1
break
# storing that sample's onehot into a onehot list of lists
onehot.append(row)
for i in range(10):
print("{}: {}".format(onehot[i], samples[i]))
I am not sure about the specifics of your implementation, but you are probably forgetting to convert from a string to an integer at some point.

Related

generating huge amout of random numbers with python

I want to generate random numbers, uniformly between -1 and 1.
I know that using NumPy and generating an array of numbers is much better than generate the one by one in a for loop.
On the other hand, I want these numbers to operate with them only once, so there's no reason for storing them in an array.
My question is, what is the best solution to this, on one hand using a for loop is not time efficient, but I don't store unnecessary numbers, I generate them one by one and then I throw them. On the other hand, an array is not memory efficient, since if I want to generate 10^10 numbers, I need to create a 10^10 size array, with horrible results.
I assume the best choice is to generate small arrays (10^3 or 10^4 elements) one by one, but I want to know if there's a better solution to this problem (maybe a NumPy function that generates the numbers but creates something like an iterable that don't store all them in memory?)

Using NumPy to generate blocks of numbers is best, and you want to keep operations vectorised as much as possible.
A simple benchmark shows that somewhere between 4k and 64k is a reasonable block size:
from timeit import Timer
import numpy as np
for xp in range(20):
size = 2**xp
timer = Timer(
f'rng.uniform(-1., 1., size={size})',
'rng = np.random.default_rng()',
globals=globals()
)
n, t = timer.autorange()
t = min([t] + timer.repeat(3, n)) / n / size
print(f'{size:8} = {1e-6/t:6.2f}M/s')
gives me
1 = 0.47M/s
2 = 0.95M/s
4 = 1.89M/s
8 = 3.80M/s
16 = 7.43M/s
32 = 14.26M/s
64 = 27.10M/s
128 = 48.60M/s
256 = 78.72M/s
512 = 119.07M/s
1024 = 158.71M/s
2048 = 191.51M/s
4096 = 218.71M/s
8192 = 233.25M/s
16384 = 241.23M/s
32768 = 245.35M/s
65536 = 248.75M/s
131072 = 250.53M/s
262144 = 252.62M/s
524288 = 253.99M/s
and working with numbers in a vectorised form is orders-of-magnitude faster.
For example, given a 64k array of values, a vectorised call of np.sum(x) takes 17µs while the similar version going through a generator sum(x) takes 3.5ms, i.e. 200 times slower. Once you've paid the price for getting the floats out into the non-vectorised Python-world going through another yield from doesn't make much difference, only taking 4.5ms, e.g.: via the iPython %timeit magic:
def yield_from(it):
yield from it
x = np.random.uniform(-1, 1, size=2**16)
%timeit np.sum(x)
%timeit sum(x)
%timeit sum(yield_from(x))

you could make a generator, as said in the comment by #Carcigenicate, and combine that with the speedup of generating entire arrays using a yield from expression.
this would look something like this:
def random_numbers():
while True:
yield from np.random.random(1000) * 2 - 1
you can adjust the number of values generated at once to whatever you need, larger is faster but uses more memory

word2vec cosine similarity greater than 1 arabic text

I have trained my word2vec model from gensim and I am getting the nearest neighbors for some words in the corpus. Here are the similarity scores:
top neighbors for الاحتلال:
الاحتلال: 1.0000001192092896
الاختلال: 0.9541053175926208
الاهتلال: 0.872565507888794
الاحثلال: 0.8386293649673462
الاكتلال: 0.8209128379821777
It is odd to get a similarity greater than 1. I cannot apply any stemming to my text because the text includes many OCR spelling mistakes (I got the text from ORC-ed documents). How can I fix the issue ?
Note I am using model.similarity(t1, t2)
This is how I trained my Word2Vec Model:
documents = list()
tokenize = lambda x: gensim.utils.simple_preprocess(x)
t1 = time.time()
docs = read_files(TEXT_DIRS, nb_docs=5000)
t2 = time.time()
print('Reading docs took: {:.3f} mins'.format((t2 - t1) / 60))
print('Number of documents: %i' % len(docs))
# Training the model
model = gensim.models.Word2Vec(docs, size=EMBEDDING_SIZE, min_count=5)
if not os.path.exists(MODEL_DIR):
os.makedirs(MODEL_DIR)
model.save(os.path.join(MODEL_DIR, 'word2vec'))
weights = model.wv.vectors
index_words = model.wv.index2word
vocab_size = weights.shape[0]
embedding_dim = weights.shape[1]
print('Shape of weights:', weights.shape)
print('Vocabulary size: %i' % vocab_size)
print('Embedding size: %i' % embedding_dim)
Below is the read_files function I defined:
def read_files(text_directories, nb_docs):
"""
Read in text files
"""
documents = list()
tokenize = lambda x: gensim.utils.simple_preprocess(x)
print('started reading ...')
for path in text_directories:
count = 0
# Read in all files in directory
if os.path.isdir(path):
all_files = os.listdir(path)
for filename in all_files:
if filename.endswith('.txt') and filename[0].isdigit():
count += 1
with open('%s/%s' % (path, filename), encoding='utf-8') as f:
doc = f.read()
doc = clean_text_arabic_style(doc)
doc = clean_doc(doc)
documents.append(tokenize(doc))
if count % 100 == 0:
print('processed {} files so far from {}'.format(count, path))
if count >= nb_docs and count <= nb_docs + 200:
print('REACHED END')
break
if count >= nb_docs and count <= nb_docs:
print('REACHED END')
break
return documents
I tried this thread but it won't help me because I rather have arabic and misspelled text
Update
I tried the following: (getting the similarity between the exact same word)
print(model.similarity('الاحتلال','الاحتلال'))
and it gave me the following result:
1.0000001

Definitionally, the cosine-similarity measure should max at 1.0.
But in practice, floating-point number representations in computers have tiny imprecisions in the deep-decimals. And, especially when a number of calculations happen in a row (as with the calculation of this cosine-distance), those will sometimes lead to slight deviations from what the expected maximum or exactly-right answer "should" be.
(Similarly: sometimes calculations that, mathematically, should result in the exact same answer no matter how they are reordered/regrouped deviate slightly when done in different orders.)
But, as these representational errors are typically "very small", they're usually not of practical concern. (They are especially small in the range of numbers around -1.0 to 1.0, but can become quite large when dealing with giant numbers.)
In your original case, the deviation is just 0.000000119209289. In the word-to-itself case, the deviation is just 0.0000001. That is, about one-ten-millionth off. (Your other sub-1.0 values have similar tiny deviations from perfect calculation, but they aren't noticeable.)
In most cases, you should just ignore it.
If you find it distracting to you or your users in numerical displays/logging, simply choosing to display all such values to a limited number of after-the-decimal-point digits – say 4 or even 5 or 6 – will hide those noisy digits. For example, using a Python 3 format-string:
sim = model.similarity('الاحتلال','الاحتلال')
print(f"{sim:.6}")
(Libraries like numpy that work with large arrays of such floats can even set a global default for display precision – see numpy.set_print_options – though that shouldn't affect the raw Python floats you're examining.)
If for some reason you absolutely need the values to be capped at 1.0, you could add extra code to do that. But, it's usually a better idea to choose your tests & printouts to be robust to, & oblivious with regard to, such tiny deviations from perfect math.

Hyperopt list of values per hyperparameter

I'm trying to use Hyperopt on a regression model such that one of its hyperparameters is defined per variable and needs to be passed as a list. For example, if I have a regression with 3 independent variables (excluding constant), I would pass hyperparameter = [x, y, z] (where x, y, z are floats).
The values of this hyperparameter have the same bounds regardless of which variable they are applied to. If this hyperparameter was applied to all variables, I could simply use hp.uniform('hyperparameter', a, b). What I want the search space to be instead is a cartesian product of hp.uniform('hyperparameter', a, b) of length n, where n is the number of variables in a regression (so, basically, itertools.product(hp.uniform('hyperparameter', a, b), repeat = n))
I'd like to know whether this is possible within Hyperopt. If not, any suggestions for an optimizer where this is possible are welcome.

As noted in my comment, I am not 100% sure what you are looking for, but here is an example of using hyperopt to optimize 3 variables combination:
import random
# define an objective function
def objective(args):
v1 = args['v1']
v2 = args['v2']
v3 = args['v3']
result = random.uniform(v2,v3)/v1
return result
# define a search space
from hyperopt import hp
space = {
'v1': hp.uniform('v1', 0.5,1.5),
'v2': hp.uniform('v2', 0.5,1.5),
'v3': hp.uniform('v3', 0.5,1.5),
}
# minimize the objective over the space
from hyperopt import fmin, tpe, space_eval
best = fmin(objective, space, algo=tpe.suggest, max_evals=100)
print(best)
they all have the same search space in this case (as I understand this was your problem definition). Hyperopt aims to minimize the objective function, so running this will end up with v2 and v3 near the minimum value, and v1 near the maximum value. Since this most generally minimizes the result of the objective function.

You could use this function to create the space:
def get_spaces(a, b, num_spaces=9):
return_set = {}
for set_num in range(9):
name = str(set_num)
return_set = {
**return_set,
**{name: hp.uniform(name, a, b)}
}
return return_set

I would first define my pre-combinatorial space as a dict. The keys are names. The values are a tuple.
from hyperopt import hp
space = {'foo': (hp.choice, (False, True)), 'bar': (hp.quniform, 1, 10, 1)}
Next, produce the required combinatorial variants using loops or itertools. Each name is kept unique using a suffix or prefix.
types = (1, 2)
space = {f'{name}_{type_}': args for type_ in types for name, args in space.items()}
>>> space
{'foo_1': (<function hyperopt.pyll_utils.hp_choice(label, options)>,
(False, True)),
'bar_1': (<function hyperopt.pyll_utils.hp_quniform(label, *args, **kwargs)>,
1, 10, 1),
'foo_2': (<function hyperopt.pyll_utils.hp_choice(label, options)>,
(False, True)),
'bar_2': (<function hyperopt.pyll_utils.hp_quniform(label, *args, **kwargs)>,
1, 10, 1)}
Finally, initialize and create the actual hyperopt space:
space = {name: fn(name, *args) for name, (fn, *args) in space.items()}
values = tuple(space.values())
>>> space
{'foo_1': <hyperopt.pyll.base.Apply at 0x7f291f45d4b0>,
'bar_1': <hyperopt.pyll.base.Apply at 0x7f291f45d150>,
'foo_2': <hyperopt.pyll.base.Apply at 0x7f291f45d420>,
'bar_2': <hyperopt.pyll.base.Apply at 0x7f291f45d660>}
This was done with hyperopt 0.2.7. As a disclaimer, I strongly advise against using hyperopt because in my experience it has significantly poor performance relative to other optimizers.

Hi so I implemented this solution with optuna. The advantage of optuna is that it will create a hyperspace for all individual values, but optimizes this values in a more intelligent way and just uses one hyperparameter optimization. For example I optimized a neural network with the Batch-SIze, Learning-rate and Dropout-Rate:
The search space is much larger than the actual values being used. This safes a lot of time instead of an grid search.
The Pseudo-Code of the implementation is:
def function(trial): #trials is the parameter of optuna, which selects the next hyperparameter
distribution = [0 , 1]
a = trials.uniform("a": distribution) #this is a uniform distribution
b = trials.uniform("a": distribution)
return (a*b)-b
#This above is the function which optuna tries to optimze/minimze
For more detailed source-Code visit Optuna. It saved a lot of time for me and it was a really good result.

How to fit the best probability distribution model to my data in python?

i have about 20,000 rows of data like this,,
Id | value
1 30
2 3
3 22
..
n 27
I did statistics to my data,, the average value 33.85, median 30.99, min 2.8, max 206, 95% confidence interval 0.21.. So most values around 33, and there are some outliers (a little).. So it seems like a distribution with long tail.
I am new to both distribution and python,, i tried class fitter https://pypi.org/project/fitter/ to try many distribution from Scipy package,, and loglaplace distribution showed the lowest error (although not quiet understand it).
I read almost all questions in this thread and i concluded two approaches (1) fitting a distribution model and then in my simulation i draw random values (2) compute the frequency of different groups of values,, but this solution will not have a value more than 206 for example.
Having my data which is values (number), what is the best approach to fit a distribution to my data in python as in my simulation i need to draw numbers. The random numbers must have same pattern as my data. Also i need to validate the model is well presenting my data by drawing my data and the model curve.

One way is to select the best model according to the Bayesian information criterion (called BIC).
OpenTURNS implements an automatic method of selection (see doc here).
Suppose you have an array x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], here a quick example:
import openturns as ot
# Define x as a Sample object. It is a sample of size 11 and dimension 1
sample = ot.Sample([[xi] for xi in x])
# define distributions you want to test on the sample
tested_distributions = [ot.WeibullMaxFactory(), ot.NormalFactory(), ot.UniformFactory()]
# find the best distribution according to BIC and print its parameters
best_model, best_bic = ot.FittingTest.BestModelBIC(sample, tested_distributions)
print(best_model)
>>> Uniform(a = -0.769231, b = 10.7692)

How (can?) I vectorize an NN training input, instead of using a for loop? Are there other large optimizations?

I am trying to create a simple neural network program/module. Because I am using Pythonista on an iPad Pro, the speed could use some improvement. My understanding is that for loops have a lot of overhead, so I was wondering if it is possible to train 50000 input:target sets using some form of vectorization.
I have been trying to find resources on how numpy arrays pass through functions, but it’s very difficult to wrap my head around. I have been trying to create one large array that holds the “train” function inputs as small lists, but the function is operating on the entire array, not the smaller ones individually.
# train function that accepts inputs and targets
# currently called inside a for loop
# input: [[a], [b]], target: [[c]]
def train(self, input_mat, target_mat):
# generate hidden layer neuron values
hidden = self.weights_in_hid.dot(input_mat)
hidden += self.bias_hid
hidden = sigmoid(hidden)
# generate output neuron values
output_mat = self.weights_hid_out.dot(hidden)
output_mat += self.bias_out
# activation function
output_mat = sigmoid(output_mat)
# more of the function continues
# ...
# Datum converts simple lists into numpy matrices, so they don’t have to be reinstantiated 50000 times
training_data = [
Datum([0, 0], [0]),
Datum([0, 1], [1]),
Datum([1, 0], [1]),
Datum([1, 1], [0]),
]
# ...
# XXX Does this for loop cause a lot of the slowdown?
for _ in range(50000):
datum = rd.choice(training_data)
brain.train(datum.inputs, datum.targets)
In the shown state everything works, but somewhat slowly. Whenever I try to pass all the data in one matrix, the function cannot vectorize it. It instead attempts to operate on the matrix as a whole, obviously raising an error on the first step, as the array is way too big.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string