Converting bit string to signed int - python-3.x

I encoded and decoded a bunch of coefficients (related to my previous question). The process is based on RLE where a bunch of coefficients are encoded and the runtime encoding is only focused on zeros. To cut it short this was the original array:
[200, -145, 0, 0, 0, 0, 51, 0, 0, 0, 0, 0, 0, 0, 0, -34, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Encoded into binary data that looks like this:
['000011001000', '11001>101101111<', '000010001110110011', '00010000111>1011110<', '00011000110011101', '000100011']
To avoid binary numbers that look like -10010001 (-145), I manually performed twos complement(since I couldn't find a built in way) on negative numbers. The results for the numbers in this case (-145, -34) were (101101111, 1011110).
To avoid confusion, I marked them in the array above for the purpouse of this question.
This was padded to be divisible by 8(the last element had 0's inserted into the beginning), divided into bytes and written into a file.
When I read the file, I decoded most things successfully and the number of coefficients is identical to the starting one. The problem arose with the negative values:
[200, 367, 0, 0, 0, 0, 51, 0, 0, 0, 0, 0, 0, 0, 0, 94, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Instead of -145 i got 367 and instead of -34 i got 94.
Is there any built-in way (or any kind of way) to convert bitstrings into signed values? I feel like this would fix my issue. I haven't been able to find a way and I'm stuck now.

For unsigned numbers the word size is not important because leading zeroes have no meaning there. For instance 5=101=0101=00101=0...0101. However, for two's complement the word size makes a difference because the first bit indicates negative numbers. For instance, -3=101 != 0101=5. If you don't know what the first bit is, you cannot tell whether the number is negative or not.
It seems like your encoding uses a variable word width. Since you already can decode the numbers, you already know how wide each word is.
# these variables should be set by your decoder
# in this case we read -145 encoded as 101101111
width = 9
word = 367
# add this to your decoder to fix the sign
firstBit = word >> (width - 1)
if (firstBit == 1):
leadingOnes = (-1 << width)
word = leadingOnes | word
The same could be done without the branch and in a single statement, but I think this could be slower on average for CPython and certainly less readable.
word |= -(word >> (width - 1)) << width
Of course you have to make sure that non-negative numbers are encoded with a leading 0 so that you can tell them apart from the negative ones.

Related

tensorflow Keras: Dimenions must be equal ValueError

I'm trying to train a model in Keras to suggest the best possible next move when presented with a pawn chess board. the board is represented as a list of 64 integers (0 for empty, 1 for player, 2 for enemy). The output is represented by a list of a field and a direction that the figure on that field should move in, which means I need two ouput layers with size 64 (number of fields) and 5 (number of possible move directions, including two forward and no move for when the game is over).
I have a list of boards and a list of solutions. When I try to fit the model however, I get the above mentioned error.
The exact error message is:
Epoch 1/75
Traceback (most recent call last):
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\main.py", line 75, in <module>
model.fit(train_fig_starts, train_fig_moves, epochs=75)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\lulll\AppData\Local\Temp\__autograph_generated_filej0zia4d5.py", line 15, in tf__train_function
retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
ValueError: in user code:
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\training.py", line 1249, in train_function *
return step_function(self, iterator)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\training.py", line 1233, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\training.py", line 1222, in run_step **
outputs = model.train_step(data)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\training.py", line 1024, in train_step
loss = self.compute_loss(x, y, y_pred, sample_weight)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\training.py", line 1082, in compute_loss
return self.compiled_loss(
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\compile_utils.py", line 265, in __call__
loss_value = loss_obj(y_t, y_p, sample_weight=sw)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\losses.py", line 152, in __call__
losses = call_fn(y_true, y_pred)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\losses.py", line 284, in call **
return ag_fn(y_true, y_pred, **self._fn_kwargs)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\losses.py", line 2176, in binary_crossentropy
backend.binary_crossentropy(y_true, y_pred, from_logits=from_logits),
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\backend.py", line 5688, in binary_crossentropy
bce = target * tf.math.log(output + epsilon())
ValueError: Dimensions must be equal, but are 2 and 64 for '{{node binary_crossentropy/mul}} = Mul[T=DT_FLOAT](binary_crossentropy/Cast, binary_crossentropy/Log)' with input shapes: [?,2], [?,64].
I have absolutely no idea what is causing this. I've searched for the error already, but the only mentions I've found seem to be describing a completely different scenario.
Since it probably helps, here's the code used to create and fit the model:
inputs = tf.keras.layers.Input(shape=64)
x = tf.keras.layers.Dense(32, activation='relu')(inputs)
out_field = tf.keras.layers.Dense(64, name="field")(x)
out_movement = tf.keras.layers.Dense(5, name="movement")(x)
model = tf.keras.Model(inputs=inputs, outputs=[out_field, out_movement])
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])
model.fit(train_fig_starts, train_fig_moves, epochs=75) #train_fig_starts and moves are defined above
EDIT 1: Here's a sample of the dataset I'm using (the whole thing is too long for the character limit)
train_fig_starts = [[0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 2, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 2, 1, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0], [0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 2, 2, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]]
train_fig_moves = [[0, 0], [0, 0], [0, 0], [0, 0], [15, 2], [15, 2]]
EDIT 2:
I changed it to sparsecategorialcrossentropy since that seems more like what I'm looking for. This is now the model code
inputs = tf.keras.layers.Input(shape=64)
x = tf.keras.layers.Dense(64, activation='relu')(inputs)
out_field = tf.keras.layers.Dense(64, activation="relu", name="field")(x)
out_field = tf.keras.layers.Dense(64, activation="softmax", name="field_softmax")(out_field)
out_movement = tf.keras.layers.Dense(5, activation="relu", name="movement")(x)
out_movement = tf.keras.layers.Dense(5, activation="softmax", name="movement_softmax")(out_movement)
model = tf.keras.Model(inputs=inputs, outputs=[out_field, out_movement])
print(model.summary())
tf.keras.utils.plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)
model.compile(optimizer='adam',
loss=[tf.keras.losses.SparseCategoricalCrossentropy(),
tf.keras.losses.SparseCategoricalCrossentropy()],
metrics=['accuracy'])
it still throws an error, this time its the following:
Node: 'sparse_categorical_crossentropy_1/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits'
logits and labels must have the same first dimension, got logits shape [32,5] and labels shape [64]
[[{{node sparse_categorical_crossentropy_1/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]] [Op:__inference_train_function_1666]
I have no idea why its like that. Output logits and labels should both be [64, 2]. Since I'm using sparse crossentropy I should be able to use integers in my training data to signify the "index" of the ouput neuron with the highest logit, right? Correct me if I'm wrong. If it helps, here's a diagram of my model:
plot of the model
So I fixed the issue by myself now. Honestly it was a pretty stupid error to make but the error messages didn't really explain well what was going on. I swapped the outputs for one hot encoding and changed the loss to CategorialCrossEntropy, which is also more fitting for a categorisation problem (Sparse didn't work with my integers for some reason). After that I needed to change the label list from a 1dim list containing lists of len = 2 to a 2dim list containing both the field and the move one hots in a separate list. If anyone runs into a similar issue and can't make sense of it, maybe this will help.

Optimising FOR LOOP or alternative to it

I am using a FOR LOOP to calculate a simple probability on a dataset with approximately 500K rows of data.
For loop
class_ = 4
class_freq = Counter(list_[-1] for list_ in train_list) # Counter({5: 1476, 1: 1531, 4: 1562, 3: 1430, 2: 1498, 7: 1517, 6: 1486})
def cp(x, class_, freq_): # x is column index passed from another function
for row in train_list:
pos = 0
neg = 0
if row[x] == 1 and row[54] == class_:
pos+=1
else:
neg+=1
cal_0 = (neg + 0.1) / (class_freq[class_value] + 0.2)
cal_1 = (pos + 0.1) / (class_freq[class_value] + 0.2)
if prob_1 > prob_0:
return prob_1
else:
return prob_0
Train_list sample
[3050, 180, 4, 277, -3, 5782, 221, 242, 156, 2721, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
[2818, 119, 19, 30, 10, 5213, 248, 220, 92, 4497, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]
[3182, 115, 10, 553, 10, 4605, 237, 231, 124, 1768, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5]
[3024, 312, 18, 474, 177, 5785, 169, 224, 194, 4961, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]
[3067, 32, 4, 30, -2, 6679, 219, 230, 147, 2947, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4]
[2716, 1, 10, 234, 27, 2100, 206, 222, 153, 5581, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4]
...
The FOR LOOP works well on small dataset (few hundred rows) as expected. Unfortunately, when I try to use it on 20K rows of data, the processing time take ages. I cannot imagine how long it will take to run 500K rows of data.
FOR LOOP is grossly bad in performance for large dataset. What is an alternative to this? Will Lambda improve processing speed? I appreciate advice and assistance here, thanks.
Edited:
Thanks to everyone comments, I have tried to work on another algorithm to replace the FOR LOOP.
def cp(x, class_, class_):
filtered_list = [t for t in train_list if t[54] == class_]
count_binary = Counter(binary[col] for binary in filtered_list)
binary_1 = count_binary[1]
binary_0 = count_binary[0]
cal_0 = (binary_0 + 0.1) / (class_freq[class_value] + 0.2)
cal_1 = (binary_1 + 0.1) / (class_freq[class_value] + 0.2)
if prob_1 > prob_0:
return prob_1
else:
return prob_0
I am still running the above code in my program and the process is not done yet - so can't tell if it is much efficient. I will appreciate if someone can provide their opinion on this new block of code.
FYI, if this is indeed a better and more efficient code, then the issue of processing speed is most likely on other parts of my code.

How to get important words using LGBM feature importance and Tfidf vectorizer?

I am working a Kaggle dataset that predicts a price of an item using its description and other attributes. Here is the link to the competition. As part of an experiment, I am currently, only using an item's description to predict its price. The description is free text and I use sklearn's Tfidf's vectorizer with a bi-gram and max features set to 60000 as input to a lightGBM model.
After training, I would like to know the most influential tokens for predicting the price. I assumed lightGBM's feature_importance method will be able to give me this. This will return a 60000 dim numpy array, whose index I can use to retrieve the token from the Tfidf's vectorizer's vocab dictionary.
Here is the code:
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=60000)
x_train = vectorizer.fit_transform(train_df['text'].values.astype('U'))
x_valid = vectorizer.transform(valid_df['text'].values.astype('U'))
idx2tok = {v: k for k, v in vectorizer.vocabulary_.items()}
features = [f'token_{i}' for i in range(len(vectorizer.vocabulary_))]
get_tok = lambda x, idxmap: idxmap[int(x[6:])]
lgb_train = lgb.Dataset(x_train, y_train)
lgb_valid = lgb.Dataset(x_valid, y_valid, reference=lgb_train)
gbm = lgb.train(lgb_params, lgb_train, num_boost_round=10, valid_sets=[lgb_train, lgb_valid], early_stopping_rounds=10, verbose_eval=True)
The model trains, however, after training when I call gbm.feature_importance(), I get a sparse array of integers, that really doesn't make sense to me:
fi = gbm.feature_importance()
fi[:100]
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
dtype=int32)
np.unique(fi)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 33, 34, 38, 45],
dtype=int32)
I'm not sure how to interpret this. I thought that earlier indices of the feature importance array will have higher value and thus tokens corresponding to that index in the vectorizer's vocab will be more important/influential than other tokens. Is this assumption wrong? How do I get the most influential/important terms that determines the model outcome? Any help is appreciated.
Thanks.

Is there a method in Pytorch to count the number of unique values in a way that can be back propagated?

Given the following tensor (which is the result of a network [note the grad_fn]):
tensor([121., 241., 125., 1., 108., 238., 125., 121., 13., 117., 121., 229.,
161., 13., 0., 202., 161., 121., 121., 0., 121., 121., 242., 125.],
grad_fn=<MvBackward>)
Which we will define as:
xx = torch.tensor([121., 241., 125., 1., 108., 238., 125., 121., 13., 117., 121., 229.,
161., 13., 0., 202., 161., 121., 121., 0., 121., 121., 242., 125.]).requires_grad_(True)
I would like to define an operation which counts the number of occurrences of each value in such a way that the operation will output the following tensor:
tensor([2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 7, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 1, 1])
i.e. there are 2 zeros, 1 one, 2 thirteens, etc... the total number of possible values is set upstream, but in this example is 243
So far I have tried the following approaches, which successfully produce the desired tensor, but do not do so in a way that allows computing gradients back through the network:
Attempt 1
tt = []
for i in range(243):
tt.append((xx == i).unsqueeze(0))
torch.cat(tt,dim=0).sum(dim=1)
Attempt 2
tvtensor = torch.tensor([i for i in range(243)]).unsqueeze(1).repeat(1,xx.shape[0]).float().requires_grad_(True)
(xx==tvtensor).sum(dim=1)
EDIT: Added Attempt
Attempt 3
-- Didn't really expect this to back prop, but figured I would give it a try anyway
ll = torch.zeros((1,243))
for x in xx:
ll[0,x.long()] += 1
Any help is appreciated
EDIT: As requested the end goal of this is the following:
I am using a technique for calculating structural similarity between two time sequences. One is real and the other is generated. The technique is outlined in this paper (https://link.springer.com/chapter/10.1007/978-3-642-02279-1_33) where a time series is converted to a sequence of code words and the distribution of code words (similar to the way that Bag of Words is used in NLP) is used to represent the time series. Two series are considered similar when the two signal distributions are similar. This is what the counting statistics tensor is for.
What is desired is to be able to construct a loss function which consumes this tensor and measures the distance between the real and generated signal (euclidian norm on the time domain data directly does not work well and this approach claimed better results), so that it can update the generator appropriately.
I would do it with unique method (only to count occurrences):
if you want to count the occurrences, you have to add the parameter return_counts=True
I did it in the version 1.3.1
This is the fast way to count occurrences, however is a non-differentiable operation, therefore, this method is not recommendable (anyway I have described the way to count ocurrences). To perform what you want, I think you should turn the input into a distribution by means of a differentiable function (softmax is the most used) and then, use some way to measure the distance between distributions (output and target) like cross-entropy, KL (kullback-liebler), JS or wasserstein.
You will not be able to do that as unique is simply non-differentiable operation.
Furthermore, only floating point tensors can have gradient as it's defined only for real numbers domain, not for integers.
Still, there might be another, differentiable way to do what you want to achieve, but that's a different question.
The "uniquify" operation is non-differentiable, but there might be ways to remedy this, for instance, by writing a custom operator, or by a clever combination of differentiable operators.
However, you need to ask yourself this question: what do you expect the gradient of such operation to be? Or, on a higher level, what are you trying to achieve with this operation?

Finding the position of the median of an array containing mostly zeros

I have a very large 1d array with most elements being zero while nonzero elements are all clustered around some few islands separated by many zeros: (here is a smaller version of that for the purpose of a MWE)
In [1]: import numpy as np
In [2]: A=np.array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3,6,20,14,10,5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,4,5,5,18,18,16,14,10,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,3,6,16,4,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
I want to find the median and its position (even approximately) in terms of the index corresponding to the median value of each island. Not surprisingly, I am getting zero which is not what I desire:
In [3]: np.median(A)
Out[3]: 0.0
In [4]: np.argsort(A)[len(A)//2]
Out[4]: 12
In the case of a single island of nonzero elements, to work around this caveat and meet my requirement that only nonzero elements are physically meaningful, I remove all zeros first and then take the median of the remaining elements:
In [5]: masks = np.where(A>0)
In [6]: A[masks]
Out[6]: array([ 1, 3, 6, 20, 14, 10, 5, 1])
This time, I get the median of the new array correctly, however the position (index) would not be correct as it is evident and also pointed out in the comments as being ill-defined mathematically.
In [7]: np.median(A[masks])
Out[7]: 5.5
In [8]: np.argsort(A[masks])[len(A[masks])//2]
Out[8]: 2
According to this approximation, I know that real median is located in the third index of the modified array but I would like to translate it back into the format of the original array where the position (index) of the median should be somewhere in the middle of the first island of the nonzero elements corresponding to a larger index (where indices of zeros are all counted correctly). Also answered in the comments are two suggestions made to come up with the position of the median given one island of nonzero elements in the middle of a sea of zeros. But what if there is more than one such island? How could possibly one calculate the index corresponding to median of each island in the context of the original histogram array where zeros are all counted?
I am wondering if there is any easy way to calculate the position of the median in such arrays of many zeros. If not, what else should I add to my lines of code to make that possible after knowing the position in the modified array? Your help is great appreciated.
Based on the comment "A is actually a discrete histogram with many bins", I think what you want is the median of the values being counted. If A is an integer array of counts, then an exact (but probably very inefficient, if you have values as high as 1e7) formula for the median is
np.median(np.repeat(np.arange(len(A)), A)) # Do not use if A contains very large values!
Alternatively, you can use
np.searchsorted(A.cumsum(), 0.5*A.sum())
which will be the integer part of the median.
For example:
In [157]: A
Out[157]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3,
6, 20, 14, 10, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0])
In [158]: np.median(np.repeat(np.arange(len(A)), A))
Out[158]: 35.5
In [159]: np.searchsorted(A.cumsum(), 0.5*A.sum())
Out[159]: 35
Another example:
In [167]: B
Out[167]:
array([ 0, 0, 0, 1, 100, 21, 8, 3, 2, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [168]: np.median(np.repeat(np.arange(len(B)), B))
Out[168]: 4.0
In [169]: np.searchsorted(B.cumsum(), 0.5*B.sum())
Out[169]: 4

Resources