Avoid & Count non-numerical values computing basic statistics in Mathematica - statistics

Please consider:
dalist={{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{2.88`, 2.04`, 4.64`,0.56`, 4.92`, 2.06`, 3.46`, 2.68`, 2.72`,0.820},
{"Laura1", "Laura1", "Laura1", "Laura1", "Laura1",
"Laura1", "Laura1", "Laura1", "Laura1","Laura1"},
{"RIGHT", 0, 1, 15.1`, 0.36`, 505, 20.059375`,15.178125`, ".", "."}}
The actual dataset is about 6 000 rows and 147 columns. However the above reflects its content. I would like to compute some basic statistics, such as the mean. My attempt:
Table[Mean#dalist[[colNO]], {colNO, 1, 4}]
How could I create a function such as to:
Avoid non-numerical values and
Count the number of non numerical values found in each lists.
I have not succeeded in finding the right pattern mechanism yet.

First observation: you could use Mean /# dalist if you wanted to average across rows. You don't need a Table function here.
Try using Cases (documentation), eg. Mean /# (Cases[#,_?NumericQ] & /# dalist)
If you want to be tricky and eliminate rows from your data that have no numeric elements (eg your third column), try the following. It first picks only the rows that have some numeric elements, and then takes only the numeric elements from those rows.
Mean /# (Cases[#,_?NumericQ] & /# (Cases[dalist, {___,_?NumericQ,___}]))
To count the non-numeric elements, you would use a similar approach:
Length /# (Cases[#,Except[_?NumericQ]] & /# dalist)
This answer has the caveat that I typed it out without the benefit of a Mathematica installation to actually check my syntax. Some typos could remeain.

Here is a variation of Verbeia's answer that you may consider.
Assuming that this is a rectangular array (all rows are the same length), then setting d to the row length (which can be found with Dimensions):
d = 10;
{d - Length##, Mean##} &#Select[#, NumericQ] & /# dalist
(* Out: *) {{0, 11/2}, {0, 2.678}, {10, Mean[{}]}, {3, 79.5282}}
That is, pairs of {number_of_non-numeric, average}.
Mean[{}] appears where there are no numeric values to average. This could be removed from the list with DeleteCases but the results would no longer align with the rows of dalist. I think it would be better to use something like: /. Mean[{}] -> "NO AVERAGE" if needed.

The key to answering your question is the NumberQ function: "*NumberQ[expr] gives True if expr is a number, and False otherwise."
To compute the mean of only numeric elements in each list:
Map[Function[lst, Mean[Select[lst, NumberQ]]], dalist]
To count the number of non-numeric elements in each list:
Map[Function[lst, Length[Select[lst, Function[x, !NumberQ[x]]]]], dalist]

Related

Efficient search for collisions in multiple lists

I have a multiple lists with data of the form:(There is a simple example, in fact, the dimension of row-vectors are much larger)
list 1: [num1] [[1,0,0,1,0], [0,0,1,0,1], [0,1,0,1,0], ...]
list 2: [num2] [[0,0,0,1,0], [1,0,0,1,0], [0,0,1,0,0], ...]
...
list n: [numn] [[1,1,0,1,0], [1,0,0,1,1], [0,0,1,0,1], ...]
Every list marked with its own number [num] (numbers are not repeated).
The main question is: How to efficently find all num's of lists with identical row-vectors from them and such vectors?
In details:
For example, the row-vector [1,0,0,1,0] occurs in list 1 and list 2, so then I should return [1,0,0,1,0] : [num1], [num2]
First of all hash tables come to mind. I think it's best to use due to the large amount of data but I know hash tables quite superficially and I can’t structurize a clear algorithm in my head with this case. Can anyone advise what should I pay attention to and what modules should I consider? Perhaps there are other efficient approaches?
It is beyond the scope of a regular question to dive into hash tables and such. But suffice to say that sets in Python are backed by hash tables and checking for set membership is almost instantaneous and much more efficient than searching through lists.
If order doesn't matter within your list of vectors, you should just think of them as unordered collections (sets). Sets need to contain immutable things, so you cannot put a list into a set, but you can put in tuples. So, if you re-structure your data to be sets of tuples, you are in good shape.
You have many "cases" of things you might do then, below are a few examples.
data = { 1: {(1, 0, 0), (1, 1, 0)},
2: {(0, 0, 0), (1, 0, 0)},
3: {(1, 0, 0), (1, 0, 1), (1, 1, 0)}}
# find common vectors in 2 sets
def common_vecs(a, b):
return a.intersection(b)
# find all the common vectors in a group of sets
def all_common_vecs(grps):
return set.intersection(*grps)
# find which sets contain a specific vector
def find(vec, data):
result = set()
for idx, grp in data.items():
if vec in grp:
result.add(idx)
return result
print(common_vecs(data[1], data[3]))
print(all_common_vecs(data.values()))
print(find((1,0,1), data))
Output:
{(1, 0, 0), (1, 1, 0)}
{(1, 0, 0)}
{3}

Indexing a multi-dimensional tensor with a tensor in PyTorch

I have the following code:
a = torch.randint(0,10,[3,3,3,3])
b = torch.LongTensor([1,1,1,1])
I have a multi-dimensional index b and want to use it to select a single cell in a. If b wasn't a tensor, I could do:
a[1,1,1,1]
Which returns the correct cell, but:
a[b]
Doesn't work, because it just selects a[1] four times.
How can I do this? Thanks
A more elegant (and simpler) solution might be to simply cast b as a tuple:
a[tuple(b)]
Out[10]: tensor(5.)
I was curious to see how this works with "regular" numpy, and found a related article explaining this quite well here.
You can split b into 4 using chunk, and then use the chunked b to index the specific element you want:
>> a = torch.arange(3*3*3*3).view(3,3,3,3)
>> b = torch.LongTensor([[1,1,1,1], [2,2,2,2], [0, 0, 0, 0]]).t()
>> a[b.chunk(chunks=4, dim=0)] # here's the trick!
Out[24]: tensor([[40, 80, 0]])
What's nice about it is that it can be easily generalized to any dimension of a, you just need to make number of chucks equal the dimension of a.

num2str sets a constant width for integer formatting

I am using num2str to print an array of integers. My problem is that the format %d, (notice no flag or field width) doesn't yield a comma-separated list of values as I would expect.
Instead, it seems that all elements are forced to the same width by introducing spaces. I would like to get rid of these spaces. For example:
>> num2str(randi(10,1,10),'%d,')
7, 8,10,10, 2, 2, 7, 1, 6, 6,
>> num2str(randi(10,1,10),'%d,')
9,5,4,7,8,6,4,2,6,3,
In the first example, you can see that all elements have a width of 2 -- this is the largest width among all elements, but I would prefer the output list to be compact: 7,8,10,10,2,2,7,1,6,6,. In the second example, the largest width is 1, and there are no spaces introduced. I don't understand why Matlab would force all elements to have equal field length.
num2str computes the max of the vector, and pads with white space numbers that have less digits (type edit num2str in the command window to see the source code).
Try sprintf instead,
sprintf('%d,', randi(1000,1,10))

List of ints to list of strings in MATLAB

I have a list of integers
[0, 10, 20, 30, ...]
And I want to use them as a legend in a plot. My understanding is that I need to give the command
legend('0','10','20','30', ...)
So how do I get a list of strings from my original vector to pass to legend()?
num2str isn't working for me because I get just one long string. I'm still a little new to MATLAB syntax...
legend(num2str([0, 10, 20, 30]'))
Converting a column vector of numbers will produce a char array with m rows, where each row can be a legend entry.

Python range() with negative strides

Is there a way of using the range() function with stride -1?
E.g. using range(10, -10) instead of the square-bracketed values below?
I.e the following line:
for y in range(10,-10)
Instead of
for y in [10,9,8,7,6,5,4,3,2,1,0,-1,-2,-3,-4,-5,-6,-7,-8,-9,-10]:
Obviously one could do this with another kind of loop more elegantly but the range() example would work much better for what I want.
You can specify the stride (including a negative stride) as the third argument, so
range(10,-11,-1)
gives
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10]
In general, it doesn't cost anything to try. You can simply type this into the interpreter and see what it does.
This is all documented here as:
range(start, stop[, step])
but mostly I'd like to encourage you to play around and see what happens. As you can see, your intuition was spot on.
Yes, by defining a step:
for i in range(10, -11, -1):
print(i)
In addition to the other good answers, there is an alternative:
for y in reversed(range(-10, 11)):
See the documentation for reversed().
You may notice that the range function works only in ascending order without the third parameter. If you use without the third parameter in the range block, it will not work.
for i in range(10,-10)
The above loop will not work.
For the above loop to work, you have to use the third parameter as negative number.
for i in range(10,-10,-1)
Yes, however you'll need to specify that you want to step backwards by setting the step argument to -1.
Use:
for y in range(10, -10, -1)
For your case using range(10,-10,-1)
will be helpful. The first argument refers to the first step, the second one refers to the last step, and the third argument refers to the size of that step.
When your range is ascending, you do not need to specify the steps if you need all numbers between, range(-10,10) or range(-10,-5).
But when your range is descending, you need to specify the step size as -1, range(10,-10,-1) or any other larger steps.
If you prefer create list in range:
numbers = list(range(-10, 10))
To summarize, these 3 are the best efficient and relevant to answer approaches I believe:
first = list(x for x in range(10, -11, -1))
second = list(range(-10, 11))
third = [x for x in reversed(range(-10, 11))]
Alternatively, NumPy would be more efficient as it creates an array as below, which is much faster than creating and writing items to the list in python. You can then convert it to the list:
import numpy as np
first = -(np.arange(10, -11, -1))
Notice the negation sign for first.
second = np.arange(-10, 11)
Convert it to the list as follow or use it as numpy.ndarray type.
to_the_list = first.tolist()
#Treversed list in reverse direction
l1=[2,4,3]
for i in range (len(l1)-1,-1,-1):
print (l1[i])

Resources