Related
I have a banded sparse square matrix , A, of type <class 'scipy.sparse.csr.csr_matrix'> and size = 400 x 400. I'd like to split this into block square matrices of size 200 x 200 each. For instance, the first block
block1 = A[0:200, 0:200]
block2 = A[100:300, 100:300]
block3 = A[200:400, 200:400]
The same information about the slices is stored in a list of tuples.
[(0,200), (100, 300), (200, 400)]
Suggestions on how to split the spare square matrix will be really helpful.
You can convert to a regular array and then split it:
from scipy.sparse import csr_matrix
import numpy as np
row = np.arange(400)[::2]
col = np.arange(400)[1::2]
data = np.random.randint(1, 10, (200))
compressed_matrix = csr_matrix((data, (row, col)), shape=(400, 400))
# Convert to a regular array
m = compressed_matrix.toarray()
# Split the matrix
sl = [(0,200), (100, 300), (200, 400)]
blocks = [m[i, i] for i in map(lambda x: slice(*x), sl)]
And if you want you can convert back each block to a compressed matrix:
blocks_csr = list(map(csr_matrix, blocks))
CODE EXPLANATION
The creation of the blocks is based on a list comprehension and basic slicing.
Each input tuple is converted to a slice object, only to create a series of row and column indexes, corresponding to that of the elements to be selected; in this answer, this is sufficient to select the requested block squared matrix. Slice objects are generated when extended indexing syntax is used: To be clear, a[start:stop:step] will create a slice object equivalent to slice(start, stop, step). In our case, they are used to dynamically change the indexes to be selected, according to the matrix we want to extract. So, if you consider the first block, m[i, i] is equivalent to m[0:200, 0:200].
Slice objects are a form of basic indexing, so a view of the original array is created, rather than a copy (this means that if you modify the view, also the original array will be modified: you can easily create a copy of the original data using the copy method of the numpy array).
The map object is used to generate slice objects from the input tuples; map applies the function provided as its first argument to all the elements of its second argument.
lambda is used to create an anonymous function, i.e., a function defined without a name. Anonymous functions are useful to accomplish specific tasks that you do not want to code in a standard function, because you are not going to reuse them or you need only for a short period of time, like in the example of this code. They make code more compact rather than defining the correspondent functions.
*x is called unpacking, i.e you extract, unpack elements from the tuple. Suppose you have a function f and a tuple a = (1, 2, 3), then f(*a) is equivalent to f(1, 2, 3) (as you can see, you can think of unpacking as removing a level of parentheses).
So, looking back at the code:
blocks = [ # this is a list comprehension
m[i, i] # basic slicing of the input array
for i in map( # map apply a function to all the item of a list
lambda x: slice(*x), sl # creating a slice object out of the provided index ranges
)
]
So I had this statistics homework and I wanted to do it with python and numpy.
The question started with making of 1000 random samples which follow normal distribution.
random_sample=np.random.randn(1000)
Then it wanted to divided these numbers to some subgroups . for example suppose we divide them to five subgroups.first subgroup is random numbers in range of (-5,-3)and it goes on to the last subgroup (3,5).
Is there anyway to do it using numpy (or anything else)?
And If it's possible I want it to work when the number of subgroups are changed.
You can get subgroup indices using numpy.digitize:
random_sample = 5 * np.random.randn(10)
random_sample
# -> array([-3.99645573, 0.44242061, 8.65191515, -1.62643622, 1.40187879,
# 5.31503683, -4.73614766, 2.00544974, -6.35537813, -7.2970433 ])
indices = np.digitize(random_sample, (-3,-1,1,3))
indices
# -> array([0, 2, 4, 1, 3, 4, 0, 3, 0, 0])
If you sort your random_sample, then you can divide this array by finding the indices of the "breakpoint" values — the values closest to the ranges you define, like -3, -5. The code would be something like:
import numpy as np
my_range = [-5,-3,-1,1,3,5] # example of ranges
random_sample = np.random.randn(1000)
hist = np.sort(random_sample)
# argmin() will find index where absolute difference is closest to zero
idx = [np.abs(hist-i).argmin() for i in my_range]
groups=[hist[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Now groups is a list where each element is an array with all random values within your defined ranges.
I am using num2str to print an array of integers. My problem is that the format %d, (notice no flag or field width) doesn't yield a comma-separated list of values as I would expect.
Instead, it seems that all elements are forced to the same width by introducing spaces. I would like to get rid of these spaces. For example:
>> num2str(randi(10,1,10),'%d,')
7, 8,10,10, 2, 2, 7, 1, 6, 6,
>> num2str(randi(10,1,10),'%d,')
9,5,4,7,8,6,4,2,6,3,
In the first example, you can see that all elements have a width of 2 -- this is the largest width among all elements, but I would prefer the output list to be compact: 7,8,10,10,2,2,7,1,6,6,. In the second example, the largest width is 1, and there are no spaces introduced. I don't understand why Matlab would force all elements to have equal field length.
num2str computes the max of the vector, and pads with white space numbers that have less digits (type edit num2str in the command window to see the source code).
Try sprintf instead,
sprintf('%d,', randi(1000,1,10))
When you take centiles of a variable in Stata, for eg.
*set directory
cd"C:\Etc\Etc Etc\"
*open data file
use "dataset.dta",clear
*get centiles
centile var1, centile(1,5(5)95,99)
is there some way to record the resulting centile table to excel? The centile values are stored in r(c_#), where # indicates the centile at which you want the data. But I need a vector of the values at all the centiles, more or less as it appears in the output window.
I have attempted to use foreach loop to get the centiles into a vector, as follows:
*Create column of centiles
foreach i in r(centiles) {
xx[1,`i']=r(c_`i')
}
without success.
Thanks
EDIT:
I've since found this to work:
matrix X = 0,0
forvalues i=1/21 {
matrix X = `i',round(r(c_`i'),.001)\ X
}
Only inconveniences are 1) I have to include a a first row of 0,0 in the output, which I will then subsequently drop. 2) In this case I have 21 centiles, but it would be nice to automate the number of centiles in case I want to change it, for example something like this:
forvalues i=1/r(n_cent) {
matrix X = `i',round(r(c_`i'),.001)\ X
}
But the "i=1/r(n_cent)" is invalid syntax. Any advice as to how I might overcome these two inconveniences would be much appreciated.
Thanks
You can use the following syntax.
Load some data and compute the percentiles.
sysuse auto, clear
centile price, centile(1,5(5)95,99)
The matrix that is supposed to contain the results has to be initialized. This matrix is called X. It has as many rows as there are centiles requested via the centile command. It has two columns. At this stage, the matrix is populated with zeroes.
matrix X = J(`=wordcount("`r(centiles)'")', 2, 0)
The following loop is stepping through the results of the centile command and is replacing the zeroes in matrix X with the appropriate results. The first column of the matrix contains the number of the centile (1, 5, 10, ...) and the second column contains the result
forvalues i = 1 / `=wordcount("`r(centiles)'")' {
local cent: word `i' of `r(centiles)'
matrix X[`i', 1] = `cent'
matrix X[`i', 2] = r(c_`i')
}
Print the results:
matrix list X
If you are using round(), you are likely doing something wrong. There are few reasons to deliberately lose precision in the data; you can always display as many digits as you like using format this way or another (either applied to the data, or as an option of list or matrix list).
I wrote epctile command that returns percentiles as an estimation command, i.e., in the e(b) vector. This can be usable immediately; findit epctile to download.
You can modify your proposal as follows:
local thenumlist 1, 5(5)95, 99
centile variable, centile(`thenumlist')
forvalues i=1/`=r(n_cent)' {
matrix X = nullmat(X) \ r(c_`i')
}
numlist "`thenumlist'"
matrix rownames X = `r(numlist)'
matrix list X, format(%9.3f)
Please consider:
dalist={{1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
{2.88`, 2.04`, 4.64`,0.56`, 4.92`, 2.06`, 3.46`, 2.68`, 2.72`,0.820},
{"Laura1", "Laura1", "Laura1", "Laura1", "Laura1",
"Laura1", "Laura1", "Laura1", "Laura1","Laura1"},
{"RIGHT", 0, 1, 15.1`, 0.36`, 505, 20.059375`,15.178125`, ".", "."}}
The actual dataset is about 6 000 rows and 147 columns. However the above reflects its content. I would like to compute some basic statistics, such as the mean. My attempt:
Table[Mean#dalist[[colNO]], {colNO, 1, 4}]
How could I create a function such as to:
Avoid non-numerical values and
Count the number of non numerical values found in each lists.
I have not succeeded in finding the right pattern mechanism yet.
First observation: you could use Mean /# dalist if you wanted to average across rows. You don't need a Table function here.
Try using Cases (documentation), eg. Mean /# (Cases[#,_?NumericQ] & /# dalist)
If you want to be tricky and eliminate rows from your data that have no numeric elements (eg your third column), try the following. It first picks only the rows that have some numeric elements, and then takes only the numeric elements from those rows.
Mean /# (Cases[#,_?NumericQ] & /# (Cases[dalist, {___,_?NumericQ,___}]))
To count the non-numeric elements, you would use a similar approach:
Length /# (Cases[#,Except[_?NumericQ]] & /# dalist)
This answer has the caveat that I typed it out without the benefit of a Mathematica installation to actually check my syntax. Some typos could remeain.
Here is a variation of Verbeia's answer that you may consider.
Assuming that this is a rectangular array (all rows are the same length), then setting d to the row length (which can be found with Dimensions):
d = 10;
{d - Length##, Mean##} &#Select[#, NumericQ] & /# dalist
(* Out: *) {{0, 11/2}, {0, 2.678}, {10, Mean[{}]}, {3, 79.5282}}
That is, pairs of {number_of_non-numeric, average}.
Mean[{}] appears where there are no numeric values to average. This could be removed from the list with DeleteCases but the results would no longer align with the rows of dalist. I think it would be better to use something like: /. Mean[{}] -> "NO AVERAGE" if needed.
The key to answering your question is the NumberQ function: "*NumberQ[expr] gives True if expr is a number, and False otherwise."
To compute the mean of only numeric elements in each list:
Map[Function[lst, Mean[Select[lst, NumberQ]]], dalist]
To count the number of non-numeric elements in each list:
Map[Function[lst, Length[Select[lst, Function[x, !NumberQ[x]]]]], dalist]