When you take centiles of a variable in Stata, for eg.
*set directory
cd"C:\Etc\Etc Etc\"
*open data file
use "dataset.dta",clear
*get centiles
centile var1, centile(1,5(5)95,99)
is there some way to record the resulting centile table to excel? The centile values are stored in r(c_#), where # indicates the centile at which you want the data. But I need a vector of the values at all the centiles, more or less as it appears in the output window.
I have attempted to use foreach loop to get the centiles into a vector, as follows:
*Create column of centiles
foreach i in r(centiles) {
xx[1,`i']=r(c_`i')
}
without success.
Thanks
EDIT:
I've since found this to work:
matrix X = 0,0
forvalues i=1/21 {
matrix X = `i',round(r(c_`i'),.001)\ X
}
Only inconveniences are 1) I have to include a a first row of 0,0 in the output, which I will then subsequently drop. 2) In this case I have 21 centiles, but it would be nice to automate the number of centiles in case I want to change it, for example something like this:
forvalues i=1/r(n_cent) {
matrix X = `i',round(r(c_`i'),.001)\ X
}
But the "i=1/r(n_cent)" is invalid syntax. Any advice as to how I might overcome these two inconveniences would be much appreciated.
Thanks
You can use the following syntax.
Load some data and compute the percentiles.
sysuse auto, clear
centile price, centile(1,5(5)95,99)
The matrix that is supposed to contain the results has to be initialized. This matrix is called X. It has as many rows as there are centiles requested via the centile command. It has two columns. At this stage, the matrix is populated with zeroes.
matrix X = J(`=wordcount("`r(centiles)'")', 2, 0)
The following loop is stepping through the results of the centile command and is replacing the zeroes in matrix X with the appropriate results. The first column of the matrix contains the number of the centile (1, 5, 10, ...) and the second column contains the result
forvalues i = 1 / `=wordcount("`r(centiles)'")' {
local cent: word `i' of `r(centiles)'
matrix X[`i', 1] = `cent'
matrix X[`i', 2] = r(c_`i')
}
Print the results:
matrix list X
If you are using round(), you are likely doing something wrong. There are few reasons to deliberately lose precision in the data; you can always display as many digits as you like using format this way or another (either applied to the data, or as an option of list or matrix list).
I wrote epctile command that returns percentiles as an estimation command, i.e., in the e(b) vector. This can be usable immediately; findit epctile to download.
You can modify your proposal as follows:
local thenumlist 1, 5(5)95, 99
centile variable, centile(`thenumlist')
forvalues i=1/`=r(n_cent)' {
matrix X = nullmat(X) \ r(c_`i')
}
numlist "`thenumlist'"
matrix rownames X = `r(numlist)'
matrix list X, format(%9.3f)
Related
I have a banded sparse square matrix , A, of type <class 'scipy.sparse.csr.csr_matrix'> and size = 400 x 400. I'd like to split this into block square matrices of size 200 x 200 each. For instance, the first block
block1 = A[0:200, 0:200]
block2 = A[100:300, 100:300]
block3 = A[200:400, 200:400]
The same information about the slices is stored in a list of tuples.
[(0,200), (100, 300), (200, 400)]
Suggestions on how to split the spare square matrix will be really helpful.
You can convert to a regular array and then split it:
from scipy.sparse import csr_matrix
import numpy as np
row = np.arange(400)[::2]
col = np.arange(400)[1::2]
data = np.random.randint(1, 10, (200))
compressed_matrix = csr_matrix((data, (row, col)), shape=(400, 400))
# Convert to a regular array
m = compressed_matrix.toarray()
# Split the matrix
sl = [(0,200), (100, 300), (200, 400)]
blocks = [m[i, i] for i in map(lambda x: slice(*x), sl)]
And if you want you can convert back each block to a compressed matrix:
blocks_csr = list(map(csr_matrix, blocks))
CODE EXPLANATION
The creation of the blocks is based on a list comprehension and basic slicing.
Each input tuple is converted to a slice object, only to create a series of row and column indexes, corresponding to that of the elements to be selected; in this answer, this is sufficient to select the requested block squared matrix. Slice objects are generated when extended indexing syntax is used: To be clear, a[start:stop:step] will create a slice object equivalent to slice(start, stop, step). In our case, they are used to dynamically change the indexes to be selected, according to the matrix we want to extract. So, if you consider the first block, m[i, i] is equivalent to m[0:200, 0:200].
Slice objects are a form of basic indexing, so a view of the original array is created, rather than a copy (this means that if you modify the view, also the original array will be modified: you can easily create a copy of the original data using the copy method of the numpy array).
The map object is used to generate slice objects from the input tuples; map applies the function provided as its first argument to all the elements of its second argument.
lambda is used to create an anonymous function, i.e., a function defined without a name. Anonymous functions are useful to accomplish specific tasks that you do not want to code in a standard function, because you are not going to reuse them or you need only for a short period of time, like in the example of this code. They make code more compact rather than defining the correspondent functions.
*x is called unpacking, i.e you extract, unpack elements from the tuple. Suppose you have a function f and a tuple a = (1, 2, 3), then f(*a) is equivalent to f(1, 2, 3) (as you can see, you can think of unpacking as removing a level of parentheses).
So, looking back at the code:
blocks = [ # this is a list comprehension
m[i, i] # basic slicing of the input array
for i in map( # map apply a function to all the item of a list
lambda x: slice(*x), sl # creating a slice object out of the provided index ranges
)
]
Say you have an ordered array of values representing x coordinates.
[0,25,50,60,75,100]
You might notice that without the 60, the values would be evenly spaced (25). This would be indicative of a repeating pattern, something that I need to extract using this list (regardless of the length and the values of the list). In this particular example, the algorithm should find and remove the 60.
There are no time or space complexity requirements.
Both the values in the list and the ideal spacing (e.g 25) are unknown. So the algorithm must obtain this by looking at the values. In addition, the number of values, and where the outliers are in the array are not guaranteed. There may be more than one outlier. The algorithm should return a list with the outliers removed. Extra points if the algorithm uses a threshold for the spacing.
Edit: Here is an example image
Here there is one outlier on the x axis. (green-line) There are two on the y axis. The x-coordinates of the array represent the rho of the line on that axis.
arr = [0,25,50,60,75,100]
First construct the distances array
dist = np.array([arr[i+1] - arr[i] for (i, _) in enumerate(arr) if i < len(arr)-1])
print(dist)
>> [25 25 10 15 25]
Now I'm using np.where and np.percentile to cut the array in 3 part: the main , the upper values and the lower values. I arbitrary set them to 5%.
cond_sup = np.where(dist > np.percentile(dist, 95))
print(cond_sup)
>> (array([]),)
cond_inf = np.where(dist < np.percentile(dist, 5))
print(cond_inf)
>> (array([2]),)
You now got indexes where the value is different from the others.
So, dist[2] has a problem, which mean by construction the problem is between arr[2] and arr[2+1]
I don't know if you want to remove 1 or more numbers from this array. So I think the way to solve this problem will be like this:
array A[] = [0,25,50,60,75,100];
sort array (if needed).
create a new array B[] with value i-th: B[i] = A[i+1] - A[i]
find the value of B[] elements that appear most time. It's will be our distance.
find i such that A[i+1]-A[i] != distance
find k (k>i and k min) such that A[i+k]-A[i] == distance
so, we need remove A[i+1] => A[i+k-1]
I hope it is right.
In the following program I have a numpy array t.I divide it by one to get J values. Now after taking the inverse I want to get back same array t. I tried to do this in the following code but when I print t and T_list they give me different values. I want exactly the same values of t as in the beginning.
`t= np.linspace(1,4,10)
print(t)
J_values =[]
for i in range(len(t)):
j= 1.0/t[i]
J_values.append(j)
print(J_values)
T_list =[]
for i in range(len(J_values)):
T=1.0/J_values[i]
T_list.append(T)
print(T_list)`
One of the great things about NumPy arrays is that you rarely have to use loops to manipulate them. Inverting the values of t can be done like this:
t = np.linspace(1, 4, 10)
J_values = 1.0 / t
Getting the original values back is then:
T_list = 1.0 / J_values
Here is background information to the problem I am encountering:
1) output is a cell array, each cell contains a matrix of size = 1024 x 1024, type = double
2) labelbout is a cell array which is the identical to output, except that each matrix has been binarized.
3) I am using the function regionprops to extract the mean intensity and centroid values for ROIs (there are multiple ROIs in each image) for each cell of output
4) props is a 5 x 1 struct with 2 fields (centroid and mean intensity)
The problem: I would like to take the mean intensity values for each ROI in every matrix and export to excel. Here is what I have so far:
for i = 1:size(output,2)
props = regionprops(labelboutput{1,i},output{1,i},'MeanIntensity','Centroid');
end
for i = 1:size(output,2)
meanValues = getfield(props(1:length(props),'MeanIntensity'));
end
writetable(struct2table(props), 'advanced_test.xlsx');
There seem to be a few issues:
1) my getfield command is not working and gets the error: "Index exceeds matrix dimensions"
2) when the information is being stored into props, it overwrites the values for each matrix. How do I make props a 5 x n (where n = number of cells in output)?
Please help!!
1) my getfield command is not working and gets the error: "Index exceeds matrix dimensions"
An easier way to get numeric values out of the same field in an array of structs, as an array is: [structArray.fieldName]. In your case this will be:
meanValues = [props.MeanIntensity];
2) when the information is being stored into props, it overwrites the values for each matrix. How do I make props a 5 x n (where n = number of cells in output)?
One option would be to preallocate an empty cell of the necessary dimensions and then fill it in with your regionprops output. Like this:
props = cell(size(output,1),1);
for k = 1:size(output,2)
props{k} = regionprops(labelboutput{1,k},output{1,k},'MeanIntensity','Centroid');
end
for k = 1:size(output,2)
meanValues = [props{k}.MeanIntensity];
end
...
Another option would be to combine your loops so that you can use your matrix data before it is overwritten. Like this:
for i = 1:size(output,2)
props = regionprops(labelboutput{1,i},output{1,i},'MeanIntensity','Centroid');
meanValues = [props.MeanIntensity];
% update this call to place props in non-overlapping parts of your file (e.g. append)
% writetable(struct2table(props), 'advanced_test.xlsx');
end
The bad thing about this second one is it has a file I/O step right inside your loop which can really slow things down; not to mention you will need to curtail your writetable call so it places the resulting table in non-overlapping regions of 'advanced_test.xlsx'.
I'm aware of:
id,value = max(enumerate(trans_p), key=operator.itemgetter(1))
I'm trying to find something equivalent for matrices, where I'm looking for the value and row index of the max for each column of the matrix
so the function could take in any matrix, such as:
np.array([[0,0,1],[2,0,0],[5,0,0]])
and return two vectors: a vector of row numbers where the max is found, and the max values themselves - for each column. I'm trying to avoid a for-loop! Ideally the function returns two values, like that:
rowIdVect, maxVect = ...........
where the values for the example matrix above would be:
[2,0,0] #rowIdVect
[5,0,1] #maxVect
I can do this in two steps:
idVect = np.argmax( myMat , axis=0)
maxVect = np.max( trans_probs_mat, axis=0)
But is there a syntax that would perform both at the same time? Note: I'm trying to improve run times.
You can use the index to find the corresponding values:
In [201]: arr=np.array([[0,0,1],[2,0,0],[5,0,0]])
In [202]: idx=np.argmax(arr, axis=0)
In [203]: np.max(arr, axis=0)
Out[203]: array([5, 0, 1])
In [204]: arr[idx,np.arange(3)]
Out[204]: array([5, 0, 1])
Is this worth it? I doubt if the use of argmax and/or max is a bottleneck in your calculations. But feel free to time test with realistic data.