rpy2 glmnet getting coefficients with row names - rpy2

I am using glmnet in Python using rpy2. However, I am not sure how to return the row names of the matrix in Python. Following only returns the matrix but not the variable names.
model = glm.cv_glmnet(x=XW_1, y=Y_1, **{'penalty.factor': penalty_factor})
coefs = np.array(base.as_matrix(glm.coef_glmnet(model, s="lambda.min")))

Did you try with .rownames ?
base.as_matrix(glm.coef_glmnet(model, s="lambda.min").rownames
(see https://rpy2.github.io/doc/v2.9.x/html/vector.html#rpy2.robjects.vectors.Matrix.rownames)

Related

How to preprocess a dataset with many types of missing data

I'm trying to do the beginner machine learning project Big Mart Sales.
The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)
My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:
from sklearn.impute import SimpleImputer as Imputer
# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])
# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])
# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.
Here are some rows of my data:
There is a python package which can do this for you in a simple way, ctrl4ai
pip install ctrl4ai
from ctrl4ai import preprocessing
preprocessing.impute_nulls(dataset)
Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
Description: Auto identifies the type of distribution in the column and imputes null values
Note: KNN consumes more system mermory if the size of the dataset is huge
Returns: Dataframe [with separate column for each categorical values]
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
If you have a numerical column, you can use some approaches to fill the missing data:
A constant value that has meaning within the domain, such as 0, distinct from all other values.
A value from another randomly selected record.
A mean, median or mode value for the column.
A value estimated by another predictive model.
Lets see how it works for a mean for one column e.g.:
One method would be to use fillna from pandas:
X['Name'].fillna(X['Name'].mean(), inplace=True)
For categorical data please have a look here: Impute categorical missing values in scikit-learn

Iterating throughput dataframe columns and using .apply() gives KeyError

So im trying to normalize my features by using .apply() iteratively on all columns of the dataframe but it gives KeyError. Can someone help?
I've tried using below code but it doesnt work :
for x in df.columns:
df[x+'_norm'] = df[x].apply(lambda x:(x-df[x].mean())/df[x].std())
I don't think it's a good idea to use mean and std functions inside the apply. You are calculating them each time which that any row is going to get its new value. Instead you can calculate them in the beginning of the loop and use of it in the apply function. Like below:
for x in df.columns:
mean = df[x].mean()
std = df[x].std()
df[x+'_norm'] = df[x].apply(lambda y:(y-mean)/std)

Create named list from matrix using rpy2

I have a 2D numpy array which I converted to R matrix and now I need to convert it further to named list:
rpy2.robjects.numpy2ri.activate()
nr,nc = counts.shape
r_mtx = robjects.r.matrix(counts, nrow=nr, ncol=nc)
So, I got the matrix r_mtx, but I am not sure how to make a named list out of it similar to how we do it in R:
named_list <- list(counts=mtx)
I need it to feed into SingleCellExperiment object to do dataset normalization:
https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html
I tried using rpy2.rlike.container both TaggedList and OrdDict but can't figure out how to apply them to my case.
Ultimately I solved it (avoiding convertion of numpy array to r matrix), straight making the named list from the numpy array:
named_list = robjects.r.list(counts=counts)
Where counts is a 2D numpy array

How to use get_operation_by_name() in tensorflow, from a graph built from a different function?

I'd like to build a tensorflow graph in a separate function get_graph(), and to print out a simple ops a in the main function. It turns out that I can print out the value of a if I return a from get_graph(). However, if I use get_operation_by_name() to retrieve a, it print out None. I wonder what I did wrong here? Any suggestion to fix it? Thank you!
import tensorflow as tf
def get_graph():
graph = tf.Graph()
with graph.as_default():
a = tf.constant(5.0, name='a')
return graph, a
if __name__ == '__main__':
graph, a = get_graph()
with tf.Session(graph=graph) as sess:
print(sess.run(a))
a = sess.graph.get_operation_by_name('a')
print(sess.run(a))
it prints out
5.0
None
p.s. I'm using python 3.4 and tensorflow 1.2.
Naming conventions in tensorflow are subtle and a bit offsetting at first.
The thing is, when you write
a = tf.constant(5.0, name='a')
a is not the constant op, but its output. Names of op outputs derive from the op name by adding a number corresponding to its rank. Here, constant has only one output, so its name is
print(a.name)
# `a:0`
When you run sess.graph.get_operation_by_name('a') you do get the constant op. But what you actually wanted is to get 'a:0', the tensor that is the output of this operation, and whose evaluation returns an array.
a = sess.graph.get_tensor_by_name('a:0')
print(sess.run(a))
# 5

Selecting a specific row from an rpy2 DataFrame

My data frame is survey data that I have got from a .csv file. One of the columns is age and I am looking to remove all respondents under 18 years of age. I'll then need to isolate age groups (18-24, 25-35, etc) into their own dataframes that I can do frequency distributions for.
The R code is simple enough:
x.sub <- subset(x.df, y > 2)
But I can't figure out how to use the r() function to get my dataframe variable from python into an R statement. It feels as though there ought to be a .subset() function in the rpy2 DataFrame class. But if it exists, I can't find it.
Using rpy2 2.2.0-dev (should be the same with 2.1.x)
from rpy2.robjects.vectors import DataFrame
dataf = DataFrame.from_csvfile("my/file.csv")
dataf_subset = dataf.rx(dataf.rx2("age").ro >= 18, True)
That one exact example is not in the documentation (and may be should be there), but it's constituting elements are:extracting elements and R operators on vectors

Resources