convert numpy.ndarry to DataFrame maintaining indices after upsampling - multisampling

X_train_1, X_test_1, y_train_1, y_test = train_test_split(x, y,
test_size = .3)
X_train_sam, y_train_sam = ADASYN(random_state=42).fit_sample(X_train_1, y_train_1)
type(X_train_1)
pandas.core.frame.DataFrame
X_train_1.shape
(1668, 353)
type(X_train_sam)
numpy.ndarray
X_train_sam.shape
(2698, 353)
How can I convert X_train_sam back to the dataframe, so that it is the same as X_train_1 and maintain indices while adding indices to the new data ?

Something like this:
result = pd.DataFrame(X_train_sam)
result.columns = train_1.columns

Related

dtype='numeric' is not compatible with arrays of bytes/strings.Convert your data to numeric values explicitly instead

I have these data I want to use for a logistic regression problem. shape of the data:
((108, 2),##train input
(108,),##train output
(35, 2), ##val input
(35,),##val output
(28, 2),##test input
(28,),##test output
(171, 3), ## all data
I did this:
'''
X = X_train.reshape(-2,2)
y = y_train.reshape(-1,1)
model_lr = LogisticRegression()
res = model_lr.fit(X,y)
X_test = np.array(X_test,dtype = float)
test = X_test.reshape(-2,2)
test = np.array(test,dtype = float)
pred = model_lr.predict(test)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
output_test = y_test.reshape(-1,1)
output_test = np.array(output_test,dtype = float)
logit_roc_auc = roc_auc_score(output_test, model_lr.predict(test))
'''
and I have this error message:
logit_roc_auc = roc_auc_score(output_test, model_lr.predict(test))
ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.Convert your data to numeric values explicitly instead.
can anybody help?
thanks
I tried reshaping the output variable, but I didn't succeed.
roc_auc_score should be able to handle an array of strings. But computing an ROC curve generally requires y_pred to be an array of floats.
Print your output_test and model_lr.predict(test) and make sure they look like the following—you'll probably see you need to switch to model_lr.predict_proba(test):
from sklearn.metrics import roc_auc_score
y_true = ["A", "A", "A", "B", "B", "B"]
y_pred = [0.2, 0.3, 0.6, 0.4, 0.7, 0.8]
print(roc_auc_score(y_true, y_pred))
# 0.8888

ValueError: y should be a 1d array, got an array of shape (74216, 2) instead

I am trying to apply Logistic Regression Models with text.
I Vectorized my data by TFIDF:
vectorizer = TfidfVectorizer(max_features=1500)
x = vectorizer.fit_transform(df['text_column'])
vectorizer_df = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names())
df.drop('text_column', axis=1, inplace=True)
result = pd.concat([df, vectorizer_df], axis=1)
I split my data:
x = result.drop('target', 1)
y = result['target']
and finally:
x_raw_train, x_raw_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
I build a classifier:
classifier = Pipeline([('clf', LogisticRegression(solver="liblinear"))])
classifier.fit(x_raw_train, y_train)
And I get this error:
ValueError: y should be a 1d array, got an array of shape (74216, 2) instead.
This is a strange thing because when I assign max_features=1000 it is working well, but when max_features=1500 I got an error.
Someone can help me please?
Basically, the text_column column in df contains at least one occurrence of the word target. This word becomes a column name when you convert the TF-IDF feature matrix to a dataframe with the parameter columns=vectorizer.get_feature_names(). Lastly, when you concatenate df with vectorized_df, you add both the target columns into the final dataframe.
Therefore, result['target'] will return two columns instead of one as there are effectively two target columns in the result dataframe. This will naturally lead to a ValueError, because, as specified in the error description, you need a 1d target array to fit your estimator, whereas your target array has two columns.
The reason why you are encountering this for a high max_features threshold is simply because the word target isn't making the cut with the lower threshold allowing the process to run as it should.
Unless you have a reason to vectorize separately, the best solution for this is to combine all your steps in a pipeline. It's as simple as:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=1500)),
('clf', LogisticRegression(solver="liblinear")),
])
pipeline.fit(x_train.text_column, y_train.target)

DataFrame has no Reshape Attribute

I'm trying to plot to scatter the graph on the following conditions. But, it failed to give a graph. First, it gave me the error message, X and Y size are not equal. Then, when I tried to reshape the dimensions, which is (Row 13 and Col 4), it gave me another error, no attribute reshape, I need your help.
df.reshape((df.shape[0], df.shape[1], 1))
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_test, y_pred, color = 'green')
plt.title(' Test_Result vs Salary')
plt.xlabel('test_score')
plt.ylabel('salary')
plt.show()
Assuming that you want to plot X_train, y_train into 2D plot, make sure that the number of elements is equal. The first error that you encounter meant that either the dimension of X_train.shape and y_train.shape are not consistent or the number of element of len(X_train) != len(y_train) or (X_train.shape[0] != (y_train.shape[0]).
Besides, if your code is exactly what you posted, I don't see any relevance between df.reshape((df.shape[0], df.shape[1], 1)) and plt.scatter(X_train, y_train, color = 'red'). You need a trick to reshape pandas dataframe.

how to select specific columns in a table by using np.r__ in dataset.loc and deal with string data

I would like to classify a problem which its data rows are something similar to
In order to divide to test train data:
x_train, x_test, y_train, y_test = train_test_split(X, y,test_size = 0.25, random_state = 0)
Method 1:
X = dataset.loc[np.r_[0:5, 7:26]].values
y = dataset.loc[np.r_[6]].values
Method 2:
X = dataset.loc[:, ['x1', 'x2','x3','x4','x5','x6','x7','x8','x9','x10','x11','x12','x13','x14','x15','x16','x17','x18','x19','x20','x21','x22','x23','x24','x25','x26']].values
y = dataset.loc[:, ['y']].values
The first method encounters this problem:
ValueError: Found input variables with inconsistent numbers of samples: [24, 1]
while the second one is OK. I do not like to write all of the columns but I do not know how to solve the problem of first method.
Also, since the data is string I encounter this error:
ValueError: could not convert string to float: 'id8053'
I tried to solve with:
X = X.apply(lambda x: pd.factorize(x)[1])
y = y.apply(lambda x: pd.factorize(x)[0])
but I encounter this error:
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
What is wrong?
np.r_ should work fine in your case. Method 1 missed the rows. You slice on integer index-columns (i.e, slicing by integer position of columns), so you need to use .iloc with np.r_ for columns and specify : for rows
Try this (note the right-end of slices in np.r_ got added 1 because .iloc ignore the right-end while loc keeps it)
Method 1:
X = dataset.iloc[:, np.r_[0:6, 7:27]].values
y = dataset.iloc[:, np.r_[7]].values

How to convert 2 DataFrame columns to ascii?

I have a DataFrame with 2 columns of strings, imported from a tsv file. Both columns need to be converted to ascii. (This is because I want to pass the text through a CountVectorizer and TfidfTransformer pipeline in scikit-learn).
I have gone through dozens of posts both on stackoverflow as well as outside, but cannot figure this one out. My code is below, including some of the things I have tried.
Any suggestions to make this work?
# tried including adding encoding="utf-8", does not work
df = pd.read_csv(questions, usecols = [3, 4, 5], nrows = 10, header=0, sep="\t")
y = df["is_duplicate"].values
X = df.drop("is_duplicate", axis=1).values
for col in X:
X = X.encode('utf-8') # does not work
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
random_state = 21, stratify = y)
def flat_list(my_list):
return [str(item) for sublist in my_list for item in sublist]
def transform_data(trans_obj_list,dataset_splits):
X_train = dataset_splits[0].astype(str)
X_train = flat_list(X_train)
for trfs in trans_obj_list:
transformed_vector = trfs().fit(X_train)
for x in range(0,len(dataset_splits)):
dataset_splits[x] =flat_list(dataset_splits[x].astype(str))
return dataset_splits
new_X_train, new_X_test = transform_data([CountVectorizer,TfidfTransformer],
[X_train, X_test])
You need to call X.str.encode(..) instead of X.encode(..) like this:
for col in X:
X = X.str.encode('utf-8') # does not work
I found an answer to my question in this question: How do I use encode (Python 3) to fix non-ascii code for CSV import in Pandas?
file_obj = open(file_name, encoding="utf-8")
master = pd.read_csv(file_obj)
I just used "ascii" instead of "utf-8" for my case.

Resources