why do I have to apply Countvectorizer on a smaller sample and then make the data frame? why can't I apply a count vectorizer to a large sample and create a data frame out of it?
here is my code :=
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(data['Pros_clean_lemma'])
print(X.shape)
output ; (128734, 26537)
Related
I have train and test datasets as below:
x_train:
inputs
[2,5,10]
[4,6,12]
...
x_test:
inputs
[7,8,14]
[5,5,7]
...
The inputs column is a vector containing the models features after applying the VectorAssembler class to 3 separate columns.
When I try to transform the test data using the StandardScaler as below, I get an error saying it doesn't have the transform method:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaledTrainDF = scaler.fit(x_train).transform(x_train)
scaledTestDF = scaler.transform(x_test)
I am told that I should fit the standard scaler on the training data only once and use those parameters to transform the test set, so it is not accurate to do:
scaledTestDF = scaler.fit(x_test).transform(x_test)
So how do I deal with the error mentioned above?
Here is the correct syntax to use the scaler. You need to call transform on a fitted model, not on the scaler itself.
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaler_model = scaler.fit(x_train)
scaledTrainDF = scaler_model.transform(x_train)
scaledTestDF = scaler_model.transform(x_test)
I have a few thousands of rows of textual data. My sample data is:
I have used sklearn CountVectorizer and TfidfTransformer I calculated top terms with tfidf weights. Below is the code which I used for this:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
data = pd.read_csv('Sample_data.csv')
cvec = CountVectorizer(stop_words='english', min_df=5, max_df=0.95, ngram_range=(1,2))
cvec.fit(data['Text'])
list(islice(cvec.vocabulary_.items(), 30))
len(cvec.vocabulary_)
cvec_count = cvec.transform(data['Text'])
print('Sparse Matrix Shape : ', cvec_count.shape)
print('Non Zero Count : ', cvec_count.nnz)
print('sparsity: %.2f%%' % (100.0 * cvec_count.nnz / (cvec_count.shape[0] * cvec_count.shape[1])))
occ = np.asarray(cvec_count.sum(axis=0)).ravel().tolist()
count_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences' : occ})
term_freq = count_df.sort_values(by='occurrences', ascending=False).head(30)
print(term_freq)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_count)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weight_df = pd.DataFrame({'term' : cvec.get_feature_names(), 'weight' : weights})
tf_idf = weight_df.sort_values(by='weight', ascending=False).head(30)
print(tf_idf)
Now I want to plot (bar or line graph) the top 30 terms with their weights using matplotlib. How can I do this?
Thanks in Advance!
I create a program that predict digits from in a dataset. I want when it predict data their should be two cases if it predict right then data should added automatically in dataset otherwise it takes right answer throw user and insert to dataset.
code
import numpy as np
import pandas as pd
import matplotlib.pyplot as pt
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("train.csv").values
clf = DecisionTreeClassifier()
xtrain = data[0:21000,1:]
train_label=data[0:21000,0]
clf.fit(xtrain,train_label)
xtest = data[21000: ,1:]
actual_label=data[21000:,0]
d = xtest[9]
d.shape = (28,28)
pt.imshow(d,cmap='gray')
print(clf.predict([xtest[9]]))
pt.show()
I'm not sure I'm following your question, but if you want to distinguish between good and wrong predictions and take different ways, you should specific do that.
predictions = clf.predict(xtest)
good_predictions = xtest[pd.Series(predictions == actual_label)]
bad_predictions = xtest[pd.Series(predictions != actual_label)]
So, in good_predictions will be all the rows in xtest that where predicted right.
I am new to machine learning and facing some issues in converting scalar array to 2d array.
I am trying to implement polynomial regression in spyder. Here is my code, Please help!
# Polynomial Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
# Predicting a new result with Polynomial Regression
lin_reg_2.predict(poly_reg.fit_transform(6.5))
ValueError: Expected 2D array, got scalar array instead: array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
You get this issue in Jupyter only.
To resolve in jupyter make the value into np array using below code.
lin_reg.predict(np.array(6.5).reshape(1,-1))
lin_reg_2.predict(poly_reg.fit_transform(np.array(6.5).reshape(1,-1)))
For spyder it work same as you expected:
lin_reg.predict(6.5)
lin_reg_2.predict(poly_reg.fit_transform(6.5))
The issue with your code is linreg.predict(6.5).
If you read the error statement it says that the model requires a 2-d array , however 6.5 is scalar.
Why? If you see your X data is having 2-d so anything that you want to predict with your model should also have two 2d shape.
This can be achieved either by using .reshape(-1,1) which creates a column vector (feature vector) or .reshape(1,-1) If you have single sample.
Things to remember in order to predict I need to prepare my data in the same way as my original training data.
If you need any more info let me know.
You have to give the input as 2D array, Hence try this!
lin_reg.predict([6.5])
lin_reg_2.predict(poly_reg.fit_transform([6.5]))
I have to classify some texts with support vector machine. In my train file I have 5 different categories. I have to do classify at first with "Bag of Words" feature, after with SVD feature by keeping 90% of the total variance.
I 'm using python and sklearn but I don't know how to create the above SVD feature.
My train set is separated with tab (\t), my texts are in 'Content' column and the categories are in 'Category' column.
The high level steps for a tf-idf/PCA/SVM workflow are as follows:
Load data (will be different in your case):
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
train_text = newsgroups_train.data
y = newsgroups_train.target
Preprocess features and train classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.svm import SVC
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(train_text)
pca = PCA(.8)
X = pca.fit_transform(X_tfidf.todense())
clf = SVC(kernel="linear")
clf.fit(X,y)
Finally, do the same preprocessing steps for test dataset and make predictions.
PS
If you wish, you may combine preprocessing steps into Pipeline:
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
preproc = Pipeline([('tfidf',TfidfVectorizer())
,('todense', FunctionTransformer(lambda x: x.todense(), validate=False))
,('pca', PCA(.9))])
X = preproc.fit_transform(train_text)
and use it later for dealing with test data as well.