How to iteratively extract feature using sklearn.feature_extraction.text.CountVectorizer? - python-3.x

I have a sqlite3 database containing many text data. I want to extract the text and convert them into term-frequency matrix using 'CountVectorizer' or 'HashingVectorizer'. The way I can think of is to use 'fetchall' function of 'sqlite3..cursor'.
The problem is that the dataset is too big. I am wondering if there is a way to extract features and convert to matrix iteratively?
# extract text data using 'fetchall'
conn=sqlite3.connect('text.db')
c=conn.cursor()
c_exe=c.execute("SELECT * FROM table")
text_tuple=c_exe.fetchall()
text=[item[0] for item in text_tuple]
# convert the text into tf-matrix
vectorizer=CountVectorizer()
Y=vectorizer.fit_transform(text)
# if there's a way to do it iteratively, e.g. 'modified_vectorizer'
for text in c_exe:
Y=modified_vectorizer()

Related

Is there an python solution for mapping a (pandas data frame) with (unique values of Split a string column)

I have a data frame (df).
The Data frame contains a string column called: supported_cpu.
The (supported_cpu) data is a string type separated by a comma.
I want to use this data for the ML model.
enter image description here
I had to get unique values for the column (supported_cpu). The output is a (list) of unique values.
def pars_string(df,col):
#Separate the column from the string using split
data=df[col].value_counts().reset_index()
data['index']=data['index'].str.split(",")
# Create a list including all of the items, which is separated by column
df_01=[]
for i in range(data.shape[0]):
for j in data['index'][i]:
df_01.append(j)
# get unique value from sub_df
list_01=list(set(df_01))
# there are some leading or trailing spaces in the list_01 which need to be deleted to get unique value
list_02=[x.strip(' ') for x in list_01]
# get unique value from list_02
list_03=list(set(list_02))
return(list_03)
supported_cpu_list = pars_string(df=df,col='supported_cpu')
The output:
enter image description here
I want to map this output to the data frame to encode it for the ML model.
How could I store the output in the data frame? Note : Some row have a multi-value(more than one CPU)
Input: string type separated by a column
output: I did not know what it should be.
Input: string type separated by a column
output: I did not know what it should be.
I really recommend to anyone who's starting using pandas to read about vectorization and thinking in terms of columns (aka Series). This is the way it was build and it is the way in which its supposed to be used.
And from what I understand (I may be wrong) is that you want to get unique values from supported_cpu column. So you could use the Series methods on string to split that particular column, then flatten the resulting array using internal `chain
from itertools import chain
df['supported_cpu'] = df['supported_cpu'].str.split(pat=',')
unique_vals = set(chain(*df['supported_cpus'].tolist()))
unique_vals = (item for item in unique_vals if item)
Multi-values in some rows should be parsed to single values for later ML model training. The list can be converted to dataframe simply by pd.DataFrame(supported_cpu_list).

Adding labels to data in csv format for machine learning

I intend to make a model using sklearn to predict cuisines. I however have this column in my data (Column B) that brings me a ValueError: could not convert string to float: 'indian'
please help if you can.
csv file
You are probably trying to cast that column to a float somewhere in your code. If you're using sklearn, it will handle converting the label column specified into a numeric label representation. If you want to specify the string label name to integer label you can do it like this:
label_mapper = dict(zip(set(df['Column B']), len(set(df['Column B'])))
df['Column B'] = df['Column B'].apply(lambda x: label_mapper[x])

How to preprocess a dataset with many types of missing data

I'm trying to do the beginner machine learning project Big Mart Sales.
The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)
My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:
from sklearn.impute import SimpleImputer as Imputer
# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])
# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])
# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.
Here are some rows of my data:
There is a python package which can do this for you in a simple way, ctrl4ai
pip install ctrl4ai
from ctrl4ai import preprocessing
preprocessing.impute_nulls(dataset)
Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
Description: Auto identifies the type of distribution in the column and imputes null values
Note: KNN consumes more system mermory if the size of the dataset is huge
Returns: Dataframe [with separate column for each categorical values]
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
If you have a numerical column, you can use some approaches to fill the missing data:
A constant value that has meaning within the domain, such as 0, distinct from all other values.
A value from another randomly selected record.
A mean, median or mode value for the column.
A value estimated by another predictive model.
Lets see how it works for a mean for one column e.g.:
One method would be to use fillna from pandas:
X['Name'].fillna(X['Name'].mean(), inplace=True)
For categorical data please have a look here: Impute categorical missing values in scikit-learn

How to export "simplices" array from Delaunay triangulation?

I am using the "Delaunay triangulation" module in from "scipy.spatial."
I am able to generate an array (actually an ndarray, since I am using x, y and z coordinates) from the "simplices," but unable to export it into any format I can use for further processing.
The code is straightforward:
tri = Delaunay(points)
a = np.array(points[tri.simplices])
What I get looks like this:
[[7.02192702e+05, 7.53337067e+06, 1.43116411e+02],
[7.02275075e+05, 7.53339801e+06, 1.53508313e+02],
[7.02073353e+05, 7.53340902e+06, 1.40979450e+02],
[7.02288667e+05, 7.53338498e+06, 1.52185457e+02]],
...,
[[7.02038856e+05, 7.53333613e+06, 1.39584833e+02],
[7.02069568e+05, 7.53327029e+06, 1.46902739e+02],
[7.02062213e+05, 7.53331215e+06, 1.31241316e+02],
[7.02040635e+05, 7.53329922e+06, 1.30787203e+02]],...
By playing around with it I can export it into an extended string:
702299.971067+7533414.077516+163.2373+...
But I would prefer to have it in a .csv file with columns, or convert that extended string into a table or array with a set number of columns.
I assume I'm doing something wrong in saving or writing the output, but can't find any obvious solutions to saving/exporting arrays online anywhere.
Any ideas? suggestions?
Once it's in an np.ndarray format, just use np.savetxt() to save the array to a .txt file. (see: https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html).
This is the simplest method I know of.

Using lbsvm in Matlab with Excel data

My data are in Excel, so to convert them in Libsvm format, I convert the Excel sheet to CSV format and follow the procedure on Libsvm web site:- assuming the CSV file is SPECTF.train : -
matlab> SPECTF = csvread('SPECTF.train'); % read a csv file
matlab> labels = SPECTF(:, 1); % labels from the 1st column
matlab> features = SPECTF(:, 2:end);
matlab> features_sparse = sparse(features); % features must be in a
sparse matrix
matlab> libsvmwrite('SPECTFlibsvm.train', labels, features_sparse);
Then I read it using libsvmread (name)
Is there a shorter way to format excel data in Libsvm format directly? Thanks.
I don't think there is a need to convert to csv. You can use xlsread to read the data directly from the excel file and use libsvmwrite to get that in the form compatible with libsvm.

Resources