Stratified Sampling in python scikit-learn - python-3.x

I want to divide my dataset into train and test sets using stratified sampling(scikitlearn).my approach is as follows :
1) I'am reading a CSV file and loading it using pandas readCSV.so ultimately i'am storing the loaded csv in a dataframe names "dataset"
dataset = pd.readCSV('CSV_NAME)
2) Now i'am applying stratified sampling as :
train,test = train_test_split(dataset,test_size=0.20,stratify=True)
But it throwing the following error :
TypeError: Singleton array array(True, dtype=bool) cannot be considered a valid collection.
So please suggest me the correct way of doing to it.

'train_test_split' needs to know what the target variable is. Therefore, you should change your call to something like:
train,test = train_test_split(dataset[needed columns], dataset.target,test_size=0.20,stratify=True)

Btw, there is a missing single quote in your first line of code.
You could convert the pandas dataframe to a numpy array by the following
import numpy
dataset = pd.readCSV('CSV_NAME')
dataset = array(dataset)
like suggested in the second answer here: https://www.quora.com/How-does-python-pandas-go-along-with-scikit-learn-library-Has-anyone-doing-data-analysis-using-pandas-and-then-then-fit-models-using-scikit-learn
Or you could read the dataset into a numpy array directly.

Related

converting a Object (containing string and integers) Pandas dataframe to a scipy sparse matrix

I have a dataframe with two columns, one column is medicine name of dtype object it contains medicine name and few of the medicine name followed by its mg(eg. Avil25 and other row for Avil50) and other column is Price of dtype int . I'm trying to convert medicine name column into a scipy csr_matrix using the following lines of code:
from scipy.sparse import csr_matrix
sparse_matrix = csr_matrix(medName)
I am getting the following error message:
TypeError: no supported conversion for types: (dtype('O'),)
as an alternative way I tried to remove the integers using(medName.str.replace('\d+', '')) from dataframe and tried sparse_matrix = csr_matrix(medName.astype(str)) . Still i am getting the same error.
What's going on wrong here?
What is another way to convert this dataframe to csr matrix?
you will have the encode strings to numeric data types for it to be made sparse. One solution ( probably not the most memory efficient) is to make a networkx graph, where the string-words will be the nodes, using the nodelist of the graph you can keep track of the word to numeric mapping.

how to convert the type of an object from "pandas.core.groupby.generic.SeriesGroupBy" to "pandas.core.series.Series"?

I have a variable of type "pandas.core.groupby.generic.SeriesGroupBy" which I got from grouping various fields of a pandas dataframe. But, I would like to convert that variable into a pandas series which is working but with a lot of errors.
Here is the code which I have tried:
w = data.groupby(['dt', 'b'])['w']
w = pd.Series(w)
When I try to run this code, it's taking a lot of time to execute and also generating a lot of errors.
I am getting a pandas Series as follows:
But, I am expecting something similar to this:
Is there any other way to group the below column of a DataFrame and store it inside a pandas Series:
Pandas groupby objects are iterable. Using list comprehension you can extract the partitioned sub-series. Try:
list_of_series = [s for _, s in data.groupby(['dt', 'b'])['w']]
list_of_series is a list and should contain your desired pandas series.

Each csv as one training example

I have many csv files that has multiple rows and columns which are mostly floating point numbers (some are categorical but one-hot encoded).
Each csv file is the representation of one training example.It contains dependent and independent variables in the same file.
(for example, its not like machine learning problem where each row contains all the information and predicts y1, y2,y3 of that row, its like all the rows combined of x1 to x8
will predict all rows combined of y1 to y3. Hence each csv becomes one training example.
representation of one such csv
** The above image is the representation of one of such csv files
Please note that the length/size of each csv varies.
I want to build a simple ann or any other neural net model. I have problem in processing input data. As each csv is one single training example, in which format should i have to store data to pass to a neural net.
Thanks in advance,
skw
Let's say you have some .csv file all with same data format stored in a folder data.
First you can use glob to read the filenames and use pandas to read the csv and convert to numpy array.
import glob
import pandas as pd
csv = [] # read as numpy array
for f in glob.glob('path/*.csv'):
csv.append(pd.read_csv(f).to_numpy)
print(csv[0].shape)
# it should print (num_rows_csv, 11) # as, 11 columns
# now, first 8 columns are features, and last 3 columns are response
X = []
y = []
for arr in csv:
X.append(arr[0:8])
y.append(arr[8:])
X = np.array(X)
y = np.array(y)
Now, it's easy to train this with CNN, LSTM, any model you want.

TensorFlow: extract data with a given feature, from NSynth Dataset

I have a data set of TFRecord files of serialized TensorFlow Example protocol buffers with one Example proto per note, downloaded from https://magenta.tensorflow.org/datasets/nsynth. I am using the test set, which is approximately 1 Gb, in case someone wants to download it, to check the code below. Each Example contains many features: pitch, instrument ...
The code that reads in this data is:
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
# Reading input data
dataset = tf.data.TFRecordDataset('../data/nsynth-test.tfrecord')
# Convert features into tensors
features = {
"pitch": tf.FixedLenFeature([1], dtype=tf.int64),
"audio": tf.FixedLenFeature([64000], dtype=tf.float32),
"instrument_family": tf.FixedLenFeature([1], dtype=tf.int64)}
parse_function = lambda example_proto: tf.parse_single_example(example_proto,features)
dataset = dataset.map(parse_function)
# Consuming TFRecord data.
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(batch_size=3)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
batch = iterator.get_next()
sess.run(batch)
Now, the pitch ranges from 21 to 108. But I want to consider data of a given pitch only, e.g. pitch = 51. How do I extract this "pitch=51" subset from the whole dataset? Or alternatively, what do I do to make my iterator go through this subset only?
What you have looks pretty good, all you're missing is a filter function.
For example if you only wanted to extract pitch=51, you should add after your map function
dataset = dataset.filter(lambda example: tf.equal(example["pitch"][0], 51))

Why get wrong index when saving data in libsvm format by using saveAsLibSVMFile

I want save data in libsvm format by python. So I choose to use pyspark to finish this task. But the data I saved was not in the libsvm format.
Here is my code.
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # c is rdd format
MLUtils.saveAsLibSVMFile(d, "D://spark-warehouse/part1")
When I run the code print (d.take(3)), it showsas follows which is the format of LabeledPoint.
[LabeledPoint(-0.05643994211287995, [0.0142684401451,-0.0072049689441,-0.929159510172,-0.893124442121,-0.996100725507]), LabeledPoint(-0.02315484804630974, [0.0408706166868,-0.00372670807453,-0.891585462256,-0.839681870708,-0.96168588986]), LabeledPoint(0.03039073806078152, [0.0577992744861,-0.00621118012422,-0.898020043313,-0.847917899172,-0.968368717236])]
However, when I tested my saved data, it did not in the libsvm format .It shows the wrong label which only has label 1.
''.join(sorted(input(glob("D://spark-warehouse/part2" + "/part-0000*")))).
-0.05643994211287995 1:[ 0.01426844 -0.00720497 -0.92915951 -0.89312444 -0.99610073]\n-0.02315484804630974 1:[ 0.04087062 -0.00372671 -0.89158546 -0.83968187 -0.96168589]\n0.03039073806078152 1:[ 0.05779927 -0.00621118 -0.89802004 -0.8479179 -0.96836872]\n
Which should be in the right format like follows.
-0.05643994211287995 1:0.01426844 2:-0.00720497 3:-0.92915951 4:-0.89312444 5:-0.99610073]\n ...
And my python version is 3.5.2 and pyspark version is 2.0.1. I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Note: I want to do SVR, so my indexes are not in the type of int.

Resources