Generating class name list based on class index list - python-3.x

I'm playing with iris_dataset from sklearn.datasets
I want to generate list similiar to iris_dataset['target'] but to have name of class instead of index.
The way I did it:
from sklearn.datasets import load_iris
iris_dataset=load_iris()
y=iris_dataset.target
print("Iris target: \n {}".format(iris_dataset.target))
unique_y = np.unique(y)
class_seq=['']
class_seq=class_seq*y.shape[0]
for i in range(y.shape[0]):
for (yy,tn) in zip(unique_y,iris_dataset.target_names):
if y[i]==yy:
class_seq[i]=tn
print("Class sequence: \n {}".format(class_seq))
but I would like to do it not looping through all of the elements of y, how to do it better way?
The outcome is that I need this list for pandas.radviz plot to have a proper legend:
pd.plotting.radviz(iris_DataFrame,'class_seq',color=['blue','red','green'])
And further to have it for any other dataset.

You can do it by looping over iris_dataset.target_names.size. This is only size 3 so it should be alot faster for large y arrays.
class_seq = np.empty(y.shape, dtype=iris_dataset.target_names.dtype)
for i in range(iris_dataset.target_names.size):
mask = y == i
class_seq[mask] = iris_dataset.target_names[i]
If you want to have class_seq as a list: class_seq = list(class_seq)

Yo can do it by list comprehension.
class_seq = [ iris_dataset.target_names[i] for i in iris_dataset.target]
or by using map
class_seq = list(map(lambda x : iris_dataset.target_names[x], iris_dataset.target))

Related

Can we get columns names sorted in the order of their tf-idf values (if exists) for each document?

I'm using sklearn TfIdfVectorizer. I'm trying to get the column names in a list in the order of thier tf-idf values in decreasing order for each document? So basically, If a document has all the stop words then we don't need any column names.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
msg = ["My name is Venkatesh",
"Trying to get the significant words for each vector",
"I want to get the list of words name in the decresasing order of their tf-idf values for each vector",
"is to my"]
stopwords=['is','to','my','the','for','in','of','i','their']
tfidf_vect = TfidfVectorizer(stop_words=stopwords)
tfidf_matrix=tfidf_vect.fit_transform(msg)
pd.DataFrame(tfidf_matrix.toarray(),
columns=tfidf_vect.get_feature_names_out())
I want to generate a column with the list word names in the decreasing order of their tf-idf values
So the column would be like this
['venkatesh','name']
['significant','trying','vector','words','each','get']
['decreasing','idf','list','order','tf','values','want','each','get','name','vector','words']
[] # empty list Since the document consists only stopwords
Above is the primary result I'm looking for, it would be great if we get the sorted dict with tdfidf values as keys and the list of words as values asociated with that tfidf value for each document
So,the result would be like the below
{'0.785288':['venkatesh'],'0.619130':['name']}
{'0.47212':['significant','trying'],'0.372225':['vector','words','each','get']}
{'0.314534':['decreasing','idf','list','order','tf','values','want'],'0.247983':['each','get','name','vector','words']}
{} # empty dict Since the document consists only stopwords
I think this code does what you want and avoids using pandas:
from itertools import groupby
sort_func = lambda v: v[0] # sort by first value in tuple
all_dicts = []
for row in tfidf_matrix.toarray():
sorted_vals = sorted(zip(row, tfidf_vect.get_feature_names()), key=sort_func, reverse=True)
all_dicts.append({val:[g[1] for g in group] for val, group in groupby(sorted_vals, key=sort_func) if val != 0})
You could make it even less readable and put it all in a single comprehension! :-)
The combination of the following function and to_dict() method on dataframe can give you the desired output.
def ret_dict(_dict):
# Get a list of unique values
list_keys = list(set(_dict.values()))
processed_dict = {key:[] for key in list_keys}
# Prepare dictionary
for key, value in _dict.items():
processed_dict[value].append(str(key))
# Sort the keys (as you want)
sorted_keys = sorted(processed_dict, key=lambda x: x, reverse=True)
sorted_keys = [ keys for keys in sorted_keys if keys > 0]
# Return the dictionary with sorted keys
sorted_dict = {k:processed_dict[k] for k in sorted_keys}
return sorted_dict
Then:
res = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vect.get_feature_names_out())
list_dict = res.to_dict('records')
processed_list = []
for _dict in list_dict:
processed_list.append(ret_dict(_dict))
processed_list contains the output you desire. For instance: processed_list[1] would output:
{0.47212002654617047: ['significant', 'trying'], 0.3722248517590162: ['each', 'get', 'vector', 'words']}

List comprehension requiring values from seperate lists for function input, with multiple return values

I have two lists. One of the lists contains many pandas.core.frame.DataFrame objects, named X_train_frames and the other contains many pandas.core.series.Series objects named y_train_frames.
Each value in X_train_frames maps to a label in y_train_frames
I would like to use them in a function together and return a list.
I have tried:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 1, sampling_strategy = 'minority')
X_bal_frames, y_bal_frames = [smote.fit_resample(X_frame, y_frame) for X_frame, y_frame in zip(X_train_frames, y_train_frames)]
I receive the following error:
ValueError: too many values to unpack (expected 2)
I expect to return two lists of SMOTE resampled data in this case:
X_bal_frames will have a list of pandas.core.frame.DataFrames
and
y_bal_frames will have a list of pandas.core.series.Series
Given that zip(*x) will return two tuples of a 2D list, each part of the tuple can be saved using the syntax below.
a, b = zip(*x)
For the case of this example.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state = 1, sampling_strategy = 'minority')
X_bal_frames, y_bal_frames = zip(*[smote.fit_resample(X_frame, y_frame) for X_frame, y_frame in zip(X_train_frames, y_train_frames)])

How to use Python3 multiprocessing to append list?

I have an empty list empty_list = []
and 2 other lists: list1=[[1,2,3],[4,5,6],[7,8,9]],list2=[[10,11,12],[13,14,15],[16,17,18]].
I would like to two things:
I would like to pick up [1,2,3] from list and [10,11,12] to make [1,2,3,10,11,12];[4,5,6]and[13,14,15] to form [4,5,6,13,14,15] and finally [7,8,9],[17,18,19] to form [7,8,9,17,18,19]
append listA=[1,2,3,10,11,12], listB=[4,5,6,13,14,15], listC=[7,8,9,17,18,19] to empty with axis=0.
I have done this work by non-multiprocess but slowly. I would ask how to do it by multiprocess.
I have two naive approaches but do not know how to implement it.
to use pool,
make a func0, for picking up sub-lists and merge them using pool.map(func0,[lst for lst in[ list1,list2,list3]]
make a func1 for appending listA, listB, listC to the empty and then pool.map(func1,[lst for lst in [listA,listB,listC]]),
to use multiprocessing.Array
but I have not figured out how to do it
This sample may not need to use multiprocessing but I have lists with thousands lines.
I am not sure if this can help, but you can avoid some list comprehensions:
empty_list=[]
for l1,l2 in zip(list1,list2):
empty_list.append(l1+l2)
Let's check time performance with some random lists:
import timeit
code_to_test = """
import numpy as np
list1 = [np.random.randint(0,10, 100).tolist() for i in range(10_000)]
list2 = [np.random.randint(0,10, 100).tolist() for i in range(10_000)]
empty_list=[]
for l1,l2 in zip(list1,list2):
empty_list.append(l1+l2)
"""
elapsed_time = timeit.timeit(code_to_test, number=100)/100
print(elapsed_time, ' seconds')
0.12564824399999452 seconds
You can use dask to parallelize numpy operations:
import dask.array as da
list1 = da.from_array(list1)
list2 = da.from_array(list2)
result = da.hstack([list1,list2])
result.compute()

Split list into randomised ordered sub lists

I would like to improve the below code to split a list of values into two sub lists, which have been randomised and sorted. The below code works, but I'm sure there is a better/cleaner way to do it.
import random
data = list(range(1, 61))
random.shuffle(data)
Intervention = data[:30]
Control = data[30:]
Intervention.sort()
Control.sort()
f = open('Randomised_Groups.txt', 'w')
f.write('Intervention Group = ' + str(Intervention) + '\n' + 'Control Group = ' + str(Control))
f.close()
The expected output is:
Intervention = [1,3,7,9]
Control = [2,4,5,6,8,10]
I think your code is short and clean already. Some changes you can make:
Call sorted() when you slice it.
Intervention = sorted(data[:30])
You can also define both Intervention and Control on one line:
Intervention, Control = data[:30], data[30:]
I would replace the 30 with a variable:
half = len(data)//2
It is safer to open a file with with. That closes the file automatically when indentation stops.
with open('Randomised_Groups.txt', 'w') as f:
...
With the use of f-strings you can make the write statement shorter:
f.write(f'Intervention Group = {Intervention} \nControl Group = {Control}')
All combined:
import random
data = list(range(1, 61))
random.shuffle(data)
half = len(data)//2
Intervention, Control = sorted(data[:half]), sorted(data[half:])
with open('Randomised_Groups.txt', 'w') as f:
f.write(f'Intervention Group = {Intervention}\nControl Group = {Control}')
Something like this might be what you want:
import random
my_rng = [random.randint(0,1) for i in range(60)]
Control = [i for i in range(60) if my_rng[i] == 0]
Intervention = [i for i in range(60) if my_rng[i] == 1]
print(Control)
The idea is to create 60 random 1s or 0s to use as indicators for which list to put each number in. This will only work if you do not need the two lists to be the same length. To get the same length would require changing how my_rng is created in this example.
I have tinkered a bit further and got the lists of the same length:
import random
my_rng = [0 for i in range(30)]
my_rng.extend([1 for i in range(30)])
random.shuffle(my_rng)
Control = [i for i in range(60) if my_rng[i] == 0]
Intervention = [i for i in range(60) if my_rng[i] == 1]
Here, instead of adding randomly 1 or 0 to my_rng I get a list of 30 0s and 30 1s to shuffle, then continue like before.
Here is another solution that is more dynamic using built in random functionality that only creates the lists needed (no extra memory) and would work with lists that contain any type of object (provided that object can be sorted):
import random
def convert_to_random_list(data, num_list):
"""
Takes in the data as one large list and converts it into
[num_list] random sorted lists.
"""
result_lists = [list() for _ in range(num_list)] # two lists
for x in data:
# Using randint we pick which list to insert into
result_lists[random.randint(0, num_list - 1)].append(x)
# You could use list comprehension here with `sorted(...)` but it would take a little extra memory.
for _list in result_lists:
_list.sort()
return result_lists
Can be tested with:
data = list(range(1, 61))
random.shuffle(data)
temp = convert_to_random_list(data, 3)
print(temp)

What to pass to clf.predict()?

I started playing with Decision Trees lately and I wanted to train my own simple model with some manufactured data. I wanted to use this model to predict some further mock data, just to get a feel of how it works, but then I got stuck. Once your model is trained, how do you pass data to predict()?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Docs state:
clf.predict(X)
Parameters:
X : array-like or sparse matrix of shape = [n_samples, n_features]
But when trying to pass np.array, np.ndarray, list, tuple or DataFrame it just throws an error. Can you help me understand why please?
Code below:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import graphviz
import pandas as pd
import numpy as np
import random
from sklearn import tree
pd.options.display.max_seq_items=5000
pd.options.display.max_rows=20
pd.options.display.max_columns=150
lenght = 50000
miles_commuting = [random.choice([2,3,4,5,7,10,20,25,30]) for x in range(lenght)]
salary = [random.choice([1300,1600,1800,1900,2300,2500,2700,3300,4000]) for x in range(lenght)]
full_time = [random.choice([1,0,1,1,0,1]) for x in range(lenght)]
DataFrame = pd.DataFrame({'CommuteInMiles':miles_commuting,'Salary':salary,'FullTimeEmployee':full_time})
DataFrame['Moving'] = np.where((DataFrame.CommuteInMiles > 20) & (DataFrame.Salary > 2000) & (DataFrame.FullTimeEmployee == 1),1,0)
DataFrame['TargetLabel'] = np.where((DataFrame.Moving == 1),'Considering move','Not moving')
target = DataFrame.loc[:,'Moving']
data = DataFrame.loc[:,['CommuteInMiles','Salary','FullTimeEmployee']]
target_names = DataFrame.TargetLabel
features = data.columns.values
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data, target)
clf.predict(?????) #### <===== What should go here?
clf.predict([30,4000,1])
ValueError: Expected 2D array, got 1D array instead:
array=[3.e+01 4.e+03 1.e+00].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
clf.predict(np.array(30,4000,1))
ValueError: only 2 non-keyword arguments accepted
Where is your "mock data" that you want to predict?
Your data should be of the same shape that you used when calling fit(). From the code above, I see that your X has three columns ['CommuteInMiles','Salary','FullTimeEmployee']. You need to have those many columns in your prediction data, number of rows can be arbitrary.
Now when you do
clf.predict([30,4000,1])
The model is not able to understand that these are columns of a same row or data of different rows.
So you need to convert that into 2-d array, where inner array represents the single row.
Do this:
clf.predict([[30,4000,1]]) #<== Observe the two square brackets
You can have multiple rows to be predicted, each in inner list. Something like this:
X_test = [[30,4000,1],
[35,15000,0],
[40,2000,1],]
clf.predict(X_test)
Now as for your last error clf.predict(np.array(30,4000,1)), this has nothing to do with predict(). You are using the np.array() wrong.
According to the documentation, the signature of np.array is:
(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
Leaving the first (object) all others are keyword arguments, so they need to be used as such. But when you do this: np.array(30,4000,1), each value is considered as input to separate param here: object=30, dtype=4000, copy=1. This is not allowed and hence error. If you want to make a numpy array from list, you need to pass a list.
Like this: np.array([30,4000,1])
Now this will be considered correctly as input to object param.

Resources