Transform an integer into dummies vector in python - python-3.x

Hi there!
I m working on python with pandas' get_dummies function and I try to transform an int into a vector like for example with a 5 categories feature :
1 -> [1,0,0,0,0]
2 -> [0,1,0,0,0]
...
Does a function exist for that?
If not I can built a function but I just ask before reinventing the wheel.
Thanks !

Just cast the relevant Series to a string and then use get_dummies as usual.
pd.get_dummies(df['col'].astype(str))

I think it's so easy you should just write a simple function to do that, instead of asking. Here is one of countless ways to do this.
import numpy as np
def get_dumm(lenn, num):
arr = np.zeros(lenn, dtype='bool_') #replace type with 'int8' if needed
arr[num - 1] = True #replace True with 1 if type of arr is 'int8'
return arr
get_dumm(5,3)
Output:
array([False, False, True, False, False], dtype=bool)
Or if you use int8:
array([0, 0, 1, 0, 0], dtype=int8)

Related

Fastest way of updating a Python list based on indices

I have a Python dictionary like this -
my_dict = {'Names':['Tom', 'Mariam', 'Lata', 'Tina', 'Abin'],
'Attendance':[False, False, False, False, False]}
I also have a Python list of flags for which indices need to changed to True in my_dict['Attendance'] -
flag_list = [0, 2, 3]
Based on the flag_list, my_dict needs to be changed to -
my_dict = {'Names':['Tom', 'Mariam', 'Lata', 'Tina', 'Abin'],
'Attendance':[True, False, True, True, False]}
What would be the fastest way of achieving this? Can it be done without a loop? Thank you for any guidance.
Using a loop
for index in flag_list:
my_dict['Attendance'][index] = True
A micro optimization would be to fetch the list from the dict only once:
attendance_list = my_dict['Attendance']
for index in flag_list:
attendance_list[index] = True
But unless flag_list is thousands elements long I wouldn't worry about it.
Using vectorization
If you are willing to take advantage of vectorization you can use a numpy array:
import numpy as np
my_dict = {'Names':['Tom', 'Mariam', 'Lata', 'Tina', 'Abin'],
'Attendance': np.array([False, False, False, False, False])}
flag_list = [0, 2, 3]
my_dict['Attendance'][flag_list] = True
But again, unless your data is very big I wouldn't worry about optimizing this piece of code very much.
Example timings
import random
from timeit import Timer
import numpy as np
ATTENDANCE_LIST_SIZE = 100000
FLAG_LIST_SIZE = 60000
dict_with_numpy = {'Attendance': np.random.choice([False, True],
ATTENDANCE_LIST_SIZE)}
dict_without_numpy = {'Attendance': random.choices([False, True],
k=ATTENDANCE_LIST_SIZE)}
flag_list = random.choices(range(ATTENDANCE_LIST_SIZE), k=FLAG_LIST_SIZE)
def using_numpy():
dict_with_numpy['Attendance'][flag_list] = True
def no_numpy_pre_fetching_list():
attendance_list = dict_without_numpy['Attendance']
for index in flag_list:
attendance_list[index] = True
def no_numpy():
for index in flag_list:
dict_without_numpy['Attendance'][index] = True
print(f'no_numpy\t\t\t\t\t\t{min(Timer(no_numpy).repeat(3, 3))}')
print(f'no_numpy_pre_fetching_list\t\t{min(Timer(no_numpy_pre_fetching_list).repeat(3, 3))}')
print(f'using_numpy\t\t\t\t\t\t{min(Timer(using_numpy).repeat(3, 3))}')
For this amount of data, the output is (on my machine)
no_numpy 0.009737916999999985
no_numpy_pre_fetching_list 0.0048406370000000365
using_numpy 0.009164470000000036
So using vectorization for this data is not the most efficient.

Convert a list of labels into number given a defined dictionary

I have the following dictionary defined in my code:
label_dict = {'positive': 1, 'negative': 0}
I also have a label_list that contains two possible values: "positive" and "negative".
I want to essentially map each label in label_list to the respective numeric value defined by label_dict.
I have the following for loop defined as well: for label in range(len(label_list)): for iterating through label_list.
How can I accomplish this? Any help is much appreciated.
One solution is to convert your label_list to Series and use mapping and then return it back to list again like that:
import pandas as pd
label_dict = {'positive': 1, 'negative': 0}
label_list = ["positive","negative","negative","positive",
"negative","positive","negative"]
new_lst = pd.Series(label_list).map(label_dict).tolist()
#output
print(new_lst) # [1, 0, 0, 1, 0, 1, 0]

torch find indices of matching rows in 2 2D tensors

I have two 2D tensors, in different length, both are different subsets of the same original 2d tensor and I would like to find all the matching "rows"
e.g
A = [[1,2,3],[4,5,6],[7,8,9],[3,3,3]
B = [[1,2,3],[7,8,9],[4,4,4]]
torch.2dintersect(A,B) -> [0,2] (the indecies of A that B also have)
I've only see numpy solutions, that use dtype as dicts, and does not work for pytorch.
Here is how I do it in numpy
arr1 = edge_index_dense.numpy().view(np.int32)
arr2 = edge_index2_dense.numpy().view(np.int32)
arr1_view = arr1.view([('', arr1.dtype)] * arr1.shape[1])
arr2_view = arr2.view([('', arr2.dtype)] * arr2.shape[1])
intersected = np.intersect1d(arr1_view, arr2_view, return_indices=True)
This answer was posted before the OP updated the question with other restrictions that changed the problem quite a bit.
TL;DR You can do something like this:
torch.where((A == B).all(dim=1))[0]
First, assuming you have:
import torch
A = torch.Tensor([[1,2,3],[4,5,6],[7,8,9]])
B = torch.Tensor([[1,2,3],[4,4,4],[7,8,9]])
We can check that A == B returns:
>>> A == B
tensor([[ True, True, True],
[ True, False, False],
[ True, True, True]])
So, what we want is: the rows in which they are all True. For that, we can use the .all() operation and specify the dimension of interest, in our case 1:
>>> (A == B).all(dim=1)
tensor([ True, False, True])
What you actually want to know is where the Trues are. For that, we can get the first output of the torch.where() function:
>>> torch.where((A == B).all(dim=1))[0]
tensor([0, 2])
If A and B are 2D tensors, the following code finds the indices such that A[indices] == B. If multiple indices satisfy this condition, the first index found is returned. If not all elements of B are present in A, the corresponding index is ignored.
values, indices = torch.topk(((A.t() == B.unsqueeze(-1)).all(dim=1)).int(), 1, 1)
indices = indices[values!=0]
# indices = tensor([0, 2])

Evaluating the output of multiple functions which return True or False

So I came across this interesting problem.It basically is a lot of functions which return True or False within a function and I want that secondary function to return either True or False based on applying an AND or OR logic to all of the functions within it. I know this is a terrible way to explain so let's see some code which will hopefully explain it better
#this is the first function that return True or False
def f(x):
if x == 1:
return True
elif x == 0:
return False
#this is the function that takes the first one and I want it to return either True or False based on an AND logic
def g(f):
f(1)
f(0)
f(0)
f(1)
f(1)
f(0)
f(1)
Now I know I can just write the second function with 'and' between all the f(x) functions that I call but that seems very ugly and so I want something that will just evaluate all of these and return me a value. I don't enough experience with writing methods which take in multiple inputs and also multiple inputs that vary so I would appreciate any help on this.
You can use all and a comprehension over the variable arguments (*args) of the funtion:
>>> def f(x):
... if x == 1:
... return True
... elif x == 0:
... return False
...
>>> def g(f, *args):
... return all(f(x) for x in args)
...
>>> g(f, 1, 0, 0, 1)
False
>>> g(f, 1, 1, 1)
True
You can use the existing all function that is equivalent to a logical AND:
def f(x):
return x < 5
all((f(1), f(2), f(3), f(4)))
Now concerning function g you can do this (for example):
def g(f, inputs):
for i in inputs:
yield f(i)
all(g(f, range(5)))
Here you can replace range(5) with any of [0, 1, 2, 3, 4], (0, 1, 2, 3, 4), {0, 1, 2, 3, 4}, and many more (ie. any iterable).
Note that a function similar to g also exists in python, it's called map, you could use it this way:
all(map(f, range(5))) # More or less equivalent to all(g(f, range(5)))
You could also directly make use a generator expression (an alternative to the yield generator form):
all(f(i) for i in range(5))
Which one of this solution is the best really depend on the use case and on your personal preferences (even if the last one is probably the one you will most commonly see).
For AND function, you can use python's all, and for OR function, you can use python's any
>>> all([True, False])
False
>>> all([True, True])
True
>>> any([True, False])
True
>>> any([True, True])
True
>>> any([False, False])
False
Just append all your outputs in a list, and evaluate all or any, so considering function f you defined
print(all([f(1),f(1)]))
print(all([f(0),f(0)]))
print(any([f(1), f(0)]))
print(any([f(0), f(0)]))
#True
#False
#True
#False

If I have duplicates in a list with brackets, what should I do

Suppose I have the following list:
m=[1,2,[1],1,2,[1]]
I wish to take away all duplicates. If it were not for the brackets inside the the list, then I could use:
m=list(set(m))
but when I do this, I get the error:
unhashable type 'set'.
What command will help me remove duplicates so that I could only be left with the list
m=[1,2,[1]]
Thank you
You can do something along these lines:
m=[1,2,[1],1,2,[1]]
seen=set()
nm=[]
for e in m:
try:
x={e}
x=e
except TypeError:
x=frozenset(e)
if x not in seen:
seen.add(x)
nm.append(e)
>>> nm
[1, 2, [1]]
From comments: This method preserves the order of the original list. If you want the numeric types in order first and the other types second, you can do:
sorted(nm, key=lambda e: 0 if isinstance(e, (int,float)) else 1)
The first step will be to convert the inner lists to tuples:
>> new_list = [tuple(i) if type(i) == list else i for i in m]
Then create a set to remove duplicates:
>> no_duplicates = set(new_list)
>> no_duplicates
{1, 2, (1,)}
and you can convert that into list if you wish.
For a more generic solution you can serialize each list item with pickle.dumps before passing them to set(), and then de-serialize the items with pickle.loads:
import pickle
m = list(map(pickle.loads, set(map(pickle.dumps, m))))
If you want the original order to be maintained, you can use a dict (which has become ordered since Python 3.6+) instead of a set:
import pickle
m = list(map(pickle.loads, {k: 1 for k in map(pickle.dumps, m)}))
Or if you need to be compatible with Python 3.5 or earlier versions, you can use collections.OrderedDict instead:
import pickle
from collections import OrderedDict
m = list(map(pickle.loads, OrderedDict((k, 1) for k in map(pickle.dumps, m))))
result = []
for i in m:
flag = True
for j in m:
if i == j:
flag = False
if flag:
result.append(i)
Result will be: [1,2,[1]]
There are ways to make this code shorter, but I'm writing it more verbosely for readability. Also, note that this method is O(n^2), so I wouldn't recommend for long lists. But benefits is the simplicity.
Simple Solution,
m=[1,2,[1],1,2,[1]]
l= []
for i in m:
if i not in l:
l.append(i)
print(l)
[1, 2, [1]]
[Program finished]

Resources