Get a list of tensor from masked indices - pytorch

I'm trying to get a list of tensors based on different group,
e.g.,
x = tensor([ 0.3018, -0.0079, 1.4995, -1.4422, 1.6007])
indices = torch.tensor([0,0,1,1,2])
res = func(x,indices)
I want my result to be
res= [[0.3018, -0.0079], [1.4995, -1.4422], [1.6007]]
I'm wondering how can I achieve this result, I checked gather and index_select,
but I can't get the result like above.
Thank you!

How about
res = [x[indices == i_] for i_ in indices.unique()]

Related

Issue with pd.DataFrame.apply with arguments

I want to create augmented data in a new dataframe for every row of an original dataframe.
So, I've defined augment method which I want to use in apply as following:
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int):
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
When I call this as following, everything works:
tmp_df = pd.DataFrame(None, columns=testDF.columns)
augment(testDF.iloc[0], column_name='binMap', target_df=tmp_df, num_samples=4)
augment(testDF.iloc[1], column_name='binMap', target_df=tmp_df, num_samples=4)
However, when I try it using 'apply' method, I get the prints or the display working fine but the resultant dataframe shows error
tmp_df = pd.DataFrame(None, columns=testDF.columns)
testDF.apply(augment, args=('binMap', tmp_df, 4, ), axis=1)
This is how the o/p data looks like after the apply call -
,data
<Error>, <Error>
<Error>, <Error>
What am I doing wrong?
Your test is very nice, thank you for the clear exposition.
I am happy to be your rubber duck.
In test A, you (successfully) mess with
testDF.iloc[0] and [1],
using kind of a Fortran-style API
for augment(), leaving a side effect result in tmp_df.
Test B is carefully constructed to
be "the same" except for the .apply() call.
So let's see, what's different?
Hard to say.
Let's go examine the docs.
Oh, right!
We're using the .apply() API,
so we'd better follow it.
Down at the end it explains:
Returns: Series or DataFrame
Result of applying func along the given axis of the DataFrame.
But you're offering return None instead.
Now, I'm not here to pass judgement on
whether it's best to have side effects
on a target df -- that's up to you.
But .apply() will be bent out of shape
until you give it something nice
to store as its own result.
Happy hunting!
Tiny little style nit.
You wrote
args=('binMap', tmp_df, 4, )
to offer a 3-tuple. Better to write
args=('binMap', tmp_df, 4)
As written it tends to suggest 1-tuple notation.
When is trailing comma helpful?
in a 1-tuple it is essential: x = (7,)
in multiline dict / list expressions it minimizes git diffs, when inevitably another entry ('cherry'?) will later be added
fruits = [
'apple',
'banana',
]
This change worked for me -
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int) -> pd.Series:
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
return row
And updated call to apply as following:
testDF = testDF.apply(augment, args=('binMap', tmp_df, 4, ), result_type='broadcast', axis=1)
Thank you #J_H.
If there are better to way to achieve what I'm doing, please feel free to suggest the improvements.

Create a tuple in Python 3 with string and number

I have a object in python like below
contributor_detail=contributorId+ ',' +contentFileName+','+productPriority
output =
CSFBW23_1194968,CSFBW23_1194968.pdf,10
CSFBW23_1194969,CSFBW23_1194968.pdf,11
CSFBW23_1194970,CSFBW23_1194968.pdf,13
Apology if question was unclear
I will try to reframe it again here ,This is how i am making an array and then need to make tuple to sort .
for record in event['Records']:
#pull the body out & json load it
jsonmaybe=(record["body"])
jsonmaybe=json.loads(jsonmaybe)
jsonmaybe1=(jsonmaybe["Message"])
jsonmaybe1=json.loads(jsonmaybe1)
#now the normal stuff works
bucket_name = jsonmaybe1["Records"][0]["s3"]["bucket"]["name"]
print(bucket_name)
key=jsonmaybe1["Records"][0]["s3"]["object"]["key"]
print(key)
s3_clientobj = s3.get_object(Bucket=bucket_name, Key=key)
s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')
employee_dict = json.loads(s3_clientdata)
contributorId=employee_dict['Research']['Product']['#productID']
contentFileName=employee_dict['Research']['Product']['Content']['Resource']['Name']
productPriority=employee_dict['Research']['Product']['#productPriority']
print("contributorId---------------"+contributorId)
print("contentFileName---------------"+contentFileName)
print("productPriority---------------"+productPriority)
contributor_detail=contributorId +','+contentFileName+','+productPriority
unsorted_contributors.append(contributor_detail)
I want to create a tuple out of it and add multiple object to that tuple .
Output i am getting now
['CSFBW23_1194968,CSFBW23_1194968.pdf,6', 'CSFBW23_1194968,CSFBW23_1194968.pdf,7', 'CSFBW23_1194968,CSFBW23_1194968.pdf,9']
Expected output
[("CSFBW23_1194968","CSFBW23_1194968.pdf",6),("CSFBW23_1194968","CSFBW23_1194968.pdf",7),("CSFBW23_1194968","CSFBW23_1194968.pdf",9)]
I need tuple in above so that it can be sorted based on 3rd item in tuple which is number .
sorted_contributors.sort(key=itemgetter(2))
Please help to create such format in loop
If I understand you correctly, it looks like you just need to stop making contributor_detail and string as you are with contributorId+ ',' +contentFileName+','+productPriority. Assuming those values are strings and an int, you are joining them as a string with ,. It really isn't very clear what you're after, but I suspect, what you want is:
conributor_detail = contributorId, contentFileName, productPriority
That yields a tuple, from which you can make a list.
from operator import itemgetter
contrib1 = "CSFBW23_1194968", "CSFBW23_1194968.pdf", 10
contrib2 = "CSFBW23_1194969", "CSFBW23_1194968.pdf", 11
contrib3 = "CSFBW23_1194970", "CSFBW23_1194968.pdf", 13
contributors = [contrib1, contrib2, contrib3]
sorted_contributors = sorted(contributors, key=itemgetter(2))
print(sorted_contributors)
Output:
[('CSFBW23_1194968', 'CSFBW23_1194968.pdf', 10), ('CSFBW23_1194969', 'CSFBW23_1194968.pdf', 11), ('CSFBW23_1194970', 'CSFBW23_1194968.pdf', 13)]
A simplified clarification of your issue:
# what you are doing (incorrect):
In [1]: contrib1 = "thingB"+","+"thingA"+","+"1"
In [2]: contrib2 = "thingC"+","+"thingD"+","+"2"
In [3]: print([contrib1, contrib2])
['thingB,thingA,1', 'thingC,thingD,2']
# what you really want is:
In [4]: contrib1 = "thingB", "thingA", 1
In [5]: contrib2 = "thingC", "thingD", 2
In [6]: print([contrib1, contrib2])
[('thingB', 'thingA', 1), ('thingC', 'thingD', 2)]
This will create proper format
my_list = []
my_list.append(contributorId)
my_list.append(contentFileName)
my_list.append(bucket_name)
my_list.append(key)
my_list.append(int(productPriority))
unsorted_contributors.append(my_list)
Concatenation is creating a format mismatch so create a list and append that will should work

retain only 1 item in a list per unique prefix

I have an example situation where I have a list as follows:
test = ['a-nyc','a-chi','b-sf','c-dal','a-phx','c-la']
the items in this list are naturally ordered in some way, and the objective is to keep the first encountered value for each prefix, e.g. the desired result is a list as follows:
['a-nyc', 'b-sf', 'c-dal']
is there a handy way of doing this?
looks like this can be done this way:
newl = []
prel = []
for i in range(len(test)):
if test[i].split('-')[0] not in prel:
newl.append(test[i])
else:
pass
prel.append(test[i].split('-')[0])
but not sure if there is a more pythonic solution
Yes, you can try like following also:
test = ['a-nyc','a-chi','b-sf','c-dal','a-phx','c-la']
prefix = []
newlist = []
for i in test:
if i.split('-')[0] not in prefix:
prefix.append(i.split('-')[0])
newlist.append(i)
print(newlist)
In this, if any query then let me know.
Thank you.

What to pass to clf.predict()?

I started playing with Decision Trees lately and I wanted to train my own simple model with some manufactured data. I wanted to use this model to predict some further mock data, just to get a feel of how it works, but then I got stuck. Once your model is trained, how do you pass data to predict()?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Docs state:
clf.predict(X)
Parameters:
X : array-like or sparse matrix of shape = [n_samples, n_features]
But when trying to pass np.array, np.ndarray, list, tuple or DataFrame it just throws an error. Can you help me understand why please?
Code below:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import graphviz
import pandas as pd
import numpy as np
import random
from sklearn import tree
pd.options.display.max_seq_items=5000
pd.options.display.max_rows=20
pd.options.display.max_columns=150
lenght = 50000
miles_commuting = [random.choice([2,3,4,5,7,10,20,25,30]) for x in range(lenght)]
salary = [random.choice([1300,1600,1800,1900,2300,2500,2700,3300,4000]) for x in range(lenght)]
full_time = [random.choice([1,0,1,1,0,1]) for x in range(lenght)]
DataFrame = pd.DataFrame({'CommuteInMiles':miles_commuting,'Salary':salary,'FullTimeEmployee':full_time})
DataFrame['Moving'] = np.where((DataFrame.CommuteInMiles > 20) & (DataFrame.Salary > 2000) & (DataFrame.FullTimeEmployee == 1),1,0)
DataFrame['TargetLabel'] = np.where((DataFrame.Moving == 1),'Considering move','Not moving')
target = DataFrame.loc[:,'Moving']
data = DataFrame.loc[:,['CommuteInMiles','Salary','FullTimeEmployee']]
target_names = DataFrame.TargetLabel
features = data.columns.values
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data, target)
clf.predict(?????) #### <===== What should go here?
clf.predict([30,4000,1])
ValueError: Expected 2D array, got 1D array instead:
array=[3.e+01 4.e+03 1.e+00].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
clf.predict(np.array(30,4000,1))
ValueError: only 2 non-keyword arguments accepted
Where is your "mock data" that you want to predict?
Your data should be of the same shape that you used when calling fit(). From the code above, I see that your X has three columns ['CommuteInMiles','Salary','FullTimeEmployee']. You need to have those many columns in your prediction data, number of rows can be arbitrary.
Now when you do
clf.predict([30,4000,1])
The model is not able to understand that these are columns of a same row or data of different rows.
So you need to convert that into 2-d array, where inner array represents the single row.
Do this:
clf.predict([[30,4000,1]]) #<== Observe the two square brackets
You can have multiple rows to be predicted, each in inner list. Something like this:
X_test = [[30,4000,1],
[35,15000,0],
[40,2000,1],]
clf.predict(X_test)
Now as for your last error clf.predict(np.array(30,4000,1)), this has nothing to do with predict(). You are using the np.array() wrong.
According to the documentation, the signature of np.array is:
(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
Leaving the first (object) all others are keyword arguments, so they need to be used as such. But when you do this: np.array(30,4000,1), each value is considered as input to separate param here: object=30, dtype=4000, copy=1. This is not allowed and hence error. If you want to make a numpy array from list, you need to pass a list.
Like this: np.array([30,4000,1])
Now this will be considered correctly as input to object param.

Calling seperate theano functions with same inputs?

I have something like this:
x=T.matrix('x')
params = [self.W, self.b1, self.b2]
hidden = self.activation_function(T.dot(x, self.W)+self.b1)
output = T.dot(hidden,T.transpose(self.W))+self.b2
output = self.output_function(output)
L = -T.sum(x*T.log(output) + (1-x)*T.log(1-output), axis=1)
cost=L.mean()
th_train = th.function(inputs=[index], outputs=[cost], updates=updates,
givens={x:self.X[index:index+mini_batch_size,:]})
This is working fine. I would now like to see what the mean of the hidden units is. I tried adding this before the line where L = -T.sum... is declared:
hm = T.mean(hidden)
hidden_mean_func = th.function(inputs=[hm], outputs=[hm], name="hidden_mean_function_printer")
print hidden_mean_func(hm)
I get the following error:
TypeError: ('Bad input argument to theano function with name "hidden_mean_function_printer" at index 0(0-based)', 'Expected an array-like object, but found a Variable: maybe you are trying to call a function on a (possibly shared) variable instead of a numeric array?')
I really have two questions: 1) Why can't I do this? 2) What is the correct way to achieve what I want?
Thank you
As far as I can see, you are giving him the function as input. If you use your array/matrix of hidden units instead the code should work.
hidden_mean_func = th.function(inputs=[hidden], outputs=[hm], name="hidden_mean_function_printer")
print hidden_mean_func(hidden)

Resources