Retrieving original data from PyTorch nn.Embedding - pytorch

I'm passing a dataframe with 5 categories (ex. car, bus, ...) into nn.Embedding.
When I do embedding.parameters(), I can see that there are 5tensors but how do I know which index corresponds to the original input (ex. car, bus, ...)?

You can't as tensors are unnamed (only dimensions can be named, see PyTorch's Named Tensors).
You have to keep the names in separate data container, for example (4 categories here):
import pandas as pd
import torch
df = pd.DataFrame(
{
"bus": [1.0, 2, 3, 4, 5],
"car": [6.0, 7, 8, 9, 10],
"bike": [11.0, 12, 13, 14, 15],
"train": [16.0, 17, 18, 19, 20],
}
)
df_data = df.to_numpy().T
df_names = list(df)
embedding = torch.nn.Embedding(df_data.shape[0], df_data.shape[1])
embedding.weight.data = torch.from_numpy(df_data)
Now you can simply use it with any index you want:
index = 1
embedding(torch.tensor(index)), df_names[index]
This would give you (tensor[6, 7, 8, 9, 10], "car") so the data and respective column name.

Related

Simplify numpy expression [duplicate]

This question already has answers here:
Access n-th dimension in python [duplicate]
(5 answers)
Closed 2 years ago.
How can I simplify this:
import numpy as np
ex = np.arange(27).reshape(3, 3, 3)
def get_plane(axe, index):
return ex.swapaxes(axe, 0)[index] # is there a better way ?
I cannot find a numpy function to get a plane in a higher dimensional array, is there one?
EDIT
The ex.take(index, axis=axe) method is great, but it copies the array instead of giving a view, what I originally wanted.
So what is the shortest way to index (without copying) a n-th dimensional array to get a 2d slice of it, with index and axis?
Inspired by this answer, you can do something like this:
def get_plane(axe, index):
slices = [slice(None)]*len(ex.shape)
slices[axe]=index
return ex[tuple(slices)]
get_plane(1,1)
output:
array([[ 3, 4, 5],
[12, 13, 14],
[21, 22, 23]])
What do you mean by a 'plane'?
In [16]: ex = np.arange(27).reshape(3, 3, 3)
Names like plane, row, and column, are arbitrary conventions, not formally defined in numpy. The default display of this array looks like 3 'planes' or 'blocks', each with rows and columns:
In [17]: ex
Out[17]:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])
Standard indexing lets us view any 2d block, in any dimension:
In [18]: ex[0]
Out[18]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [19]: ex[0,:,:]
Out[19]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [20]: ex[:,0,:]
Out[20]:
array([[ 0, 1, 2],
[ 9, 10, 11],
[18, 19, 20]])
In [21]: ex[:,:,0]
Out[21]:
array([[ 0, 3, 6],
[ 9, 12, 15],
[18, 21, 24]])
There are ways of saying I want block 0 in dimension 1 etc, but first make sure you understand this indexing. This is the core numpy functionality.
In [23]: np.take(ex, 0, 1)
Out[23]:
array([[ 0, 1, 2],
[ 9, 10, 11],
[18, 19, 20]])
In [24]: idx = (slice(None), 0, slice(None)) # also np.s_[:,0,:]
In [25]: ex[idx]
Out[25]:
array([[ 0, 1, 2],
[ 9, 10, 11],
[18, 19, 20]])
And yes you can swap axes (or transpose), it that suits your needs.

Combine multiple lists (of equal length) stored in dict into a single list of list

I have this following dictionary (which essentially resembles a table):
tbl = {'col0':[20, 30, 22, 15, 24],
'col1':[13, 15, 10, 14, 15],
'col2':[52, 12, 14, 36, 23] }
I want to convert this to a list of list that combines all the list across the columns (i.e. same index elements become one list-element in list of list)
It should look somewhat like this:
[[20, 13, 52], [30, 15, 12], [22, 10, 14], [15, 14, 36], [24, 15, 23]]
it should also work for situations where my dict would be something like this:
tbl = {'col0':1.0,
'col1':7.0,
'col2':1.3 }
# converted into
[[1.0, 7.0, 1.3]]
is there a pythonic way of doing this ? I basically need it to print a Table structure row-wise by over-riding a __str__ method for a structure which currently stores table values in dict format
You can always use an unreadable double list comprehension!
my_list_of_lists = [[tbl[key][idx] for key in tbl] for idx in range(len(tbl[list(tbl.keys())[0]]))]
If you might have data without a length, you can use this instead (as long as all columns are the same length):
def len_checker(item):
try:
return len(item)
except:
return 0
my_list_of_lists = [[tbl[key][idx] for key in tbl] for idx in range(len(tbl[list(tbl.keys())[0]]))] if len_checker(tbl[list(tbl.keys())[0]]) else [[tbl[key] for key in tbl]]
Aren't these fun?
Things are a little cleaner if you can guarantee that the key 'col0' is in your table.
my_list_of_lists = [[tbl[key][idx] for key in tbl] for idx in range(len(tbl['col0']))] if len_checker(tbl['col0']) else [[tbl[key] for key in tbl]]
In all seriousness, though, if you want clean code you should be using something like a Pandas DataFrame.
from pandas import DataFrame
try:
df = DataFrame(tbl)
except:
df = DataFrame(tbl,index=[0])
my_list_of_lists = [list(df.iloc[row]) for row in range(df.shape[0])]
You can use numpy too.
import numpy as np
arr = np.vstack([np.array(tbl[key]) for key in tbl])
my_list_of_lists = [list(arr[...,col]) for col in range(arr.shape[1])]
zip is handy for this:
>>> list(zip(*tbl.values()))
[(20, 13, 52), (30, 15, 12), (22, 10, 14), (15, 14, 36), (24, 15, 23)]
For a list of lists instead of tuples, you can use a generator expression:
>>> list(list(x) for x in zip(*tbl.values()))
[[20, 13, 52], [30, 15, 12], [22, 10, 14], [15, 14, 36], [24, 15, 23]]

Incorrect ArrayType elements inside Pyspark pandas_udf

I am using Spark 2.3.0 and trying the pandas_udf user-defined functions within my Pyspark code. According to https://github.com/apache/spark/pull/20114, ArrayType is currently supported. My user-defined function is:
def transform(c):
if not any(isinstance(x, (list, tuple, np.ndarray)) for x in c.values):
nvalues = c.values
else:
nvalues = np.array(c.values.tolist())
tvalues = some_external_function(nvalues)
if not any(isinstance(y, (list, tuple, np.ndarray)) for y in tvalues):
p = pd.Series(np.array(tvalues))
else:
p = pd.Series(list(tvalues))
return p
transform = pandas_udf(transform, ArrayType(LongType()))
When i am applying this function to a specific array column of a large Spark Dataframe, then i notice that the first element of the pandas series c, has different double size compared to the others, and the last one has 0 size:
0 [73, 10, 223, 46, 14, 73, 14, 5, 14, 21, 10, 2...
1 [223, 46, 14, 73, 14, 5, 14, 21, 30, 16]
2 [46, 14, 73, 14, 5, 14, 21, 30, 16, 15]
...
4695 []
Name: _70, Length: 4696, dtype: object
With the first array having 20 elements, the second 10 (which is the correct one), and the last one 0. And then of course the c.values fails with ValueError: setting an array element with a sequence., since the array has multiple sizes.
When i am trying the same function to column with array of strings, then all sizes are correct, and the rest of the functions steps as well.
Any idea what might be the issue? Possible bug?
UPDATED with example:
I am attaching a simple example, just printing the values inside the pandas_udf function.
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("testing pandas_udf")\
.getOrCreate()
arr = []
for i in range(100000):
arr.append([2,2,2,2,2])
df = spark.createDataFrame(arr, ArrayType(LongType()))
def transform(c):
print(c)
print(c.values)
return c
transform = pandas_udf(transform, ArrayType(LongType()))
df = df.withColumn('new_value', transform(col('value')))
df.show()
spark.stop()
If you check executor's log, you will see logs like:
0 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
1 [2, 2, 2, 2, 2]
2 [2, 2, 2, 2, 2]
3 [2, 2, 2, 2, 2]
4 [2, 2, 2, 2, 2]
5 [2, 2, 2, 2, 2]
...
9996 [2, 2, 2, 2, 2]
9997 [2, 2, 2, 2, 2]
9998 []
9999 []
Name: _0, Length: 10000, dtype: object
SOLVED:
If you face the same issue, upgrade to Spark 2.3.1 and pyarrow 0.9.0.post1.
yeah, looks like there is a bug in Spark. My situation concerns 2.3.0 and PyArrow 0.13.0. The only remedy available for me is just to convert array into string and then pass it to Pandas UDF.
def _identity(sample_array):
return sample_array.apply(lambda e: [int(i) for i in e.split(',')])
array_identity_udf = F.pandas_udf(_identity,
returnType=ArrayType(IntegerType()),
functionType=F.PandasUDFType.SCALAR)
test_df = (spark
.table('test_table')
.repartition(F.ceil(F.rand(seed=1234) * 100))
.cache())
test1_df = (test_df
.withColumn('array_test', array_identity_udf(F.concat_ws(',', F.col('sample_array')))))

Pass argument to array of functions

I have a 2D numpy array of lambda functions. Each function has 2 arguments and returns a float.
What's the best way to pass the same 2 arguments to all of these functions and get a numpy array of answers out?
I've tried something like:
np.reshape(np.fromiter((fn(1,2) for fn in np.nditer(J,order='K',flags=["refs_ok"])),dtype = float),J.shape)
to evaluate each function in J with arguments (1,2) ( J contains the functions).
But it seems very round the houses, and also doesn't quite work...
Is there a good way to do this?
A = J(1,2)
doesn't work!
You can use list comprehensions:
A = np.asarray([[f(1,2) for f in row] for row in J])
This should work for both numpy arrays and list of lists.
I don't think there is a really clean way, but this is reasonably clean and works:
import operator
import numpy as np
# create array of lambdas
a = np.array([[lambda x, y, i=i, j=j: x**i + y**j for i in range(4)] for j in range(4)])
# apply arguments 2 and 3 to all of them
np.vectorize(operator.methodcaller('__call__', 2, 3))(a)
# array([[ 2, 3, 5, 9],
# [ 4, 5, 7, 11],
# [10, 11, 13, 17],
# [28, 29, 31, 35]])
Alternatively, and slightly more flexible:
from types import FunctionType
np.vectorize(FunctionType.__call__)(a, 2, 3)
# array([[ 2, 3, 5, 9],
# [ 4, 5, 7, 11],
# [10, 11, 13, 17],
# [28, 29, 31, 35]])

Split RDD into n parts in pySpark

I want to split an RDD into n-parts of equal length using Pyspark
If the RDD is something like
data = range(0,20)
d_rdd = sc.parallelize(data)
d_rdd.glom().collect()
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]
I want set of any two random numbers together, like
[[0,4],[6,11],[5,18],[3,14],[17,9],[12,8],[2,10],[1,15],[13,19],[7,16]]
Two methods:
set partition num when using parallelize, and use function distinct()
data = range(0,20)
d_rdd = sc.parallelize(data, 10).distinct()
d_rdd.glom().collect()
using repartition() and distinct()
data = range(0,20)
d_rdd = sc.parallelize(data).repartition(10).distinct()
d_rdd.glom().collect()

Resources