Incorrect ArrayType elements inside Pyspark pandas_udf - apache-spark

I am using Spark 2.3.0 and trying the pandas_udf user-defined functions within my Pyspark code. According to https://github.com/apache/spark/pull/20114, ArrayType is currently supported. My user-defined function is:
def transform(c):
if not any(isinstance(x, (list, tuple, np.ndarray)) for x in c.values):
nvalues = c.values
else:
nvalues = np.array(c.values.tolist())
tvalues = some_external_function(nvalues)
if not any(isinstance(y, (list, tuple, np.ndarray)) for y in tvalues):
p = pd.Series(np.array(tvalues))
else:
p = pd.Series(list(tvalues))
return p
transform = pandas_udf(transform, ArrayType(LongType()))
When i am applying this function to a specific array column of a large Spark Dataframe, then i notice that the first element of the pandas series c, has different double size compared to the others, and the last one has 0 size:
0 [73, 10, 223, 46, 14, 73, 14, 5, 14, 21, 10, 2...
1 [223, 46, 14, 73, 14, 5, 14, 21, 30, 16]
2 [46, 14, 73, 14, 5, 14, 21, 30, 16, 15]
...
4695 []
Name: _70, Length: 4696, dtype: object
With the first array having 20 elements, the second 10 (which is the correct one), and the last one 0. And then of course the c.values fails with ValueError: setting an array element with a sequence., since the array has multiple sizes.
When i am trying the same function to column with array of strings, then all sizes are correct, and the rest of the functions steps as well.
Any idea what might be the issue? Possible bug?
UPDATED with example:
I am attaching a simple example, just printing the values inside the pandas_udf function.
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("testing pandas_udf")\
.getOrCreate()
arr = []
for i in range(100000):
arr.append([2,2,2,2,2])
df = spark.createDataFrame(arr, ArrayType(LongType()))
def transform(c):
print(c)
print(c.values)
return c
transform = pandas_udf(transform, ArrayType(LongType()))
df = df.withColumn('new_value', transform(col('value')))
df.show()
spark.stop()
If you check executor's log, you will see logs like:
0 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
1 [2, 2, 2, 2, 2]
2 [2, 2, 2, 2, 2]
3 [2, 2, 2, 2, 2]
4 [2, 2, 2, 2, 2]
5 [2, 2, 2, 2, 2]
...
9996 [2, 2, 2, 2, 2]
9997 [2, 2, 2, 2, 2]
9998 []
9999 []
Name: _0, Length: 10000, dtype: object
SOLVED:
If you face the same issue, upgrade to Spark 2.3.1 and pyarrow 0.9.0.post1.

yeah, looks like there is a bug in Spark. My situation concerns 2.3.0 and PyArrow 0.13.0. The only remedy available for me is just to convert array into string and then pass it to Pandas UDF.
def _identity(sample_array):
return sample_array.apply(lambda e: [int(i) for i in e.split(',')])
array_identity_udf = F.pandas_udf(_identity,
returnType=ArrayType(IntegerType()),
functionType=F.PandasUDFType.SCALAR)
test_df = (spark
.table('test_table')
.repartition(F.ceil(F.rand(seed=1234) * 100))
.cache())
test1_df = (test_df
.withColumn('array_test', array_identity_udf(F.concat_ws(',', F.col('sample_array')))))

Related

Remove a row in pandas using Chain rule

I have below code
import pandas as pd
df = pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]})
df
df[df['A'].notna()]
Last line remove entire row of df for which A is None.
However to improve readability, I was achieve this final dataframe is one line where I created df, using chain rule.
Is there any way to achieve this?
Use loc with a function/lambda:
df = (pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]})
.loc[lambda d: d['A'].notna()]
)
output:
A B C D
0 12.0 7.0 20 14.0
1 4.0 2.0 16 3.0
2 5.0 54.0 11 NaN
4 1.0 NaN 8 6.0
Documentation:
Allowed inputs are:
[...]
A callable function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the
above)

TypeError: 'int' object does not support item deletion python3?

I want to delete first element from a list after i convert int to list but it always shows me this error :
line 18, in del q[i] TypeError: 'int' object does not support item deletion
my code:
fn = 10534
sn = 67
tn = 1120
fnn = [int(x) for x in str(fn)]
i=0
for q in fnn:
del q[i]
print(q)
how can i solve this error ?
Given that it is not recommended to delete from a list while iterating over it your code could be refactored to follow best practices by creating a new list with the desired elements:
>>> fn = 21345678
>>> oldList = [int(x) for x in str(fn)]
[2, 1, 3, 4, 5, 6, 7, 8]
>>> newList = oldList[1:]
[1, 3, 4, 5, 6, 7, 8]
If you are constrained by memory, or are using a large list, use the following syntax instead:
>>> fn = 21345678
>>> myList = [int(x) for x in str(fn)]
[2, 1, 3, 4, 5, 6, 7, 8]
>>> myList[:] = myList[1:]
[1, 3, 4, 5, 6, 7, 8]
This will replace the contents of myList with the contents of the list slice.
A good item to review is List slicing

Retrieving original data from PyTorch nn.Embedding

I'm passing a dataframe with 5 categories (ex. car, bus, ...) into nn.Embedding.
When I do embedding.parameters(), I can see that there are 5tensors but how do I know which index corresponds to the original input (ex. car, bus, ...)?
You can't as tensors are unnamed (only dimensions can be named, see PyTorch's Named Tensors).
You have to keep the names in separate data container, for example (4 categories here):
import pandas as pd
import torch
df = pd.DataFrame(
{
"bus": [1.0, 2, 3, 4, 5],
"car": [6.0, 7, 8, 9, 10],
"bike": [11.0, 12, 13, 14, 15],
"train": [16.0, 17, 18, 19, 20],
}
)
df_data = df.to_numpy().T
df_names = list(df)
embedding = torch.nn.Embedding(df_data.shape[0], df_data.shape[1])
embedding.weight.data = torch.from_numpy(df_data)
Now you can simply use it with any index you want:
index = 1
embedding(torch.tensor(index)), df_names[index]
This would give you (tensor[6, 7, 8, 9, 10], "car") so the data and respective column name.

I am solving the problem of saving and printing duplicate values ​in lists with Python, but I get the following error: How do I fix this?

# Input:
# - list_data_a: list of numeric or character values
# - list_data_b: list of numeric or character values
# Output:
# - long list of list_data_a and list_data_b is returned
# If the lengths are the same, the value of two list_data_a is returned.
# Examples:
# >>> import gowithflow as gwf
# >>> a = [1, 2, 3, 4, 5, 6]
# >>> b = [1, 2, 3]
# >>> gwf.comparison_list_size (a, b)
# [1, 2, 3, 4, 5, 6]
# >>> b = [1, 2, 3, 5, 7, 8, 9, 10]
# >>> gwf.comparison_list_size (a, b)
# [1, 2, 3, 5, 7, 8, 9, 10]
# '' '
# === Modify codes below =============
# You can write more than one line of code,
# Must be returned by assigning result value to result variable
list_data = [5, 10, 15, 20]
element_value = [5, 10, 58, 88]
for i in range(len(list_data)):
if list_data[i] in element_value:
list_data.remove(list_data[i])
print(len(list_data))
print('-----------')
print(element_value)
if list_data[i] in element_value:
IndexError: list index out of range
use list in a loop, is not it possible to use for i in range (len ()) to iterate over the size of list?
PS. Sorry I could not speak English ...
list_data = [5, 10, 15, 20]
element_value = [5, 10, 58, 88]
for i in list_data:
if i in element_value:
list_data.remove(i)

How to process different row in tensor based on the first column value in tensorflow

let's say I have a 4 by 3 tensor:
sample = [[10, 15, 25], [1, 2, 3], [4, 4, 10], [5, 9, 8]]
I would like to return another tensor of shape 4: [r1,r2,r3,r4] where r is either equal to tf.reduce_sum(row) if row[0] is less than 5, or r is equal to tf.reduce_mean(row) if row[0] is greater or equal to 5.
output:
output = [16.67, 6, 18, 7.33]
I'm not an adept to tensorflow, please do assist me on how to achieve the above in python 3 without a for loop.
thank you
UPDATES:
So I've tried to adapt the answer given by #Onyambu to include two samples in the functions but it gave me an error in all instances.
here is the answer for the first case:
def f(x):
c = tf.constant(5,tf.float32)
def fun1():
return tf.reduce_sum(x)
def fun2():
return tf.reduce_mean(x)
return tf.cond(tf.less(x[0],c),fun1,fun2)
a = tf.map_fn(f,tf.constant(sample,tf.float32))
The above works well.
The for two samples:
sample1 = [[10, 15, 25], [1, 2, 3], [4, 4, 10], [5, 9, 8]]
sample2 = [[0, 15, 25], [1, 2, 3], [0, 4, 10], [1, 9, 8]]
def f2(x1,x2):
c = tf.constant(1,tf.float32)
def fun1():
return tf.reduce_sum(x1[:,0] - x2[:,0])
def fun2():
return tf.reduce_mean(x1 - x2)
return tf.cond(tf.less(x2[0],c),fun1,fun2)
a = tf.map_fn(f2,tf.constant(sample1,tf.float32), tf.constant(sample2,tf.float32))
The adaptation does give errors, but the principle is simple:
calculate the tf.reduce_sum of sample1[:,0] - sample2[:,0] if row[0] is less than 1
calculate the tf.reduce_sum of sample1 - sample2 if row[0] is greater or equal to 1
Thank you for your assistance in advance!
import tensorflow as tf
def f(x):
y = tf.constant(5,tf.float32)
def fun1():
return tf.reduce_sum(x)
def fun2():
return tf.reduce_mean(x)
return tf.cond(tf.less(x[0],y),fun1,fun2)
a = tf.map_fn(f,tf.constant(sample,tf.float32))
with tf.Session() as sess: print(sess.run(a))
[16.666666 6. 18. 7.3333335]
If you want to shorten it:
y = tf.constant(5,tf.float32)
f=lambda x: tf.cond(tf.less(x[0], y), lambda: tf.reduce_sum(x),lambda: tf.reduce_mean(x))
a = tf.map_fn(f,tf.constant(sample,tf.float32))
with tf.Session() as sess: print(sess.run(a))

Resources