Split RDD into n parts in pySpark - apache-spark

I want to split an RDD into n-parts of equal length using Pyspark
If the RDD is something like
data = range(0,20)
d_rdd = sc.parallelize(data)
d_rdd.glom().collect()
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]
I want set of any two random numbers together, like
[[0,4],[6,11],[5,18],[3,14],[17,9],[12,8],[2,10],[1,15],[13,19],[7,16]]

Two methods:
set partition num when using parallelize, and use function distinct()
data = range(0,20)
d_rdd = sc.parallelize(data, 10).distinct()
d_rdd.glom().collect()
using repartition() and distinct()
data = range(0,20)
d_rdd = sc.parallelize(data).repartition(10).distinct()
d_rdd.glom().collect()

Related

Retrieving original data from PyTorch nn.Embedding

I'm passing a dataframe with 5 categories (ex. car, bus, ...) into nn.Embedding.
When I do embedding.parameters(), I can see that there are 5tensors but how do I know which index corresponds to the original input (ex. car, bus, ...)?
You can't as tensors are unnamed (only dimensions can be named, see PyTorch's Named Tensors).
You have to keep the names in separate data container, for example (4 categories here):
import pandas as pd
import torch
df = pd.DataFrame(
{
"bus": [1.0, 2, 3, 4, 5],
"car": [6.0, 7, 8, 9, 10],
"bike": [11.0, 12, 13, 14, 15],
"train": [16.0, 17, 18, 19, 20],
}
)
df_data = df.to_numpy().T
df_names = list(df)
embedding = torch.nn.Embedding(df_data.shape[0], df_data.shape[1])
embedding.weight.data = torch.from_numpy(df_data)
Now you can simply use it with any index you want:
index = 1
embedding(torch.tensor(index)), df_names[index]
This would give you (tensor[6, 7, 8, 9, 10], "car") so the data and respective column name.

Combine multiple lists (of equal length) stored in dict into a single list of list

I have this following dictionary (which essentially resembles a table):
tbl = {'col0':[20, 30, 22, 15, 24],
'col1':[13, 15, 10, 14, 15],
'col2':[52, 12, 14, 36, 23] }
I want to convert this to a list of list that combines all the list across the columns (i.e. same index elements become one list-element in list of list)
It should look somewhat like this:
[[20, 13, 52], [30, 15, 12], [22, 10, 14], [15, 14, 36], [24, 15, 23]]
it should also work for situations where my dict would be something like this:
tbl = {'col0':1.0,
'col1':7.0,
'col2':1.3 }
# converted into
[[1.0, 7.0, 1.3]]
is there a pythonic way of doing this ? I basically need it to print a Table structure row-wise by over-riding a __str__ method for a structure which currently stores table values in dict format
You can always use an unreadable double list comprehension!
my_list_of_lists = [[tbl[key][idx] for key in tbl] for idx in range(len(tbl[list(tbl.keys())[0]]))]
If you might have data without a length, you can use this instead (as long as all columns are the same length):
def len_checker(item):
try:
return len(item)
except:
return 0
my_list_of_lists = [[tbl[key][idx] for key in tbl] for idx in range(len(tbl[list(tbl.keys())[0]]))] if len_checker(tbl[list(tbl.keys())[0]]) else [[tbl[key] for key in tbl]]
Aren't these fun?
Things are a little cleaner if you can guarantee that the key 'col0' is in your table.
my_list_of_lists = [[tbl[key][idx] for key in tbl] for idx in range(len(tbl['col0']))] if len_checker(tbl['col0']) else [[tbl[key] for key in tbl]]
In all seriousness, though, if you want clean code you should be using something like a Pandas DataFrame.
from pandas import DataFrame
try:
df = DataFrame(tbl)
except:
df = DataFrame(tbl,index=[0])
my_list_of_lists = [list(df.iloc[row]) for row in range(df.shape[0])]
You can use numpy too.
import numpy as np
arr = np.vstack([np.array(tbl[key]) for key in tbl])
my_list_of_lists = [list(arr[...,col]) for col in range(arr.shape[1])]
zip is handy for this:
>>> list(zip(*tbl.values()))
[(20, 13, 52), (30, 15, 12), (22, 10, 14), (15, 14, 36), (24, 15, 23)]
For a list of lists instead of tuples, you can use a generator expression:
>>> list(list(x) for x in zip(*tbl.values()))
[[20, 13, 52], [30, 15, 12], [22, 10, 14], [15, 14, 36], [24, 15, 23]]

(pandas) access all rows except for the first 3 for the specific columns at index

I want to access all rows except for the first 3 for the specific columns at index 1, 2, 4, 5, 7, 8, 10, 11, 13, 14 of a csv file. How can I do this? All examples I have found show how to slice (for example 1:14 but I do not want all columns in between but specific ones.
When I try:
p = df[3:, [1, 2, 4, 5, 7, 8, 10, 11, 13, 14]]
I get an error:
p = df[3:, [1, 2, 4, 5, 7, 8, 10, 11, 13, 14]]
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2139, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2146, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1840, in _get_item_cache
res = cache.get(item)
TypeError: unhashable type: 'slice'
and it does not work with the notation p = df[[3:], [1, 2, 4, 5, 7, 8, 10, 11, 13, 14]]
IIUC you need DataFrame.iloc for filter by positions here all rows without first 3 and columns names by positions:
df.iloc[3:, [1, 2, 4, 5, 7, 8, 10, 11, 13, 14]]
p = df.iloc[[3:],[1, 2, 4, 5, 7, 8, 10, 11, 13, 14]]
you were mostly right except you just had to close the square brackets after "3:".
and using loc, iloc is recommended for indexing.

Incorrect ArrayType elements inside Pyspark pandas_udf

I am using Spark 2.3.0 and trying the pandas_udf user-defined functions within my Pyspark code. According to https://github.com/apache/spark/pull/20114, ArrayType is currently supported. My user-defined function is:
def transform(c):
if not any(isinstance(x, (list, tuple, np.ndarray)) for x in c.values):
nvalues = c.values
else:
nvalues = np.array(c.values.tolist())
tvalues = some_external_function(nvalues)
if not any(isinstance(y, (list, tuple, np.ndarray)) for y in tvalues):
p = pd.Series(np.array(tvalues))
else:
p = pd.Series(list(tvalues))
return p
transform = pandas_udf(transform, ArrayType(LongType()))
When i am applying this function to a specific array column of a large Spark Dataframe, then i notice that the first element of the pandas series c, has different double size compared to the others, and the last one has 0 size:
0 [73, 10, 223, 46, 14, 73, 14, 5, 14, 21, 10, 2...
1 [223, 46, 14, 73, 14, 5, 14, 21, 30, 16]
2 [46, 14, 73, 14, 5, 14, 21, 30, 16, 15]
...
4695 []
Name: _70, Length: 4696, dtype: object
With the first array having 20 elements, the second 10 (which is the correct one), and the last one 0. And then of course the c.values fails with ValueError: setting an array element with a sequence., since the array has multiple sizes.
When i am trying the same function to column with array of strings, then all sizes are correct, and the rest of the functions steps as well.
Any idea what might be the issue? Possible bug?
UPDATED with example:
I am attaching a simple example, just printing the values inside the pandas_udf function.
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("testing pandas_udf")\
.getOrCreate()
arr = []
for i in range(100000):
arr.append([2,2,2,2,2])
df = spark.createDataFrame(arr, ArrayType(LongType()))
def transform(c):
print(c)
print(c.values)
return c
transform = pandas_udf(transform, ArrayType(LongType()))
df = df.withColumn('new_value', transform(col('value')))
df.show()
spark.stop()
If you check executor's log, you will see logs like:
0 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
1 [2, 2, 2, 2, 2]
2 [2, 2, 2, 2, 2]
3 [2, 2, 2, 2, 2]
4 [2, 2, 2, 2, 2]
5 [2, 2, 2, 2, 2]
...
9996 [2, 2, 2, 2, 2]
9997 [2, 2, 2, 2, 2]
9998 []
9999 []
Name: _0, Length: 10000, dtype: object
SOLVED:
If you face the same issue, upgrade to Spark 2.3.1 and pyarrow 0.9.0.post1.
yeah, looks like there is a bug in Spark. My situation concerns 2.3.0 and PyArrow 0.13.0. The only remedy available for me is just to convert array into string and then pass it to Pandas UDF.
def _identity(sample_array):
return sample_array.apply(lambda e: [int(i) for i in e.split(',')])
array_identity_udf = F.pandas_udf(_identity,
returnType=ArrayType(IntegerType()),
functionType=F.PandasUDFType.SCALAR)
test_df = (spark
.table('test_table')
.repartition(F.ceil(F.rand(seed=1234) * 100))
.cache())
test1_df = (test_df
.withColumn('array_test', array_identity_udf(F.concat_ws(',', F.col('sample_array')))))

RDD: Preserve total order when repartitioning

It seems one of my assumptions were incorrect regarding order in RDDs (related).
Suppose I wish to repartition a RDD after having sorted it.
import random
l = list(range(20))
random.shuffle(l)
spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x)\
.repartition(3)\
.collect()
Which yields:
[16, 17, 18, 19, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
As we can see, the order is preserved within a partition but the total order is not preserved over all partitions.
I would like to preserve total order of the RDD, like so:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
I am having difficulty finding anything online which could be of assistance. Help would be appreciated.
It appears that we can provide the argument numPartitions=partitions to the sortBy function to partition the RDD and preserve total order:
import random
l = list(range(20))
random.shuffle(l)
partitions = 3
spark.sparkContext\
.parallelize(l)\
.sortBy(lambda x:x ,numPartitions=partitions)\
.collect()

Resources