Pyspark : ValueError: need more than 2 values to unpack - apache-spark

My data is in the following format after join
# (u'session_id', ((u'prod_id', u'user_id'), (u'prod_label', u'user_id')))
# (u'u'session_id', ((u'20133', u'129001032'), None))
# (u'u'session_id', ((u'2024574', u'61370212'), (u'Loc1', u'61370212')))
I want to treat the cases where the second tuple is None versus the one where it is not None. I tried filtering it using the following code, but I get error. How do I filter these out?
left_outer_joined_no_null = left_outer_joined.filter(lambda (session_id, ((tuple1), (tuple2))): (tuple2) != None)
ValueError: need more than 2 values to unpack
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

You code works fine on my machine: Spark version is 1.5.1.
I have tried like this:
left_outer_joined = sc.parallelize([(u'session_id', ((u'prod_id', u'user_id'), (u'prod_label', u'user_id'))),
(u'session_id', ((u'20133', u'129001032'), None)),
(u'session_id', ((u'2024574', u'61370212'), (u'Loc1', u'61370212')))])
left_outer_joined_no_null = left_outer_joined \
.filter(lambda (session_id, ((tuple1), (tuple2))): (tuple2) != None)
for value in left_outer_joined_no_null.collect():
print(value)
Result is like as above this as you expected.
(u'session_id', ((u'prod_id', u'user_id'), (u'prod_label', u'user_id')))
(u'session_id', ((u'2024574', u'61370212'), (u'Loc1', u'61370212')))
NOTE:
In you input lines (u'u'session_id you have extra symbol u'. On the second and third lines it repeated two times. Maybe problem is there?

Related

how to solve array must not contain infs or nans

fa = FactorAnalyzer(rotation = None,n_factors=5)
fa.fit(dat)
print(pd.DataFrame(fa.get_factor_variance(),index=['Variance','Proportional Var','Cumulative Var']))
I tried this code and gets the error as array must not contain nulls or infs.
I checked for nulls and infs but there were no nulls or infs shown.
So how can I solve this ?
I think you need to clean your data set by removing all nan and inf
import pandas as pd
def dataset_cleaner(df):
assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
df.dropna(inplace=True)
keep_indices = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
return df[keep_indices].astype(np.float64)

Pyspark - how to pass a column to a function after casting?

First I called sha2 function from pyspark.sql.functions incorrectly, passing it a column of DoubleType and got the following error:
cannot resolve 'sha2(`metric`, 256)' due to data type mismatch: argument 1 requires binary type, however, '`metric`' is of double type
Then I tried to first cast the columns to a StringType but still getting the same error. I probably miss something on how column transformations are processed by Spark.
I've noticed that when I just call a df.withColumn(col_name, F.lit(df[col_name].cast(StringType()))) without calling .withColumn(col_name, F.sha2(df[col_name], 256))the columns type is changed to StringType.
How should I apply a transformation correctly in this case?
def parse_to_sha2(df: DataFrame, cols: list):
for col_name in cols:
df = df.withColumn(col_name, F.lit(df[col_name].cast(StringType()))) \
.withColumn(col_name, F.sha2(df[col_name], 256))
return df
You don't need lit here
Try
.withColumn(col_name, F.sha2(df[col_name].cast('string'), 256))
I believe the issue here is the call to F.lit which creates a literal.
def parse_to_sha2(df: DataFrame, cols: list):
for col_name in cols:
df = df.withColumn(
col_name,
F.col(col_name).cast(StringType()).alias(f"{col_name}_casted")
).withColumn(
col_name,
F.sha2(F.col(f"{col_name}_casted"), 256)
)
return df
This should generate you a sha value per column.
In case you need all of them you would need to pass all columns to sha since it takes col* of arguments.
Edit: The last bit of comment is not correct, only F.hash takes multiple columns as arguments, md5, crc, sha take only 1 so sorry for that confusion.

Calculate cosine similarity between two columns of Spark DataFrame in PySpark [duplicate]

I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below:
Ex: [24,23,27,23] should get converted to [24, 23, 27]
Code:
def uniq_array(col_array):
x = np.unique(col_array)
return x
uniq_array_udf = udf(uniq_array,ArrayType(IntegerType()))
Df3 = Df2.withColumn("age_array_unique",uniq_array_udf(Df2.age_array))
In the above code, Df2.age_array is the array on which I am applying the UDF to get a different column "age_array_unique" which should contain only unique values in the array.
However, as soon as I run the command Df3.show(), I get the error:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
Can anyone please let me know why this is happening?
Thanks!
The source of the problem is that object returned from the UDF doesn't conform to the declared type. np.unique not only returns numpy.ndarray but also converts numerics to the corresponding NumPy types which are not compatible with DataFrame API. You can try something like this:
udf(lambda x: list(set(x)), ArrayType(IntegerType()))
or this (to keep order)
udf(lambda xs: list(OrderedDict((x, None) for x in xs)),
ArrayType(IntegerType()))
instead.
If you really want np.unique you have to convert the output:
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))
You need to convert the final value to a python list. You implement the function as follows:
def uniq_array(col_array):
x = np.unique(col_array)
return list(x)
This is because Spark doesn't understand the numpy array format. In order to feed a python object that Spark DataFrames understand as an ArrayType, you need to convert the output to a python list before returning it.
I also got this error when my UDF returns a float but I forget to cast it as a float. I need to do this:
retval = 0.5
return float(retval)
As of pyspark version 2.4, you can use array_distinct transformation.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
Below Works fine for me
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))
[x.item() for x in <any numpy array>]
converts it to plain python.

Getting all acceptable string arguments to DataFrameGroupby.aggregate

So, I have a piece of code that takes a groupby object and a dictionary mapping columns in the groupby to strings, indicating aggregation types. I want to validate that all the values in the dictionary are strings that pandas accepts in its aggregation. However, I don't want to use a try/except (which, without a loop, will only catch a single problem value). How do I do this?
I've already tried importing the SelectionMixin from pandas.core.generic and checking against the values in SelectionMixin._cython_table, but this clearly isn't an exhaustive list. My version of pandas is 0.20.3.
Here's an example of how I want to use this
class SomeModule:
ALLOWED_AGGREGATIONS = # this is where I would save the collection of allowed values
#classmethod
def aggregate(cls, df, groupby_cols, aggregation_dict):
disallowed_aggregations = list(
set(aggregation_dict.values) - set(cls.ALLOWED_AGGREGATIONS)
)
if len(disallowed_aggregations):
val_str = ', '.join(disallowed_aggregations)
raise ValueError(
f'Unallowed aggregations found: {val_str}'
)
return df.groupby(groupby_cols).agg(aggregation_dict)

PYSPARK - How I get only value in SQLCONTEXT result

I'm trying to return this Sqlcontext result in a list, but return very things and I want only value.
Pyspark CODE:
rdd = sqlcontext.sql('select * from Dicionario where CD_PROC="{}"'.format(processo)).rdd.map(lambda l: l).collect()
Result (one line):
[Row(CD_PROC=u'CAL-0004-LL0', NM_CAMPO=u'TIPO_REGISTRO', TIPO=u'A', TAMANHO=u'1', CD_DOMI=u'', ID_PK=u'0', DT_HOM=u'')]
I want:
[CAL-0004-LL0, TIPO_REGISTRO, A, 1, ,0,,]
Row object is a subclass of tuple.
results = [Row(CD_PROC=u'CAL-0004-LL0', NM_CAMPO=u'TIPO_REGISTRO', TIPO=u'A', TAMANHO=u'1', CD_DOMI=u'', ID_PK=u'0', DT_HOM=u'')]
print(tuple(results[0])) # or use list() or [:] anything you like
Output
(u'', u'CAL-0004-LL0', u'', u'0', u'TIPO_REGISTRO', u'1', u'A')

Resources