Creating a DataFrame from Row results in 'infer schema issue' - apache-spark

When I began learning PySpark, I used a list to create a dataframe. Now that inferring the schema from list has been deprecated, I got a warning and it suggested me to use pyspark.sql.Row instead. However, when I try to create one using Row, I get infer schema issue. This is my code:
>>> row = Row(name='Severin', age=33)
>>> df = spark.createDataFrame(row)
This results in the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 390, in _createFromLocal
struct = self._inferSchemaFromList(data)
File "/spark2-client/python/pyspark/sql/session.py", line 322, in _inferSchemaFromList
schema = reduce(_merge_type, map(_infer_schema, data))
File "/spark2-client/python/pyspark/sql/types.py", line 992, in _infer_schema
raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>
So I created a schema
>>> schema = StructType([StructField('name', StringType()),
... StructField('age',IntegerType())])
>>> df = spark.createDataFrame(row, schema)
but then, this error gets thrown.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 387, in _createFromLocal
data = list(data)
File "/spark2-client/python/pyspark/sql/session.py", line 509, in prepare
verify_func(obj, schema)
File "/spark2-client/python/pyspark/sql/types.py", line 1366, in _verify_type
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 33 in type <type 'int'>

The createDataFrame function takes a list of Rows (among other options) plus the schema, so the correct code would be something like:
from pyspark.sql.types import *
from pyspark.sql import Row
schema = StructType([StructField('name', StringType()), StructField('age',IntegerType())])
rows = [Row(name='Severin', age=33), Row(name='John', age=48)]
df = spark.createDataFrame(rows, schema)
df.printSchema()
df.show()
Out:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
+-------+---+
| name|age|
+-------+---+
|Severin| 33|
| John| 48|
+-------+---+
In the pyspark docs (link) you can find more details about the createDataFrame function.

you need to create a list of type Row and pass that list with schema to your createDataFrame() method. sample example
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
department1 = Row(id='AAAAAAAAAAAAAA', type='XXXXX',cost='2')
department2 = Row(id='AAAAAAAAAAAAAA', type='YYYYY',cost='32')
department3 = Row(id='BBBBBBBBBBBBBB', type='XXXXX',cost='42')
department4 = Row(id='BBBBBBBBBBBBBB', type='YYYYY',cost='142')
department5 = Row(id='BBBBBBBBBBBBBB', type='ZZZZZ',cost='149')
department6 = Row(id='CCCCCCCCCCCCCC', type='XXXXX',cost='15')
department7 = Row(id='CCCCCCCCCCCCCC', type='YYYYY',cost='23')
department8 = Row(id='CCCCCCCCCCCCCC', type='ZZZZZ',cost='10')
schema = StructType([StructField('id', StringType()), StructField('type',StringType()),StructField('cost', StringType())])
rows = [department1,department2,department3,department4,department5,department6,department7,department8 ]
df = spark.createDataFrame(rows, schema)

If you're just making a pandas dataframe, you can convert each Row to a dict and then rely on pandas' type inference, if that's good enough for your needs. This worked for me:
import pandas as pd
sample = output.head(5) #this returns a list of Row objects
df = pd.DataFrame([x.asDict() for x in sample])

I have had a similar problem recently and the answers here helped me untderstand the problem better.
my code:
row = Row(name="Alice", age=11)
spark.createDataFrame(row).show()
resulted in a very similar error:
An error was encountered:
Can not infer schema for type: <class 'int'>
Traceback ...
the cause of the problem:
createDataFrame expects an array of rows. So if you only have one row and don't want to invent more, simply make it an array: [row]
row = Row(name="Alice", age=11)
spark.createDataFrame([row]).show()

Related

Pyspark - Use a dataclass inside map function - can't pickle _abc_data objects

from typing import Optional, List
from dataclasses_json import DataClassJsonMixin
from dataclasses import dataclass, field
from uuid import uuid4 as uuid
from pyspark.sql import SparkSession, Row
from datetime import datetime, date
from pyspark.sql.functions import concat_ws, concat, col, lit
from pyspark.sql.types import StructType,StructField, StringType
import requests, json
#import cloudpickle
#import pyspark.serializers
#dataclass
class example(DataClassJsonMixin):
DataHeaders: Optional[str] = None
DataValues: Optional[str] = None
DateofData: str= datetime.now().isoformat()
spark = SparkSession.builder.getOrCreate()
#pyspark.serializers.cloudpickle = cloudpickle
df_test_row = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df_test_row.show()
cd = example(DataHeaders="a", DataValues="b")
json_req_str = cd.to_json()
print(json_req_str)
def function_odp_req(row):
a = row.__getattr__("a")
b = row.__getattr__("b")
cd = example(DataHeaders=a, DataValues=b)
json_req_str = cd.to_json()
return (1,json_req_str)
df_res = df_test_row.rdd.map((lambda line: function_odp_req(line)))
test = StructType([StructField('id', StringType(), True),StructField('json_request', StringType(), True)])
final_df_json = spark.createDataFrame(data=df_res, schema = test)
final_df_json.printSchema()
final_df_json.show()
The above is the full example. Not able to get the actual json string from the data class when used in side lambda function.
I have referred to multiple posts, but no luck. Any help is appreciated .
When the above gets executed, it generates below response
| a| b| c| d| e|
+---+---+-------+----------+-------------------+
| 1|2.0|string1|2000-01-01|2000-01-01 12:00:00|
| 2|3.0|string2|2000-02-01|2000-01-02 12:00:00|
| 4|5.0|string3|2000-03-01|2000-01-03 12:00:00|
+---+---+-------+----------+-------------------+
{"DataHeaders": "a", "DataValues": "b", "DateofData": "2021-03-24T15:05:10.845998"}
root
|-- id: string (nullable = true)
|-- json_request: string (nullable = true)
+---+------------+
| id|json_request|
+---+------------+
| 1| {}|
| 1| {}|
| 1| {}|
+---+------------+
If you look at the above sample output of dataframe, json_request is coming as empty. When the same is used out side of map function, json string got generated.
when I do not enable the "cloudpickle" , I see the below exception
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4696972411027040040.py", line 468, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 49, in <module>
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 749, in createDataFrame
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2298, in _to_java_object_rdd
return self.ctx._jvm.SerDeUtil.pythonToJava(rdd._jrdd, True)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2532, in _jrdd
self._jrdd_deserializer, profiler)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2434, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 600, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: TypeError: can't pickle _abc_data objects
So it is basically the "example" object used in side the function is not getting serialized, is what I understood.

PySpark trying to apply previous field's schema to next field

Having this weird issue with PySpark. It seems to be trying to apply the schema for the previous field, to the next field, as it's processing.
Simplest test case I could come up with:
%pyspark
from pyspark.sql.types import (
DateType,
StructType,
StructField,
StringType,
)
from datetime import date
from pyspark.sql import Row
schema = StructType(
[
StructField("date", DateType(), True),
StructField("country", StringType(), True),
]
)
test = spark.createDataFrame(
[
Row(
date=date(2019, 1, 1),
country="RU",
),
],
schema
)
Stacktrace:
Fail to execute line 26: schema
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-8579306903394369208.py", line 380, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 26, in <module>
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 691, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 423, in _createFromLocal
data = [schema.toInternal(row) for row in data]
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 601, in toInternal
for f, v, c in zip(self.fields, obj, self._needConversion))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 601, in <genexpr>
for f, v, c in zip(self.fields, obj, self._needConversion))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 439, in toInternal
return self.dataType.toInternal(obj)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 175, in toInternal
return d.toordinal() - self.EPOCH_ORDINAL
AttributeError: 'str' object has no attribute 'toordinal'
Bonus information from running it locally rather than in Zepplin:
self = DateType, d = 'RU'
def toInternal(self, d):
if d is not None:
> return d.toordinal() - self.EPOCH_ORDINAL
E AttributeError: 'str' object has no attribute 'toordinal'
e.g., it's trying to apply DateType to country. If I get rid of date, it's fine. If I get rid of country, it's fine. Both together, is a no go.
Any ideas? Am I missing something obvious?
If you're going to use a list of Rows, you don't need to specify the schema as well. This is because the Row already knows the schema.
The problem is happening because the pyspark.sql.Row object does not maintain the order that you specified for the fields.
print(Row(date=date(2019, 1, 1), country="RU"))
#Row(country='RU', date=datetime.date(2019, 1, 1))
From the docs:
Row can be used to create a row object by using named arguments, the fields will be sorted by names.
As you can see, the country field is being put first. When spark tries to create the DataFrame with the specified schema, it expects the first item to be a DateType.
One way to fix this is to put the fields in your schema in alphabetical order:
schema = StructType(
[
StructField("country", StringType(), True),
StructField("date", DateType(), True)
]
)
test = spark.createDataFrame(
[
Row(date=date(2019, 1, 1), country="RU")
],
schema
)
test.show()
#+-------+----------+
#|country| date|
#+-------+----------+
#| RU|2019-01-01|
#+-------+----------+
Or in this case, there's no need to even pass in the schema to createDataFrame. It will be inferred from the Rows:
test = spark.createDataFrame(
[
Row(date=date(2019, 1, 1), country="RU")
]
)
And if you wanted to reorder the columns, use select:
test = test.select("date", "country")
test.show()
#+----------+-------+
#| date|country|
#+----------+-------+
#|2019-01-01| RU|
#+----------+-------+

Using broadcasted dataframe in pyspark UDF

Is it possible to use a broadcasted data frame in the UDF of a pyspark SQl application.
My Code calls the broadcasted Dataframe inside a pyspark dataframe like below.
fact_ent_df_data =
sparkSession.sparkContext.broadcast(fact_ent_df.collect())
def generate_lookup_code(col1,col2,col3):
fact_ent_df_count=fact_ent_df_data.
select(fact_ent_df_br.TheDate.between(col1,col2),
fact_ent_df_br.Ent.isin('col3')).count()
return fact_ent_df_count
sparkSession.udf.register("generate_lookup_code" , generate_lookup_code )
sparkSession.sql('select sample4,generate_lookup_code(sample1,sample2,sample 3) as count_hol from table_t')
I am getting local variable used before assignment error when i use the broadcasted df_bc. Any help is appreciated
And the Error i am getting is
Traceback (most recent call last):
File "C:/Users/Vignesh/PycharmProjects/gettingstarted/aramex_transit/spark_driver.py", line 46, in <module>
sparkSession.udf.register("generate_lookup_code" , generate_lookup_code )
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 323, in register
self.sparkSession._jsparkSession.udf().registerPython(name, register_udf._judf)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 148, in _judf
self._judf_placeholder = self._create_judf()
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 157, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 33, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\rdd.py", line 2391, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\serializers.py", line 575, in dumps
return cloudpickle.dumps(obj, 2)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py", line 918, in dumps
cp.dump(obj)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py", line 249, in dump
raise pickle.PicklingError(msg)
pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o24.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Think about Spark Broadcast variable as a Python simple data type like list, So the problem is how to pass a variable to the UDF functions. Here is an example:
Suppose we have ages list d and a data frame with columns name and age. So we want to check if the age of each person is in ages list.
from pyspark.sql.functions import udf, col
l = [13, 21, 34] # ages list
d = [('Alice', 10), ('bob', 21)] # data frame rows
rdd = sc.parallelize(l)
b_rdd = sc.broadcast(rdd.collect()) # define broadcast variable
df = spark.createDataFrame(d , ["name", "age"])
def check_age (age, age_list):
if age in l:
return "true"
return "false"
def udf_check_age(age_list):
return udf(lambda x : check_age(x, age_list))
df.withColumn("is_age_in_list", udf_check_age(b_rdd.value)(col("age"))).show()
Output:
+-----+---+--------------+
| name|age|is_age_in_list|
+-----+---+--------------+
|Alice| 10| false|
| bob| 21| true|
+-----+---+--------------+
Just trying to contribute with a simpler example based on Soheil's answer.
from pyspark.sql.functions import udf, col
def check_age (_age):
return _age > 18
dict_source = {"alice": 10, "bob": 21}
broadcast_dict = sc.broadcast(dict_source) # define broadcast variable
rdd = sc.parallelize(list(dict_source.keys()))
result = rdd.map(
lambda _name: check_age(broadcast_dict.value.get(_name)) # Here you specify the broadcasted var `.value`
)
print(result.collect())

ValueError: Cannot convert column into bool

I'm trying build a new column on dataframe as below:
l = [(2, 1), (1,1)]
df = spark.createDataFrame(l)
def calc_dif(x,y):
if (x>y) and (x==1):
return x-y
dfNew = df.withColumn("calc", calc_dif(df["_1"], df["_2"]))
dfNew.show()
But, I get:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 346, in <module>
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 334, in <module>
File "<stdin>", line 38, in <module>
File "<stdin>", line 36, in calc_dif
File "/usr/hdp/current/spark2-client/python/pyspark/sql/column.py", line 426, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Why It happens? How can I fix It?
Either use udf:
from pyspark.sql.functions import udf
#udf("integer")
def calc_dif(x,y):
if (x>y) and (x==1):
return x-y
or case when (recommended)
from pyspark.sql.functions import when
def calc_dif(x,y):
when(( x > y) & (x == 1), x - y)
The first one computes on Python objects, the second one on Spark Columns
It is complaining because you give your calc_dif function the whole column objects, not the actual data of the respective rows. You need to use a udf to wrap your calc_dif function :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
l = [(2, 1), (1,1)]
df = spark.createDataFrame(l)
def calc_dif(x,y):
# using the udf the calc_dif is called for every row in the dataframe
# x and y are the values of the two columns
if (x>y) and (x==1):
return x-y
udf_calc = udf(calc_dif, IntegerType())
dfNew = df.withColumn("calc", udf_calc("_1", "_2"))
dfNew.show()
# since x < y calc_dif returns None
+---+---+----+
| _1| _2|calc|
+---+---+----+
| 2| 1|null|
| 1| 1|null|
+---+---+----+
For anyone who has a similar error: I was trying to pass an rdd when I needed a Pandas object and got the same error. Obviously, I could simply solve it by a ".toPandas()"
For anyone who faces the same error message, check the brackets. Sometimes boolean expression needs more specific expressions like;
DF_New=
df1.withColumn('EventStatus',\
F.when(((F.col("Adjusted_Timestamp")) <\
(F.col("Event_Finish"))) &\
((F.col("Adjusted_Timestamp"))>\
F.col("Event_Start"))),1).otherwise(0))

read content of Column<COLUMN-NAME> in pyspark

I am using spark 1.5.0
I have a data frame created like below, and am trying to read a column from here
>>> words = tokenizer.transform(sentenceData)
>>> words
DataFrame[label: bigint, sentence: string, words: array<string>]
>>> words['words']
Column<words>
I want to read all the words (vocab) from the sentences. How can I read this
Edit 1: Error Still Prevails
I now ran this in spark 2.0.0 and getting this error
>>> wordsData.show()
+--------------------+--------------------+
| desc| words|
+--------------------+--------------------+
|Virat is good bat...|[virat, is, good,...|
| sachin was good| [sachin, was, good]|
|but modi sucks bi...|[but, modi, sucks...|
| I love the formulas|[i, love, the, fo...|
+--------------------+--------------------+
>>> wordsData
DataFrame[desc: string, words: array<string>]
>>> vocab = wordsData.select(explode('words')).rdd.flatMap(lambda x: x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 305, in flatMap
return self.mapPartitionsWithIndex(func, preservesPartitioning)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 330, in mapPartitionsWithIndex
return PipelinedRDD(self, f, preservesPartitioning)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 2383, in __init__
self._jrdd_deserializer = self.ctx.serializer
AttributeError: 'SparkSession' object has no attribute 'serializer'
Resolution for Edit - 1 - Link
You can:
from pyspark.sql.functions import explode
words.select(explode('words')).rdd.flatMap(lambda x: x)

Resources