read content of Column<COLUMN-NAME> in pyspark - apache-spark

I am using spark 1.5.0
I have a data frame created like below, and am trying to read a column from here
>>> words = tokenizer.transform(sentenceData)
>>> words
DataFrame[label: bigint, sentence: string, words: array<string>]
>>> words['words']
Column<words>
I want to read all the words (vocab) from the sentences. How can I read this
Edit 1: Error Still Prevails
I now ran this in spark 2.0.0 and getting this error
>>> wordsData.show()
+--------------------+--------------------+
| desc| words|
+--------------------+--------------------+
|Virat is good bat...|[virat, is, good,...|
| sachin was good| [sachin, was, good]|
|but modi sucks bi...|[but, modi, sucks...|
| I love the formulas|[i, love, the, fo...|
+--------------------+--------------------+
>>> wordsData
DataFrame[desc: string, words: array<string>]
>>> vocab = wordsData.select(explode('words')).rdd.flatMap(lambda x: x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 305, in flatMap
return self.mapPartitionsWithIndex(func, preservesPartitioning)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 330, in mapPartitionsWithIndex
return PipelinedRDD(self, f, preservesPartitioning)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 2383, in __init__
self._jrdd_deserializer = self.ctx.serializer
AttributeError: 'SparkSession' object has no attribute 'serializer'
Resolution for Edit - 1 - Link

You can:
from pyspark.sql.functions import explode
words.select(explode('words')).rdd.flatMap(lambda x: x)

Related

Spark: KMeans - ValueError: could not convert string to float: '0\x00\x00'

I'm trying to create a kmeans for the mnist dataset. I have a way it works but it is the dirtiest hack.
My input is an CSV file with 784 (=28*28) values between 0 and 255 per row.
My first attempt was to just read my csv input, convert it to sparse arrays and fit the model with the data. However, the code below throws an error:
data = spark.read.csv("datasets/mnist_test.csv").rdd\
.map(lambda x : [float(str) for str in x])\
.toDF()
features = VectorAssembler(inputCols=data.columns, outputCol="features").transform(data).select("features")
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(features)
Output:
22/01/25 10:44:41 ERROR Executor: Exception in task 4.0 in stage 113.0 (TID 131)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 619, in main
process()
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 611, in process
serializer.dump_stream(out_iter, outfile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/opt/spark/python/lib/pyspark.zip/pyspark/util.py", line 74, in wrapper
return f(*args, **kwargs)
File "/tmp/ipykernel_74/2701217925.py", line 2, in <lambda>
File "/tmp/ipykernel_74/2701217925.py", line 2, in <listcomp>
ValueError: could not convert string to float: '0\x00\x00'
...
My next attempt was to save the dataframe as svm and then load it again:
MLUtils.saveAsLibSVMFile(features.rdd.map(lambda x: LabeledPoint(0, MLLibVectors.fromML(x.features))), './libsvm')
data2 = MLUtils.loadLibSVMFile(spark.sparkContext, './libsvm').toDF()
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(features)
Output:
22/01/25 10:47:06 ERROR Instrumentation: java.lang.IllegalArgumentException: requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
My final (working) attempt was to load the exported partitions with the spark.read.format("libsvm").load(...) method:
data3 = spark.read.format("libsvm").load("libsvm/part-00000").select("features")
data3arr = list()
for i in range(5):
data3arr.append(spark.read.format("libsvm").load("libsvm/part-0000"+str(i)).select("features"))
data3cpl = data3arr[0]
for i in data3arr[1:]:
data3cpl.union(i)
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(data3cpl)
If I look at the structure, the dataframes look quite similar in their structure. Only that features gives me an error on .show():
features.printSchema()
features.show(1,False)
data2.printSchema()
data2.show(1,False)
data3cpl.printSchema()
data3cpl.show(1,False)
Output:
root
|-- features: vector (nullable = true)
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 186, in manager
File "/opt/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 74, in worker
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 663, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 564, in read_int
raise EOFError
EOFError
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(784,[202,203,204,205,206,207,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,291,292,293,294,295,296,297,298,299,300,301,326,327,328,329,353,354,355,356,381,382,383,384,408,409,410,411,436,437,438,439,463,464,465,466,491,492,493,518,519,520,521,545,546,547,548,572,573,574,575,576,600,601,602,603,627,628,629,630,631,655,656,657,658,682,683,684,685,686,710,711,712,713,714,738,739,740,741],[84.0,185.0,159.0,151.0,60.0,36.0,222.0,254.0,254.0,254.0,254.0,241.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,170.0,52.0,67.0,114.0,72.0,114.0,163.0,227.0,254.0,225.0,254.0,254.0,254.0,250.0,229.0,254.0,254.0,140.0,17.0,66.0,14.0,67.0,67.0,67.0,59.0,21.0,236.0,254.0,106.0,83.0,253.0,209.0,18.0,22.0,233.0,255.0,83.0,129.0,254.0,238.0,44.0,59.0,249.0,254.0,62.0,133.0,254.0,187.0,5.0,9.0,205.0,248.0,58.0,126.0,254.0,182.0,75.0,251.0,240.0,57.0,19.0,221.0,254.0,166.0,3.0,203.0,254.0,219.0,35.0,38.0,254.0,254.0,77.0,31.0,224.0,254.0,115.0,1.0,133.0,254.0,254.0,52.0,61.0,242.0,254.0,254.0,52.0,121.0,254.0,254.0,219.0,40.0,121.0,254.0,207.0,18.0])|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 1 row
root
|-- features: vector (nullable = true)
|-- label: double (nullable = true)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|features |label|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|(778,[202,203,204,205,206,207,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,291,292,293,294,295,296,297,298,299,300,301,326,327,328,329,353,354,355,356,381,382,383,384,408,409,410,411,436,437,438,439,463,464,465,466,491,492,493,518,519,520,521,545,546,547,548,572,573,574,575,576,600,601,602,603,627,628,629,630,631,655,656,657,658,682,683,684,685,686,710,711,712,713,714,738,739,740,741],[84.0,185.0,159.0,151.0,60.0,36.0,222.0,254.0,254.0,254.0,254.0,241.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,170.0,52.0,67.0,114.0,72.0,114.0,163.0,227.0,254.0,225.0,254.0,254.0,254.0,250.0,229.0,254.0,254.0,140.0,17.0,66.0,14.0,67.0,67.0,67.0,59.0,21.0,236.0,254.0,106.0,83.0,253.0,209.0,18.0,22.0,233.0,255.0,83.0,129.0,254.0,238.0,44.0,59.0,249.0,254.0,62.0,133.0,254.0,187.0,5.0,9.0,205.0,248.0,58.0,126.0,254.0,182.0,75.0,251.0,240.0,57.0,19.0,221.0,254.0,166.0,3.0,203.0,254.0,219.0,35.0,38.0,254.0,254.0,77.0,31.0,224.0,254.0,115.0,1.0,133.0,254.0,254.0,52.0,61.0,242.0,254.0,254.0,52.0,121.0,254.0,254.0,219.0,40.0,121.0,254.0,207.0,18.0])|0.0 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
only showing top 1 row
root
|-- features: vector (nullable = true)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(776,[202,203,204,205,206,207,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,291,292,293,294,295,296,297,298,299,300,301,326,327,328,329,353,354,355,356,381,382,383,384,408,409,410,411,436,437,438,439,463,464,465,466,491,492,493,518,519,520,521,545,546,547,548,572,573,574,575,576,600,601,602,603,627,628,629,630,631,655,656,657,658,682,683,684,685,686,710,711,712,713,714,738,739,740,741],[84.0,185.0,159.0,151.0,60.0,36.0,222.0,254.0,254.0,254.0,254.0,241.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,198.0,170.0,52.0,67.0,114.0,72.0,114.0,163.0,227.0,254.0,225.0,254.0,254.0,254.0,250.0,229.0,254.0,254.0,140.0,17.0,66.0,14.0,67.0,67.0,67.0,59.0,21.0,236.0,254.0,106.0,83.0,253.0,209.0,18.0,22.0,233.0,255.0,83.0,129.0,254.0,238.0,44.0,59.0,249.0,254.0,62.0,133.0,254.0,187.0,5.0,9.0,205.0,248.0,58.0,126.0,254.0,182.0,75.0,251.0,240.0,57.0,19.0,221.0,254.0,166.0,3.0,203.0,254.0,219.0,35.0,38.0,254.0,254.0,77.0,31.0,224.0,254.0,115.0,1.0,133.0,254.0,254.0,52.0,61.0,242.0,254.0,254.0,52.0,121.0,254.0,254.0,219.0,40.0,121.0,254.0,207.0,18.0])|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 1 row
Can anyone tell me how to properly convert my data so I can feed it into my kmeans fit?
I'm answering because i had a similar issue and in this post i didn't find any solution.
In my case, the problem was caused by a filter operation on a DataFrame.
I solved calling cache() in that DataFrame.
In this case then, one possible solution is to try to cache the RDD:
data = spark.read.csv("datasets/mnist_test.csv").rdd\
.map(lambda x : [float(str) for str in x]).cache()\
.toDF()
features = VectorAssembler(inputCols=data.columns, outputCol="features").transform(data).select("features")
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(features)

load jalali date from string in pyspark

I need to load jalali date from string and then, return it as gregorian date string. I'm using the following code:
def jalali_to_gregorian(col, format=None):
if not format:
format = "%Y/%m/d"
gre = jdatetime.datetime.strptime(col, format=format).togregorian()
return gre.strftime(format=format)
# register the function
spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType())
# load the date and show it:)
df = df.withColumn("financial_date", jalali_to_gregorian(df.PersianCreateDate))
df.select(['PersianCreateDate', 'financial_date']).show()
it throws ValueError: time data 'Column<PersianCreateDate>' does not match format '%Y/%m/%d' error at me.
the string from the column is a match and I have tested it. this is a problem from how spark is sending the column value to my function. anyway to solve it?
to test:
df=spark.createDataFrame([('1399/01/02',),('1399/01/01',)],['jalali'])
df = df.withColumn("gre", jalali_to_gregorian(df.jalali))
df.show()
should result in
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/20|
|1399/01/01|2020/03/21|
+----------+----------+
instead, I'm thrown at with:
Fail to execute line 2: df = df.withColumn("financial_date", jalali_to_gregorian(df.jalali))
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6468469233020961307.py", line 375, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "<stdin>", line 7, in jalali_to_gregorian
File "/usr/local/lib/python2.7/dist-packages/jdatetime/__init__.py", line 929, in strptime
(date_string, format))
ValueError: time data 'Column<jalali>' does not match format '%Y/%m/%d''%Y/%m/%d'
Your problem is that you're trying to apply function to the column, not to the values inside the column.
The code that you have used: spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType()) registers your function for use in the Spark SQL (via spark.sql(...), not in the pyspark.
To get function that you can use inside withColumn, select, etc., you need to create a wrapper function that is done with udf function and this function should be used in the withColumn:
from pyspark.sql.functions import udf
jalali_to_gregorian_udf = udf(jalali_to_gregorian, StringType())
df = df.withColumn("gre", jalali_to_gregorian_udf(df.jalali))
>>> df.show()
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/21|
|1399/01/01|2020/03/20|
+----------+----------+
See documentation for more details.
You also have the error in the time format - instead of format = "%Y/%m/d" it should be format = "%Y/%m/%d".
P.S. If you're running on Spark 3.x, then I recommend to look to the vectorized UDFs (aka, Pandas UDFs) - they are much faster than usual UDFs, and will provide better performance if you have a lot of data.

Using broadcasted dataframe in pyspark UDF

Is it possible to use a broadcasted data frame in the UDF of a pyspark SQl application.
My Code calls the broadcasted Dataframe inside a pyspark dataframe like below.
fact_ent_df_data =
sparkSession.sparkContext.broadcast(fact_ent_df.collect())
def generate_lookup_code(col1,col2,col3):
fact_ent_df_count=fact_ent_df_data.
select(fact_ent_df_br.TheDate.between(col1,col2),
fact_ent_df_br.Ent.isin('col3')).count()
return fact_ent_df_count
sparkSession.udf.register("generate_lookup_code" , generate_lookup_code )
sparkSession.sql('select sample4,generate_lookup_code(sample1,sample2,sample 3) as count_hol from table_t')
I am getting local variable used before assignment error when i use the broadcasted df_bc. Any help is appreciated
And the Error i am getting is
Traceback (most recent call last):
File "C:/Users/Vignesh/PycharmProjects/gettingstarted/aramex_transit/spark_driver.py", line 46, in <module>
sparkSession.udf.register("generate_lookup_code" , generate_lookup_code )
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 323, in register
self.sparkSession._jsparkSession.udf().registerPython(name, register_udf._judf)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 148, in _judf
self._judf_placeholder = self._create_judf()
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 157, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\sql\udf.py", line 33, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\rdd.py", line 2391, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\serializers.py", line 575, in dumps
return cloudpickle.dumps(obj, 2)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py", line 918, in dumps
cp.dump(obj)
File "D:\spark-2.3.2-bin-hadoop2.6\spark-2.3.2-bin-hadoop2.6\python\pyspark\cloudpickle.py", line 249, in dump
raise pickle.PicklingError(msg)
pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o24.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Think about Spark Broadcast variable as a Python simple data type like list, So the problem is how to pass a variable to the UDF functions. Here is an example:
Suppose we have ages list d and a data frame with columns name and age. So we want to check if the age of each person is in ages list.
from pyspark.sql.functions import udf, col
l = [13, 21, 34] # ages list
d = [('Alice', 10), ('bob', 21)] # data frame rows
rdd = sc.parallelize(l)
b_rdd = sc.broadcast(rdd.collect()) # define broadcast variable
df = spark.createDataFrame(d , ["name", "age"])
def check_age (age, age_list):
if age in l:
return "true"
return "false"
def udf_check_age(age_list):
return udf(lambda x : check_age(x, age_list))
df.withColumn("is_age_in_list", udf_check_age(b_rdd.value)(col("age"))).show()
Output:
+-----+---+--------------+
| name|age|is_age_in_list|
+-----+---+--------------+
|Alice| 10| false|
| bob| 21| true|
+-----+---+--------------+
Just trying to contribute with a simpler example based on Soheil's answer.
from pyspark.sql.functions import udf, col
def check_age (_age):
return _age > 18
dict_source = {"alice": 10, "bob": 21}
broadcast_dict = sc.broadcast(dict_source) # define broadcast variable
rdd = sc.parallelize(list(dict_source.keys()))
result = rdd.map(
lambda _name: check_age(broadcast_dict.value.get(_name)) # Here you specify the broadcasted var `.value`
)
print(result.collect())

ValueError: Cannot convert column into bool

I'm trying build a new column on dataframe as below:
l = [(2, 1), (1,1)]
df = spark.createDataFrame(l)
def calc_dif(x,y):
if (x>y) and (x==1):
return x-y
dfNew = df.withColumn("calc", calc_dif(df["_1"], df["_2"]))
dfNew.show()
But, I get:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 346, in <module>
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 334, in <module>
File "<stdin>", line 38, in <module>
File "<stdin>", line 36, in calc_dif
File "/usr/hdp/current/spark2-client/python/pyspark/sql/column.py", line 426, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Why It happens? How can I fix It?
Either use udf:
from pyspark.sql.functions import udf
#udf("integer")
def calc_dif(x,y):
if (x>y) and (x==1):
return x-y
or case when (recommended)
from pyspark.sql.functions import when
def calc_dif(x,y):
when(( x > y) & (x == 1), x - y)
The first one computes on Python objects, the second one on Spark Columns
It is complaining because you give your calc_dif function the whole column objects, not the actual data of the respective rows. You need to use a udf to wrap your calc_dif function :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
l = [(2, 1), (1,1)]
df = spark.createDataFrame(l)
def calc_dif(x,y):
# using the udf the calc_dif is called for every row in the dataframe
# x and y are the values of the two columns
if (x>y) and (x==1):
return x-y
udf_calc = udf(calc_dif, IntegerType())
dfNew = df.withColumn("calc", udf_calc("_1", "_2"))
dfNew.show()
# since x < y calc_dif returns None
+---+---+----+
| _1| _2|calc|
+---+---+----+
| 2| 1|null|
| 1| 1|null|
+---+---+----+
For anyone who has a similar error: I was trying to pass an rdd when I needed a Pandas object and got the same error. Obviously, I could simply solve it by a ".toPandas()"
For anyone who faces the same error message, check the brackets. Sometimes boolean expression needs more specific expressions like;
DF_New=
df1.withColumn('EventStatus',\
F.when(((F.col("Adjusted_Timestamp")) <\
(F.col("Event_Finish"))) &\
((F.col("Adjusted_Timestamp"))>\
F.col("Event_Start"))),1).otherwise(0))

Creating a DataFrame from Row results in 'infer schema issue'

When I began learning PySpark, I used a list to create a dataframe. Now that inferring the schema from list has been deprecated, I got a warning and it suggested me to use pyspark.sql.Row instead. However, when I try to create one using Row, I get infer schema issue. This is my code:
>>> row = Row(name='Severin', age=33)
>>> df = spark.createDataFrame(row)
This results in the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 390, in _createFromLocal
struct = self._inferSchemaFromList(data)
File "/spark2-client/python/pyspark/sql/session.py", line 322, in _inferSchemaFromList
schema = reduce(_merge_type, map(_infer_schema, data))
File "/spark2-client/python/pyspark/sql/types.py", line 992, in _infer_schema
raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>
So I created a schema
>>> schema = StructType([StructField('name', StringType()),
... StructField('age',IntegerType())])
>>> df = spark.createDataFrame(row, schema)
but then, this error gets thrown.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 387, in _createFromLocal
data = list(data)
File "/spark2-client/python/pyspark/sql/session.py", line 509, in prepare
verify_func(obj, schema)
File "/spark2-client/python/pyspark/sql/types.py", line 1366, in _verify_type
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 33 in type <type 'int'>
The createDataFrame function takes a list of Rows (among other options) plus the schema, so the correct code would be something like:
from pyspark.sql.types import *
from pyspark.sql import Row
schema = StructType([StructField('name', StringType()), StructField('age',IntegerType())])
rows = [Row(name='Severin', age=33), Row(name='John', age=48)]
df = spark.createDataFrame(rows, schema)
df.printSchema()
df.show()
Out:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
+-------+---+
| name|age|
+-------+---+
|Severin| 33|
| John| 48|
+-------+---+
In the pyspark docs (link) you can find more details about the createDataFrame function.
you need to create a list of type Row and pass that list with schema to your createDataFrame() method. sample example
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
department1 = Row(id='AAAAAAAAAAAAAA', type='XXXXX',cost='2')
department2 = Row(id='AAAAAAAAAAAAAA', type='YYYYY',cost='32')
department3 = Row(id='BBBBBBBBBBBBBB', type='XXXXX',cost='42')
department4 = Row(id='BBBBBBBBBBBBBB', type='YYYYY',cost='142')
department5 = Row(id='BBBBBBBBBBBBBB', type='ZZZZZ',cost='149')
department6 = Row(id='CCCCCCCCCCCCCC', type='XXXXX',cost='15')
department7 = Row(id='CCCCCCCCCCCCCC', type='YYYYY',cost='23')
department8 = Row(id='CCCCCCCCCCCCCC', type='ZZZZZ',cost='10')
schema = StructType([StructField('id', StringType()), StructField('type',StringType()),StructField('cost', StringType())])
rows = [department1,department2,department3,department4,department5,department6,department7,department8 ]
df = spark.createDataFrame(rows, schema)
If you're just making a pandas dataframe, you can convert each Row to a dict and then rely on pandas' type inference, if that's good enough for your needs. This worked for me:
import pandas as pd
sample = output.head(5) #this returns a list of Row objects
df = pd.DataFrame([x.asDict() for x in sample])
I have had a similar problem recently and the answers here helped me untderstand the problem better.
my code:
row = Row(name="Alice", age=11)
spark.createDataFrame(row).show()
resulted in a very similar error:
An error was encountered:
Can not infer schema for type: <class 'int'>
Traceback ...
the cause of the problem:
createDataFrame expects an array of rows. So if you only have one row and don't want to invent more, simply make it an array: [row]
row = Row(name="Alice", age=11)
spark.createDataFrame([row]).show()

Resources