Error while converting RDD to Dataframe [PySpark] - apache-spark

I'm trying to convert an RDD back to a Spark DataFrame using the code below
schema = StructType(
[StructField("msn", StringType(), True),
StructField("Input_Tensor", ArrayType(DoubleType()), True)]
)
DF = spark.createDataFrame(rdd, schema=schema)
The dataset has only two columns:
msn that contains only a string of character.
Input_Tensor a 2D array of float.
But I keep having this error and I'm not sure where it's coming from :
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/myproject/datasets/train.py", line 51, in EMA_detector
DF = spark.createDataFrame(rdd, schema=schema)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/sql/session.py", line 790, in createDataFrame
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2364, in _to_java_object_rdd
return self.ctx._jvm.SerDeUtil.pythonToJava(rdd._jrdd, True)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2599, in _jrdd
self._jrdd_deserializer, profiler)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2500, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2486, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/serializers.py", line 694, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: AttributeError: 'NoneType' object has no attribute 'items'
EDIT:
My RDD comes from this :
rdd = test_data.mapPartitions(lambda part: vectorizer.transform(part))
The dataset test_data is itself an RDD but somehow after the mapPartitions it's a pipelinedRDD and that seems to be why it fails.

Assuming your rdd is defined by the following data:
data = [("row1", [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]), ("row2", [[7.7, 8.8, 9.9], [10.10, 11.11, 12.12]])]
Then you can use toDF() method which will infer the data types for you. In this case string and array<array<double>>
>>> sc.parallelize(data).toDF(["msn", "Input_Tensor"])
DataFrame[msn: string, Input_Tensor: array<array<double>>]
The end result:
>>> sc.parallelize(data).toDF(["msn", "Input_Tensor"]).show(truncate=False)
+----+---------------------------------------+
|msn |Input_Tensor |
+----+---------------------------------------+
|row1|[[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]] |
|row2|[[7.7, 8.8, 9.9], [10.1, 11.11, 12.12]]|
+----+---------------------------------------+
However, you can still use createDataFrame method if the tensor is defined as a double array in the schema.
from pyspark.sql.types import StructType, StructField, ArrayType, DoubleType, StringType
data = [("row1", [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]), ("row2", [[7.7, 8.8, 9.9], [10.10, 11.11, 12.12]])]
rdd = sc.parallelize(data)
schema = StructType([
StructField("msn", StringType(), True),
StructField("Input_Tensor", ArrayType(ArrayType(DoubleType())), True)])
spark.createDataFrame(rdd, schema=schema).show(truncate=False)

Related

Write parquet from another parquet with a new schema using pyspark

I am using pyspark dataframes, I want to read a parquet file and write it with a different schema from the original file
The original schema is (It have 9.000 variables, I am just putting the first 5 for the example):
[('id', 'string'),
('date', 'string'),
('option', 'string'),
('cel1', 'string'),
('cel2', 'string')]
And I want to write:
[('id', 'integer'),
('date', 'integer'),
('option', 'integer'),
('cel1', 'integer'),
('cel2', 'integer')]
My code is:
df = sqlContext.read.parquet("PATH")
### SOME OPERATIONS ###
write_schema = StructType([StructField('id' , IntegerType(), True),
StructField('date' , IntegerType(), True),
StructField('option' , IntegerType(), True),
StructField('cel1' , IntegerType(), True),
StructField('cel2' , IntegerType(), True) ])
df.option("schema",write_schema).write("PATH")
After I run it I still have the same schema from the original data, everything is string, the schema did not changed
Also I tried using
df = sqlContext.read.option("schema",write_schema).parquet(PATH)
This option does not change the schema when I read it, It shows the original one, so I use (suggested in here):
df = sqlContext.read.schema(write_schema).parquet(PATH)
These one works for the reading part, if I see the types I get:
df.dtypes
#>>[('id', 'int'),
# ('date', 'int'),
# ('option', 'int'),
# ('cel1', 'int'),
# ('cel2', 'int')]
But when I tried to write the parquet I get an error:
Parquet column cannot be converted. Column: [id], Expected: IntegerType, Found: BINARY
Regards
Cast your columns to int and then try writing to another parquet file. No schema specification needed.
df = spark.read.parquet("filepath")
df2 = df.select(*map(lambda col: df[col].cast('int'), df.columns))
df2.write.parquet("filepath")
for this you can actually enforce the schema right when reading the data.
you can modify the code as follows:
df = sqlContext.read.option("schema",write_schema).parquet("PATH")
df.write.parquet("NEW_PATH")

Pyspark: Unable to turn RDD into DataFrame due to data type str instead of StringType

I'm doing some complex operations in Pyspark where the last operation is a flatMap that yields an object of type pyspark.rdd.PipelinedRDD whose content is simply a list of strings:
print(output_data.take(8))
> ['a', 'abc', 'a', 'aefgtr', 'bcde', 'bc', 'bhdsjfk', 'b']
I'm starting my Spark-Session like this (local session for testing):
spark = SparkSession.builder.appName("my_app")\
.config('spark.sql.shuffle.partitions', '2').master("local").getOrCreate()
My input data looks like this:
input_data = (('a', ('abc', [[('abc', 23)], 23, False, 3])),
('a', ('abcde', [[('abcde', 17)], 17, False, 5])),
('a', ('a', [[('a', 66)], 66, False, 1])),
('a', ('aefgtr', [[('aefgtr', 65)], 65, False, 6])),
('b', ('bc', [[('bc', 25)], 25, False, 2])),
('b', ('bcde', [[('bcde', 76)], 76, False, 4])),
('b', ('b', [[('b', 13)], 13, False, 1])),
('b', ('bhdsjfk', [[('bhdsjfk', 36)], 36, False, 7])))
input_data = sc.parallelize(input_data)
I want to turn that output RDD into a DataFrame with one column like this:
schema = StructType([StructField("term", StringType())])
df = spark.createDataFrame(output_data, schema=schema)
This doesn't work, I'm getting this error:
TypeError: StructType can not accept object 'a' in type <class 'str'>
So I tried it without schema and got this error:
TypeError: Can not infer schema for type: <class 'str'>
EDIT: The same error happens when trying toDF().
So for some reason I have a pyspark.rdd.PipelinedRDD whose elements are not StringType but standard Python str.
I'm relatively new to Pyspark so can someone enlighten me on why this might be happening?
I'm surprised Pyspark isn't able to implicitely cast str to StringType.
I can't post the entire code, just saying that I'm doing some complex stuff with strings including string comparison and for-loops. I'm not explicitely typecasting anything though.
One solution would be to convert your RDD of String into a RDD of Row as follows:
from pyspark.sql import Row
df = spark.createDataFrame(output_data.map(lambda x: Row(x)), schema=schema)
# or with a simple list of names as a schema
df = spark.createDataFrame(output_data.map(lambda x: Row(x)), schema=['term'])
# or even use `toDF`:
df = output_data.map(lambda x: Row(x)).toDF(['term'])
# or another variant
df = output_data.map(lambda x: Row(term=x)).toDF()
Interestingly, as you mention it, specifiying a schema for a RDD of a raw type like string does not work. Yet, if we only specify the type, it works but you cannot specify the name. Another approach would thus be to do just that and rename the column called value like this:
from pyspark.sql import functions as F
df = spark.createDataFrame(output_data, StringType())\
.select(F.col('value').alias('term'))
# or similarly
df = spark.createDataFrame(output_data, "string")\
.select(F.col('value').alias('term'))

How to replace a string value to int in Spark Dataset?

For example, input data:
1.0
\N
Schema:
val schema = StructType(Seq(
StructField("value", DoubleType, false)
))
Read into Spark Dataset:
val df = spark.read.schema(schema)
.csv("/path to csv file ")
When I use this Dataset, I will get an exception as "\N" is invalid for double. How can I replace "\N" with 0.0 in this dataset entirely? Thanks.
If data is malformed, don't use schema with inappropriate type. Define input as StringType:
val schema = StructType(Seq(
StructField("value", StringType, false)
))
and cast data later:
val df = spark.read.schema(schema).csv("/path/to/csv/file")
.withColumn("value", $"value".cast("double"))
.na.fill(0.0)

How to create an empty DataFrame? Why "ValueError: RDD is empty"?

I am trying to create an empty dataframe in Spark (Pyspark).
I am using similar approach to the one discussed here enter link description here, but it is not working.
This is my code
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
This is the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema
first = rdd.first()
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
extending Joe Widen's answer, you can actually create the schema with no fields like so:
schema = StructType([])
so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[].
>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())
In Scala, if you choose to use sqlContext.emptyDataFrame and check out the schema, it will return StructType().
scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []
scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()
At the time this answer was written it looks like you need some sort of schema
from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)
sc = spark.sparkContext
sqlContext.createDataFrame(sc.emptyRDD(), schema)
This will work with spark version 2.0.0 or more
from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
spark.range(0).drop("id")
This creates a DataFrame with an "id" column and no rows then drops the "id" column, leaving you with a truly empty DataFrame.
You can just use something like this:
pivot_table = sparkSession.createDataFrame([("99","99")], ["col1","col2"])
If you want an empty dataframe based on an existing one, simple limit rows to 0.
In PySpark :
emptyDf = existingDf.limit(0)
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType
spark = SparkSession.builder.appName('SparkPractice').getOrCreate()
schema = StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
df.printSchema()
This is a roundabout but simple way to create an empty spark df with an inferred schema
# Initialize a spark df using one row of data with the desired schema
init_sdf = spark.createDataFrame([('a_string', 0, 0)], ['name', 'index', 'seq_#'])
# remove the row. Leaves the schema
empty_sdf = init_sdf.where(col('name') == 'not_match')
empty_sdf.printSchema()
# Output
root
|-- name: string (nullable = true)
|-- index: long (nullable = true)
|-- seq_#: long (nullable = true)
Seq.empty[String].toDF()
This will create a empty df. Helpful for testing purposes and all. (Scala-Spark)
In Spark 3.1.2, the spark.sparkContext.emptyRDD() function throws an error. Using the schema, passing an empty list will work:
df = spark.createDataFrame([], schema)
You can do it by loading an empty file (parquet, json etc.) like this:
df = sqlContext.read.json("my_empty_file.json")
Then when you try to check the schema you'll see:
>>> df.printSchema()
root
In Scala/Java not passing a path should work too, in Python it throws an exception. Also if you ever switch to Scala/Python you can use this method to create one.
You can create an empty data frame by using following syntax in pyspark:
df = spark.createDataFrame([], ["col1", "col2", ...])
where [] represents the empty value for col1 and col2. Then you can register as temp view for your sql queries:
**df2.createOrReplaceTempView("artist")**

Why does dropna() not work?

System: Spark 1.3.0 (Anaconda Python dist.) on Cloudera Quickstart VM 5.4
Here's a Spark DataFrame:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',3),
('Baz',22,'US',6),
(None,75,None,7)])
schema = StructType([StructField('Name', StringType(), True),
StructField('Age', IntegerType(), True),
StructField('Country', StringType(), True),
StructField('Score', IntegerType(), True)])
df = sqlContext.createDataFrame(data,schema)
data.show()
Name Age Country Score
Foo 41 US 3
Foo 39 UK 1
Bar 57 CA 2
Bar 72 CA 3
Baz 22 US 6
null 75 null 7
However neither of these work!
df.dropna()
df.na.drop()
I get this message:
>>> df.show()
Name Age Country Score
Foo 41 US 3
Foo 39 UK 1
Bar 57 CA 2
Bar 72 CA 3
Baz 22 US 6
null 75 null 7
>>> df.dropna().show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 580, in __getattr__
jc = self._jdf.apply(name)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.apply.
: org.apache.spark.sql.AnalysisException: Cannot resolve column name "dropna" among (Name, Age, Country, Score);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Has anybody else experienced this problem? What's the workaround? Pyspark seems to thing that I am looking for a column called "na". Any help would be appreciated!
tl;dr The methods na and dropna are only available since Spark 1.3.1.
Few mistakes you made:
data = sc.parallelize([....('',75,'', 7 )]), you intended to use '' to represent None, however, it's just a String instead of null
na and dropna are both methods on dataFrame class, therefore, you should call it with your df.
Runnable Code:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',3),
('Baz',22,'US',6),
(None, 75, None, 7)])
schema = StructType([StructField('Name', StringType(), True),
StructField('Age', IntegerType(), True),
StructField('Country', StringType(), True),
StructField('Score', IntegerType(), True)])
df = sqlContext.createDataFrame(data,schema)
df.dropna().show()
df.na.drop().show()
I realize that the question was asked a yr ago, in-case leaving the solution for Scala, below in-case someone lands here looking for the same
val data = sc.parallelize(List(("Foo",41,"US",3), ("Foo",39,"UK",1),
("Bar",57,"CA",2), ("Bar",72,"CA",3), ("Baz",22,"US",6), (None, 75,
None, 7)))
val schema = StructType(Array(StructField("Name", StringType, true),
StructField("Age", IntegerType, true), StructField("Country",
StringType, true), StructField("Score", IntegerType, true)))
val dat = data.map(d => Row(d._1, d._2, d._3, d._4))
val df = sqlContext.createDataFrame(dat, schema)
df.na.drop()
Note:
The above solution will still fail to give the right result in Scala, not sure what is different in the implementation between Scala and python binding. na.drop works if the missing data is represented as null. It fails for "" and None. One alternative around the same is to make use of withColumn function to handle missing values of different forms

Resources