System: Spark 1.3.0 (Anaconda Python dist.) on Cloudera Quickstart VM 5.4
Here's a Spark DataFrame:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',3),
('Baz',22,'US',6),
(None,75,None,7)])
schema = StructType([StructField('Name', StringType(), True),
StructField('Age', IntegerType(), True),
StructField('Country', StringType(), True),
StructField('Score', IntegerType(), True)])
df = sqlContext.createDataFrame(data,schema)
data.show()
Name Age Country Score
Foo 41 US 3
Foo 39 UK 1
Bar 57 CA 2
Bar 72 CA 3
Baz 22 US 6
null 75 null 7
However neither of these work!
df.dropna()
df.na.drop()
I get this message:
>>> df.show()
Name Age Country Score
Foo 41 US 3
Foo 39 UK 1
Bar 57 CA 2
Bar 72 CA 3
Baz 22 US 6
null 75 null 7
>>> df.dropna().show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 580, in __getattr__
jc = self._jdf.apply(name)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.apply.
: org.apache.spark.sql.AnalysisException: Cannot resolve column name "dropna" among (Name, Age, Country, Score);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Has anybody else experienced this problem? What's the workaround? Pyspark seems to thing that I am looking for a column called "na". Any help would be appreciated!
tl;dr The methods na and dropna are only available since Spark 1.3.1.
Few mistakes you made:
data = sc.parallelize([....('',75,'', 7 )]), you intended to use '' to represent None, however, it's just a String instead of null
na and dropna are both methods on dataFrame class, therefore, you should call it with your df.
Runnable Code:
data = sc.parallelize([('Foo',41,'US',3),
('Foo',39,'UK',1),
('Bar',57,'CA',2),
('Bar',72,'CA',3),
('Baz',22,'US',6),
(None, 75, None, 7)])
schema = StructType([StructField('Name', StringType(), True),
StructField('Age', IntegerType(), True),
StructField('Country', StringType(), True),
StructField('Score', IntegerType(), True)])
df = sqlContext.createDataFrame(data,schema)
df.dropna().show()
df.na.drop().show()
I realize that the question was asked a yr ago, in-case leaving the solution for Scala, below in-case someone lands here looking for the same
val data = sc.parallelize(List(("Foo",41,"US",3), ("Foo",39,"UK",1),
("Bar",57,"CA",2), ("Bar",72,"CA",3), ("Baz",22,"US",6), (None, 75,
None, 7)))
val schema = StructType(Array(StructField("Name", StringType, true),
StructField("Age", IntegerType, true), StructField("Country",
StringType, true), StructField("Score", IntegerType, true)))
val dat = data.map(d => Row(d._1, d._2, d._3, d._4))
val df = sqlContext.createDataFrame(dat, schema)
df.na.drop()
Note:
The above solution will still fail to give the right result in Scala, not sure what is different in the implementation between Scala and python binding. na.drop works if the missing data is represented as null. It fails for "" and None. One alternative around the same is to make use of withColumn function to handle missing values of different forms
Related
I am trying to solve the problems from O'Reilly book of Learning Spark.
Below part of code is working fine
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# define schema for our data
schema = StructType([
StructField("Id", IntegerType(), False),
StructField("First", StringType(), False),
StructField("Last", StringType(), False),
StructField("Url", StringType(), False),
StructField("Published", StringType(), False),
StructField("Hits", IntegerType(), False),
StructField("Campaigns", ArrayType(StringType()), False)])
#create our data
data = [[1, "Jules", "Damji", "https://tinyurl.1", "1/4/2016", 4535, ["twitter", "LinkedIn"]],
[2, "Brooke","Wenig","https://tinyurl.2", "5/5/2018", 8908, ["twitter", "LinkedIn"]],
[3, "Denny", "Lee", "https://tinyurl.3","6/7/2019",7659, ["web", "twitter", "FB",
"LinkedIn"]],
[4, "Tathagata", "Das","https://tinyurl.4", "5/12/2018", 10568, ["twitter", "FB"]],
[5, "Matei","Zaharia", "https://tinyurl.5", "5/14/2014", 40578, ["web", "twitter", "FB",
"LinkedIn"]],
[6, "Reynold", "Xin", "https://tinyurl.6", "3/2/2015", 25568, ["twitter", "LinkedIn"]]
]
# main program
if __name__ == "__main__":
# create a SparkSession
spark = (SparkSession
.builder
.appName("Example-3_6")
.getOrCreate())
# create a DataFrame using the schema defined above
blogs_df = spark.createDataFrame(data, schema)
But when I am trying to execute .show(), I am getting java error. Can somebody help me on how do I resolve this error ?
blogs_df.show()
Error :
Py4JJavaError: An error occurred while calling o95.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (<>.<>.com executor driver): java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:165)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
At the same time when I am executing below code, I am getting the result of df.show()
from pyspark.sql.types import StructType, IntegerType, StringType
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
schema = StructType() \
.add("city", StringType(), True) \
.add("state", StringType(), True) \
.add("pop", IntegerType(), True)
df_with_schema1 = spark.read.format("csv") \
.option("delimiter", ",") \
.option("header", True) \
.schema(schema) \
.load("<directory>\\pyspark-test.csv")
df_with_schema1.show()
You most likely don't have Python installed properly. Also gave this a try
spark = (SparkSession
.builder
.master('local[*]') # add this
.appName("Example-3_6")
.getOrCreate())
I'm trying to convert an RDD back to a Spark DataFrame using the code below
schema = StructType(
[StructField("msn", StringType(), True),
StructField("Input_Tensor", ArrayType(DoubleType()), True)]
)
DF = spark.createDataFrame(rdd, schema=schema)
The dataset has only two columns:
msn that contains only a string of character.
Input_Tensor a 2D array of float.
But I keep having this error and I'm not sure where it's coming from :
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/myproject/datasets/train.py", line 51, in EMA_detector
DF = spark.createDataFrame(rdd, schema=schema)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/sql/session.py", line 790, in createDataFrame
jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2364, in _to_java_object_rdd
return self.ctx._jvm.SerDeUtil.pythonToJava(rdd._jrdd, True)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2599, in _jrdd
self._jrdd_deserializer, profiler)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2500, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/rdd.py", line 2486, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/tmp/conda-d3f87356-6008-4349-9075-f488e0870d02/real/envs/conda-env/lib/python3.6/site-packages/pyspark/serializers.py", line 694, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: AttributeError: 'NoneType' object has no attribute 'items'
EDIT:
My RDD comes from this :
rdd = test_data.mapPartitions(lambda part: vectorizer.transform(part))
The dataset test_data is itself an RDD but somehow after the mapPartitions it's a pipelinedRDD and that seems to be why it fails.
Assuming your rdd is defined by the following data:
data = [("row1", [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]), ("row2", [[7.7, 8.8, 9.9], [10.10, 11.11, 12.12]])]
Then you can use toDF() method which will infer the data types for you. In this case string and array<array<double>>
>>> sc.parallelize(data).toDF(["msn", "Input_Tensor"])
DataFrame[msn: string, Input_Tensor: array<array<double>>]
The end result:
>>> sc.parallelize(data).toDF(["msn", "Input_Tensor"]).show(truncate=False)
+----+---------------------------------------+
|msn |Input_Tensor |
+----+---------------------------------------+
|row1|[[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]] |
|row2|[[7.7, 8.8, 9.9], [10.1, 11.11, 12.12]]|
+----+---------------------------------------+
However, you can still use createDataFrame method if the tensor is defined as a double array in the schema.
from pyspark.sql.types import StructType, StructField, ArrayType, DoubleType, StringType
data = [("row1", [[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]]), ("row2", [[7.7, 8.8, 9.9], [10.10, 11.11, 12.12]])]
rdd = sc.parallelize(data)
schema = StructType([
StructField("msn", StringType(), True),
StructField("Input_Tensor", ArrayType(ArrayType(DoubleType())), True)])
spark.createDataFrame(rdd, schema=schema).show(truncate=False)
I changed an RDD to DataFrame and compared the results with another DataFrame which I imported using read.csv but the floating point precision is not the same from the two approaches. I appreciate your help.
The data I am using is from here.
from pyspark.sql import Row
from pyspark.sql.types import *
RDD way
orders = sc.textFile("retail_db/orders")
order_items = sc.textFile('retail_db/order_items')
orders_comp = orders.filter(lambda line: ((line.split(',')[-1] == 'CLOSED') or (line.split(',')[-1] == 'COMPLETE')))
orders_compMap = orders_comp.map(lambda line: (int(line.split(',')[0]), line.split(',')[1]))
order_itemsMap = order_items.map(lambda line: (int(line.split(',')[1]),
(int(line.split(',')[2]), float(line.split(',')[4])) ))
joined = orders_compMap.join(order_itemsMap)
joined2 = joined.map(lambda line: ((line[1][0], line[1][1][0]), line[1][1][1]))
joined3 = joined2.reduceByKey(lambda a, b : a +b).sortByKey()
df1 = joined3.map(lambda x:Row(date = x[0][0], product_id = x[0][1], total = x[1])).toDF().select(['date','product_id', 'total'])
DataFrame
schema = StructType([StructField('order_id', IntegerType(), True),
StructField('date', StringType(), True),
StructField('customer_id', StringType(), True),
StructField('status', StringType(), True)])
orders2 = spark.read.csv("retail_db/orders",schema = schema)
schema = StructType([StructField('item_id', IntegerType(), True),
StructField('order_id', IntegerType(), True),
StructField('product_id', IntegerType(), True),
StructField('quantity', StringType(), True),
StructField('sub_total', FloatType(), True),
StructField('product_price', FloatType(), True)])
orders_items2 = spark.read.csv("retail_db/order_items", schema = schema)
orders2.registerTempTable("orders2t")
orders_items2.registerTempTable("orders_items2t")
df2 = spark.sql('select o.date, oi.product_id, sum(oi.sub_total) \
as total from orders2t as o inner join orders_items2t as oi on
o.order_id = oi.order_id \
where o.status in ("CLOSED", "COMPLETE") group by o.date,
oi.product_id order by o.date, oi.product_id')
Are they the same?
df1.registerTempTable("df1t")
df2.registerTempTable("df2t")
spark.sql("select d1.total - d2.total as difference from df1t as d1 inner
join df2t as d2 on d1.date = d2.date \
and d1.product_id =d2.product_id ").show(truncate = False)
Ignoring loss of precision in conversions there are not the same.
Python
According to Python's Floating Point Arithmetic: Issues and Limitations standard implementations use 64 bit representation:
Almost all machines today (November 2000) use IEEE-754 floating point arithmetic, and almost all platforms map Python floats to IEEE-754 “double precision”. 754 doubles contain 53 bits of precision,
Spark SQL
In Spark SQL FloatType uses 32 bit representation:
FloatType: Represents 4-byte single-precision floating point numbers.
Using DoubleType might be closer:
DoubleType: Represents 8-byte double-precision floating point numbers.
but if predictable behavior is important you should use DecimalTypes with well defined precision.
I have data in a list and want to convert it to a spark dataframe with one of the column names containing a "."
I wrote the below code which ran without any errors.
input_data = [('retail', '2017-01-03T13:21:00', 134),
('retail', '2017-01-03T13:21:00', 100)]
rdd_schema = StructType([StructField('business', StringType(), True), \
StructField('date', StringType(), True), \
StructField("`US.sales`", FloatType(), True)])
input_mock_df = spark.createDataFrame(input_mock_rdd_map, rdd_schema)
The below code returns the column names
input_mock_df.columns
But any operations on this dataframe is giving error for example
input_mock_df.count()
How do I make a valid spark dataframe which contains a "."?
Note:
I dont give "." in the column name the code works perfectly.
I want to solve it using native spark and not use pandas etc
I have ran the below code
input_data = [('retail', '2017-01-03T13:21:00', 134),
('retail', '2017-01-03T13:21:00', 100)]
rdd_schema = StructType([StructField('business', StringType(), True), \
StructField('date', StringType(), True), \
StructField("US.sales", IntegerType(), True)])
input_mock_df = sqlContext.createDataFrame(input_data, rdd_schema)
input_mock_df.count()
and it works fine returning the count as 2. Please try and reply
I want to convert RDD to DataFrame using StructType. But item "Broken,Line," would cause error. Is there an elegant way to process record like this? Thanks.
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
val mySchema = StructType(Array(
StructField("colA", StringType, true),
StructField("colB", StringType, true),
StructField("colC", StringType, true)))
val x = List("97573,Start,eee", "9713,END,Good", "Broken,Line,")
val inputx = sc.parallelize(x).
| map((x:String) => Row.fromSeq(x.split(",").slice(0,mySchema.size).toSeq))
val df = spark.createDataFrame(inputx, mySchema)
df.show
Error would be like this:
Name: org.apache.spark.SparkException Message: Job aborted due to
stage failure: Task 0 in stage 14.0 failed 1 times, most recent
failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor
driver): java.lang.RuntimeException: Error while encoding:
java.lang.ArrayIndexOutOfBoundsException: 2
I'm using:
Spark: 2.2.0
Scala: 2.11.8
And I ran the code in spark-shell.
Row.fromSeq on which we apply your schema throws the error that you are getting. Your third element in your list contains just 2 elements. You can't transform it into a Row with three elements unless you add a null value instead of the missing value.
And when creating your DataFrame, Spark is expecting 3 elements per Row on which to apply the schema, thus the error.
A quick and dirty solution would be to use scala.util.Try to get fields separately :
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
import scala.util.Try
val mySchema = StructType(Array(StructField("colA", StringType, true), StructField("colB", StringType, true), StructField("colC", StringType, true)))
val l = List("97573,Start,eee", "9713,END,Good", "Broken,Line,")
val rdd = sc.parallelize(l).map {
x => {
val fields = x.split(",").slice(0, mySchema.size)
val f1 = Try(fields(0)).getOrElse("")
val f2 = Try(fields(1)).getOrElse("")
val f3 = Try(fields(2)).getOrElse("")
Row(f1, f2, f3)
}
}
val df = spark.createDataFrame(rdd, mySchema)
df.show
// +------+-----+----+
// | colA| colB|colC|
// +------+-----+----+
// | 97573|Start| eee|
// | 9713| END|Good|
// |Broken| Line| |
// +------+-----+----+
I wouldn't say that it's an elegant solution like you've asked. Parsing strings is never elegant ! You ought using the csv source to read it correctly (or spark-csv for < 2.x).