Different floating point precision from RDD and DataFrame - apache-spark

I changed an RDD to DataFrame and compared the results with another DataFrame which I imported using read.csv but the floating point precision is not the same from the two approaches. I appreciate your help.
The data I am using is from here.
from pyspark.sql import Row
from pyspark.sql.types import *
RDD way
orders = sc.textFile("retail_db/orders")
order_items = sc.textFile('retail_db/order_items')
orders_comp = orders.filter(lambda line: ((line.split(',')[-1] == 'CLOSED') or (line.split(',')[-1] == 'COMPLETE')))
orders_compMap = orders_comp.map(lambda line: (int(line.split(',')[0]), line.split(',')[1]))
order_itemsMap = order_items.map(lambda line: (int(line.split(',')[1]),
(int(line.split(',')[2]), float(line.split(',')[4])) ))
joined = orders_compMap.join(order_itemsMap)
joined2 = joined.map(lambda line: ((line[1][0], line[1][1][0]), line[1][1][1]))
joined3 = joined2.reduceByKey(lambda a, b : a +b).sortByKey()
df1 = joined3.map(lambda x:Row(date = x[0][0], product_id = x[0][1], total = x[1])).toDF().select(['date','product_id', 'total'])
DataFrame
schema = StructType([StructField('order_id', IntegerType(), True),
StructField('date', StringType(), True),
StructField('customer_id', StringType(), True),
StructField('status', StringType(), True)])
orders2 = spark.read.csv("retail_db/orders",schema = schema)
schema = StructType([StructField('item_id', IntegerType(), True),
StructField('order_id', IntegerType(), True),
StructField('product_id', IntegerType(), True),
StructField('quantity', StringType(), True),
StructField('sub_total', FloatType(), True),
StructField('product_price', FloatType(), True)])
orders_items2 = spark.read.csv("retail_db/order_items", schema = schema)
orders2.registerTempTable("orders2t")
orders_items2.registerTempTable("orders_items2t")
df2 = spark.sql('select o.date, oi.product_id, sum(oi.sub_total) \
as total from orders2t as o inner join orders_items2t as oi on
o.order_id = oi.order_id \
where o.status in ("CLOSED", "COMPLETE") group by o.date,
oi.product_id order by o.date, oi.product_id')
Are they the same?
df1.registerTempTable("df1t")
df2.registerTempTable("df2t")
spark.sql("select d1.total - d2.total as difference from df1t as d1 inner
join df2t as d2 on d1.date = d2.date \
and d1.product_id =d2.product_id ").show(truncate = False)

Ignoring loss of precision in conversions there are not the same.
Python
According to Python's Floating Point Arithmetic: Issues and Limitations standard implementations use 64 bit representation:
Almost all machines today (November 2000) use IEEE-754 floating point arithmetic, and almost all platforms map Python floats to IEEE-754 “double precision”. 754 doubles contain 53 bits of precision,
Spark SQL
In Spark SQL FloatType uses 32 bit representation:
FloatType: Represents 4-byte single-precision floating point numbers.
Using DoubleType might be closer:
DoubleType: Represents 8-byte double-precision floating point numbers.
but if predictable behavior is important you should use DecimalTypes with well defined precision.

Related

pyspark json not able to inferschema for empty

In Pyspark, whenever i read a json file with an empty set element. The entire element is ignored in the resultant DataFrame.
Sample json :
{logs :[],pagination:{}}
And it only ignores the second element, i.e pagination in the above example. is there anyway to read the json with proper schema.?
Yes, you can perform in two ways with schema and without schema:
Reading Json with schema:
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,LongType
schema = StructType([StructField('email', StringType(), True),
StructField('first_name', StringType(), True),
StructField('gender', StringType(), True),
StructField('id', LongType(), True),
StructField('last_name', StringType(), True)])
df = spark.read.schema(schema).json(r'dbfs:/FileStore/MOCK_DATA__1_.json')
Reading Json Without schema
d1 = spark.read.json(r'dbfs:/FileStore/MOCK_DATA__1_.json')
d1.show()

Why is Spark making certain columns of a CSV file to nulls when there is data in those columns?

I am trying to read a json file, convert it to CSV in PySpark as below.
df = spark.read.json(inputdir')
I have the below schema which I am imposing on my dataframe.
mechanic_schema = StructType([
StructField("name", StringType(), True),
StructField("some_other_column", StringType(), True),
StructField("url", StringType(), True),
StructField("image", StringType(), True),
StructField("startTime", StringType(), True),
StructField("recipeYield", StringType(), True),
StructField("datePublished", StringType(), True),
StructField("endTime", StringType(), True),
StructField("description", StringType(), True)
])
I am saving the dataframe: df in an output directory as below.
df.select(mechanic_schema.names).write.format('csv').option("header","true").save(''/Users/bobby/Desktop/output/', header='true')
This is how the output looks like:
df.show()
Now in another script, I am reading the same csv file that I saved in output path of df as below:
df = spark.read.format('csv').option('header', True).load('/Users/bobby/Desktop/output/')
df.show()
But strangely, the output contains so many columns as nulls which looks like this:
So I checked my output CSV file and the data looks exactly fine there.
I have never come across this phenomenon until now and don't understand what did I do wrong here.
Could anyone let me know what is causing this issue and how can fix this problem ?
Any help is appreciated.

How does the Databricks Delta Lake `mergeSchema` option handle differing data types?

What does the Databricks Delta Lake mergeSchema option do if a pre-existing column is appended with a different data type?
For example, given a Delta Lake table with schema foo INT, bar INT, what would happen when trying to write-append new data with schema foo INT, bar DOUBLE when specifying the option mergeSchema = true?
The write fails. (as of Delta Lake 0.5.0 on Databricks 6.3)
I think this is what you are looking for.
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.functions.input_file_name
val customSchema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", StringType, true),
StructField("field3", StringType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true),
StructField("field6", StringType, true),
StructField("field7", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("sep", "|")
.schema(customSchema)
.load("mnt/rawdata/corp/ABC*.gz")
.withColumn("file_name", input_file_name())
Just name 'field1', 'field2', etc., as your actual field names. Also, the 'ABC*.gz' does a wildcard search for files beginning with a specific string, like 'abc', or whatever, and the '*' character, which means any combination of characters, up the the backslash and the '.gz' which means it's a zipped file. Yours could be different, of course, so just change that convention to meet your specific needs.

How to create spark dataframe with column name which contains dot/period?

I have data in a list and want to convert it to a spark dataframe with one of the column names containing a "."
I wrote the below code which ran without any errors.
input_data = [('retail', '2017-01-03T13:21:00', 134),
('retail', '2017-01-03T13:21:00', 100)]
rdd_schema = StructType([StructField('business', StringType(), True), \
StructField('date', StringType(), True), \
StructField("`US.sales`", FloatType(), True)])
input_mock_df = spark.createDataFrame(input_mock_rdd_map, rdd_schema)
The below code returns the column names
input_mock_df.columns
But any operations on this dataframe is giving error for example
input_mock_df.count()
How do I make a valid spark dataframe which contains a "."?
Note:
I dont give "." in the column name the code works perfectly.
I want to solve it using native spark and not use pandas etc
I have ran the below code
input_data = [('retail', '2017-01-03T13:21:00', 134),
('retail', '2017-01-03T13:21:00', 100)]
rdd_schema = StructType([StructField('business', StringType(), True), \
StructField('date', StringType(), True), \
StructField("US.sales", IntegerType(), True)])
input_mock_df = sqlContext.createDataFrame(input_data, rdd_schema)
input_mock_df.count()
and it works fine returning the count as 2. Please try and reply

Pyspark: Transforming PythonRDD to Dataframe

Could someone guide me to convert PythonRDD to a DataFrame.
As per my understanding, reading a file should create a DF, but in my case it has created a PythonRDD. I finding it hard to convert PythonRDD to a DataFrame. Could not find CreateDataFrame() or toDF().
Please find my below code to read a tab seperated text file:
rdd1 = sparkCxt.textFile(setting.REFRESH_HDFS_DIR + "/Refresh")
rdd2 = rdd1.map(lambda row: unicode(row).lower().strip()\
if type(row) == unicode else row)
Now, I would want to convert PythonRDD to a DF.
I wanted to convert to DF to map the schema, so that I could do further processing at column level.
Also, please suggest if you think there is a better approach.
Please reply if more details are required.
Thank you.
Spark DataFrames can be created directly from a text file, but you should use sqlContext instead of sc (SparkContext), since sqlContext is an entry point for working with DataFrames.
df = sqlContext.read.text('path/to/my/file')
This will create a DataFrame with a single column named value. You can use UDF functions to split it into required columns.
Another approach would be to read the text files to an RDD, split it into columns using map, reduce, filter and other operations, and then convert the final RDD to a DataFrame.
For example, let's say we have a RDD named my_rdd with the following structure:
[(1, 'Alice', 23), (2, 'Bob', 25)]
We can easily convert it to a DataFrame:
df = sqlContext.createDataFrame(my_rdd, ['id', 'name', 'age'])
where id, name and age are names for our columns.
you can try using toPandas() although you should be cautious when doing so since converting an rdd to pandas DataFrame will be like bringing all data into memory which might cause OOM error if your distributed data is large.
I would use the Spark-csv package (Spark-csv Github) and import directly into a dataframe after defining the schema.
For example:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("year", IntegerType(), True), \
StructField("make", StringType(), True), \
StructField("model", StringType(), True), \
StructField("comment", StringType(), True), \
StructField("blank", StringType(), True)])
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.options(header='true') \
.load('cars.csv', schema = customSchema)
This defaults to a comma for the delimiter, but you can change that to a tab with something like:
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.options(header='true', delimiter='\t') \
.load('cars.csv', schema = customSchema)
Note that it is possible to infer the schema using another option, but this does require reading the entire file prior to loading the dataframe.

Resources