Converting RDD into a dataframe int vs Double - apache-spark

Why is it possible to convert an rdd[int] into a dataframe using the implicit method
import sqlContext.implicits._
//Concatenate rows
val rdd1 = sc.parallelize(Array(4,5,6)).toDF()
rdd1.show()
rdd1: org.apache.spark.sql.DataFrame = [_1: int]
+---+
| _1|
+---+
| 4|
| 5|
| 6|
+---+
but rdd[Double] is throwing an error:
val rdd2 = sc.parallelize(Array(1.1,2.34,3.4)).toDF()
error: value toDF is not a member of org.apache.spark.rdd.RDD[Double]

Spark 2.x
In Spark 2.x toDF uses SparkSession.implicits and provides rddToDatasetHolder and localSeqToDatasetHolder for any type with Encoder so with
val spark: SparkSession = ???
import spark.implicits._
both:
Seq(1.1, 2.34, 3.4).toDF()
and
sc.parallelize(Seq(1.1, 2.34, 3.4)).toDF()
are valid.
Spark 1.x
It is not possible. Excluding Product types SQLContext provides implicit conversions only for RDD[Int] (intRddToDataFrameHolder), RDD[Long] (longRddToDataFrameHolder) and RDD[String] (stringRddToDataFrameHolder). You can always map to RDD[(Double,)] first:
sc.parallelize(Seq(1.1, 2.34, 3.4)).map(Tuple1(_)).toDF()

Related

Spark read CSV - Not showing corroupt Records

Spark has a Permissive mode for reading CSV files which stores the corroupt records into a separate column named _corroupt_record.
permissive -
Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column
called _corrupt_record
However, when I am trying following example, I don't see any column named _corroupt_record. the reocords which doesn't match with schema appears to be null
data.csv
data
10.00
11.00
$12.00
$13
gaurang
code
import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), false)
))
val df = spark.read.format("csv")
.option("header", "true")
.option("mode", "PERMISSIVE")
.schema(schema)
.load("../test.csv")
schema
scala> df.printSchema()
root
|-- value: decimal(25,10) (nullable = true)
scala> df.show()
+-------------+
| value|
+-------------+
|10.0000000000|
|11.0000000000|
| null|
| null|
| null|
+-------------+
If I change the mode to FAILFAST I am getting error when I try to see data.
Adding the _corroup_record as suggested by Andrew and Prateek resolved the issue.
import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), true),
new StructField("_corrupt_record", StringType, true)
))
val df = spark.read.format("csv")
.option("header", "true")
.option("mode", "PERMISSIVE")
.schema(schema)
.load("../test.csv")
querying Data
scala> df.show()
+-------------+---------------+
| value|_corrupt_record|
+-------------+---------------+
|10.0000000000| null|
|11.0000000000| null|
| null| $12.00|
| null| $13|
| null| gaurang|
+-------------+---------------+

Error when running a query involving ROUND function in spark sql

I am trying, in pyspark, to obtain a new column by rounding one column of a table to the precision specified, in each row, by another column of the same table, e.g., from the following table:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
I should be able to obtain the following result:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+
In particular, I have tried the following code:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import (
StructType, StructField, FloatType, LongType,
IntegerType
)
pdDF = pd.DataFrame(columns=["Data", "Rounding"], data=[[3.141592, 3],
[0.577215, 1]])
mySchema = StructType([ StructField("Data", FloatType(), True),
StructField("Rounding", IntegerType(), True)])
spark = (SparkSession.builder
.master("local")
.appName("column rounding")
.getOrCreate())
df = spark.createDataFrame(pdDF,schema=mySchema)
df.show()
df.createOrReplaceTempView("df_table")
df_rounded = spark.sql("SELECT Data, Rounding, ROUND(Data, Rounding) AS Rounded_Column FROM df_table")
df_rounded .show()
but I get the following error:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'round(df_table.`Data`, df_table.`Rounding`)' due to data type mismatch: Only foldable Expression is allowed for scale arguments; line 1 pos 23;\n'Project [Data#0, Rounding#1, round(Data#0, Rounding#1) AS Rounded_Column#12]\n+- SubqueryAlias df_table\n +- LogicalRDD [Data#0, Rounding#1], false\n"
Any help would be deeply appreciated :)
With spark sql , the catalyst throws out the following error in your run - Only foldable Expression is allowed for scale arguments
i.e #param scale new scale to be round to, this should be a constant int at runtime
ROUND only expect a Literal for the scale. you can try out writing custom code instead of spark-sql way.
EDIT:
With UDF,
val df = Seq(
(3.141592,3),
(0.577215,1)).toDF("Data","Rounding")
df.show()
df.createOrReplaceTempView("df_table")
import org.apache.spark.sql.functions._
def RoundUDF(customvalue:Double, customscale:Int):Double = BigDecimal(customvalue).setScale(customscale, BigDecimal.RoundingMode.HALF_UP).toDouble
spark.udf.register("RoundUDF", RoundUDF(_:Double,_:Int):Double)
val df_rounded = spark.sql("select Data, Rounding, RoundUDF(Data, Rounding) as Rounded_Column from df_table")
df_rounded.show()
Input:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
Output:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+

Table or view not found with registerTempTable

So I run the following on pyspark shell:
>>> data = spark.read.csv("annotations_000", header=False, mode="DROPMALFORMED", schema=schema)
>>> data.show(3)
+----------+--------------------+--------------------+---------+---------+--------+-----------------+
| item_id| review_id| text| aspect|sentiment|comments| annotation_round|
+----------+--------------------+--------------------+---------+---------+--------+-----------------+
|9999900031|9999900031/custom...|Just came back to...|breakfast| 3| null|ASE_OpeNER_round2|
|9999900031|9999900031/custom...|Just came back to...| staff| 3| null|ASE_OpeNER_round2|
|9999900031|9999900031/custom...|The hotel was loc...| noise| 2| null|ASE_OpeNER_round2|
+----------+--------------------+--------------------+---------+---------+--------+-----------------+
>>> data.registerTempTable("temp")
>>> df = sqlContext.sql("select first(item_id), review_id, first(text), concat_ws(';', collect_list(aspect)) as aspect from temp group by review_id")
>>> df.show(3)
+---------------------+--------------------+--------------------+--------------------+
|first(item_id, false)| review_id| first(text, false)| aspect|
+---------------------+--------------------+--------------------+--------------------+
| 100012|100012/tripadviso...|We stayed here la...| staff;room|
| 100013|100013/tripadviso...|We stayed for two...| breakfast|
| 100031|100031/tripadviso...|We stayed two nig...|noise;breakfast;room|
+---------------------+--------------------+--------------------+--------------------+
and it works perfectly with the shell sqlContext variable.
When I write it as a script:
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
sc = SparkContext(appName="AspectDetector")
spark = SparkSession(sc)
sqlContext = SQLContext(sc)
data.registerTempTable("temp")
df = sqlContext.sql("select first(item_id), review_id, first(text), concat_ws(';', collect_list(aspect)) as aspect from temp group by review_id")
and run it I get the following:
pyspark.sql.utils.AnalysisException: u'Table or view not found: temp;
line 1 pos 99'
How is that possible? Am I doing something wrong on the instatiation of sqlContext?
First you will want to initialize spark with Hive support, for example:
spark = SparkSession.builder \
.master("yarn") \
.appName("AspectDetector") \
.enableHiveSupport() \
.getOrCreate()
sqlContext = SQLContext(spark)
But instead of using sqlContext.sql(), you will want to use spark.sql() to run your query.
I found this confusing as well but I think it is because when you do the data.registerTempTable("temp") you are actually in the spark context instead of the sqlContext context. If you want to query a hive table, you should still use sqlContext.sql().

How to Convert DataFrame to RDD using sql Context

I have created Data frame to read csv file using sqlContext from which I need to convert a column of table to RDD and then dense Vector to perform matrix multiplication .
I am finding it difficult to do so.
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("inferSchema","true")
.load("/home/project/SparkRead/train.csv")
val result1 = sqlContext.sql("SELECT Sales from train").rdd
how to convert it to dense vector?
You can convert Dataframe to Vector using VectorAssembler. Check out the code below:
val df = spark.read.
format("com.databricks.spark.csv").
option("header","true").
option("inferSchema","true").
load("/tmp/train.csv")
// assuming input
// a,b,c,d
// 1,2,3,4
// 1,1,2,3
// 1,3,4,5
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val assembler = new VectorAssembler().
setInputCols(Array("a", "b", "c", "d")).
setOutputCol("vect")
val output = assembler.transform(df)
// show the result
output.show()
// +---+---+---+---+-----------------+
// | a| b| c| d| vect|
// +---+---+---+---+-----------------+
// | 1| 2| 3| 4|[1.0,2.0,3.0,4.0]|
// | 1| 1| 2| 3|[1.0,1.0,2.0,3.0]|
// | 1| 3| 4| 5|[1.0,3.0,4.0,5.0]|
// +---+---+---+---+-----------------+

Get first non-null values in group by (Spark 1.6)

How can I get the first non-null values from a group by? I tried using first with coalesce F.first(F.coalesce("code")) but I don't get the desired behavior (I seem to get the first row).
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
I tried:
(df
.groupby("id")
.agg(F.first(F.coalesce("code")),
F.first(F.coalesce("name")))
.collect())
DESIRED OUTPUT
[Row(id='a', code='code1', name='name2')]
For Spark 1.3 - 1.5, this could do the trick:
from pyspark.sql import functions as F
df.groupBy(df['id']).agg(F.first(df['code']), F.first(df['name'])).show()
+---+-----------+-----------+
| id|FIRST(code)|FIRST(name)|
+---+-----------+-----------+
| a| code1| name2|
+---+-----------+-----------+
Edit
Apparently, in version 1.6 they have changed the way the first aggregate function is processed. Now, the underlying class First should be constructed with a second argument ignoreNullsExpr parameter, which is not yet used by the first aggregate function (as can bee seen here). However, in Spark 2.0 it will be able to call agg(F.first(col, True)) to ignore nulls (as can be checked here).
Therefore, for Spark 1.6 the approach must be different and a little more inefficient, unfornately. One idea is the following:
from pyspark.sql import functions as F
df1 = df.select('id', 'code').filter(df['code'].isNotNull()).groupBy(df['id']).agg(F.first(df['code']))
df2 = df.select('id', 'name').filter(df['name'].isNotNull()).groupBy(df['id']).agg(F.first(df['name']))
result = df1.join(df2, 'id')
result.show()
+---+-------------+-------------+
| id|first(code)()|first(name)()|
+---+-------------+-------------+
| a| code1| name2|
+---+-------------+-------------+
Maybe there is a better option. I'll edit the answer if I find one.
Because I only had one non-null value for every grouping, using min / max in 1.6 worked for my purposes:
(df
.groupby("id")
.agg(F.min("code"),
F.min("name"))
.show())
+---+---------+---------+
| id|min(code)|min(name)|
+---+---------+---------+
| a| code1| name2|
+---+---------+---------+
The first method accept an argument ignorenulls, that can be set to true,
Python:
df.groupby("id").agg(first(col("code"), ignorenulls=True).alias("code"))
Scala:
df.groupBy("id").agg(first(col("code"), ignoreNulls = true).alias("code"))

Resources