Reading Hive Tables in Spark Dataframe without header - apache-spark

I have the following Hive table:
select* from employee;
OK
abc 19 da
xyz 25 sa
pqr 30 er
suv 45 dr
when I read this in spark(pyspark):
df = hiveCtx.sql('select* from spark_hive.employee')
df.show()
+----+----+-----+
|name| age| role|
+----+----+-----+
|name|null| role|
| abc| 19| da|
| xyz| 25| sa|
| pqr| 30| er|
| suv| 45| dr|
+----+----+-----+
I end up getting the headers in my spark DataFrame. Is there a simple way to remove it ?
Also, am I missing something while reading the table into the DataFrame (Ideally I shouldn't be getting the header right ?) ?

You have to remove header from the result. You can do like this:
scala> val df = sql("select * from employee")
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> df.show
+----+----+----+
| id|name| age|
+----+----+----+
|null|name|null|
| 1| abc| 19|
| 2| xyz| 25|
| 3| pqr| 30|
| 4| suv| 45|
+----+----+----+
scala> val header = df.first()
header: org.apache.spark.sql.Row = [null,name,null]
scala> val data = df.filter(row => row != header)
data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, name: string ... 1 more field]
scala> data.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 19|
| 2| xyz| 25|
| 3| pqr| 30|
| 4| suv| 45|
+---+----+---+
Thanks.

you can use skip.header.line.count for skip this header. You could also specify the same while creating the table. For example:
create external table testtable ( id int,name string, age int)
row format delimited .............
tblproperties ("skip.header.line.count"="1");
after that load the data and then check your query I hope you will get expected output.

Not the Most elegant way, but this worked with pyspark:
rddWithoutHeader = dfemp.rdd.filter(lambda line: line!=header)
dfnew = sqlContext.createDataFrame(rddWithoutHeader)

Related

how to execute many expressions in the selectExpr

it is possible to apply many expression in the same selectExpr,
for example If I have this DF:
+---+
| i|
+---+
| 10|
| 15|
| 11|
| 56|
+---+
how to multiply by 2 and rename the column as this :
df.selectExpr("i*2 as multiplication")
def selectExpr(exprs: String*): org.apache.spark.sql.DataFrame
If you have many expressions you have to pass them comma separated strings. Please check below code.
scala> val df = (1 to 10).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.selectExpr("id*2 as twotimes", "id * 3 as threetimes").show
+--------+----------+
|twotimes|threetimes|
+--------+----------+
| 2| 3|
| 4| 6|
| 6| 9|
| 8| 12|
| 10| 15|
| 12| 18|
| 14| 21|
| 16| 24|
| 18| 27|
| 20| 30|
+--------+----------+
Yes, you can pass multiple expressions inside the df.selectExpr. https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame
scala> case class Person(name: String, lanme: String)
scala> val personDS = Seq(Person("Max", 1), Person("Adam", 2), Person("Muller", 3)).toDS()
scala > personDs.show(false)
+------+---+
|name |age|
+------+---+
|Max |1 |
|Adam |2 |
|Muller|3 |
+------+---+
scala> personDS.selectExpr("age*2 as multiple","name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+
Or else you can also use withColumn to achieve the same results
scala> personDS.withColumn("multiple",$"age"*2).select($"multiple",$"name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+

How to get a string representation of DataFrame (as does Dataset.show)?

I need a useful string representation of a Spark dataframe. The one I get with df.show is great -- but I can't get that output as a string because the internal showString method called by show is private. Is there some way I can get a similar output without writing a method to duplicate this same functionality?
showString is simply private[sql] that means that the code to access it has to be in the same package, i.e. org.apache.spark.sql.
The trick is to create a helper object that does belong to the org.apache.spark.sql package, but the single method we're about to create is not private (at any level).
I usually mimic what an instance method does with the very first input parameter as the target and the input parameters to match the target method.
package org.apache.spark.sql
object AccessShowString {
def showString[T](df: Dataset[T],
_numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = {
df.showString(_numRows, truncate, vertical)
}
}
TIP Use paste -raw to copy and paste the code in spark-shell.
Let's use showString then.
import org.apache.spark.sql.AccessShowString.showString
val df = spark.range(10)
scala> println(showString(df, 10))
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
If you really set on reusing existing code, you can access showString by reflection
scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val showString = classOf[org.apache.spark.sql.DataFrame].getDeclaredMethod("showString", classOf[Int], classOf[Int], classOf[Boolean])
showString: java.lang.reflect.Method = public java.lang.String org.apache.spark.sql.Dataset.showString(int,int,boolean)
scala> showString.setAccessible(true)
scala> showString.invoke(df, 10.asInstanceOf[Object], 20.asInstanceOf[Object], false.asInstanceOf[Object]).asInstanceOf[String]
res2: String =
"+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
"

How to delete the first few rows in dataframe Scala/sSark?

I hava a DataFrame and I want to delete first and the second row. What should I do?
This is my input:
+-----+
|value|
+-----+
| 1|
| 4|
| 3|
| 5|
| 4|
| 18|
-------
This is the excepted result:
+-----+
|value|
+-----+
| 3|
| 5|
| 4|
| 18|
-------
In my opinion it does not make sense to speak about a first or second record if you cannot define an ordering of your dataframe. The ordering of the records as a result of the show statement is "arbitrary" and depends on partitioning of your data.
Suppose you have a column over which you can order your records, you can use Window-functions. Starting with this dataframe:
+----+-----+
|year|value|
+----+-----+
|2007| 1|
|2008| 4|
|2009| 3|
|2010| 5|
|2011| 4|
|2012| 18|
+----+-----+
You can do
import org.apache.spark.sql.expressions.Window
df
.withColumn("rn",row_number().over(Window.orderBy($"year")))
.where($"rn">2).drop($"rn")
.show
The simple and easy way is to assign a id for each row and filter it
val df = Seq(1,2,3,5,4,18).toDF("value")
df.withColumn("id", monotonically_increasing_id()).filter($"id" > 1).drop("id")
Edit: Since the monotonically_increasing_id() does not grantee consecutive You can use zipWithUniqueId as below
val rows = df.rdd.zipWithUniqueId().map {
case (row, id) => Row.fromSeq(row.toSeq :+ id)
}
val df1 = spark.createDataFrame(rows, StructType(df.schema.fields :+ StructField("id", LongType, false)))
df1.filter($"id" > 1).drop("id")
Output:
+-----+
|value|
+-----+
| 3|
| 5|
| 4|
| 18|
+-----+
This will also help you to drop the nth row in dataframe.
Hope this helps!

How to convert RDD[List[Int]] to DataFrame?

I hava a RDD[List[Int]] ,I don not know the count of list[Int],I want to convert i Rdd[List[Int]] to DataFrame,How should I do?
this is my input:
val l1=Array(1,2,3,4)
val l2=Array(1,2,3,4)
val Lz=Seq(l1,l2)
val rdd1=sc.parallelize(Lz,2)
this is my expect result:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 2| 3| 4|
| 1| 2| 3| 4|
+---+---+---+---+
There might be some other and better functional way to do this, but this works too:
def getSchema(myArray : Array[Int]): StructType = {
var schemaArray = scala.collection.mutable.ArrayBuffer[StructField]()
for((el,idx) <- myArray.view.zipWithIndex){
schemaArray += StructField("col"+idx , IntegerType, true)
}
StructType(schemaArray)
}
val l1=Array(1,2,3,4)
val l2=Array(1,2,3,4)
val Lz=Seq(l1,l2)
val rdd1=sc.parallelize(Lz,2).map(Row.fromSeq(_))
val schema = getSchema(l1) //Since both arrays will be of same type and size
val df = sqlContext.createDataFrame(rdd1, schema)
df.show()
+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
| 1| 2| 3| 4|
| 1| 2| 3| 4|
+----+----+----+----+
You can do the following :
val l1=Array(1,2,3,4)
val l2=Array(1,2,3,4)
val Lz=Seq(l1,l2)
val df = sc.parallelize(Lz,2).map{
case Array(val1, val2, val3, val4) => (val1, val2, val3, val4)
}.toDF
df.show
// +---+---+---+---+
// | _1| _2| _3| _4|
// +---+---+---+---+
// | 1| 2| 3| 4|
// | 1| 2| 3| 4|
// +---+---+---+---+
If you have lots of columns, you would need to proceed differently but you need to know the schema of your data otherwise you'll not be able to perform the following :
val sch = df.schema // I just took the schema from the old df but you can add one programmatically
val df2 = spark.createDataFrame(sc.parallelize(Lz,2).map{ Row.fromSeq(_) }, sch)
df2.show
// +---+---+---+---+
// | _1| _2| _3| _4|
// +---+---+---+---+
// | 1| 2| 3| 4|
// | 1| 2| 3| 4|
// +---+---+---+---+
Unless you provide a schema, you won't be able to do much except having an array column :
val df3 = sc.parallelize(Lz,2).toDF
// df3: org.apache.spark.sql.DataFrame = [value: array<int>]
df3.show
// +------------+
// | value|
// +------------+
// |[1, 2, 3, 4]|
// |[1, 2, 3, 4]|
// +------------+
df3.printSchema
//root
// |-- value: array (nullable = true)
// | |-- element: integer (containsNull = false)

Changing Nulls Ordering in Spark SQL

I need to be able to sort columns in ascending and descending order and also allow nulls to be first or nulls to be last. Using RDDs I could use the sortByKey method with a custom comparator. I was wondering if there is a corresponding approach using the Dataset API. I see how to to add desc/asc to columns but I have no clue on the nulls ordering.
You can also do it with the dataset API:
scala> val df = Seq("a", "b", null).toDF("x")
df: org.apache.spark.sql.DataFrame = [x: string]
scala> df.select('*).orderBy('x.asc_nulls_last).show
+----+
| x|
+----+
| a|
| b|
|null|
+----+
scala> df.select('*).orderBy('x.asc_nulls_first).show
+----+
| x|
+----+
|null|
| a|
| b|
+----+
Same thing works with desc_nulls_last and desc_nulls_first.
As mentioned by Oleksandr, there was a pull request for this. Now you can optionally use "nulls first" or "nulls last"
scala> spark.sql("select * from spark_10747 order by col3 nulls last").show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 6| 7| 4|
| 6| 11| 4|
| 6| 15| 8|
| 6| 15| 8|
| 6| 7| 8|
| 6| 12| 10|
| 6| 9| 10|
| 6| 13|null|
| 6| 10|null|
+----+----+----+

Resources