Spark: printing Hbase data and converting it into Dataframe - apache-spark

I am having difficulty in playing with the data received from my Hbase table. I have a Hbase table EMP_META: COLUMN_NAME,SALARY,DESIGNATION,BONUS and I read it using below code:
def main(args: Array[String]): Unit = {
val sc = new SparkContext("local", "hbase-test")
println("Running Phoenix Context")
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "EMP_META")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("--------------: "+hBaseRDD.first())
}
However when I print it using the above print statement I get below output:
(65 6d 70 6c 6f 79 65 65,keyvalues={employee/0:COLUMN_NAME/1483975443911/Put/vlen=4/seqid=0, employee/0:DATA_TYPE/1483975443911/Put/vlen=7/seqid=0, employee/0:_0/1483975443911/Put/vlen=1/seqid=0})
Instead of simple data text row. I want to convert the output to a dataframe so that I can easily play with the data. Can someone please help me in this.
Thanks

If you want to convert hbaseRDD to DataFrame,you can use the follow code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
hBaseRDD.toDF
If you want to convert the result to String, you should convert the Array[Byte] to String.The data stored in HBase is Array[Byte].Try to use Bytes.toString(data) to convert it.

Related

To avoid manual files errors how to code dynamic datatype of a column check in spark/scala

We are getting lot of manual files which we need to validate the few datatypes before process the data-frame. Can someone please suggest how can I proceed on this requirement. Basically need to write one spark Generic/common program which should work for many files. if possible please send more detail on this email id as well pathirammi1#gmail.com.
Wondering if your files have records with delimiter seperated (like csv file). If yes, you could very well read it as a text file and split the records based and delimiter and process it.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDFromCSVFile {
def main(args:Array[String]): Unit ={
def splitString(row:String):Array[String]={
row.split(",")
}
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.textFile("randomfile.csv")
val rdd2:RDD = rdd.map(row=>{
val strArray = splitString(row)
val field1 = strArray(0)
val field2 = strArray(1)
val field3 = strArray(3)
val field4 = strArray(4)
// DO custom code here and return to create RDD
})
rdd2.foreach(a=>println(a.toString))
}
}
If you have non-structured data then you should use below code
import org.apache.spark.sql.SparkSession
object RDDFromWholeTextFile {
def main(args:Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.wholeTextFiles("alice.txt")
rdd.foreach(a=>println(a._1+"---->"+a._2))
}
}
Hope this helps !!
Thanks,
Naveen

Write dataframe to csv with datatype map<string,bigint> in Spark

I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying
"Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type."
Code which I have used:
val conf=new SparkConf().setAppName("student-example").setMaster("local")
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
def sumaggr=udf((aggr: Map[String, collection.mutable.WrappedArray[Long]]) => if (aggr.keySet.contains("aggr")) aggr("aggr").sum else 0)
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
datadf.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
I tried converting datadf.toString() but still I am facing same issue.
How can write that result to CSV.
spark version 2.1.1
Spark CSV source supports only atomic types. You cannot store any columns that are non-atomic
I think best is to create a JSON for the column that has map<string,bigint> as a datatype and save it in csv as below.
import spark.implicits._
import org.apache.spark.sql.functions._
datadf.withColumn("column_name_with_map_type", to_json(struct($"column_name_with_map_type"))).write.csv("outputpath")
Hope this helps!
You are trying to save the output of
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
which I guess is a mistake as the udf function and all the aggregation done would go in vain if you do so
So I think you want to save the output of
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
So you need to save it in a new dataframe variable and use that variable to save.
val finalDF = datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0)
finalDF.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
And you should be fine.

Output not in readable format in Spark 2.2.0 Dataset

Following is the code which i am trying to execute with spark2.2.0 on intellij IDE. But the output i am getting is doesnt look in readble format.
val spark = SparkSession
.builder()
.appName("Spark SQL basic example").master("local[2]")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
import scala.reflect.ClassTag
implicit def kryoEncoder[A](implicit ct: ClassTag[A]) =
org.apache.spark.sql.Encoders.kryo[A](ct)
case class Person(name: String, age: Long)
// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
Output shown :
+--------------------+
| value|
+--------------------+
|[01 00 44 61 74 6...|
+--------------------+
Can anyone explain if I am missing anything here?
Thanks
This is because you're using Kryo Encoder which is not designed to deserialize objects for show.
In general you should never use Kryo Encoder when more precise Encoders are available. It has poorer performance and less features. Instead use Product Encoder
spark.createDataset(Seq(Person("Andy", 32)))(Encoders.product[Person])

how can i add a timestamp as an extra column to my dataframe

*Hi all,
I have an easy question for you all.
I have an RDD, created from kafka streaming using createStream method.
Now i want to add a timestamp as a value to this rdd before converting in to dataframe.
I have tried doing to add a value to the dataframe using with withColumn() but returning this error*
val topicMaps = Map("topic" -> 1)
val now = java.util.Calendar.getInstance().getTime()
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "now" among (action, device_os_ver, device_type, event_name,
item_name, lat, lon, memberid, productUpccd, tenantid);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:15
As i came to know that DataFrames cannot be altered as they are immutable, but RDDs are immutable as well.
Then what is the best way to do it.
How to a value to the RDD(adding timestamp to an RDD dynamically).
Try current_timestamp function.
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())
For add a new column with a constant like timestamp, you can use litfunction:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("timeStamp_column", lit(System.currentTimeMillis))
This works for me. I usually perform a write after this.
val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())
In Scala/Databricks:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())
See my output
I see in comments that some folks are having trouble getting the timestamp to string. Here is a way to do that using spark 3 datetime format
import org.apache.spark.sql.functions._
val d =dataframe.
.withColumn("timeStamp_column", date_format(current_timestamp(), "y-M-d'T'H:m:sX"))

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Update : Starting from Spark 2.2.x
there is finally a proper way to do it using Dataset.
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
import spark.implicits._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|id, date, timedump
|1, "2014/01/01 23:00:01",1499959917383
|2, "2014/11/31 12:40:32",1198138008843
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
Old spark versions
Actually you can, though it's using library internals and not widely advertised. Just create and use your own CsvParser instance.
Example that works for me on spark 1.6.0 and spark-csv_2.10-1.4.0 below
import com.databricks.spark.csv.CsvParser
val csvData = """
|userid,organizationid,userfirstname,usermiddlename,userlastname,usertitle
|1,1,user1,m1,l1,mr
|2,2,user2,m2,l2,mr
|3,3,user3,m3,l3,mr
|""".stripMargin
val rdd = sc.parallelize(csvData.lines.toList)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val csvDataFrame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
You can parse your string into a csv using, e.g. scala-csv:
val myCSVdata : Array[List[String]] =
myCSVString.split('\n').flatMap(CSVParser.parseLine(_))
Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...
You can then make this an RDD of records:
val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)
Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
I omit this step.
You can then convert to a DataFrame:
import spark.implicits._
myCSVDataframe = myCSVRDD.toDF()
The accepted answer wasn't working for me in spark 2.2.0 but lead me to what I needed with csvData.lines.toList
val fileUrl = getClass.getResource(s"/file_in_resources.csv")
val stream = fileUrl.getContent.asInstanceOf[InputStream]
val streamString = Source.fromInputStream(stream).mkString
val csvList = streamString.lines.toList
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvList.toDS())
.as[SomeCaseClass]

Resources