convert RDD to Dataframe in 2.0

convert RDD to Dataframe in 2.0 - apache-spark

I am trying to convert rdd to dataframe in Spark2.0
val conf=new SparkConf().setAppName("dataframes").setMaster("local")
val sc=new SparkContext(conf)
val sqlCon=new SQLContext(sc)
import sqlCon.implicits._
val rdd=sc.textFile("/home/cloudera/alpha.dat").persist()
val row=rdd.first()
val data=rdd.filter { x => !x.contains(row) }
data.foreach { x => println(x) }
case class person(name:String,age:Int,city:String)
val rdd2=data.map { x => x.split(",") }
val rdd3=rdd2.map { x => person(x(0),x(1).toInt,x(2)) }
val df=rdd3.toDF()
df.printSchema();
df.registerTempTable("alpha")
val df1=sqlCon.sql("select * from alpha")
df1.foreach { x => println(x) }
but i a getting below error at toDF(). ---> "val df=rdd3.toDF() "
Multiple markers at this line:
- Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case
classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
- Implicit conversion found: rdd3 ⇒ rddToDatasetHolder(rdd3): (implicit evidence$4:
org.apache.spark.sql.Encoder[person])org.apache.spark.sql.DatasetHolder[person]
How to convert the above to Dataframe using toDF()

Cloudera & Spark 2.0? hmmm, didn't think we supported that yet :)
Anyway, first of all you don't need to call .persist() on your RDD so you can remove that bit. Secondly, since Person is a case class you should capitalize its name.
Lastly, in Spark 2.0 you no longer call import sqlContext.implicits._ to implicitly build a DataFrame schema, you now call import spark.implicits._. This is hinted at by your error message.

There was a simple mistake where I had defined case class inside the main method. After removing the same, I am able to convert RDD to DataFrame.
package sparksql
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoders
import org.apache.spark.SparkContext
object asw {
case class Person(name:String,age:Int,city:String)
def main(args: Array[String]): Unit = {
val conf=new SparkConf().setMaster("local").setAppName("Dataframe")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val rdd1=sc.textFile("/home/cloudera/alpha.dat")
val row=rdd1.first()
val data=rdd1.filter { x => !x.contains(row) }
val rdd2=data.map { x => x.split(",") }
val df=rdd2.map { x => Person(x(0),x(1).toInt,x(2)) }.toDF()
df.createOrReplaceTempView("rdd21")
spark.sql("select * from rdd21").show()
}
}

Related

Save RDD as csv file using coalesce function

I am trying to stream twitter data using Apache Spark in Intellij however when i use the function coalesce , it says that it cannot resolve symbol coalesce. Here is my main code:
val spark = SparkSession.builder().appName("twitterStream").master("local[*]").getOrCreate()
import spark.implicits._
val sc: SparkContext = spark.sparkContext
val streamContext = new StreamingContext(sc, Seconds(5))
val filters = Array("Singapore")
val filtered = TwitterUtils.createStream(streamContext, None, filters)
val englishTweets = filtered.filter(_.getLang() == "en")
//englishTweets.print()
englishTweets.foreachRDD{rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val tweets = rdd.map( field =>
(
field.getId,
field.getUser.getScreenName,
field.getCreatedAt.toInstant.toString,
field.getText.toLowerCase.split(" ").filter(_.matches("^[a-zA-Z0-9 ]+$")).fold("")((a, b) => a + " " + b).trim,
sentiment(field.getText)
)
)
val tweetsdf = tweets.toDF("userID", "user", "createdAt", "text", "sentimentType")
tweetsdf.printSchema()
tweetsdf.show(false)
}.coalesce(1).write.csv("hdfs://localhost:9000/usr/sparkApp/test/testing.csv")

I have tried with my own dataset, and I have read a dataset and while writing I have applied coalesce function and it is giving results, please refer to this it may help you.
import org.apache.spark.sql.SparkSession
import com.spark.Rdd.DriverProgram
import org.apache.log4j.{ Logger, Level }
import org.apache.spark.sql.SaveMode
import java.sql.Date
object JsonDataDF {
System.setProperty("hadoop.home.dir", "C:\\hadoop");
System.setProperty("hadoop.home.dir", "C:\\hadoop"); // This is the system property which is useful to find the winutils.exe
Logger.getLogger("org").setLevel(Level.WARN) // This will remove Logs
case class AOK(appDate:Date, arr:String, base:String, Comments:String)
val dp = new DriverProgram
val spark = dp.getSparkSession()
def main(args : Array[String]): Unit = {
import spark.implicits._
val jsonDf = spark.read.option("multiline", "true").json("C:\\Users\\34979\\Desktop\\Work\\Datasets\\JSONdata.txt").as[AOK]
jsonDf.coalesce(1) // Refer Here
.write
.mode(SaveMode.Overwrite)
.option("header", "true")
.format("csv")
.save("C:\\Users\\34979\\Desktop\\Work\\Datasets\\JsonToCsv")
}
}

Reading files dynamically from HDFS from within spark transformation functions

How can a file from HDFS be read in a spark function not using sparkContext within the function.
Example:
val filedata_rdd = rdd.map { x => ReadFromHDFS(x.getFilePath) }
Question is how ReadFromHDFS can be implemented?Usually to read from HDFS we could do a sc.textFile but in this case sc cannot be used in the function.

You don't necessarily need service context to interact with HDFS. You can simply broadcast the hadoop configuration from master and use the broadcasted configuration value on executors to construct a hadoop.fs.FileSystem. Then the world is your. :)
Following is the code:
import java.io.StringWriter
import com.sachin.util.SparkIndexJobHelper._
import org.apache.commons.io.IOUtils
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SerializableWritable, SparkConf}
class Test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[15]")
.setAppName("TestJob")
val sc = createSparkContext(conf)
val confBroadcast = sc.broadcast(new SerializableWritable(sc.hadoopConfiguration))
val rdd: RDD[String] = ??? // your existing rdd
val filedata_rdd = rdd.map { x => readFromHDFS(confBroadcast.value.value, x) }
}
def readFromHDFS(configuration: Configuration, path: String): String = {
val fs: FileSystem = FileSystem.get(configuration)
val inputStream = fs.open(new Path(path));
val writer = new StringWriter();
IOUtils.copy(inputStream, writer, "UTF-8");
writer.toString();
}
}

Convert a RDD into DataFrame after foreachRDD operation

I am processing logs which using Spark Streaming. I parse the log and convert the logs into Java Map. Following is the code.
Now I want to convert this Map into DataFrames
Any suggestion how achieve this?
val sparkConf = new SparkConf().setAppName("StreamingApp").setMaster("local[2]")
sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
sqlContext= new SQLContext(sc)
val lines = ssc.textFileStream("hdfs://localhost:9000/test")
process(lines)
def process(lines: DStream[String]) {
val maptorow = lines.foreachRDD(rdd=>{
rdd.map(line => getMap(line))
.map(p =>
Row(p.get("column1"),
p.get("column2"))
}) // how to get dataframe after this?
def getMap(logs: String): java.util.Map[String, Object] = {
val k : java.util.Map[String, String] = parseLog(logs)
}
}
Thanks

foreachRDD has no return type, hence, you shouldn't be saving maptorow and in order for you to convert it, you need to do the conversion inside the foreachRDD and then deal with each RDD by itself as a separate set of data
val sqlContext = new SQLContext(sparkContext)
lines.foreachRDD(rdd=>{
import sqlContext.implicits._
val maptorow = lines.foreachRDD(rdd=>{
val newRDD = rdd.map(line => getMap(line))
.map(p =>
Row(p.get("column1"),
p.get("column2"))
val myDataFrame = newRDD.toDF()
//process myDataFrame as a DF
})

rdd action will be suspended in DStream foreachRDD function

I have encountered with error : rdd action will be suspended in DStream foreachRDD function.
Please refer to the following code.
import _root_.kafka.common.TopicAndPartition
import _root_.kafka.message.MessageAndMetadata
import _root_.kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object StreamingTest {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName("NetworkWordCount")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(5))
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topicOffset = Map(TopicAndPartition("test_log",0)->200000L)
val messageHandler = (mmd: MessageAndMetadata[String, String]) => mmd.message
val kafkaStream = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder,String](ssc,kafkaParams,topicOffset,messageHandler)
kafkaStream.foreachRDD(rdd=>{
println(rdd.count())
val collected = rdd.collect()
})
ssc.start()
ssc.awaitTermination()
}
}
Error:
The function rdd.count() or rdd.collect() will be suspended.
I am using spark version is 1.4.1.
Am I using it in a wrong way?
Thanks in advance.

if we didn't set the maxRatePerPartition from kafka, it will try to read all data, so it will be look like suspended. But it actually busy at reading data.
After I set the following configuration
spark.streaming.kafka.maxRatePerPartition=1000
It will print log.

Spark streaming for Azure Event hubs

I tried the given process (https://azure.microsoft.com/en-in/documentation/articles/hdinsight-apache-spark-eventhub-streaming/) step by step. I have just modified the spark receiver code according to my requirement. The spark streaming consumer api when I am spark-submitting its fetching the data from EventHub as DStream[Array[Bytes]] which I am doing a foreachRDD and converting into an RDD[String] . The issue I am facing here is that the statements below the streaming line are not getting executed until I stop the program execution by pressing ctrl+c.
package com.onerm.spark
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.eventhubs.EventHubsUtils
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark._
import org.apache.spark.sql.hive.HiveContext
import java.util.concurrent.{Executors, ExecutorService}
object HiveEvents {
def b2s(a: Array[Byte]): String = new String(a)
def main(args: Array[String]): Unit = {
val ehParams = Map[String, String](
"eventhubs.policyname" -> "myreceivepolicy",
"eventhubs.policykey" -> "jgrH/5yjdMjajQ1WUAQsKAVGTu34=",
"eventhubs.namespace" -> "SparkeventHubTest-ns",
"eventhubs.name" -> "SparkeventHubTest",
"eventhubs.partition.count" -> "4",
"eventhubs.consumergroup" -> "$default",
"eventhubs.checkpoint.dir" -> "/EventCheckpoint_0.1",
"eventhubs.checkpoint.interval" -> "10"
)
val conf = new SparkConf().setAppName("Eventhubs Onerm")
val sc= new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val pool:ExecutorService=Executors.newFixedThreadPool(5)
val ssc = new StreamingContext(sc, Seconds(120))
var dataString :RDD[String] =sc.emptyRDD
val stream=EventHubsUtils.createUnionStream(ssc, ehParams)
**//lines below are not getting executed until I stop the execution**
stream.print()
stream.foreachRDD {
rdd =>
if(rdd.isEmpty())
{
println("RDD IS EMPTY ")
}
else
{
dataString=rdd.map(line=>b2s(line))
println("COUNT" +dataString.count())
sqlContext.read.json(dataString).registerTempTable("jsoneventdata")
val filterData=sqlContext.sql("SELECT id,ClientProperties.PID,ClientProperties.Program,ClientProperties.Platform,ClientProperties.Version,ClientProperties.HWType,ClientProperties.OffVer,ContentID,Data,Locale,MappedSources,MarketingMessageContext.ActivityInstanceID,MarketingMessageContext.CampaignID,MarketingMessageContext.SegmentName,MarketingMessageContext.OneRMInstanceID,MarketingMessageContext.DateTimeSegmented,Source,Timestamp.Date,Timestamp.Epoch,TransactionID,UserAction,EventProcessedUtcTime,PartitionId,EventEnqueuedUtcTime from jsoneventdata")
filterData.show(10)
filterData.saveAsParquetFile("EventCheckpoint_0.1/ParquetEvent")
} }
ssc.start()
ssc.awaitTermination()
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

convert RDD to Dataframe in 2.0 - apache-spark

Related

Save RDD as csv file using coalesce function

Reading files dynamically from HDFS from within spark transformation functions

Convert a RDD into DataFrame after foreachRDD operation

rdd action will be suspended in DStream foreachRDD function

Spark streaming for Azure Event hubs

Categories

Resources