Save RDD as csv file using coalesce function

Save RDD as csv file using coalesce function - apache-spark

I am trying to stream twitter data using Apache Spark in Intellij however when i use the function coalesce , it says that it cannot resolve symbol coalesce. Here is my main code:
val spark = SparkSession.builder().appName("twitterStream").master("local[*]").getOrCreate()
import spark.implicits._
val sc: SparkContext = spark.sparkContext
val streamContext = new StreamingContext(sc, Seconds(5))
val filters = Array("Singapore")
val filtered = TwitterUtils.createStream(streamContext, None, filters)
val englishTweets = filtered.filter(_.getLang() == "en")
//englishTweets.print()
englishTweets.foreachRDD{rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val tweets = rdd.map( field =>
(
field.getId,
field.getUser.getScreenName,
field.getCreatedAt.toInstant.toString,
field.getText.toLowerCase.split(" ").filter(_.matches("^[a-zA-Z0-9 ]+$")).fold("")((a, b) => a + " " + b).trim,
sentiment(field.getText)
)
)
val tweetsdf = tweets.toDF("userID", "user", "createdAt", "text", "sentimentType")
tweetsdf.printSchema()
tweetsdf.show(false)
}.coalesce(1).write.csv("hdfs://localhost:9000/usr/sparkApp/test/testing.csv")

I have tried with my own dataset, and I have read a dataset and while writing I have applied coalesce function and it is giving results, please refer to this it may help you.
import org.apache.spark.sql.SparkSession
import com.spark.Rdd.DriverProgram
import org.apache.log4j.{ Logger, Level }
import org.apache.spark.sql.SaveMode
import java.sql.Date
object JsonDataDF {
System.setProperty("hadoop.home.dir", "C:\\hadoop");
System.setProperty("hadoop.home.dir", "C:\\hadoop"); // This is the system property which is useful to find the winutils.exe
Logger.getLogger("org").setLevel(Level.WARN) // This will remove Logs
case class AOK(appDate:Date, arr:String, base:String, Comments:String)
val dp = new DriverProgram
val spark = dp.getSparkSession()
def main(args : Array[String]): Unit = {
import spark.implicits._
val jsonDf = spark.read.option("multiline", "true").json("C:\\Users\\34979\\Desktop\\Work\\Datasets\\JSONdata.txt").as[AOK]
jsonDf.coalesce(1) // Refer Here
.write
.mode(SaveMode.Overwrite)
.option("header", "true")
.format("csv")
.save("C:\\Users\\34979\\Desktop\\Work\\Datasets\\JsonToCsv")
}
}

Related

How to batch insert into hbase using saveAsNewAPIHadoopDataset

just learn spark for a while， i found the api: saveAsNewAPIHadoopDataset when i use hbase, code like below， as far as know，this code can insert one row at a time , how to change it to batch put? i am a rookie ..please help...tks
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.{SparkContext, SparkConf}
/**
*
*/
object HbaseTest2 {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("HBaseTest").setMaster("local")
val sc = new SparkContext(sparkConf)
val tablename = "account"
sc.hadoopConfiguration.set("hbase.zookeeper.quorum","slave1,slave2,slave3")
sc.hadoopConfiguration.set("hbase.zookeeper.property.clientPort", "2181")
sc.hadoopConfiguration.set(TableOutputFormat.OUTPUT_TABLE, tablename)
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
val indataRDD = sc.makeRDD(Array("1,jack,15","2,Lily,16","3,mike,16"))
val rdd = indataRDD.map(_.split(',')).map{arr=>{
val put = new Put(Bytes.toBytes(arr(0)))
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes("name"),Bytes.toBytes(arr(1)))
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes("age"),Bytes.toBytes(arr(2).toInt))
(new ImmutableBytesWritable, put)
}}
rdd.saveAsNewAPIHadoopDataset(job.getConfiguration())
sc.stop()
}
}

Actually you don't need to worry about this - under the hood, put(Put) and put(List<Put>) are identical. They both buffer messages and flush them in batches. There should be no noticeable performance difference.
I'm afraid the other answer is misguided.

saveAsNewAPIHadoopDataset performs single put.
To perform bulk put to hbase table, you can use hbase-spark connector.
The connector executes bulkPutFunc2 within mapPartition() so is efficient.
Your source code will change like below -
object HBaseTest {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("HBaseTest").setMaster("local")
val sc = new SparkContext(sparkConf)
val tablename = "account"
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", "slave1,slave2,slave3")
hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
hbaseConf.set("zookeeper.znode.parent", "/hbase")
val hbaseContext = new HBaseContext(sc, hbaseConf)
val indataRDD = sc.makeRDD(Array("1,jack,15", "2,Lily,16", "3,mike,16"))
hbaseContext.bulkPut(indataRDD, TableName.valueOf(tablename), bulkPutFunc2)
sc.stop()
}
def bulkPutFunc2(arrayRec : String): Put = {
val rec = arrayRec.split(",")
val put = new Put(Bytes.toBytes(rec(0).toInt))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("name"), Bytes.toBytes(rec(1)))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("age"), Bytes.toBytes(rec(2).toInt))
put
}
}
pom.xml would have following entry -
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>1.2.0-cdh5.12.1</version>
<dependency>

Reading files dynamically from HDFS from within spark transformation functions

How can a file from HDFS be read in a spark function not using sparkContext within the function.
Example:
val filedata_rdd = rdd.map { x => ReadFromHDFS(x.getFilePath) }
Question is how ReadFromHDFS can be implemented?Usually to read from HDFS we could do a sc.textFile but in this case sc cannot be used in the function.

You don't necessarily need service context to interact with HDFS. You can simply broadcast the hadoop configuration from master and use the broadcasted configuration value on executors to construct a hadoop.fs.FileSystem. Then the world is your. :)
Following is the code:
import java.io.StringWriter
import com.sachin.util.SparkIndexJobHelper._
import org.apache.commons.io.IOUtils
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SerializableWritable, SparkConf}
class Test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[15]")
.setAppName("TestJob")
val sc = createSparkContext(conf)
val confBroadcast = sc.broadcast(new SerializableWritable(sc.hadoopConfiguration))
val rdd: RDD[String] = ??? // your existing rdd
val filedata_rdd = rdd.map { x => readFromHDFS(confBroadcast.value.value, x) }
}
def readFromHDFS(configuration: Configuration, path: String): String = {
val fs: FileSystem = FileSystem.get(configuration)
val inputStream = fs.open(new Path(path));
val writer = new StringWriter();
IOUtils.copy(inputStream, writer, "UTF-8");
writer.toString();
}
}

Convert a RDD into DataFrame after foreachRDD operation

I am processing logs which using Spark Streaming. I parse the log and convert the logs into Java Map. Following is the code.
Now I want to convert this Map into DataFrames
Any suggestion how achieve this?
val sparkConf = new SparkConf().setAppName("StreamingApp").setMaster("local[2]")
sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
sqlContext= new SQLContext(sc)
val lines = ssc.textFileStream("hdfs://localhost:9000/test")
process(lines)
def process(lines: DStream[String]) {
val maptorow = lines.foreachRDD(rdd=>{
rdd.map(line => getMap(line))
.map(p =>
Row(p.get("column1"),
p.get("column2"))
}) // how to get dataframe after this?
def getMap(logs: String): java.util.Map[String, Object] = {
val k : java.util.Map[String, String] = parseLog(logs)
}
}
Thanks

foreachRDD has no return type, hence, you shouldn't be saving maptorow and in order for you to convert it, you need to do the conversion inside the foreachRDD and then deal with each RDD by itself as a separate set of data
val sqlContext = new SQLContext(sparkContext)
lines.foreachRDD(rdd=>{
import sqlContext.implicits._
val maptorow = lines.foreachRDD(rdd=>{
val newRDD = rdd.map(line => getMap(line))
.map(p =>
Row(p.get("column1"),
p.get("column2"))
val myDataFrame = newRDD.toDF()
//process myDataFrame as a DF
})

rdd action will be suspended in DStream foreachRDD function

I have encountered with error : rdd action will be suspended in DStream foreachRDD function.
Please refer to the following code.
import _root_.kafka.common.TopicAndPartition
import _root_.kafka.message.MessageAndMetadata
import _root_.kafka.serializer.StringDecoder
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object StreamingTest {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName("NetworkWordCount")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(5))
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topicOffset = Map(TopicAndPartition("test_log",0)->200000L)
val messageHandler = (mmd: MessageAndMetadata[String, String]) => mmd.message
val kafkaStream = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder,String](ssc,kafkaParams,topicOffset,messageHandler)
kafkaStream.foreachRDD(rdd=>{
println(rdd.count())
val collected = rdd.collect()
})
ssc.start()
ssc.awaitTermination()
}
}
Error:
The function rdd.count() or rdd.collect() will be suspended.
I am using spark version is 1.4.1.
Am I using it in a wrong way?
Thanks in advance.

if we didn't set the maxRatePerPartition from kafka, it will try to read all data, so it will be look like suspended. But it actually busy at reading data.
After I set the following configuration
spark.streaming.kafka.maxRatePerPartition=1000
It will print log.

Spark streaming for Azure Event hubs

I tried the given process (https://azure.microsoft.com/en-in/documentation/articles/hdinsight-apache-spark-eventhub-streaming/) step by step. I have just modified the spark receiver code according to my requirement. The spark streaming consumer api when I am spark-submitting its fetching the data from EventHub as DStream[Array[Bytes]] which I am doing a foreachRDD and converting into an RDD[String] . The issue I am facing here is that the statements below the streaming line are not getting executed until I stop the program execution by pressing ctrl+c.
package com.onerm.spark
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.eventhubs.EventHubsUtils
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark._
import org.apache.spark.sql.hive.HiveContext
import java.util.concurrent.{Executors, ExecutorService}
object HiveEvents {
def b2s(a: Array[Byte]): String = new String(a)
def main(args: Array[String]): Unit = {
val ehParams = Map[String, String](
"eventhubs.policyname" -> "myreceivepolicy",
"eventhubs.policykey" -> "jgrH/5yjdMjajQ1WUAQsKAVGTu34=",
"eventhubs.namespace" -> "SparkeventHubTest-ns",
"eventhubs.name" -> "SparkeventHubTest",
"eventhubs.partition.count" -> "4",
"eventhubs.consumergroup" -> "$default",
"eventhubs.checkpoint.dir" -> "/EventCheckpoint_0.1",
"eventhubs.checkpoint.interval" -> "10"
)
val conf = new SparkConf().setAppName("Eventhubs Onerm")
val sc= new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val pool:ExecutorService=Executors.newFixedThreadPool(5)
val ssc = new StreamingContext(sc, Seconds(120))
var dataString :RDD[String] =sc.emptyRDD
val stream=EventHubsUtils.createUnionStream(ssc, ehParams)
**//lines below are not getting executed until I stop the execution**
stream.print()
stream.foreachRDD {
rdd =>
if(rdd.isEmpty())
{
println("RDD IS EMPTY ")
}
else
{
dataString=rdd.map(line=>b2s(line))
println("COUNT" +dataString.count())
sqlContext.read.json(dataString).registerTempTable("jsoneventdata")
val filterData=sqlContext.sql("SELECT id,ClientProperties.PID,ClientProperties.Program,ClientProperties.Platform,ClientProperties.Version,ClientProperties.HWType,ClientProperties.OffVer,ContentID,Data,Locale,MappedSources,MarketingMessageContext.ActivityInstanceID,MarketingMessageContext.CampaignID,MarketingMessageContext.SegmentName,MarketingMessageContext.OneRMInstanceID,MarketingMessageContext.DateTimeSegmented,Source,Timestamp.Date,Timestamp.Epoch,TransactionID,UserAction,EventProcessedUtcTime,PartitionId,EventEnqueuedUtcTime from jsoneventdata")
filterData.show(10)
filterData.saveAsParquetFile("EventCheckpoint_0.1/ParquetEvent")
} }
ssc.start()
ssc.awaitTermination()
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Save RDD as csv file using coalesce function - apache-spark

Related

How to batch insert into hbase using saveAsNewAPIHadoopDataset

Reading files dynamically from HDFS from within spark transformation functions

Convert a RDD into DataFrame after foreachRDD operation

rdd action will be suspended in DStream foreachRDD function

Spark streaming for Azure Event hubs

Categories

Resources