Spark Find the Occurrences of Matched Strings

Spark Find the Occurrences of Matched Strings - apache-spark

how i can find the occurence of the matched string as per the below code snippet, i'm able to get the filtered strings as an output , but not the occurences
import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("wordCount")
val sc = new SparkContext(conf)
// Load our input data.
val input = sc.textFile("file:///tmp/ganesh/*")
val matched_pattern = input.filter(line => line.contains("Title"))
// Split it up into words.
val words = matched_pattern.flatMap(line => line.split(" "))
// Transform into pairs and count.
val counts = words.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile("file:///tmp/sparkout")
}
}

Here is an example - with broadcast variable usage. stopWords is in fact include words.
val dfsFilename = "/FileStore/tables/7dxa9btd1477497663691/Text_File_01-880f5.txt"
val readFileRDD = spark.sparkContext.textFile(dfsFilename)
// res4: Array[String] = Array(The the is Is a A to To OK ok I) //stopWords
val stopWordsInput = spark.sparkContext.textFile("/FileStore/tables/filter_words.txt")
val stopWords = stopWordsInput.flatMap(x => x.split(" ")).map(_.trim).collect.toSet
val broadcasted = sc.broadcast(stopWords)
val wcounts1 = readFileRDD.map(x => (x.replaceAll("[^A-Za-z0-9]", " ")
.trim.toLowerCase))
.flatMap(line=>line.split(" "))
.filter(broadcasted.value.contains(_))
.map(word=>(word, 1))
.reduceByKey(_ + _)
wcounts1.collect
returns:
res2: Array[(String, Int)] = Array((The,1), (I,3), (to,1), (the,1))
You can embellish with broadcast on the stopWords -which is what I did.
I saw you XML input and a replaceAll. You can fiddle with that to your liking. I also added a clause to put it all to lower case.

Related

How to generate large word count file in Spark?

I want to generate 10 million lines’ wordcount file for performance test(each line has the same sentence). But I have no idea about how to code it.
You can give me an example code, and save file in HDFS directly.

You can try something like this.
Generate 1 column with values from 1 to 100k and one with values from 1 to 100 explode both of them with explode(column).
You can't generate one column with 10 Mil values because kryo buffer is gonna throw an error.
I don't know if this is the best performance way to do it, but it is the fastest way I can think right now.
val generateList = udf((s: Int) => {
val buf = scala.collection.mutable.ArrayBuffer.empty[Int]
for(i <- 1 to s) {
buf += i
}
buf
})
val someDF = Seq(
("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
).toDF("sentence")
val someDfWithMilColumn = someDF.withColumn("genColumn1", generateList(lit(100000)))
.withColumn("genColumn2", generateList(lit(100)))
val someDfWithMilColumn100k = someDfWithMilColumn
.withColumn("expl_val", explode($"mil")).drop("expl_val", "genColumn1")
val someDfWithMilColumn10mil = someDfWithMilColumn100k
.withColumn("expl_val2", explode($"10")).drop("genColumn2", "expl_val2")
someDfWithMilColumn10mil.write.parquet(path)

You can do it by joining the 2 DFs as below,
Also find the code explanation inline.
import org.apache.spark.sql.SaveMode
object GenerateTenMils {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
spark.conf.set("spark.sql.crossJoin.enabled","true") // Enable cross join
import spark.implicits._
//Create a DF with your sentence
val df = List("each line has the same sentence").toDF
//Create another Dataset with 10000000 records
spark.range(10000000)
.join(df) // Cross Join the dataframes
.coalesce(1) // Output to a single file
.drop("id") // Drop the extra column
.write
.mode(SaveMode.Overwrite)
.text("src/main/resources/tenMils") // Write as text file
}
}

You could follow this approach.
Tail recursive to generate the objects list and Dataframes, and Union to generate the big Dataframe
val spark = SparkSession
.builder()
.appName("TenMillionsRows")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","TenMillionsRows") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
/**
* Returns a List of nums sentences
* #param sentence
* #param num
* #return
*/
def getList(sentence: String, num: Int) : List[String] = {
#tailrec
def loop(st: String,n: Int, acc: List[String]): List[String] = {
n match {
case num if num == 0 => acc
case _ => loop(st, n - 1, st :: acc)
}
}
loop(sentence,num,List())
}
/**
* Returns a Dataframe that is the union of nums dataframes
* #param lst
* #param num
* #return
*/
def getDataFrame(lst: List[String], num: Int): DataFrame = {
#tailrec
def loop (ls: List[String],n: Int, acc: DataFrame): DataFrame = {
n match {
case n if n == 0 => acc
case _ => loop(lst,n - 1, acc.union(sc.parallelize(ls).toDF("sentence")))
}
}
loop(lst, num, sc.parallelize(List(sentence)).toDF("sentence"))
}
val sentence = "hope for the best but prepare for the worst"
val lSentence = getList(sentence, 100000)
val dfs = getDataFrame(lSentence,100)
println(dfs.count())
// output: 10000001
dfs.write.orc("path_to_hdfs") // write dataframe to a orc file
// you can save the file as parquet, txt, json .......
// with dataframe.write
Hope this helps.

Add extra column for child data frame from parent data frame in nested XML in Spark

I am creating a data after loading many XML files .
Each xml file has one unique field fun:DataPartitionId
I am creating many rows from one XML files .
Now I want to add this fun:DataPartitionId for each row in the resulting rows from the XML.
For example suppose 1st XML has 100 rows then each 100 rows will have same fun:DataPartitionId field .
So fun:DataPartitionId is as a header filed in each XML.
This is what I am doing .
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
val getDataPartition = udf { (DataPartition: String) =>
if (DataPartition=="1") "SelfSourcedPublic"
else if (DataPartition=="2") "Japan"
else if (DataPartition=="3") "SelfSourcedPrivate"
else "ThirdPartyPrivate"
}
val getFFActionParent = udf { (FFAction: String) =>
if (FFAction=="Insert") "I|!|"
else if (FFAction=="Overwrite") "I|!|"
else "D|!|"
}
val getFFActionChild = udf { (FFAction: String) =>
if (FFAction=="Insert") "I|!|"
else if (FFAction=="Overwrite") "O|!|"
else "D|!|"
}
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")
val dfDataPartition=getDataPartition(dfContentEnvelope("env:Header.fun:DataPartitionId"))
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val df =dfContentItem.withColumn("DataPartition",dfDataPartition)
df.show()

When you read your xml file using
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")
DataParitionId column is read as Long
fun:DataPartitionId: long (nullable = true)
so you should change the udf function as
val getDataPartition = udf { (DataPartition: Long) =>
if (DataPartition== 1) "SelfSourcedPublic"
else if (DataPartition== 2) "Japan"
else if (DataPartition== 3) "SelfSourcedPrivate"
else "ThirdPartyPrivate"
}
If possible you should be using when function instead of udf function to boost the processing speed and memory usage
Now I want to add this fun:DataPartitionId for each row in the resulting rows from the xml .
Your mistake is that you forgot to select that particular column, so the following code
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
should be
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select($"env:Header.fun:DataPartitionId".as("DataPartitionId"),$"column1.*")
Then you can apply the udf function
val df = dfContentItem.select(getDataPartition($"DataPartitionId"), $"env:Data.sr:Source.*", $"_action".as("FFAction|!|"))
So working code as a whole should be
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
val getDataPartition = udf { (DataPartition: Long) =>
if (DataPartition=="1") "SelfSourcedPublic"
else if (DataPartition=="2") "Japan"
else if (DataPartition=="3") "SelfSourcedPrivate"
else "ThirdPartyPrivate"
}
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select($"env:Header.fun:DataPartitionId".as("DataPartitionId"),$"column1.*")
val df = dfContentItem.select(getDataPartition($"DataPartitionId"), $"env:Data.sr:Source.*", $"_action".as("FFAction|!|"))
df.show(false)
And you can proceed with the rest of the code.

value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]

Am getting a compilation error converting the pre-LDA transformation to a data frame using SCALA in SPARK 2.0. The specific code that is throwing an error is as per below:
val documents = PreLDAmodel.transform(mp_listing_lda_df)
.select("docId","features")
.rdd
.map{ case Row(row_num: Long, features: MLVector) => (row_num, features) }
.toDF()
The complete compilation error is:
Error:(132, 8) value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Here is the complete code:
import java.io.FileInputStream
import java.sql.{DriverManager, ResultSet}
import java.util.Properties
import org.apache.spark.SparkConf
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.LDA
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA => oldLDA}
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}
object MPClassificationLDA {
/*Start: Configuration variable initialization*/
val props = new Properties
val fileStream = new FileInputStream("U:\\JIRA\\MP_Classification\\target\\classes\\mpclassification.properties")
props.load(fileStream)
val mpExtract = props.getProperty("mpExtract").toString
val shard6_db_server_name = props.getProperty("shard6_db_server_name").toString
val shard6_db_user_id = props.getProperty("shard6_db_user_id").toString
val shard6_db_user_pwd = props.getProperty("shard6_db_user_pwd").toString
val mp_output_file = props.getProperty("mp_output_file").toString
val spark_warehouse_path = props.getProperty("spark_warehouse_path").toString
val rf_model_file_path = props.getProperty("rf_model_file_path").toString
val windows_hadoop_home = props.getProperty("windows_hadoop_home").toString
val lda_vocabulary_size = props.getProperty("lda_vocabulary_size").toInt
val pre_lda_model_file_path = props.getProperty("pre_lda_model_file_path").toString
val lda_model_file_path = props.getProperty("lda_model_file_path").toString
fileStream.close()
/*End: Configuration variable initialization*/
val conf = new SparkConf().set("spark.sql.warehouse.dir", spark_warehouse_path)
def main(arg: Array[String]): Unit = {
//SQL Query definition and parameter values as parameter upon executing the Object
val cont_id = "14211599"
val top = "100000"
val start_date = "2016-05-01"
val end_date = "2016-06-01"
val mp_spark = SparkSession
.builder()
.master("local[*]")
.appName("MPClassificationLoadLDA")
.config(conf)
.getOrCreate()
MPClassificationLDACalculation(mp_spark, cont_id, top, start_date, end_date)
mp_spark.stop()
}
private def MPClassificationLDACalculation
(mp_spark: SparkSession
,cont_id: String
,top: String
,start_date: String
,end_date: String
): Unit = {
//DB connection definition
def createConnection() = {
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver").newInstance();
DriverManager.getConnection("jdbc:sqlserver://" + shard6_db_server_name + ";user=" + shard6_db_user_id + ";password=" + shard6_db_user_pwd);
}
//DB Field Names definition
def extractvalues(r: ResultSet) = {
Row(r.getString(1),r.getString(2))
}
//Prepare SQL Statement with parameter value replacement
val query = """SELECT docId = audt_id, text = auction_title FROM brands6.dbo.uf_ds_marketplace_classification_listing(#cont_id, #top, '#start_date', '#end_date') WHERE ? < ? OPTION(RECOMPILE);"""
.replaceAll("#cont_id", cont_id)
.replaceAll("#top", top)
.replaceAll("#start_date", start_date)
.replaceAll("#end_date", end_date)
.stripMargin
//Connect to Source DB and execute the Prepared SQL Steatement
val mpDataRDD = new JdbcRDD(mp_spark.sparkContext
,createConnection
,query
,lowerBound = 0
,upperBound = 10000000
,numPartitions = 1
,mapRow = extractvalues)
val schema_string = "docId,text"
val fields = StructType(schema_string.split(",")
.map(fieldname => StructField(fieldname, StringType, true)))
//Create Data Frame using format identified through schema_string
val mpDF = mp_spark.createDataFrame(mpDataRDD, fields)
mpDF.collect()
val mp_listing_tmp = mpDF.selectExpr("cast(docId as long) docId", "text")
mp_listing_tmp.printSchema()
println(mp_listing_tmp.first)
val mp_listing_lda_df = mp_listing_tmp.withColumn("docId", mp_listing_tmp("docId"))
mp_listing_lda_df.printSchema()
val tokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("rawTokens")
.setMinTokenLength(2)
val stopWordsRemover = new StopWordsRemover()
.setInputCol("rawTokens")
.setOutputCol("tokens")
val vocabSize = 4000
val countVectorizer = new CountVectorizer()
.setVocabSize(vocabSize)
.setInputCol("tokens")
.setOutputCol("features")
val PreLDApipeline = new Pipeline()
.setStages(Array(tokenizer, stopWordsRemover, countVectorizer))
val PreLDAmodel = PreLDApipeline.fit(mp_listing_lda_df)
//comment out after saving it the first time
PreLDAmodel.write.overwrite().save(pre_lda_model_file_path)
val documents = PreLDAmodel.transform(mp_listing_lda_df)
.select("docId","features")
.rdd
.map{ case Row(row_num: Long, features: MLVector) => (row_num, features) }
.toDF()
//documents.printSchema()
val numTopics: Int = 20
val maxIterations: Int = 100
//note the FeaturesCol need to be set
val lda = new LDA()
.setOptimizer("em")
.setK(numTopics)
.setMaxIter(maxIterations)
.setFeaturesCol(("_2"))
val vocabArray = PreLDAmodel.stages(2).asInstanceOf[CountVectorizerModel].vocabulary
}
}
Am thinking that it is related to conflicts in the imports section of the code. Appreciate any help.

2 things needed to be done:
Import implicits: Note that this should be done only after an instance of org.apache.spark.sql.SQLContext is created. It should be written as:
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
Move case class outside of the method: case class, by use of which you define the schema of the DataFrame, should be defined outside of the method needing it. You can read more about it here: https://issues.scala-lang.org/browse/SI-6649

spark: How to merge shuffled rdd efficiently?

I have 5 shuffled key-value rdds, one big one(1,000,000 records), and 4 relative small ones(100,000 records).All rdds were shullfed with the same number of partitions, I have two strategies to merge the 5 one,
Merge the 5 rdds together
merge the 4 small rdds together and then join the bigone
I think the strategy 2 would be more efficiently, as it would not re-shuffle the big one. But the experiment result shows the strategy 1 more efficient. The code and output are following:
Code
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}
object MergeStrategy extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val conf = new SparkConf().setMaster("local[4]").setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val bigRddSize = 1e6.toInt
val smallRddSize = 1e5.toInt
println(bigRddSize)
val bigRdd = sc.parallelize((0 until bigRddSize)
.map(x => (scala.util.Random.nextInt, 0))).repartition(100).cache
bigRdd.take(10).foreach(println)
val smallRddList = (0 until 4).map(i => {
val rst = sc.parallelize((0 until smallRddSize)
.map(x => (scala.util.Random.nextInt, 0))).repartition(100).cache
println(rst.count)
rst
}).toArray
// strategy 1
{
val begin = System.currentTimeMillis
val s1Rst = sc.union(Array(bigRdd) ++ smallRddList).distinct(100)
println(s1Rst.count)
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("S1 time count: %.1f s".format(timeCost))
}
// strategy 2
{
val begin = System.currentTimeMillis
val smallMerged = sc.union(smallRddList).distinct(100).cache
println(smallMerged.count)
val s2Rst = bigRdd.fullOuterJoin(smallMerged).flatMap({ case (key, (left, right)) => {
if (left.isDefined && right.isDefined) Array((key, left.get), (key, right.get)).distinct
else if (left.isDefined) Array((key, left.get))
else if (right.isDefined) Array((key, right.get))
else throw new Exception("Cannot happen")
}
})
println(s2Rst.count)
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("S2 time count: %.1f s".format(timeCost))
}
}
Output
1000000
(688282474,0)
(-255073127,0)
(872746474,0)
(-792516900,0)
(417252803,0)
(-1514224305,0)
(1586932811,0)
(1400718248,0)
(939155130,0)
(1475156418,0)
100000
100000
100000
100000
1399777
S1 time count: 39.7 s
399984
1399894
S2 time count: 49.8 s
My understanding for shuffled rdd was wrong? Can anybody give some advices?
Thanks!

I found a method to merge rdd more efficiently, see the following 2 merging strategies:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext
import org.apache.spark.{HashPartitioner, SparkContext, SparkConf}
import scala.collection.mutable.ArrayBuffer
object MergeStrategy extends App {
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val conf = new SparkConf().setMaster("local[4]").setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val rddCount = 20
val mergeCount = 5
val dataSize = 20000
val parts = 50
// generate data
scala.util.Random.setSeed(943343)
val testData = for (i <- 0 until rddCount)
yield sc.parallelize(scala.util.Random.shuffle((0 until dataSize).toList).map(x => (x, 0)))
.partitionBy(new HashPartitioner(parts))
.cache
testData.foreach(x => println(x.count))
// strategy 1: merge directly
{
val buff = ArrayBuffer[RDD[(Int, Int)]]()
val begin = System.currentTimeMillis
for (i <- 0 until rddCount) {
buff += testData(i)
if ((buff.size >= mergeCount || i == rddCount - 1) && buff.size > 1) {
val merged = sc.union(buff).distinct
.partitionBy(new HashPartitioner(parts)).cache
println(merged.count)
buff.foreach(_.unpersist(false))
buff.clear
buff += merged
}
}
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("Strategy 1 Time Cost: %.1f".format(timeCost))
assert(buff.size == 1)
println("Strategy 1 Complete, with merged Count %s".format(buff(0).count))
}
// strategy 2: merge directly without repartition
{
val buff = ArrayBuffer[RDD[(Int, Int)]]()
val begin = System.currentTimeMillis
for (i <- 0 until rddCount) {
buff += testData(i)
if ((buff.size >= mergeCount || i == rddCount - 1) && buff.size > 1) {
val merged = sc.union(buff).distinct(parts).cache
println(merged.count)
buff.foreach(_.unpersist(false))
buff.clear
buff += merged
}
}
val end = System.currentTimeMillis
val timeCost = (end - begin) / 1000d
println("Strategy 2 Time Cost: %.1f".format(timeCost))
assert(buff.size == 1)
println("Strategy 2 Complete, with merged Count %s".format(buff(0).count))
}
}
The result shows that strategy 1 (time cost 20.8 seconds) is more efficient than strategy 2 (time cost 34.3 seconds). my pc is window 8, CPU 4 cores 2.0GHz, 8GB memory.
The only difference is that strategy partitioned by HashPartitioner, but strategy 2 not. As a result, the strategy 1 produce ShuffledRDD, but strategy 1 MapPartitionsRDD. I think RDD.distinct function processes ShuflledRDD more efficiently than MapPartitionsRDD.

How to process a subset of input records in a batch, i.e. the first second in 3-sec batch time?

If I set Seconds(1) for the batch time in StreamingContext, like this:
val ssc = new StreamingContext(sc, Seconds(1))
3 seconds will receive the 3 seconds of data, but I only need the first seconds of data, I can discard the next 2 seconds of data. So can I spend 3 seconds to process only first second of data?

You can do this via updateStateByKey if you keep track of counter, for example like below:
import org.apache.spark.SparkContext
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamEveryThirdApp {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "Streaming Test")
implicit val ssc = new StreamingContext(sc, Seconds(1))
ssc.checkpoint("./checkpoint")
// generate stream
val inputDStream = createConstantStream
// increase seconds counter
val accStream = inputDStream.updateStateByKey(updateState)
// keep only 1st second records
val firstOfThree = accStream.filter { case (key, (value, counter)) => counter == 1}
firstOfThree.print()
ssc.start()
ssc.awaitTermination()
}
def updateState: (Seq[Int], Option[(Option[Int], Int)]) => Option[(Option[Int], Int)] = {
case(values, state) =>
state match {
// If no previous state, i.e. set first Second
case None => Some(Some(values.sum), 1)
// If this is 3rd second - remove state
case Some((prevValue, 3)) => None
// If this is not the first second - increase seconds counter, but don't calculate values
case Some((prevValue, counter)) => Some((None, counter + 1))
}
}
def createConstantStream(implicit ssc: StreamingContext): ConstantInputDStream[(String, Int)] = {
val seq = Seq(
("key1", 1),
("key2", 3),
("key1", 2),
("key1", 2)
)
val rdd = ssc.sparkContext.parallelize(seq)
val inputDStream = new ConstantInputDStream(ssc, rdd)
inputDStream
}
}
In case if you have time information within your data, you could also use 3 seconds window stream.window(Seconds(3), Seconds(3)) and filter records by the time information from data, and quite often this is preferred approach

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Find the Occurrences of Matched Strings - apache-spark

Related

How to generate large word count file in Spark?

Add extra column for child data frame from parent data frame in nested XML in Spark

value toDF is not a member of org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]

spark: How to merge shuffled rdd efficiently?

How to process a subset of input records in a batch, i.e. the first second in 3-sec batch time?

Categories

Resources