How to read multiple text files into a single RDD? - apache-spark

I want to read a bunch of text files from a hdfs location and perform mapping on it in an iteration using spark.
JavaRDD<String> records = ctx.textFile(args[1], 1); is capable of reading only one file at a time.
I want to read more than one file and process them as a single RDD. How?

You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat and therefore this also works with Hadoop (and Scalding).

Use union as follows:
val sc = new SparkContext(...)
val r1 = sc.textFile("xxx1")
val r2 = sc.textFile("xxx2")
...
val rdds = Seq(r1, r2, ...)
val bigRdd = sc.union(rdds)
Then the bigRdd is the RDD with all files.

You can use a single textFile call to read multiple files. Scala:
sc.textFile(','.join(files))

You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D

In PySpark, I have found an additional useful way to parse files. Perhaps there is an equivalent in Scala, but I am not comfortable enough coming up with a working translation. It is, in effect, a textFile call with the addition of labels (in the below example the key = filename, value = 1 line from file).
"Labeled" textFile
input:
import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)
for filename in glob.glob(Data_File + "/*"):
Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)
output: array with each entry containing a tuple using filename-as-key and with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.
[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
...]
You can also recombine either as a list of lines:
Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]
Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):
Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

you can use
JavaRDD<String , String> records = sc.wholeTextFiles("path of your directory")
here you will get the path of your file and content of that file. so you can perform any action of a whole file at a time that saves the overhead

All answers are correct with sc.textFile
I was just wondering why not wholeTextFiles For example, in this case...
val minPartitions = 2
val path = "/pathtohdfs"
sc.wholeTextFiles(path,minPartitions)
.flatMap{case (path, text)
...
one limitation is that, we have to load small files otherwise performance will be bad and may lead to OOM.
Note :
The wholefile should fit in to memory
Good for file formats that are NOT splittable by line... such as XML files
Further reference to visit

There is a straight forward clean solution available. Use the wholeTextFiles() method. This will take a directory and forms a key value pair. The returned RDD will be a pair RDD.
Find below the description from Spark docs:
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file

TRY THIS
Interface used to write a DataFrame to external storage systems (e.g. file systems, key-value stores, etc). Use DataFrame.write() to access this.
New in version 1.4.
csv(path, mode=None, compression=None, sep=None, quote=None, escape=None, header=None, nullValue=None, escapeQuotes=None, quoteAll=None, dateFormat=None, timestampFormat=None)
Saves the content of the DataFrame in CSV format at the specified path.
Parameters:
path – the path in any Hadoop supported file system
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error (default case): Throw an exception if data already exists.
compression – compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
sep – sets the single character as a separator for each field and value. If None is set, it uses the default value, ,.
quote – sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
escape – sets the single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, \
escapeQuotes – A flag indicating whether values containing quotes should always be enclosed in quotes. If None is set, it uses the default value true, escaping all values containing a quote character.
quoteAll – A flag indicating whether all values should always be enclosed in quotes. If None is set, it uses the default value false, only escaping values containing a quote character.
header – writes the names of columns as the first line. If None is set, it uses the default value, false.
nullValue – sets the string representation of a null value. If None is set, it uses the default value, empty string.
dateFormat – sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type. If None is set, it uses the default value value, yyyy-MM-dd.
timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. If None is set, it uses the default value value, yyyy-MM-dd'T'HH:mm:ss.SSSZZ.

rdd = textFile('/data/{1.txt,2.txt}')

Related

How to identify or reroute bad xml's when reading xmls with spark

Using spark, I am trying to read a bunch of xmls from a path, one of the files is a dummy file which is not an xml.
I would like the spark to tell me that one particular file is not valid, in any way
Adding "badRecordsPath" otiton writes the bad data into specified location for JSON files, but the same is not working for xml, is there some other way?
df = (spark.read.format('json')
.option('badRecordsPath','/tmp/data/failed')
.load('/tmp/data/dummy.json')
As far as I know.... Unfortunately it wasnt available in xml package of spark till today in a declarative way... in the way you are expecting...
Json it was working since FailureSafeParser was implemented like below... in DataFrameReader
/**
* Loads a `Dataset[String]` storing JSON objects (<a href="http://jsonlines.org/">JSON Lines
* text format or newline-delimited JSON</a>) and returns the result as a `DataFrame`.
*
* Unless the schema is specified using `schema` function, this function goes through the
* input once to determine the input schema.
*
* #param jsonDataset input Dataset with one JSON object per record
* #since 2.2.0
*/
def json(jsonDataset: Dataset[String]): DataFrame = {
val parsedOptions = new JSONOptions(
extraOptions.toMap,
sparkSession.sessionState.conf.sessionLocalTimeZone,
sparkSession.sessionState.conf.columnNameOfCorruptRecord)
val schema = userSpecifiedSchema.getOrElse {
TextInputJsonDataSource.inferFromDataset(jsonDataset, parsedOptions)
}
ExprUtils.verifyColumnNameOfCorruptRecord(schema, parsedOptions.columnNameOfCorruptRecord)
val actualSchema =
StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord))
val createParser = CreateJacksonParser.string _
val parsed = jsonDataset.rdd.mapPartitions { iter =>
val rawParser = new JacksonParser(actualSchema, parsedOptions, allowArrayAsStructs = true)
val parser = new FailureSafeParser[String](
input => rawParser.parse(input, createParser, UTF8String.fromString),
parsedOptions.parseMode,
schema,
parsedOptions.columnNameOfCorruptRecord)
iter.flatMap(parser.parse)
}
sparkSession.internalCreateDataFrame(parsed, schema, isStreaming = jsonDataset.isStreaming)
}
you can implement the feature programatic way.
read all the files in the folder using sc.textFile .
foreach file using xml parser parse the entries.
If its valid redirect to another path .
If its invalid, then write in to bad record path.

How do I write a standalone application in Spark to find 20 of most mentions in a text file filled with extracted tweets

I'm creating a standalone application in spark where I need to read in a text file that is filled with tweets. Every mention starts with the symbol, "#". The objective is to go through this file, and find the most 20 mentions. Punctuation should be stripped from all mentions and if the tweet has the same mention more than once, it should be counted only once. There can be multiple unique mentions in a single tweet. There are many tweets in the file.
I am new to scala and apache-spark. I was thinking of using the filter function and placing the results in a list. Then convert the list into a set where items are unique. But the syntax, regular expressions, and reading the file are a problem i face.
def main(args: Array[String]){
val locationTweetFile = args(0)
val spark = SparkSession.builder.appName("does this matter?").getOrCreate()
tweet file is huge, is this command below, safe?
val tweetsFile = spark.read.textFile(locationTweetFile).cache()
val mentionsExp = """([#])+""".r
}
If the tweet had said
"Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER."
Then the output should be something like, ((honda, 1),(customer,1))
Since there are multiple tweets, another tweet can say,
"#HoNdA I am the same #cuSTomER #STACKEXCHANGE."
Then the Final output will be something like
((honda,2),(customer,2),(stackexchange,1))
Let's go step-by step.
1) appName("does this matter?") in your case doesn't matter
2) spark.read.textFile(filename) is safe due to its laziness, file won't be loaded into your memory
Now, about implementation:
Spark is about transformation of data, so you need to think how to transform raw tweets to list of unique mentions in each tweet. Next you transform list of mentions to Map[Mention, Int], where Int is a total count of that mention in the RDD.
Tranformation is usually done via map(f: A => B) method where f is a function mapping A value to B.
def tweetToMentions(tweet: String): Seq[String] =
tweet.split(" ").collect {
case s if s.startsWith("#") => s.replaceAll("[,.;!?]", "").toLowerCase
}.distinct.Seq
val mentions = tweetToMentions("Hey #Honda, I am #customer I love #honda. I am favorite #CUSTOMER.")
// mentions: Seq("#honda", "#customer")
The next step is to apply this function to each element in our RDD:
val mentions = tweetsFile.flatMap(tweetToMentions)
Note that we use flatMap instead of map because tweetToMentions returns Seq[String] and we want our RDD to contain only mentions, flatMap will flatten the result.
To count occurences of each mention in the RDD we need to apply some magic:
First, we map our mentions to pairs of (Mention, 1)
mentions.map(mention => (mention, 1))
Then we use reduceByKey which will count how many times each mention occurs in our RDD. Lastly, we order the mentions by their counts and retreive result.
val result = mentions
.map(mention => (mention, 1))
.reduceByKey((a, b) => a + b)
.takeOrdered(20)(Ordering[Int].reverse.on(_.2))

Apache Spark write to multiple outputs [different parquet schemas] without caching

I want to transform my input data (XML files) and produce 3 different outputs.
Each output will be in parquet format and will have a different schema/number of columns.
Currently in my solution, the data is stored in RDD[Row], where each Row belongs to one of three types and has a different number of fields. What I'm doing now is caching the RDD, then filtering it (using the field telling me about the record type) and saving the data using the following method:
var resultDF_1 = sqlContext.createDataFrame(filtered_data_1, schema_1)
resultDF_1.write.parquet(output_path_1)
...
// the same for filtered_data_2 and filtered_data_3
Is there any way to do it better, for example do not cache entire data in memory?
In MapReduce we have MultipleOutputs class and we can do it this way:
MultipleOutputs.addNamedOutput(job, "data_type_1", DataType1OutputFormat.class, Void.class, Group.class);
MultipleOutputs.addNamedOutput(job, "data_type_2", DataType2OutputFormat.class, Void.class, Group.class);
MultipleOutputs.addNamedOutput(job, "data_type_3", DataType3OutputFormat.class, Void.class, Group.class);
...
MultipleOutputs<Void, Group> mos = new MultipleOutputs<>(context);
mos.write("data_type_1", null, myRecordGroup1, filePath1);
mos.write("data_type_2", null, myRecordGroup2, filePath2);
...
We had exactly this problem, to re-iterate: we read 1000s of datasets into one RDD, all of different schemas (we used a nested Map[String, Any]) and wanted to write those 1000s of datasets to different Parquet partitions in their respective schemas. All in a single embarrassingly parallel Spark Stage.
Our initial approach indeed did the hacky thing of caching, but this meant (a) 1000 passes of the cached data (b) hitting a lot of memory issues!
For a long time now I've wanted to bypass the Spark's provided .parquet methods and go to lower level underlying libraries, and wrap that in a nice functional signature. Finally recently we did exactly this!
The code is too much to copy and paste all of it here, so I will just paste the main crux of the code to explain how it works. We intend on making this code Open Source in the next year or two.
val successFiles: List[String] = successFilePaths(tableKeyToSchema, tableKeyToOutputKey, tableKeyToOutputKeyNprs)
// MUST happen first
info("Deleting success files")
successFiles.foreach(S3Utils.deleteObject(bucket, _))
if (saveMode == SaveMode.Overwrite) {
info("Deleting past files as in Overwrite mode")
parDeleteDirContents(bucket, allDirectories(tableKeyToOutputKey, tableKeyToOutputKeyNprs, partitions, continuallyRunStartTime))
} else {
info("Not deleting past files as in Append mode")
}
rdd.mapPartitionsWithIndex {
case (index, records) =>
records.toList.groupBy(_._1).mapValues(_.map(_._2)).foreach {
case (regularKey: RegularKey, data: List[NotProcessableRecord Either UntypedStruct]) =>
val (nprs: List[NotProcessableRecord], successes: List[UntypedStruct]) =
Foldable[List].partitionEither(data)(identity)
val filename = s"part-by-partition-index-$index.snappy.parquet"
Parquet.writeUntypedStruct(
data = successes,
schema = toMessageType(tableKeyToSchema(regularKey.tableKey)),
fsMode = fs,
path = s3 / bucket / tableKeyToOutputKey(regularKey.tableKey) / regularKey.partition.pathSuffix /?
continuallyRunStartTime.map(hourMinutePathSuffix) / filename
)
Parquet.writeNPRs(
nprs = nprs,
fsMode = fs,
path = s3 / bucket / tableKeyToOutputKeyNprs(regularKey.tableKey) / regularKey.partition.pathSuffix /?
continuallyRunStartTime.map(hourMinutePathSuffix) / filename
)
} pipe Iterator.single
}.count() // Just some action to force execution
info("Writing _SUCCESS files")
successFiles.foreach(S3Utils.uploadFileContent(bucket, "", _))
Of course this code cannot be copy and pasted as many methods and values are not provided. The key points are:
We hand crank the deleting of _SUCCESS files and previous files when overwriting
Each spark partition will result in one-or-many output files (many when multiple data schemas are in the same partition)
We hand crank the writing of _SUCCESS files
Notes:
UntypedStruct is our nested representation of arbitrary schema. It's a little bit like Row in Spark but much better, as it's based on Map[String, Any].
NotProcessableRecord are essentially just dead letters
Parquet.writeUntypedStruct is the crux of the logic of writing a parquet file, so we'll explain this in more detail. Firstly
val toMessageType: StructType => MessageType = new org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter().convert
Should be self explanatory. Next fsMode contains within it the com.amazonaws.auth.AWSCredentials, then inside writeUntypedStruct we use that to construct org.apache.hadoop.conf.Configuration setting fs.s3a.access.key and fs.s3a.secret.key.
writeUntypedStruct basically just calls out to:
def writeRaw(
data: List[UntypedStruct],
schema: MessageType,
config: Configuration,
path: Path,
compression: CompressionCodecName = CompressionCodecName.SNAPPY
): Unit =
Using.resource(
ExampleParquetWriter.builder(path)
.withType(schema)
.withConf(config)
.withCompressionCodec(compression)
.withValidation(true)
.build()
)(writer => data.foreach(data => writer.write(transpose(data, new SimpleGroup(schema)))))
where SimpleGroup comes from org.apache.parquet.example.data.simple, and ExampleParquetWriter extends ParquetWriter<Group>. The method transpose is a very tedious self writing recursion through the UntypedStruct populating a Group (some ugly Java mutable low level thing).
Credit must go to https://github.com/davidainslie for figuring out how these underlying libraries work, and labouring out the code, which like I said, we intend on making Open Source soon!
AFAIK, there is no way to split one RDD into multiple RDD per se. This is just how the way Spark's DAG works: only child RDDs pulling data from parent RDDs.
We can, however, have multiple child RDDs reading from the same parent RDD. To avoid recomputing the parent RDD, there is no other way but to cache it. I assume that you want to avoid caching because you're afraid of insufficient memory. We can avoid Out Of Memory (OOM) issue by persisting the RDD to MEMORY_AND_DISK so that large RDD will spill to disk if and when needed.
Let's begin with your original data:
val allDataRDD = sc.parallelize(Seq(Row(1,1,1),Row(2,2,2),Row(3,3,3)))
We can persist this in memory first, but allow it to spill over to disk in case of insufficient memory:
allDataRDD.persist(StorageLevel.MEMORY_AND_DISK)
We then create the 3 RDD outputs:
filtered_data_1 = allDataRDD.filter(_.get(1)==1) // //
filtered_data_2 = allDataRDD.filter(_.get(2)==1) // use your own filter funcs here
filtered_data_3 = allDataRDD.filter(_.get(3)==1) // //
We then write the outputs:
var resultDF_1 = sqlContext.createDataFrame(filtered_data_1, schema_1)
resultDF_1.write.parquet(output_path_1)
var resultDF_2 = sqlContext.createDataFrame(filtered_data_2, schema_2)
resultDF_2.write.parquet(output_path_2)
var resultDF_3 = sqlContext.createDataFrame(filtered_data_3, schema_3)
resultDF_3.write.parquet(output_path_3)
If you truly really want to avoid multiple passes, there is a workaround using a custom partitioner. You can repartition your data into 3 partitions and each partition will have its own task and hence its own output file/part. The caveat is that parallelism will be heavily reduced to 3 threads/tasks, and there's also the risk of >2GB of data stored in a single partition (Spark has a 2GB limit per partition). I am not providing detailed code for this method because I don't think it can write parquet files with different schema.

Spark Dataframe validating column names for parquet writes

I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format.
However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names.
How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job.
[1]
Spark's CatalystSchemaConverter
def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in Parquet schema
checkConversionRequirement(
!name.matches(".*[ ,;{}()\n\t=].*"),
s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
|Please use alias to rename it.
""".stripMargin.split("\n").mkString(" ").trim
)
}
For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:
file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", ""))
df = spark.read.schema(df.schema).parquet(file)
You can use a regex to replace all invalid characters with an underscore before you write into parquet. Additionally, strip accents from the column names too.
Here's a function normalize that do this for both Scala and Python :
Scala
/**
* Normalize column name by replacing invalid characters with underscore
* and strips accents
*
* #param columns dataframe column names list
* #return the list of normalized column names
*/
def normalize(columns: Seq[String]): Seq[String] = {
columns.map { c =>
org.apache.commons.lang3.StringUtils.stripAccents(c.replaceAll("[ ,;{}()\n\t=]+", "_"))
}
}
// using the function
val df2 = df.toDF(normalize(df.columns):_*)
Python
import unicodedata
import re
def normalize(column: str) -> str:
"""
Normalize column name by replacing invalid characters with underscore
strips accents and make lowercase
:param column: column name
:return: normalized column name
"""
n = re.sub(r"[ ,;{}()\n\t=]+", '_', column.lower())
return unicodedata.normalize('NFKD', n).encode('ASCII', 'ignore').decode()
# using the function
df = df.toDF(*map(normalize, df.columns))
This is my solution using Regex in order to rename all the dataframe's columns following the parquet convention:
df.columns.foldLeft(df){
case (currentDf, oldColumnName) => currentDf.withColumnRenamed(oldColumnName, oldColumnName.replaceAll("[ ,;{}()\n\t=]", ""))
}
I hope it helps,
I had the same problem with column names containing spaces.
The first part of the solution was to put the names in backquotes.
The second part of the solution was to replace the spaces with underscores.
Sorry but I have only the pyspark code ready:
from pyspark.sql import functions as F
df_tmp.select(*(F.col("`" + c+ "`").alias(c.replace(' ', '_')) for c in df_tmp.columns)
Using alias to change your field names without those special characters.
I have encounter this error "Error in SQL statement: AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://learn.microsoft.com/azure/databricks/delta/delta-column-mapping Or you can use alias to rename it."
The issue was because I used MAX(COLUM_NAME) when creating a table based on a parquet / Delta table, and the new name of the new table was "MAX(COLUM_NAME)" because forgot to use Aliases and parquet files doesn't support brackets '()'
Solved by using aliases (removing the brackets)
It was fixed in Spark 3.3.0 release at least for the parquet files (I tested), it might work with JSON as well.

Spark Streaming Desinging Questiion

I am new in spark. I wanted to do spark streaming setup to retrieve key value pairs of below format files:
file: info1
Note: Each info file will have around of 1000 of these records. And our system is continuously generating these info files. Through, spark streaming i wanted to do mapping of line numbers and info files and wanted to get aggregate result.
Can we give input to spark cluster these kind of files? I am interested in the "SF" and "DA" delimiters only, "SF" corresponds to source file and "DA" corresponds the ( line number, count).
As this input data is not the line format, so is this the good idea to use these files for the spark input or should i need to do some intermediary stage where i need to clean these files to generate new files which will have each record information in line instead of blocks?
Or can we achieve this in Spark itself? What should be the right approach?
What i wanted to achieve?
I wanted to get line level information. Means, to get line (As a key) and info files (as values)
Final output i wanted is like below:
line178 -> (info1, info2, info7.................)
line 2908 -> (info3, info90, ..., ... ,)
Do let me know if my explanation is not clear or if i am missing something.
Thanks & Regards,
Vinti
You could do something like this. Having your DStream stream:
// this gives you DA & FP lines, with the line number as the key
val validLines = stream.map(_.split(":")).
filter(line => Seq("DA", "FP").contains(line._1)).
map(_._2.split(","))
map(line => (line._1, line._2))
// now you should accumulate values
val state = validLines.updateStateByKey[Seq[String]](updateFunction _)
def updateFunction(newValues: Seq[Seq[String]], runningValues: Option[Seq[String]]): Option[Seq[String]] = {
// add the new values
val newVals = runnigValues match {
case Some(list) => list :: newValues
case _ => newValues
}
Some(newVals)
}
This should accumulate for each key a sequence with the values associated, storing it in state

Resources