save each element of rdd in text file hdfs - apache-spark

I am using spark application. In each element of rdd contains good amount of data. I want to save each element of rdd into multiple hdfs files respectively. I tried rdd.saveAsTextFile("foo.txt") But I will create a single file for whole rdd. rdd size is 10. I want 10 files in hdfs. How can I achieve this??

If I understand your question, you can create a custom output format like this
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
Then convert your RDD into a key/val one where the key is the file path, and you can use saveAsHadoopFile function insted of saveAsTextFile, like this:
myRDD.saveAsHadoopFile(OUTPUT_PATH, classOf[String], classOf[String],classOf[RDDMultipleTextOutputFormat])

Related

How can I save a single column of a pyspark dataframe in multiple json files?

I have a dataframe that looks a bit like this:
| key 1 | key 2 | key 3 | body |
I want to save this dataframe in 1 json-file per partition, where a partition is a unique combination of keys 1 to 3. I have the following requirements:
The paths of the files should be /key 1/key 2/key 3.json.gz
The files should be compressed
The contents of the files should be values of body (this column contains a json string), one json-string per line.
I've tried multiple things, but no look.
Method 1: Using native dataframe.write
I've tried using the native write method to save the data. Something like this:
df.write
.partitionBy("key 1", "key 2", "key 3") \
.mode('overwrite') \
.format('json') \
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") \
.save(
path=path,
compression="gzip"
)
This solution doesn't store the files in the correct path and with the correct name, but this can be fixed by moving them afterwards. However, the biggest problem is that this is writing the complete dataframe, while I only want to write the values of the body column. But I need the other columns to partition the data.
Method 2: Using the Hadoop filesystem
It's possible to directly call the Hadoop filesystem java library using this: sc._gateway.jvm.org.apache.hadoop.fs.FileSystem. With access to this filesystem it's possible to create files myself, giving me more control over the path, the filename and the contents. However, in order to make this code scale I'm doing this per partition, so:
df.foreachPartition(save_partition)
def save_partition(items):
# Store the items of this partition here
However, I can't get this to work because the save_partition function is executed on the workers, which doesn't have access to the SparkSession and the SparkContext (which is needed to reach the Hadoop Filesystem JVM libraries). I could solve this by pulling all the data to the driver using collect() and save it from there, but that won't scale.
So, quite a story, but I prefer to be complete here. What am I missing? Is it impossible to do what I want, or am I missing something obvious? Or is it difficult? Or maybe it's only possible from Scala/Java? I would love to get some help on this.
It may be slightly tricky to do in pure pyspark. It is not recommended to create too many partitions. From what you have explained I think you are using partition only to get one JSON body per file. You may need a bit of Scala here but your spark job can still remain to be a PySpark Job.
Spark Internally defines DataSources interfaces through which you can define how to read and write data. JSON is one such data source. You can try to extend the default JsonFileFormat class and create your own JsonFileFormatV2. You will also need to define a JsonOutputWriterV2 class extending the default JsonOutputWriter. The output writer has a write function that gives you access to individual rows and paths passed on from the spark program. You can modify the write function to meet your needs.
Here is a sample of how I achieved customizing JSON writes for my use case of writing a fixed number of JSON entries per file. You can use it as a reference for implementing your own JSON writing strategy.
class JsonFileFormatV2 extends JsonFileFormat {
override val shortName: String = "jsonV2"
override def prepareWrite(
sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory = {
val conf = job.getConfiguration
val fileLineCount = options.get("filelinecount").map(_.toInt).getOrElse(1)
val parsedOptions = new JSONOptions(
options,
sparkSession.sessionState.conf.sessionLocalTimeZone,
sparkSession.sessionState.conf.columnNameOfCorruptRecord)
parsedOptions.compressionCodec.foreach { codec =>
CompressionCodecs.setCodecConfiguration(conf, codec)
}
new OutputWriterFactory {
override def newInstance(
path: String,
dataSchema: StructType,
context: TaskAttemptContext): OutputWriter = {
new JsonOutputWriterV2(path, parsedOptions, dataSchema, context, fileLineCount)
}
override def getFileExtension(context: TaskAttemptContext): String = {
".json" + CodecStreams.getCompressionExtension(context)
}
}
}
}
private[json] class JsonOutputWriterV2(
path: String,
options: JSONOptions,
dataSchema: StructType,
context: TaskAttemptContext,
maxFileLineCount: Int) extends JsonOutputWriter(
path,
options,
dataSchema,
context) {
private val encoding = options.encoding match {
case Some(charsetName) => Charset.forName(charsetName)
case None => StandardCharsets.UTF_8
}
var recordCounter = 0
var filecounter = 0
private val maxEntriesPerFile = maxFileLineCount
private var writer = CodecStreams.createOutputStreamWriter(
context, new Path(modifiedPath(path)), encoding)
private[this] var gen = new JacksonGenerator(dataSchema, writer, options)
private def modifiedPath(path:String): String = {
val np = s"$path-filecount-$filecounter"
np
}
override def write(row: InternalRow): Unit = {
gen.write(row)
gen.writeLineEnding()
recordCounter += 1
if(recordCounter >= maxEntriesPerFile){
gen.close()
writer.close()
filecounter+=1
recordCounter = 0
writer = CodecStreams.createOutputStreamWriter(
context, new Path(modifiedPath(path)), encoding)
gen = new JacksonGenerator(dataSchema, writer, options)
}
}
override def close(): Unit = {
if(recordCounter<maxEntriesPerFile){
gen.close()
writer.close()
}
}
}
You can add this new custom data source jar to spark classpath and then in your pyspark you can invoke it as follows.
df.write.format("org.apache.spark.sql.execution.datasources.json.JsonFileFormatV2").option("filelinecount","5").mode("overwrite").save("path-to-save")

How to read a whole folder's files into one RDD map in Spark?

I use binaryFiles to read files from HDFS, but one map only has one file.
sparkContext.binaryFiles("hdfs://name/a/b/id-*.zzz").map(x=>{})
In the map phase, I can only deal with one file. Can I set two or more files in one map, and deal with them parallel?
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope
Returns paired RDD, where key is the File and value is Content

Convert DStream of case class with joda.DateTime to Spark DataFrame

I want to save a DStream into HDFS using parquet format. The problem is that my case class use joda.DateTime while Spark SQL doesn't support this. For example:
case class Log (timestamp: DateTime, ...dozen of other fields here...)
But I got error: java.lang.UnsupportedOperationException: Schema for type org.joda.time.DateTime is not supported when trying to convert RDD to DF:
def output(logdstream: DStream[Log]) {
logdstream.foreachRDD(elem => {
val df = elem.toDF()
df.saveAsParquet(...)
});
}
My models are complex and have a lot of fields, so I don't want to write different case classes to get rid of the joda.DateTime. Another option would be save directly from json to parquet but it's not ideal. Is there an easy way to do automatic conversion from joda.DateTime to sql.Timestamp to be used with spark (convert to Spark's dataframe).
Thanks.
It's a little bit verbose, but you an try mapping Log to Spark SQL Row:
logdstream.foreachRDD(rdd => {
rdd.map(log => Row(
log.timestamp.toDate,
log.field2,
...
)).toDF().saveAsParquest(...)
})

How to load data from saved file with Spark

Spark provide method saveAsTextFile which can store RDD[T] into disk or hdfs easily.
T is an arbitrary serializable class.
I want to reverse the operation.
I wonder whether there is a loadFromTextFile which can easily load a file into RDD[T]?
Let me make it clear:
class A extends Serializable {
...
}
val path:String = "hdfs..."
val d1:RDD[A] = create_A
d1.saveAsTextFile(path)
val d2:RDD[A] = a_load_function(path) // this is the function I want
//d2 should be the same as d1
Try to use d1.saveAsObjectFile(path) to store and val d2 = sc.objectFile[A](path) to load.
I think you cannot saveAsTextFile and read it out as RDD[A] without transformation from RDD[String]
To create file based RDD, We can use SparkContext.textFile API
Below is an example:
val textFile = sc.textFile("input.txt")
We can specify the URI explicitly.
If the file is in HDFS:
sc.textFile("hdfs://host:port/filepath")
If the file is in local:
sc.textFile("file:///path to the file/")
If the file is S3:
s3.textFile("s3n://mybucket/sample.txt");
To load RDD to Speicific type:
case class Person(name: String, age: Int)
val people = sc.textFile("employees.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
Here, people will be of type org.apache.spark.rdd.RDD[Person]

Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values

I am using Spark 1.0.1 to process a large amount of data. Each row contains an ID number, some with duplicate IDs. I want to save all the rows with the same ID number in the same location, but I am having trouble doing it efficiently. I create an RDD[(String, String)] of (ID number, data row) pairs:
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1), x)}
A way that works, but is not performant, is to collect the ID numbers, filter the RDD for each ID, and save the RDD of values with the same ID as a text file.
val ids = rdd.keys.distinct.collect
ids.foreach({ id =>
val dataRows = mapRdd.filter(_._1 == id).values
dataRows.saveAsTextFile(id)
})
I also tried a groupByKey or reduceByKey so that each tuple in the RDD contains a unique ID number as the key and a string of combined data rows separated by new lines for that ID number. I want to iterate through the RDD only once using foreach to save the data, but it can't give the values as an RDD
groupedRdd.foreach({ tup =>
val data = sc.parallelize(List(tup._2)) //nested RDD does not work
data.saveAsTextFile(tup._1)
})
Essentially, I want to split an RDD into multiple RDDs by an ID number and save the values for that ID number into their own location.
I think this problem is similar to
Write to multiple outputs by key Spark - one Spark job
Please refer the answer there.
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any =
NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
key.asInstanceOf[String]
}
object Split {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Split" + args(1))
val sc = new SparkContext(conf)
sc.textFile("input/path")
.map(a => (k, v)) // Your own implementation
.partitionBy(new HashPartitioner(num))
.saveAsHadoopFile("output/path", classOf[String], classOf[String],
classOf[RDDMultipleTextOutputFormat])
spark.stop()
}
}
Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.
new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.
you can directly call saveAsTextFile on grouped RDD, here it will save the data based on partitions, i mean, if you have 4 distinctID's, and you specified the groupedRDD's number of partitions as 4, then spark stores each partition data into one file(so by which you can have only one fileper ID) u can even see the data as iterables of eachId in the filesystem.
This will save the data per user ID
val mapRdd = rdd.map{ x=> (x.split("\\t+")(1),
x)}.groupByKey(numPartitions).saveAsObjectFile("file")
If you need to retrieve the data again based on user id you can do something like
val userIdLookupTable = sc.objectFile("file").cache() //could use persist() if data is to big for memory
val data = userIdLookupTable.lookup(id) //note this returns a sequence, in this case you can just get the first one
Note that there is no particular reason to save to the file in this case I just did it since the OP asked for it, that being said saving to a file does allow you to load the RDD at anytime after the initial grouping has been done.
One last thing, lookup is faster than a filter approach of accessing ids but if you're willing to go off a pull request from spark you can checkout this answer for a faster approach

Resources