How can I save a single column of a pyspark dataframe in multiple json files? - apache-spark

I have a dataframe that looks a bit like this:
| key 1 | key 2 | key 3 | body |
I want to save this dataframe in 1 json-file per partition, where a partition is a unique combination of keys 1 to 3. I have the following requirements:
The paths of the files should be /key 1/key 2/key 3.json.gz
The files should be compressed
The contents of the files should be values of body (this column contains a json string), one json-string per line.
I've tried multiple things, but no look.
Method 1: Using native dataframe.write
I've tried using the native write method to save the data. Something like this:
df.write
.partitionBy("key 1", "key 2", "key 3") \
.mode('overwrite') \
.format('json') \
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") \
.save(
path=path,
compression="gzip"
)
This solution doesn't store the files in the correct path and with the correct name, but this can be fixed by moving them afterwards. However, the biggest problem is that this is writing the complete dataframe, while I only want to write the values of the body column. But I need the other columns to partition the data.
Method 2: Using the Hadoop filesystem
It's possible to directly call the Hadoop filesystem java library using this: sc._gateway.jvm.org.apache.hadoop.fs.FileSystem. With access to this filesystem it's possible to create files myself, giving me more control over the path, the filename and the contents. However, in order to make this code scale I'm doing this per partition, so:
df.foreachPartition(save_partition)
def save_partition(items):
# Store the items of this partition here
However, I can't get this to work because the save_partition function is executed on the workers, which doesn't have access to the SparkSession and the SparkContext (which is needed to reach the Hadoop Filesystem JVM libraries). I could solve this by pulling all the data to the driver using collect() and save it from there, but that won't scale.
So, quite a story, but I prefer to be complete here. What am I missing? Is it impossible to do what I want, or am I missing something obvious? Or is it difficult? Or maybe it's only possible from Scala/Java? I would love to get some help on this.

It may be slightly tricky to do in pure pyspark. It is not recommended to create too many partitions. From what you have explained I think you are using partition only to get one JSON body per file. You may need a bit of Scala here but your spark job can still remain to be a PySpark Job.
Spark Internally defines DataSources interfaces through which you can define how to read and write data. JSON is one such data source. You can try to extend the default JsonFileFormat class and create your own JsonFileFormatV2. You will also need to define a JsonOutputWriterV2 class extending the default JsonOutputWriter. The output writer has a write function that gives you access to individual rows and paths passed on from the spark program. You can modify the write function to meet your needs.
Here is a sample of how I achieved customizing JSON writes for my use case of writing a fixed number of JSON entries per file. You can use it as a reference for implementing your own JSON writing strategy.
class JsonFileFormatV2 extends JsonFileFormat {
override val shortName: String = "jsonV2"
override def prepareWrite(
sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory = {
val conf = job.getConfiguration
val fileLineCount = options.get("filelinecount").map(_.toInt).getOrElse(1)
val parsedOptions = new JSONOptions(
options,
sparkSession.sessionState.conf.sessionLocalTimeZone,
sparkSession.sessionState.conf.columnNameOfCorruptRecord)
parsedOptions.compressionCodec.foreach { codec =>
CompressionCodecs.setCodecConfiguration(conf, codec)
}
new OutputWriterFactory {
override def newInstance(
path: String,
dataSchema: StructType,
context: TaskAttemptContext): OutputWriter = {
new JsonOutputWriterV2(path, parsedOptions, dataSchema, context, fileLineCount)
}
override def getFileExtension(context: TaskAttemptContext): String = {
".json" + CodecStreams.getCompressionExtension(context)
}
}
}
}
private[json] class JsonOutputWriterV2(
path: String,
options: JSONOptions,
dataSchema: StructType,
context: TaskAttemptContext,
maxFileLineCount: Int) extends JsonOutputWriter(
path,
options,
dataSchema,
context) {
private val encoding = options.encoding match {
case Some(charsetName) => Charset.forName(charsetName)
case None => StandardCharsets.UTF_8
}
var recordCounter = 0
var filecounter = 0
private val maxEntriesPerFile = maxFileLineCount
private var writer = CodecStreams.createOutputStreamWriter(
context, new Path(modifiedPath(path)), encoding)
private[this] var gen = new JacksonGenerator(dataSchema, writer, options)
private def modifiedPath(path:String): String = {
val np = s"$path-filecount-$filecounter"
np
}
override def write(row: InternalRow): Unit = {
gen.write(row)
gen.writeLineEnding()
recordCounter += 1
if(recordCounter >= maxEntriesPerFile){
gen.close()
writer.close()
filecounter+=1
recordCounter = 0
writer = CodecStreams.createOutputStreamWriter(
context, new Path(modifiedPath(path)), encoding)
gen = new JacksonGenerator(dataSchema, writer, options)
}
}
override def close(): Unit = {
if(recordCounter<maxEntriesPerFile){
gen.close()
writer.close()
}
}
}
You can add this new custom data source jar to spark classpath and then in your pyspark you can invoke it as follows.
df.write.format("org.apache.spark.sql.execution.datasources.json.JsonFileFormatV2").option("filelinecount","5").mode("overwrite").save("path-to-save")

Related

Nullability in Spark sql schemas is advisory by default. What is best way to strictly enforce it?

I am working on a simple ETL project which reads CSV files, performs
some modifications on each column, then writes the result out as JSON.
I would like downstream processes which read my results
to be confident that my output conforms to
an agreed schema, but my problem is that even if I define
my input schema with nullable=false for all fields, nulls can sneak
in and corrupt my output files, and there seems to be no (performant) way I can
make Spark enforce 'not null' for my input fields.
This seems to be a feature, as stated below in Spark, The Definitive Guide:
when you define a schema where all columns are declared to not have
null values , Spark will not enforce that and will happily let null
values into that column. The nullable signal is simply to help Spark
SQL optimize for handling that column. If you have null values in
columns that should not have null values, you can get an incorrect
result or see strange exceptions that can be hard to debug.
I have written a little check utility to go through each row of a dataframe and
raise an error if nulls are detected in any of the columns (at any level of
nesting, in the case of fields or subfields like map, struct, or array.)
I am wondering, specifically: DID I RE-INVENT THE WHEEL WITH THIS CHECK UTILITY ? Are there any existing libraries, or
Spark techniques that would do this for me (ideally in a better way than what I implemented) ?
The check utility and a simplified version of my pipeline appears below. As presented, the call to the
check utility is commented out. If you run without the check utility enabled, you would see this result in
/tmp/output.csv.
cat /tmp/output.json/*
(one + 1),(two + 1)
3,4
"",5
The second line after the header should be a number, but it is an empty string
(which is how spark writes out the null, I guess.) This output would be problematic for
downstream components that read my ETL job's output: these components just want integers.
Now, I can enable the check by un-commenting out the line
//checkNulls(inDf)
When I do this I get an exception that informs me of the invalid null value and prints
out the entirety of the offending row, like this:
java.lang.RuntimeException: found null column value in row: [null,4]
One Possible Alternate Approach Given in Spark/Definitive Guide
Spark, The Definitive Guide mentions the possibility of doing this:
<dataframe>.na.drop()
But this would (AFAIK) silently drop the bad records rather than flagging the bad ones.
I could then do a "set subtract" on the input before and after the drop, but that seems like
a heavy performance hit to find out what is null and what is not. At first glance, I'd
prefer my method.... But I am still wondering if there might be some better way out there.
The complete code is given below. Thanks !
package org
import java.io.PrintWriter
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// before running, do; rm -rf /tmp/out* /tmp/foo*
object SchemaCheckFailsToExcludeInvalidNullValue extends App {
import NullCheckMethods._
//val input = "2,3\n\"xxx\",4" // this will be dropped as malformed
val input = "2,3\n,4" // BUT.. this will be let through
new PrintWriter("/tmp/foo.csv") { write(input); close }
lazy val sparkConf = new SparkConf()
.setAppName("Learn Spark")
.setMaster("local[*]")
lazy val sparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
val spark = sparkSession
val schema = new StructType(
Array(
StructField("one", IntegerType, nullable = false),
StructField("two", IntegerType, nullable = false)
)
)
val inDf: DataFrame =
spark.
read.
option("header", "false").
option("mode", "dropMalformed").
schema(schema).
csv("/tmp/foo.csv")
//checkNulls(inDf)
val plusOneDf = inDf.selectExpr("one+1", "two+1")
plusOneDf.show()
plusOneDf.
write.
option("header", "true").
csv("/tmp/output.csv")
}
object NullCheckMethods extends Serializable {
def checkNull(columnValue: Any): Unit = {
if (columnValue == null)
throw new RuntimeException("got null")
columnValue match {
case item: Seq[_] =>
item.foreach(checkNull)
case item: Map[_, _] =>
item.values.foreach(checkNull)
case item: Row =>
item.toSeq.foreach {
checkNull
}
case default =>
println(
s"bad object [ $default ] of type: ${default.getClass.getName}")
}
}
def checkNulls(row: Row): Unit = {
try {
row.toSeq.foreach {
checkNull
}
} catch {
case err: Throwable =>
throw new RuntimeException(
s"found null column value in row: ${row}")
}
}
def checkNulls(df: DataFrame): Unit = {
df.foreach { row => checkNulls(row) }
}
}
You can use the built-in Row method anyNull to split the dataframe and process both splits differently:
val plusOneNoNulls = plusOneDf.filter(!_.anyNull)
val plusOneWithNulls = plusOneDf.filter(_.anyNull)
If you don't plan to have a manual null-handling process, using the builtin DataFrame.na methods is simpler since it already implements all the usual ways to automatically handle nulls (i.e drop or fill them out with default values).

save each element of rdd in text file hdfs

I am using spark application. In each element of rdd contains good amount of data. I want to save each element of rdd into multiple hdfs files respectively. I tried rdd.saveAsTextFile("foo.txt") But I will create a single file for whole rdd. rdd size is 10. I want 10 files in hdfs. How can I achieve this??
If I understand your question, you can create a custom output format like this
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
Then convert your RDD into a key/val one where the key is the file path, and you can use saveAsHadoopFile function insted of saveAsTextFile, like this:
myRDD.saveAsHadoopFile(OUTPUT_PATH, classOf[String], classOf[String],classOf[RDDMultipleTextOutputFormat])

How to send transformed data from partitions to S3?

I have an RDD which is to big to collect. I have applied a chain of transformations to the RDD and want to send its transformed data directly from its partitions on my slaves to S3. I am currently operating as follows:
val rdd:RDD = initializeRDD
val rdd2 = rdd.transform
rdd2.first // in order to force calculation of RDD
rdd2.foreachPartition sendDataToS3
Unfortunately, the data that gets sent to S3 is untransformed. The RDD looks exactly like it did in stage initializeRDD.
Here is the body of sendDataToS3:
implicit class WriteableRDD[T](rdd:RDD[T]){
def transform:RDD[String] = rdd map {_.toString}
....
def sendPartitionsToS3(prefix:String) = {
rdd.foreachPartition { p =>
val filename = prefix+new scala.util.Random().nextInt(1000000)
val pw = new PrintWriter(new File(filename))
p foreach pw.println
pw.close
s3.putObject(S3_BUCKET, filename, new File(filename))
}
this
}
}
This is called with rdd.transform.sendPartitionsToS3(prefix).
How do I make sure the data that gets sent in sendDataToS3 is the transformed data?
My guess is there is a bug in your code that is not included in the question.
I'm answering anyway just to make sure you are aware of RDD.saveAsTextFile. You can give it a path on S3 (s3n://bucket/directory) and it will write each partition into that path directly from the executors.
I can hardly imagine when you would need to implement your own sendPartitionsToS3 instead of using saveAsTextFile.

How to load data from saved file with Spark

Spark provide method saveAsTextFile which can store RDD[T] into disk or hdfs easily.
T is an arbitrary serializable class.
I want to reverse the operation.
I wonder whether there is a loadFromTextFile which can easily load a file into RDD[T]?
Let me make it clear:
class A extends Serializable {
...
}
val path:String = "hdfs..."
val d1:RDD[A] = create_A
d1.saveAsTextFile(path)
val d2:RDD[A] = a_load_function(path) // this is the function I want
//d2 should be the same as d1
Try to use d1.saveAsObjectFile(path) to store and val d2 = sc.objectFile[A](path) to load.
I think you cannot saveAsTextFile and read it out as RDD[A] without transformation from RDD[String]
To create file based RDD, We can use SparkContext.textFile API
Below is an example:
val textFile = sc.textFile("input.txt")
We can specify the URI explicitly.
If the file is in HDFS:
sc.textFile("hdfs://host:port/filepath")
If the file is in local:
sc.textFile("file:///path to the file/")
If the file is S3:
s3.textFile("s3n://mybucket/sample.txt");
To load RDD to Speicific type:
case class Person(name: String, age: Int)
val people = sc.textFile("employees.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
Here, people will be of type org.apache.spark.rdd.RDD[Person]

aparch spark, NotSerializableException: org.apache.hadoop.io.Text

here is my code:
val bg = imageBundleRDD.first() //bg:[Text, BundleWritable]
val res= imageBundleRDD.map(data => {
val desBundle = colorToGray(bg._2) //lineA:NotSerializableException: org.apache.hadoop.io.Text
//val desBundle = colorToGray(data._2) //lineB:everything is ok
(data._1, desBundle)
})
println(res.count)
lineB goes well but lineA shows that:org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.io.Text
I try to use use Kryo to solve my problem but it seems nothing has been changed:
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Text])
kryo.register(classOf[BundleWritable])
}
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
Thanks!!!
I had a similar problem when my Java code was reading sequence files containing Text keys.
I found this post helpful:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-java-io-NotSerializableException-org-apache-hadoop-io-Text-td2650.html
In my case, I converted Text to a String using map:
JavaPairRDD<String, VideoRecording> mapped = videos.map(new PairFunction<Tuple2<Text,VideoRecording>,String,VideoRecording>() {
#Override
public Tuple2<String, VideoRecording> call(
Tuple2<Text, VideoRecording> kv) throws Exception {
// Necessary to copy value as Hadoop chooses to reuse objects
VideoRecording vr = new VideoRecording(kv._2);
return new Tuple2(kv._1.toString(), vr);
}
});
Be aware of this note in the API for sequenceFile method in JavaSparkContext:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD will create many references to the same object. If you plan to directly cache Hadoop writable objects, you should first copy them using a map function.
In Apache Spark while dealing with Sequence files, we have to follow these techniques:
-- Use Java equivalent Data Types in place of Hadoop data types.
-- Spark Automatically converts the Writables into Java equivalent Types.
Ex:- We have a sequence file "xyz", here key type is say Text and value
is LongWritable. When we use this file to create an RDD, we need use their
java equivalent data types i.e., String and Long respectively.
val mydata = = sc.sequenceFile[String, Long]("path/to/xyz")
mydata.collect
The reason your code has the serialization problem is that your Kryo setup, while close, isn't quite right:
change:
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
to:
val sparkConf = new SparkConf()
// ... set master, appname, etc, then:
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(sparkConf)

Resources