How can I make (Spark1.6) saveAsTextFile to append existing file? - apache-spark

In SparkSQL,I use DF.wirte.mode(SaveMode.Append).json(xxxx),but this method get these files like
the filename is too complex and random,I can't use api to get.So I want to use saveAstextfile ,beacuse filename is not complex and regular, but I don't know how to append file in same diretory?Appreciate for your time.

worked on Spark 1.5 , I think this is right usage..
dataframe.write().mode(SaveMode.Append).format(FILE_FORMAT).**partitionBy**("parameter1", "parameter2").save(path);

As spark uses HDFS, this is the typical output it produces. You can use the FileUtil to merge the files back into one. It is an efficient solution as it doesn't require spark to collect whole data into single memory by partitioning it into 1. This is the approach i follow.
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
val hdfs = FileSystem.get(hadoopConf)
val mergedPath = "merged-" + filePath + ".json"
val merged = new Path(mergedPath)
if (hdfs.exists(merged)) {
hdfs.delete(merged, true)
}
df.wirte.mode(SaveMode.Append).json(filePath)
FileUtil.copyMerge(hdfs, path, hdfs, merged, false, hadoopConf, null)
You can read the single file using mergedPath location. Hope it helps.

You can try this method which I find from somewhere.
Process Spark Streaming rdd and store to single HDFS file
import org.apache.hadoop.fs.{ FileSystem, FileUtil, Path }
def saveAsTextFileAndMerge[T](hdfsServer: String, fileName: String, rdd: RDD[T]) = {
val sourceFile = hdfsServer + "/tmp/"
rdd.saveAsTextFile(sourceFile)
val dstPath = hdfsServer + "/final/"
merge(sourceFile, dstPath, fileName)
}
def merge(srcPath: String, dstPath: String, fileName: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val destinationPath = new Path(dstPath)
if (!hdfs.exists(destinationPath)) {
hdfs.mkdirs(destinationPath)
}
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath + "/" + fileName), false, hadoopConfig, null)
}

Related

Download several directories from blob azure storage in Kotlin and ZIP Them

I am trying to download 2 directory from blob azure storage as bytearray and compress them to zip but i got an error.
I saw in web that directory does not really exist in blob storage... I did not really understand how it works so...
Here is my code below :
val dstsData = "/projections/$projectionId/inputs/datasources"
val dstsAss = "/projections/$projectionId/inputs/assumptions"
val dstsDL = "${inputDispatchPath}inputs.zip"
val inputsDatasources = withContext(Dispatchers.IO) {
azureBatch.filesystem.downloadAsByteArray(dstsData)
}
val inputsAssumptions = withContext(Dispatchers.IO) {
azureBatch.filesystem.downloadAsByteArray(dstsAss)
}
val baos = ByteArrayOutputStream()
val zipOutputStream = ZipOutputStream(baos)
val entry = ZipEntry("inputs.zip")
entry.size = inputsDatasources.size.toLong() + inputsAssumptions.size.toShort()
withContext(Dispatchers.IO){
zipOutputStream.putNextEntry(entry)
zipOutputStream.write(inputsDatasources)
zipOutputStream.write(inputsAssumptions)
zipOutputStream.closeEntry()
zipOutputStream.close()
}
val file = ezEventBus.request(
ServiceIORequest.WriteURLFor(
user,
dstsDL,
baos.toByteArray().size.toLong()
),
correlationId
)
file .uploader.upload(
vertx,
webClient,
Buffer.buffer(baos.toByteArray()),
baos.toByteArray().size.toLong())
Do you have an idea?
Thanks a lot
Regards

Is it possible to build spark code on fly and execute?

I am trying to create a generic function to read a csv file using databriks CSV READER.But the option's are not mandatory it can differ based on the my input json configuration file.
Example1 :
"ReaderOption":{
"delimiter":";",
"header":"true",
"inferSchema":"true",
"schema":"""some custome schema.."""
},
Example2:
"ReaderOption":{
"delimiter":";",
"schema":"""some custome schema.."""
},
Is it possible to construct options or the entire read statement in runtime and run in spark ?
like below,
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.option(options)
.load(inputPath)
readDF
}
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.options(options)
.load(inputPath)
readDF
}
There is an options which takes key, value pair.

Trying to understand spark streaming flow

I have this piece of code:
val lines: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
ssc.start()
ssc.awaitTermination()
The way I understand it is, foreachRDD is happening at the driver level? So basically all that block of code:
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
is happening at the driver level? The sparkStreamingService.run(df) method basically does some transformations on the current dataframe to yield a new dataframe and then calls another method (on another jar) which stores the dataframe to cassandra.
So if this is happening all at the driver level, we are not utilizing the spark executors and how can I make it so that the executors are being used in parallel to process each partition of the RDD in parallel
My spark streaming service run method:
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
The invocation of foreachRDD does happen on the driver node. But, since we're operating at the RDD level, any transformation on it will be distributed. In your example, rdd.map will cause each partition to be sent to a particular worker node for computation.
Since we don't know what your sparkStreamingService.run method is doing, we cant tell you about the locality of its execution.
The foreachRDD may run locally, but that just means the setup. The RDD itself is a distributed collection, so the actual work is distributed.
To comment directly on the code from the docs:
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
Notice that the part of the code that is NOT based around the RDD is executed at the driver. It's the code built up using RDD that is distributed to the workers.
Your code specifically is commented below:
//df.select will be distributed, but collect will pull it all back in
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
//Since collect created a local collection then this is done on the driver
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
//This will run locally, creating a distributed record
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
//This will redistribute the work
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
//again, setting this up locally, to be run distributed
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
Ultimately, you probably can rewrite this to not need the collect and keep it all distributed, but that is for you not StackOverflow

spark read wholeTextFiles with non UTF-8 encoding

I want to read whole text files in non UTF-8 encoding via
val df = spark.sparkContext.wholeTextFiles(path, 12).toDF
into spark. How can I change the encoding?
I would want to read ISO-8859 encoded text, but it is not CSV, it is something similar to xml:SGML.
edit
maybe a custom Hadoop file input format should be used?
https://dzone.com/articles/implementing-hadoops-input-format-and-output-forma
http://henning.kropponline.de/2016/10/23/custom-matlab-inputformat-for-apache-spark/
You can read the files using SparkContext.binaryFiles() instead and build the String for the contents specifying the charset you need. E.g:
val df = spark.sparkContext.binaryFiles(path, 12)
.mapValues(content => new String(content.toArray(), StandardCharsets.ISO_8859_1))
.toDF
It's Simple.
Here is the source code,
import java.nio.charset.Charset
import org.apache.hadoop.io.{Text, LongWritable}
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
object TextFile {
val DEFAULT_CHARSET = Charset.forName("UTF-8")
def withCharset(context: SparkContext, location: String, charset: String): RDD[String] = {
if (Charset.forName(charset) == DEFAULT_CHARSET) {
context.textFile(location)
} else {
// can't pass a Charset object here cause its not serializable
// TODO: maybe use mapPartitions instead?
context.hadoopFile[LongWritable, Text, TextInputFormat](location).map(
pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)
)
}
}
}
From here it's copied.
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala
To Use it.
https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/util/TextFileSuite.scala
Edit:
If you need wholetext file,
Here is the actual source of the implementation.
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope {
assertNotStopped()
val job = NewHadoopJob.getInstance(hadoopConfiguration)
// Use setInputPaths so that wholeTextFiles aligns with hadoopFile/textFile in taking
// comma separated files as input. (see SPARK-7155)
NewFileInputFormat.setInputPaths(job, path)
val updateConf = job.getConfiguration
new WholeTextFileRDD(
this,
classOf[WholeTextFileInputFormat],
classOf[Text],
classOf[Text],
updateConf,
minPartitions).map(record => (record._1.toString, record._2.toString)).setName(path)
}
Try changing :
.map(record => (record._1.toString, record._2.toString))
to(probably):
.map(record => (record._1.toString, new String(record._2.getBytes, 0, record._2.getLength, "myCustomCharset")))

Spark - Read compressed files without file extension

I have a S3 bucket that is filled with Gz files that have no file extension. For example s3://mybucket/1234502827-34231
sc.textFile uses that file extension to select the decoder. I have found many blog post on handling custom file extensions but nothing about missing file extensions.
I think the solution may be sc.binaryFiles and unzipping the file manually.
Another possibility is to figure out how sc.textFile finds the file format. I'm not clear what these classOf[] calls work.
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
Can you try to combine the below solution for ZIP files, with gzipFileInputFormat library?
here - How to open/stream .zip files through Spark?
You can see how to do it using ZIP:
rdd1 = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());
gzipFileInputFormat:
https://github.com/bsankaran/internet_routing/blob/master/hadoop-tr/src/main/java/edu/usc/csci551/tools/GZipFileInputFormat.java
Some details about newAPIHadoopFile() can be found here:
http://spark.apache.org/docs/latest/api/python/pyspark.html
I found several examples out there that almost fit my needs. Here is the final code I used to parse a file compressed with GZ.
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream
import org.apache.spark.input.PortableDataStream
import scala.util.Try
import java.nio.charset._
def extractBSM(ps: PortableDataStream, n: Int = 1024) = Try {
val gz = new GzipCompressorInputStream(ps.open)
Stream.continually {
// Read n bytes
val buffer = Array.fill[Byte](n)(-1)
val i = gz.read(buffer, 0, n)
(i, buffer.take(i))
}
// Take as long as we've read something
.takeWhile(_._1 > 0)
.map(_._2)
.flatten
.toArray
}
def decode(charset: Charset = StandardCharsets.UTF_8)(bytes: Array[Byte]) = new String(bytes, StandardCharsets.UTF_8)
val inputFile = "s3://my-bucket/157c96bd-fb21-4cc7-b340-0bd4b8e2b614"
val rdd = sc.binaryFiles(inputFile).flatMapValues(x => extractBSM(x).toOption).map( x => decode()(x._2) )
val rdd2 = rdd.flatMap { x => x.split("\n") }
rdd2.take(10).foreach(println)
You can create your own custom codec for decoding your file. You can start by extending GzipCodec and override getDefaultExtension method where you return empty string as an extension.
EDIT: That soultion will not work in all cases due to how CompressionCodecFactory is implemented. For example: By default codec for .lz4 is loaded. This means if name of a file that you want to load ends with 4, that codec will get picked instead of custom (w/o extension). As that codec does not match extension it will get later ditched and no codec will be used.
Java:
package com.customcodec;
import org.apache.hadoop.io.compress.GzipCodec;
public class GzipCodecNoExtension extends GzipCodec {
#Override
public String getDefaultExtension() {
return "";
}
}
In spark app you just register your codec:
SparkConf conf = new SparkConf()
.set("spark.hadoop.io.compression.codecs", "com.customcodec.GzipCodecNoExtension");
You can read binary file and do decompression using map function.
JavaRDD<Tuple2<String, PortableDataStream>> rawData = spark.sparkContext().binaryFiles(readLocation, 1).toJavaRDD();
JavaRDD<String> decompressedData = rawData.map((Function<Tuple2<String, PortableDataStream>, String>) stringPortableDataStreamTuple2 -> {
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPInputStream s = new GZIPInputStream(new ByteArrayInputStream(stringPortableDataStreamTuple2._2.toArray()));
IOUtils.copy(s, out);
return new String(out.toByteArray());
});
In case of JSON content you can read that into Dataset using
Dataset co = spark.read().json(decompressedData);

Resources