Spark - Read compressed files without file extension - apache-spark

I have a S3 bucket that is filled with Gz files that have no file extension. For example s3://mybucket/1234502827-34231
sc.textFile uses that file extension to select the decoder. I have found many blog post on handling custom file extensions but nothing about missing file extensions.
I think the solution may be sc.binaryFiles and unzipping the file manually.
Another possibility is to figure out how sc.textFile finds the file format. I'm not clear what these classOf[] calls work.
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}

Can you try to combine the below solution for ZIP files, with gzipFileInputFormat library?
here - How to open/stream .zip files through Spark?
You can see how to do it using ZIP:
rdd1 = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());
gzipFileInputFormat:
https://github.com/bsankaran/internet_routing/blob/master/hadoop-tr/src/main/java/edu/usc/csci551/tools/GZipFileInputFormat.java
Some details about newAPIHadoopFile() can be found here:
http://spark.apache.org/docs/latest/api/python/pyspark.html

I found several examples out there that almost fit my needs. Here is the final code I used to parse a file compressed with GZ.
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream
import org.apache.spark.input.PortableDataStream
import scala.util.Try
import java.nio.charset._
def extractBSM(ps: PortableDataStream, n: Int = 1024) = Try {
val gz = new GzipCompressorInputStream(ps.open)
Stream.continually {
// Read n bytes
val buffer = Array.fill[Byte](n)(-1)
val i = gz.read(buffer, 0, n)
(i, buffer.take(i))
}
// Take as long as we've read something
.takeWhile(_._1 > 0)
.map(_._2)
.flatten
.toArray
}
def decode(charset: Charset = StandardCharsets.UTF_8)(bytes: Array[Byte]) = new String(bytes, StandardCharsets.UTF_8)
val inputFile = "s3://my-bucket/157c96bd-fb21-4cc7-b340-0bd4b8e2b614"
val rdd = sc.binaryFiles(inputFile).flatMapValues(x => extractBSM(x).toOption).map( x => decode()(x._2) )
val rdd2 = rdd.flatMap { x => x.split("\n") }
rdd2.take(10).foreach(println)

You can create your own custom codec for decoding your file. You can start by extending GzipCodec and override getDefaultExtension method where you return empty string as an extension.
EDIT: That soultion will not work in all cases due to how CompressionCodecFactory is implemented. For example: By default codec for .lz4 is loaded. This means if name of a file that you want to load ends with 4, that codec will get picked instead of custom (w/o extension). As that codec does not match extension it will get later ditched and no codec will be used.
Java:
package com.customcodec;
import org.apache.hadoop.io.compress.GzipCodec;
public class GzipCodecNoExtension extends GzipCodec {
#Override
public String getDefaultExtension() {
return "";
}
}
In spark app you just register your codec:
SparkConf conf = new SparkConf()
.set("spark.hadoop.io.compression.codecs", "com.customcodec.GzipCodecNoExtension");

You can read binary file and do decompression using map function.
JavaRDD<Tuple2<String, PortableDataStream>> rawData = spark.sparkContext().binaryFiles(readLocation, 1).toJavaRDD();
JavaRDD<String> decompressedData = rawData.map((Function<Tuple2<String, PortableDataStream>, String>) stringPortableDataStreamTuple2 -> {
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPInputStream s = new GZIPInputStream(new ByteArrayInputStream(stringPortableDataStreamTuple2._2.toArray()));
IOUtils.copy(s, out);
return new String(out.toByteArray());
});
In case of JSON content you can read that into Dataset using
Dataset co = spark.read().json(decompressedData);

Related

Download several directories from blob azure storage in Kotlin and ZIP Them

I am trying to download 2 directory from blob azure storage as bytearray and compress them to zip but i got an error.
I saw in web that directory does not really exist in blob storage... I did not really understand how it works so...
Here is my code below :
val dstsData = "/projections/$projectionId/inputs/datasources"
val dstsAss = "/projections/$projectionId/inputs/assumptions"
val dstsDL = "${inputDispatchPath}inputs.zip"
val inputsDatasources = withContext(Dispatchers.IO) {
azureBatch.filesystem.downloadAsByteArray(dstsData)
}
val inputsAssumptions = withContext(Dispatchers.IO) {
azureBatch.filesystem.downloadAsByteArray(dstsAss)
}
val baos = ByteArrayOutputStream()
val zipOutputStream = ZipOutputStream(baos)
val entry = ZipEntry("inputs.zip")
entry.size = inputsDatasources.size.toLong() + inputsAssumptions.size.toShort()
withContext(Dispatchers.IO){
zipOutputStream.putNextEntry(entry)
zipOutputStream.write(inputsDatasources)
zipOutputStream.write(inputsAssumptions)
zipOutputStream.closeEntry()
zipOutputStream.close()
}
val file = ezEventBus.request(
ServiceIORequest.WriteURLFor(
user,
dstsDL,
baos.toByteArray().size.toLong()
),
correlationId
)
file .uploader.upload(
vertx,
webClient,
Buffer.buffer(baos.toByteArray()),
baos.toByteArray().size.toLong())
Do you have an idea?
Thanks a lot
Regards

How to parse an XML coming from Kafka topic via Spark Streaming?

I want to Parse XML coming from Kafka topic using Spark Streaming.
com.databricks:spark-xml_2.10:0.4.1 is able to parse XML but only from files in HDFS.
Already tried with library : com.databricks:spark-xml_2.10:0.4.1 ;
val df = spark.read.format("com.databricks.spark.xml").option("rowTag", "ServiceRequest").load("/tmp/sanal/gems/gem_opr.xml") ;
Actual Results :
1) Take the stream in Spark
2) Parse the XML Stream in the poutput
You can use com.databricks.spark.xml.XmlReader.xmlRdd(spark: SparkSession, xmlRDD: RDD[String]): DataFrame method to read xml from RDD<String>. For example:
import com.databricks.spark.xml
// setting up sample data
List<ConsumerRecord<String, String>> recordsList = new ArrayList<>();
recordsList.add(new ConsumerRecord<String, String>("topic", 1, 0, "key",
"<?xml version=\"1.0\"?><catalog><book id=\"bk101\"><genre>Computer</genre></book></catalog>"));
JavaRDD<ConsumerRecord<String, String>> rdd = spark.parallelize(recordsList);
// map ConsumerRecord rdd to String rdd
JavaRDD<String> xmlRdd = rdd.map(r -> {
return r.value();
});
// read xml rdd
new XmlReader().xmlRdd(spark, xmlRdd)

Is it possible to build spark code on fly and execute?

I am trying to create a generic function to read a csv file using databriks CSV READER.But the option's are not mandatory it can differ based on the my input json configuration file.
Example1 :
"ReaderOption":{
"delimiter":";",
"header":"true",
"inferSchema":"true",
"schema":"""some custome schema.."""
},
Example2:
"ReaderOption":{
"delimiter":";",
"schema":"""some custome schema.."""
},
Is it possible to construct options or the entire read statement in runtime and run in spark ?
like below,
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.option(options)
.load(inputPath)
readDF
}
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.options(options)
.load(inputPath)
readDF
}
There is an options which takes key, value pair.

spark read wholeTextFiles with non UTF-8 encoding

I want to read whole text files in non UTF-8 encoding via
val df = spark.sparkContext.wholeTextFiles(path, 12).toDF
into spark. How can I change the encoding?
I would want to read ISO-8859 encoded text, but it is not CSV, it is something similar to xml:SGML.
edit
maybe a custom Hadoop file input format should be used?
https://dzone.com/articles/implementing-hadoops-input-format-and-output-forma
http://henning.kropponline.de/2016/10/23/custom-matlab-inputformat-for-apache-spark/
You can read the files using SparkContext.binaryFiles() instead and build the String for the contents specifying the charset you need. E.g:
val df = spark.sparkContext.binaryFiles(path, 12)
.mapValues(content => new String(content.toArray(), StandardCharsets.ISO_8859_1))
.toDF
It's Simple.
Here is the source code,
import java.nio.charset.Charset
import org.apache.hadoop.io.{Text, LongWritable}
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
object TextFile {
val DEFAULT_CHARSET = Charset.forName("UTF-8")
def withCharset(context: SparkContext, location: String, charset: String): RDD[String] = {
if (Charset.forName(charset) == DEFAULT_CHARSET) {
context.textFile(location)
} else {
// can't pass a Charset object here cause its not serializable
// TODO: maybe use mapPartitions instead?
context.hadoopFile[LongWritable, Text, TextInputFormat](location).map(
pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)
)
}
}
}
From here it's copied.
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala
To Use it.
https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/util/TextFileSuite.scala
Edit:
If you need wholetext file,
Here is the actual source of the implementation.
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope {
assertNotStopped()
val job = NewHadoopJob.getInstance(hadoopConfiguration)
// Use setInputPaths so that wholeTextFiles aligns with hadoopFile/textFile in taking
// comma separated files as input. (see SPARK-7155)
NewFileInputFormat.setInputPaths(job, path)
val updateConf = job.getConfiguration
new WholeTextFileRDD(
this,
classOf[WholeTextFileInputFormat],
classOf[Text],
classOf[Text],
updateConf,
minPartitions).map(record => (record._1.toString, record._2.toString)).setName(path)
}
Try changing :
.map(record => (record._1.toString, record._2.toString))
to(probably):
.map(record => (record._1.toString, new String(record._2.getBytes, 0, record._2.getLength, "myCustomCharset")))

How can I make (Spark1.6) saveAsTextFile to append existing file?

In SparkSQL,I use DF.wirte.mode(SaveMode.Append).json(xxxx),but this method get these files like
the filename is too complex and random,I can't use api to get.So I want to use saveAstextfile ,beacuse filename is not complex and regular, but I don't know how to append file in same diretory?Appreciate for your time.
worked on Spark 1.5 , I think this is right usage..
dataframe.write().mode(SaveMode.Append).format(FILE_FORMAT).**partitionBy**("parameter1", "parameter2").save(path);
As spark uses HDFS, this is the typical output it produces. You can use the FileUtil to merge the files back into one. It is an efficient solution as it doesn't require spark to collect whole data into single memory by partitioning it into 1. This is the approach i follow.
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
val hadoopConf = sqlContext.sparkContext.hadoopConfiguration
val hdfs = FileSystem.get(hadoopConf)
val mergedPath = "merged-" + filePath + ".json"
val merged = new Path(mergedPath)
if (hdfs.exists(merged)) {
hdfs.delete(merged, true)
}
df.wirte.mode(SaveMode.Append).json(filePath)
FileUtil.copyMerge(hdfs, path, hdfs, merged, false, hadoopConf, null)
You can read the single file using mergedPath location. Hope it helps.
You can try this method which I find from somewhere.
Process Spark Streaming rdd and store to single HDFS file
import org.apache.hadoop.fs.{ FileSystem, FileUtil, Path }
def saveAsTextFileAndMerge[T](hdfsServer: String, fileName: String, rdd: RDD[T]) = {
val sourceFile = hdfsServer + "/tmp/"
rdd.saveAsTextFile(sourceFile)
val dstPath = hdfsServer + "/final/"
merge(sourceFile, dstPath, fileName)
}
def merge(srcPath: String, dstPath: String, fileName: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val destinationPath = new Path(dstPath)
if (!hdfs.exists(destinationPath)) {
hdfs.mkdirs(destinationPath)
}
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath + "/" + fileName), false, hadoopConfig, null)
}

Resources