I have a long xmlstring that I am converting to JSON for easy processing in spark. But I am experiencing some issues with auto Infer schema. what is the efficient way to convert a Dataset xmlStringData -> Dataset with a structure? In this case should I generate a schema using StructType to read this again in spark Row as shown below:
Dataset<Row> jsonDatset = sparkSession.read().schema(schema).json(xmlStringData );
OR
Dataset<myClass> jsonDataset = xmlStringData.map((MapFunction<Row, String>) xmlRow -> {
return new myClass(xmlRow);
}, myClassEncode);
What is the difference in later processing going either route?
All I need to do later is process the data and save to CSV.
thank you
I am trying to input the csv file I have downloaded from google drive. But I don’t know how to change it. The code below is for accessing the drive and downloading the file (autumnTainan1):
download.setOnClickListener{
download("https://drive.google.com/drive/u/0/folders/1aBy0WtyaETkfD3YxZo73iQaFa2aGzwY8")
}
private fun download(url: String?) {
val request = DownloadManager.Request(Uri.parse(url))
request.setNotificationVisibility(DownloadManager.Request.VISIBILITY_VISIBLE_NOTIFY_COMPLETED)
request.setDestinationInExternalFilesDir(this, Environment.DIRECTORY_DOCUMENTS, "autumnTainan1")
request.allowScanningByMediaScanner()
val downloadManager = getSystemService(DOWNLOAD_SERVICE) as DownloadManager
downloadManager.enqueue(request)
}
This is the tflite code in kotlin:
val model = Irrad5autumn.newInstance(context)
// Creates inputs for reference.
val inputFeature0 = TensorBuffer.createFixedSize(intArrayOf(1, 24, 5), DataType.FLOAT32)
inputFeature0.loadBuffer(byteBuffer)
// Runs model inference and gets result.
val outputs = model.process(inputFeature0)
val outputFeature0 = outputs.outputFeature0AsTensorBuffer
// Releases model resources if no longer used.
model.close()
I know that I need to change the val InputFeature, but I don't know the process.
Spark version: 2.4.4
When writing a DataFrame, I want to put some metadata on certain fields. It's important for this metadata to be persisted when I write out the DataFrame and read it again later. If I save this DataFrame as Parquet and then read it back, I see the metadata is preserved. But saving as ORC, the metadata is lost when I read the files. Here is a bit of code to show how I'm doing this (in Java):
// set up schema and dataframe
Metadata myMeta = new MetadataBuilder().putString("myMetaData", "foo").build();
StructField field = DataTypes.createStructField("x", DataTypes.IntegerType, true, myMeta);
Dataset<Row> df = sparkSession.createDataFrame(rdd, /* a schema using this field */);
// write it
df.write().format("parquet").save("test");
// read it again
Dataset<Row> df2 = sparkSession.read().format("parquet").load("test");
// check the schema after reading files
df2.schema().prettyJson();
df2.schema().fields()[0].metadata();
Using Parquet format, the metadata is deserialized as I expect. However if changed to ORC format, the metadata comes back as an empty map.
Is this a known bug in the Spark ORC implementation or am I missing something? Thanks.
I have a trained a tf model and I want to apply it to big dataset in hdfs which is about billion of samples. The main point is I need to write the prediction of tf model into hdfs file. However I can't find the relative API in tensorflow about how to save data in hdfs file, only find the api about reading hdfs file
Until now the way I did it is to save the trained tf model into pb file in local and then load the pb file using Java api in spark or Mapreduce code. The problem of both spark or mapreduce is the running speed is very slow and failed with exceeds memory error.
Here is my demo:
public class TF_model implements Serializable{
public Session session;
public TF_model(String model_path){
try{
Graph graph = new Graph();
InputStream stream = this.getClass().getClassLoader().getResourceAsStream(model_path);
byte[] graphBytes = IOUtils.toByteArray(stream);
graph.importGraphDef(graphBytes);
this.session = new Session(graph);
}
catch (Exception e){
System.out.println("failed to load tensorflow model");
}
}
// this is the function to predict a sample in hdfs
public int[][] predict(int[] token_id_array){
Tensor z = session.runner()
.feed("words_ids_placeholder", Tensor.create(new int[][]{token_id_array}))
.fetch("softmax_prediction").run().get(0);
double[][][] softmax_prediction = new double[1][token_id_array.length][2];
z.copyTo(softmax_prediction);
return softmax_prediction[0];
}}
below is my spark code:
val rdd = spark.sparkContext.textFile(file_path)
val predct_result= rdd.mapPartitions(pa=>{
val tf_model = new TF_model("model.pb")
pa.map(line=>{
val transformed = transform(line) // omitted the transform code
val rs = tf_model .predict(transformed)
rs
})
})
I also tried tensorflow deployed in hadoop, but can't find a way to write big dataset into HDFS.
You may read model file from hdfs one time, then use sc.broadcast your bytes array of your graph to partitions. Finally, start load graph and predict. Just to avoid read file multiple time from hdfs.
I have some spark code to process a csv file. It does some transformation on it. I now want to save this RDD as a csv file and add a header. Each line of this RDD is already formatted correctly.
I am not sure how to do it. I wanted to do a union with the header string and my RDD but the header string is not an RDD so it does not work.
You can make an RDD out of your header line and then union it, yes:
val rdd: RDD[String] = ...
val header: RDD[String] = sc.parallelize(Array("my,header,row"))
header.union(rdd).saveAsTextFile(...)
Then you end up with a bunch of part-xxxxx files that you merge.
The problem is that I don't think you're guaranteed that the header will be the first partition and therefore end up in part-00000 and at the top of your file. In practice, I'm pretty sure it will.
More reliable would be to use Hadoop commands like hdfs to merge the part-xxxxx files, and as part of the command, just throw in the header line from a file.
Some help on writing it without Union(Supplied the header at the time of merge)
val fileHeader ="This is header"
val fileHeaderStream: InputStream = new ByteArrayInputStream(fileHeader.getBytes(StandardCharsets.UTF_8));
val output = IOUtils.copyBytes(fileHeaderStream,out,conf,false)
Now loop over you file parts to write the complete file using
val in: DataInputStream = ...<data input stream from file >
IOUtils.copyBytes(in, output, conf, false)
This made sure for me that header always comes as first line even when you use coalasec/repartition for efficient writing
def addHeaderToRdd(sparkCtx: SparkContext, lines: RDD[String], header: String): RDD[String] = {
val headerRDD = sparkCtx.parallelize(List((-1L, header))) // We index the header with -1, so that the sort will put it on top.
val pairRDD = lines.zipWithIndex()
val pairRDD2 = pairRDD.map(t => (t._2, t._1))
val allRDD = pairRDD2.union(headerRDD)
val allSortedRDD = allRDD.sortByKey()
return allSortedRDD.values
}
Slightly diff approach with Spark SQL
From Question: I now want to save this RDD as a CSV file and add a header. Each line of this RDD is already formatted correctly.
With Spark 2.x you have several options to convert RDD to DataFrame
val rdd = .... //Assume rdd properly formatted with case class or tuple
val df = spark.createDataFrame(rdd).toDF("col1", "col2", ... "coln")
df.write
.format("csv")
.option("header", "true") //adds header to file
.save("hdfs://location/to/save/csv")
Now we can even use Spark SQL DataFrame to load, transform and save CSV file
spark.sparkContext.parallelize(Seq(SqlHelper.getARow(temRet.columns,
temRet.columns.length))).union(temRet.rdd).map(x =>
x.mkString("\x01")).coalesce(1, true).saveAsTextFile(retPath)
object SqlHelper {
//create one row
def getARow(x: Array[String], size: Int): Row = {
var columnArray = new Array[String](size)
for (i <- 0 to (size - 1)) {
columnArray(i) = x(i).toString()
}
Row.fromSeq(columnArray)
}
}