I am trying to input the csv file I have downloaded from google drive. But I don’t know how to change it. The code below is for accessing the drive and downloading the file (autumnTainan1):
download.setOnClickListener{
download("https://drive.google.com/drive/u/0/folders/1aBy0WtyaETkfD3YxZo73iQaFa2aGzwY8")
}
private fun download(url: String?) {
val request = DownloadManager.Request(Uri.parse(url))
request.setNotificationVisibility(DownloadManager.Request.VISIBILITY_VISIBLE_NOTIFY_COMPLETED)
request.setDestinationInExternalFilesDir(this, Environment.DIRECTORY_DOCUMENTS, "autumnTainan1")
request.allowScanningByMediaScanner()
val downloadManager = getSystemService(DOWNLOAD_SERVICE) as DownloadManager
downloadManager.enqueue(request)
}
This is the tflite code in kotlin:
val model = Irrad5autumn.newInstance(context)
// Creates inputs for reference.
val inputFeature0 = TensorBuffer.createFixedSize(intArrayOf(1, 24, 5), DataType.FLOAT32)
inputFeature0.loadBuffer(byteBuffer)
// Runs model inference and gets result.
val outputs = model.process(inputFeature0)
val outputFeature0 = outputs.outputFeature0AsTensorBuffer
// Releases model resources if no longer used.
model.close()
I know that I need to change the val InputFeature, but I don't know the process.
Related
I have a dataframe that looks a bit like this:
| key 1 | key 2 | key 3 | body |
I want to save this dataframe in 1 json-file per partition, where a partition is a unique combination of keys 1 to 3. I have the following requirements:
The paths of the files should be /key 1/key 2/key 3.json.gz
The files should be compressed
The contents of the files should be values of body (this column contains a json string), one json-string per line.
I've tried multiple things, but no look.
Method 1: Using native dataframe.write
I've tried using the native write method to save the data. Something like this:
df.write
.partitionBy("key 1", "key 2", "key 3") \
.mode('overwrite') \
.format('json') \
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") \
.save(
path=path,
compression="gzip"
)
This solution doesn't store the files in the correct path and with the correct name, but this can be fixed by moving them afterwards. However, the biggest problem is that this is writing the complete dataframe, while I only want to write the values of the body column. But I need the other columns to partition the data.
Method 2: Using the Hadoop filesystem
It's possible to directly call the Hadoop filesystem java library using this: sc._gateway.jvm.org.apache.hadoop.fs.FileSystem. With access to this filesystem it's possible to create files myself, giving me more control over the path, the filename and the contents. However, in order to make this code scale I'm doing this per partition, so:
df.foreachPartition(save_partition)
def save_partition(items):
# Store the items of this partition here
However, I can't get this to work because the save_partition function is executed on the workers, which doesn't have access to the SparkSession and the SparkContext (which is needed to reach the Hadoop Filesystem JVM libraries). I could solve this by pulling all the data to the driver using collect() and save it from there, but that won't scale.
So, quite a story, but I prefer to be complete here. What am I missing? Is it impossible to do what I want, or am I missing something obvious? Or is it difficult? Or maybe it's only possible from Scala/Java? I would love to get some help on this.
It may be slightly tricky to do in pure pyspark. It is not recommended to create too many partitions. From what you have explained I think you are using partition only to get one JSON body per file. You may need a bit of Scala here but your spark job can still remain to be a PySpark Job.
Spark Internally defines DataSources interfaces through which you can define how to read and write data. JSON is one such data source. You can try to extend the default JsonFileFormat class and create your own JsonFileFormatV2. You will also need to define a JsonOutputWriterV2 class extending the default JsonOutputWriter. The output writer has a write function that gives you access to individual rows and paths passed on from the spark program. You can modify the write function to meet your needs.
Here is a sample of how I achieved customizing JSON writes for my use case of writing a fixed number of JSON entries per file. You can use it as a reference for implementing your own JSON writing strategy.
class JsonFileFormatV2 extends JsonFileFormat {
override val shortName: String = "jsonV2"
override def prepareWrite(
sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory = {
val conf = job.getConfiguration
val fileLineCount = options.get("filelinecount").map(_.toInt).getOrElse(1)
val parsedOptions = new JSONOptions(
options,
sparkSession.sessionState.conf.sessionLocalTimeZone,
sparkSession.sessionState.conf.columnNameOfCorruptRecord)
parsedOptions.compressionCodec.foreach { codec =>
CompressionCodecs.setCodecConfiguration(conf, codec)
}
new OutputWriterFactory {
override def newInstance(
path: String,
dataSchema: StructType,
context: TaskAttemptContext): OutputWriter = {
new JsonOutputWriterV2(path, parsedOptions, dataSchema, context, fileLineCount)
}
override def getFileExtension(context: TaskAttemptContext): String = {
".json" + CodecStreams.getCompressionExtension(context)
}
}
}
}
private[json] class JsonOutputWriterV2(
path: String,
options: JSONOptions,
dataSchema: StructType,
context: TaskAttemptContext,
maxFileLineCount: Int) extends JsonOutputWriter(
path,
options,
dataSchema,
context) {
private val encoding = options.encoding match {
case Some(charsetName) => Charset.forName(charsetName)
case None => StandardCharsets.UTF_8
}
var recordCounter = 0
var filecounter = 0
private val maxEntriesPerFile = maxFileLineCount
private var writer = CodecStreams.createOutputStreamWriter(
context, new Path(modifiedPath(path)), encoding)
private[this] var gen = new JacksonGenerator(dataSchema, writer, options)
private def modifiedPath(path:String): String = {
val np = s"$path-filecount-$filecounter"
np
}
override def write(row: InternalRow): Unit = {
gen.write(row)
gen.writeLineEnding()
recordCounter += 1
if(recordCounter >= maxEntriesPerFile){
gen.close()
writer.close()
filecounter+=1
recordCounter = 0
writer = CodecStreams.createOutputStreamWriter(
context, new Path(modifiedPath(path)), encoding)
gen = new JacksonGenerator(dataSchema, writer, options)
}
}
override def close(): Unit = {
if(recordCounter<maxEntriesPerFile){
gen.close()
writer.close()
}
}
}
You can add this new custom data source jar to spark classpath and then in your pyspark you can invoke it as follows.
df.write.format("org.apache.spark.sql.execution.datasources.json.JsonFileFormatV2").option("filelinecount","5").mode("overwrite").save("path-to-save")
I am running Pyspark scripts to write a dataframe to a csv in jupyter Notebook as below:
df.coalesce(1).write.csv('Data1.csv',header = 'true')
After an hour of runtime I am getting the below error.
Error: Invalid status code from http://.....session isn't active.
My config is like:
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("shuffle.service.enabled","true")
spark.conf.set("spark.dynamicAllocation.minExecutors",6)
spark.conf.set("spark.executor.heartbeatInterval","3600s")
spark.conf.set("spark.cores.max", "4")
spark.conf.set("spark.sql.tungsten.enabled", "true")
spark.conf.set("spark.eventLog.enabled", "true")
spark.conf.set("spark.app.id", "Logs")
spark.conf.set("spark.io.compression.codec", "snappy")
spark.conf.set("spark.rdd.compress", "true")
spark.conf.set("spark.executor.instances", "6")
spark.conf.set("spark.executor.memory", '20g')
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
spark.conf.set("spark.driver.allowMultipleContexts", "true")
spark.conf.set("spark.master", "yarn")
spark.conf.set("spark.driver.memory", "20G")
spark.conf.set("spark.executor.instances", "32")
spark.conf.set("spark.executor.memory", "32G")
spark.conf.set("spark.driver.maxResultSize", "40G")
spark.conf.set("spark.executor.cores", "5")
I have checked the container nodes and the error there is:
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed:container_e836_1556653519610_3661867_01_000005 on host: ylpd1205.kmdc.att.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Not able to figure out the issue.
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
edit the /etc/livy/conf/livy.conf file (in the cluster's master
node)
set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
restart Livy to update the setting: sudo restart livy-server in the cluster's master
test your code again
I am not well versed in pyspark but in scala the solution would involve something like this
First we need to create a method for creating a header file:
def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
//format header file path
val fileName = "dfheader.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
//write file to hdfs one line after another
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)
for (h <- colNames) {
writer.write(h + ",")
}
writer.write("\n")
writer.close()
}
You will also need a method for calling hadoop to merge your part files which would be written by df.write method:
def mergeOutputFiles(sourcePaths: String, destLocation: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
// in case of array[String] use for loop to iterate over the muliple source paths if not use the code below
// for (sourcePath <- sourcePaths) {
//Get the path under destination where the partitioned files are temporarily stored
val pathText = sourcePaths.split("/")
val destPath = "%s/%s".format(destLocation, pathText.last)
//Merge files into 1
FileUtil.copyMerge(hdfs, new Path(sourcePath), hdfs, new Path(destPath), true, hadoopConfig, null)
// }
//delete the temp partitioned files post merge complete
val tempfilesPath = "%s%s".format(destLocation, tempOutputFolder)
hdfs.delete(new Path(tempfilesPath), true)
}
Here is a method for generating output files or your df.write method where you are passing your huge DF to be written out to hadoop HDFS:
def generateOutputFiles( processedDf: DataFrame, opPath: String, tempOutputFolder: String,
spark: SparkSession): String = {
import spark.implicits._
val fileName = "%s%sNameofyourCsvFile.csv".format(opPath, tempOutputFolder)
//write as csv to output directory and add file path to array to be sent for merging and create header file
processedDf.write.mode("overwrite").csv(fileName)
createHeaderFile(fileName, processedDf.columns)
//create an array of the partitioned file paths
outputFilePathList = fileName
// you can use array of string or string only depending on if the output needs to get divided in multiple file based on some parameter in that case chagne the signature ot Array[String] as output
// add below code
// outputFilePathList(counter) = fileName
// just use a loop in the above and increment it
//counter += 1
return outputFilePathList
}
With all the methods defined here is how you can implement them:
def processyourlogic( your parameters if any):Dataframe=
{
// your logic to do whatever needs to be done to your data
}
Assuming the above method returns a dataframe, here is how you can put everything together:
val yourbigD f = processyourlogic(your parameters) // returns DF
yourbigDf.cache // caching just in case you need it
val outputPathFinal = " location where you want your file to be saved"
val tempOutputFolderLocation = "temp/"
val partFiles = generateOutputFiles(yourbigDf, outputPathFinal, tempOutputFolderLocation, spark)
mergeOutputFiles(partFiles, outputPathFinal)
Let me know if you have any other question relating to that. If the answer you seek is different then the original question should be asked.
I have a trained a tf model and I want to apply it to big dataset in hdfs which is about billion of samples. The main point is I need to write the prediction of tf model into hdfs file. However I can't find the relative API in tensorflow about how to save data in hdfs file, only find the api about reading hdfs file
Until now the way I did it is to save the trained tf model into pb file in local and then load the pb file using Java api in spark or Mapreduce code. The problem of both spark or mapreduce is the running speed is very slow and failed with exceeds memory error.
Here is my demo:
public class TF_model implements Serializable{
public Session session;
public TF_model(String model_path){
try{
Graph graph = new Graph();
InputStream stream = this.getClass().getClassLoader().getResourceAsStream(model_path);
byte[] graphBytes = IOUtils.toByteArray(stream);
graph.importGraphDef(graphBytes);
this.session = new Session(graph);
}
catch (Exception e){
System.out.println("failed to load tensorflow model");
}
}
// this is the function to predict a sample in hdfs
public int[][] predict(int[] token_id_array){
Tensor z = session.runner()
.feed("words_ids_placeholder", Tensor.create(new int[][]{token_id_array}))
.fetch("softmax_prediction").run().get(0);
double[][][] softmax_prediction = new double[1][token_id_array.length][2];
z.copyTo(softmax_prediction);
return softmax_prediction[0];
}}
below is my spark code:
val rdd = spark.sparkContext.textFile(file_path)
val predct_result= rdd.mapPartitions(pa=>{
val tf_model = new TF_model("model.pb")
pa.map(line=>{
val transformed = transform(line) // omitted the transform code
val rs = tf_model .predict(transformed)
rs
})
})
I also tried tensorflow deployed in hadoop, but can't find a way to write big dataset into HDFS.
You may read model file from hdfs one time, then use sc.broadcast your bytes array of your graph to partitions. Finally, start load graph and predict. Just to avoid read file multiple time from hdfs.
I have an RDD which is to big to collect. I have applied a chain of transformations to the RDD and want to send its transformed data directly from its partitions on my slaves to S3. I am currently operating as follows:
val rdd:RDD = initializeRDD
val rdd2 = rdd.transform
rdd2.first // in order to force calculation of RDD
rdd2.foreachPartition sendDataToS3
Unfortunately, the data that gets sent to S3 is untransformed. The RDD looks exactly like it did in stage initializeRDD.
Here is the body of sendDataToS3:
implicit class WriteableRDD[T](rdd:RDD[T]){
def transform:RDD[String] = rdd map {_.toString}
....
def sendPartitionsToS3(prefix:String) = {
rdd.foreachPartition { p =>
val filename = prefix+new scala.util.Random().nextInt(1000000)
val pw = new PrintWriter(new File(filename))
p foreach pw.println
pw.close
s3.putObject(S3_BUCKET, filename, new File(filename))
}
this
}
}
This is called with rdd.transform.sendPartitionsToS3(prefix).
How do I make sure the data that gets sent in sendDataToS3 is the transformed data?
My guess is there is a bug in your code that is not included in the question.
I'm answering anyway just to make sure you are aware of RDD.saveAsTextFile. You can give it a path on S3 (s3n://bucket/directory) and it will write each partition into that path directly from the executors.
I can hardly imagine when you would need to implement your own sendPartitionsToS3 instead of using saveAsTextFile.
here is my code:
val bg = imageBundleRDD.first() //bg:[Text, BundleWritable]
val res= imageBundleRDD.map(data => {
val desBundle = colorToGray(bg._2) //lineA:NotSerializableException: org.apache.hadoop.io.Text
//val desBundle = colorToGray(data._2) //lineB:everything is ok
(data._1, desBundle)
})
println(res.count)
lineB goes well but lineA shows that:org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.io.Text
I try to use use Kryo to solve my problem but it seems nothing has been changed:
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Text])
kryo.register(classOf[BundleWritable])
}
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
Thanks!!!
I had a similar problem when my Java code was reading sequence files containing Text keys.
I found this post helpful:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-java-io-NotSerializableException-org-apache-hadoop-io-Text-td2650.html
In my case, I converted Text to a String using map:
JavaPairRDD<String, VideoRecording> mapped = videos.map(new PairFunction<Tuple2<Text,VideoRecording>,String,VideoRecording>() {
#Override
public Tuple2<String, VideoRecording> call(
Tuple2<Text, VideoRecording> kv) throws Exception {
// Necessary to copy value as Hadoop chooses to reuse objects
VideoRecording vr = new VideoRecording(kv._2);
return new Tuple2(kv._1.toString(), vr);
}
});
Be aware of this note in the API for sequenceFile method in JavaSparkContext:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD will create many references to the same object. If you plan to directly cache Hadoop writable objects, you should first copy them using a map function.
In Apache Spark while dealing with Sequence files, we have to follow these techniques:
-- Use Java equivalent Data Types in place of Hadoop data types.
-- Spark Automatically converts the Writables into Java equivalent Types.
Ex:- We have a sequence file "xyz", here key type is say Text and value
is LongWritable. When we use this file to create an RDD, we need use their
java equivalent data types i.e., String and Long respectively.
val mydata = = sc.sequenceFile[String, Long]("path/to/xyz")
mydata.collect
The reason your code has the serialization problem is that your Kryo setup, while close, isn't quite right:
change:
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
to:
val sparkConf = new SparkConf()
// ... set master, appname, etc, then:
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(sparkConf)