Spark Streaming textFileStream not supporting wildcards

Spark Streaming textFileStream not supporting wildcards - apache-spark

I setup a simple test to stream text files from S3 and got it to work when I tried something like
val input = ssc.textFileStream("s3n://mybucket/2015/04/03/")
and in the bucket I would have log files go in there and everything would work fine.
But if their was a subfolder, it would not find any files that got put into the subfolder (and yes, I am aware that hdfs doesn't actually use a folder structure)
val input = ssc.textFileStream("s3n://mybucket/2015/04/")
So, I tried to simply do wildcards like I have done before with a standard spark application
val input = ssc.textFileStream("s3n://mybucket/2015/04/*")
But when I try this it throws an error
java.io.FileNotFoundException: File s3n://mybucket/2015/04/* does not exist.
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1483)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1523)
at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:176)
at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:134)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
.....
I know for a fact that you can use wildcards when reading fileInput for a standard spark applications but it appears that when doing streaming input, it doesn't do that nor does it automatically process files in subfolders. Is there something I'm missing here??
Ultimately what I need is a streaming job to be running 24/7 that will be monitoring an S3 bucket that has logs placed in it by date
So something like
s3n://mybucket/<YEAR>/<MONTH>/<DAY>/<LogfileName>
Is there any way to hand it the top most folder and it automatically read files that show up in any folder (cause obviously the date will increase every day)?
EDIT
So upon digging into the documentation at http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources it states that nested directories are not supported.
Can anyone shed some light as to why this is the case?
Also, since my files will be nested based upon their date, what would be a good way of solving this problem in my streaming application? It's a little complicated since the logs take a few minutes to get written to S3 and so the last file being written for the day could be written in the previous day's folder even though we're a few minutes into the new day.

Some "ugly but working solution" can be created by extending FileInputDStream.
Writing sc.textFileStream(d) is equivalent to
new FileInputDStream[LongWritable, Text, TextInputFormat](streamingContext, d).map(_._2.toString)
You can create CustomFileInputDStream that will extend FileInputDStream. The custom class will copy the compute method from the FileInputDStream class and adjust the findNewFiles method to your needs.
changing findNewFiles method from:
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val filter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val newFiles = fs.listStatus(directoryPath, filter).map(_.getPath.toString)
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
to:
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val filter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val directories = fs.listStatus(directoryPath).filter(_.isDirectory)
val newFiles = ArrayBuffer[FileStatus]()
directories.foreach(directory => newFiles.append(fs.listStatus(directory.getPath, filter) : _*))
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles.map(_.getPath.toString).toArray
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
will check for files in all first degree sub folders, you can adjust it to use the batch timestamp in order to access the relevant "subdirectories".
I created the CustomFileInputDStream as I mentioned and activated it by calling:
new CustomFileInputDStream[LongWritable, Text, TextInputFormat](streamingContext, d).map(_._2.toString)
It seems to behave us expected.
When I write solution like this I must add some points for consideration:
You are breaking Spark encapsulation and creating a custom class that you would have to support solely as time pass.
I believe that solution like this is the last resort. If your use case can be implemented by different way, it is usually better to avoid solution like this.
If you will have a lot of "subdirectories" on S3 and would check each one of them it will cost you.
It will be very interesting to understand if Databricks doesn't support nested files just because of possible performance penalty or not, maybe there is a deeper reason I haven't thought about.

we had same problem. we joined sub folder names with comma.
List<String> paths = new ArrayList<>();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy/MM/dd");
try {
Date start = sdf.parse("2015/02/01");
Date end = sdf.parse("2015/04/01");
Calendar calendar = Calendar.getInstance();
calendar.setTime(start);
while (calendar.getTime().before(end)) {
paths.add("s3n://mybucket/" + sdf.format(calendar.getTime()));
calendar.add(Calendar.DATE, 1);
}
} catch (ParseException e) {
e.printStackTrace();
}
String joinedPaths = StringUtils.join(",", paths.toArray(new String[paths.size()]));
val input = ssc.textFileStream(joinedPaths);
I hope that in this way your problem is solved.

Related

Application skipping frames while accessing sound files

I have this for statement triggered by a button that iterates through a MutableList of strings.
For each string it completes a file path and checks if that file path is valid. If it is, it's attempted to be sent to the mediaPlayer via a function to be played as a sound file. It should play the sound for all files it can find with a pause at certain stated points (2,5,7).
Unfortunately, when I test it out, the button animation comes in delayed, followed by a Logcat info of 363 skipped frames, doing too much work on the application's main thread.
I tried to consecutively commenting out certain lines of the function but was not able to identify the computationally intensive part of it. Could anybody tell me where the issue lies or how I could improve the function?
Here's the function itself:
btnStartReadingAloud.setOnClickListener {binding.root.context
Toast.makeText(binding.root.context, "Reading the exercise out loud", Toast.LENGTH_SHORT).show()
println("Reading the exercise out loud")
for (i in 0 until newExercise.lastIndex){
val currentElement : Pair<String,Array<Any>> = newExercise[i]
val currentDesignatedSoundFile : String = "R.id.trampolin_ansage_malte_"+replaceSpecialChars(currentElement.first)
val path : Uri = Uri.parse(currentDesignatedSoundFile)
val file = File(currentDesignatedSoundFile)
if (doesFileExist(file)){
System.out.println("Playing file named$file")
playSound(path)
Toast.makeText(binding.root.context, "Playing sound for: %path", Toast.LENGTH_SHORT).show()
}
val pauseTimes = listOf<Int>(2,5,7)
if (i in pauseTimes){
Thread.sleep(2000)
}
}
Here is the dedicated function to play the sound
fun playSound(soundFile : Uri) {
if (mMediaPlayer == null) {
mMediaPlayer = MediaPlayer.create(requireContext(), soundFile)
mMediaPlayer!!.isLooping = false
mMediaPlayer!!.start()
} else mMediaPlayer!!.start()
}
Any hint is appreciated, thanks already for reading & brainstorming :)

Copy files from gen2 ADLS based on acknowledgement file

I am trying to copy data from gen2 ADLS into another ADLS using data factory pipeline.
This pipeline runs daily and copies data only for that particular day. This has been done by providing start and end time in the copy activity.
Somedays the files in the source ADLS will be delayed so that the pipeline will run, but no data will be copied.
In order to track this we have planned to keep an acknowledgment file after data copy into source ADLS, so that before copying we could check for the ack file and proceed data copy only if ack file is present.
So the check should happen every 10 mins If ack file is not present, this check should run after 10 mins and this should continue for 2 hrs.
Within this 2 hrs, if file is present then the data copy should proceed and check task also should be stopped.
If there is no data after 2 hrs then the job should fail.
I was trying with validation task in ADF. But one issue is with the folder name since my folder will be named with data and timestamp of creation (for eg: 2021-03-30-02-19-33).
I have to exclude the timestamp part of folder while providing the folder name.
How is it possible. Is wildcard path accepted for validation activity?
Any leads how to implement this?
Is there any way to implement continuous check after 10 mins for 2 hrs in the get matadata task? Can we implement above scenario with get metadata task?

If we do have a requirement for using wildcard path, we have to write Scala/Python.... script on a notebook file and to execute from ADF.
I have used below scala script which takes input parameters from ADF.
import java.io.File
import java.util.Calendar
dbutils.widgets.text("mainFolderPath", "","")
dbutils.widgets.text("finalFolderStartName", "","")
dbutils.widgets.text("fileName", "","")
dbutils.widgets.text("noOfTry", "1","")
val mainFolderPath = dbutils.widgets.get("mainFolderPath")
val finalFolderStartName = dbutils.widgets.get("finalFolderStartName")
val fileName = dbutils.widgets.get("fileName")
val noOfTry = (dbutils.widgets.get("noOfTry")).toInt
println("Main folder path : " + mainFolderPath)
println("Final folder start name : " + finalFolderStartName)
println("File name to be checked : " + fileName)
println("Number of tries with a gap of 1 mins : " + noOfTry)
if(mainFolderPath == "" || finalFolderStartName == "" || fileName == ""){
dbutils.notebook.exit("Please pass input parameters and rerun!")
}
def getListOfSubDirectories(directoryName: String): Array[String] = {
(new File(directoryName))
.listFiles
.filter(_.isDirectory)
.map(_.getName)
}
var counter = 0
var folderFound = false
var fileFound = false
try{
while (counter < noOfTry && !fileFound) {
val folders = getListOfSubDirectories(mainFolderPath)
if(folders.exists(firstName => firstName.startsWith(finalFolderStartName))){
folders.foreach(fol => {
if(fol.startsWith(finalFolderStartName)){
val finalPath = mainFolderPath + "/" + fol + "/" + fileName
println("Final file path : " + finalPath)
folderFound = true
if(new File(finalPath).exists) {
fileFound = true
}else{
println("found the final folder but no file found!")
println("waiting for 10 mins! " + Calendar.getInstance().getTime())
counter = counter+1
Thread.sleep(1*60*1000)
}
}
})
}else{
println("folder does not exists with name : " + mainFolderPath + "/" + finalFolderStartName + "*")
println("waiting for 10 mins! " + Calendar.getInstance().getTime())
counter = counter+1
Thread.sleep(1*60*1000)
}
}
}catch{
case e : Throwable => throw e;
}
if(folderFound && fileFound){
println("File Exists!")
}else{
throw new Exception("File does not exists!");
}

As far as I know, ADF validation and metadata activities does not support wildcard path for the folder/file path.

How to get artifact file's URI using Artifactory's checksum API where multiple artifacts have same SHA-1 / SHA-256 values aka file's content

Artifactory Version: 5.8.4
In Artifactory, files are stored in the internal database via file's checksum (SHA1) and for retrieval purposes, SHA-256 is useful (for verifying if file is intact).
Read this first: https://www.jfrog.com/confluence/display/RTF/Checksum-Based+Storage
Let's say there are 2 Jenkins jobs, which creates few artifacts/file (rpm/jar/etc). In my case, I'll take a simple .txt file which stores date in MM/DD/YYYY format and some other jobA/B specific build result files (jars/rpms etc).
If we focus only on the text file (as I mentioned above), then:
Jenkins_jobA > generates jobA.date_mm_dd_yy.txt
Jenkins_jobA > generates jobB.date_mm_dd_yy.txt
Jenkins jobA and jobB run multiple times per day in no given run order. Sometime jobA runs first and sometime jobB.
As the content of the file are mostly same for both jobs (per day), the SHA-1 value on jobA's .txt file and jobB.txt file will be same i.e. in Artifactory, both files will be stored in the first 2 character based directry folder structure (as per the Check-sum based storage mechanism).
Basically running sha1sum and sha256sum on both files in Linux, would return the exact same output.
Over the time, these artifacts (.txt, etc) gets promoted from one repository to another (promotion process i.e. from snapshot -> stage -> release repo) so my current logic written in Groovy is to find the URI of the artifact sitting behind a "VIRTUAL" repository (which contains a set of physical local repositories in some order) is listed below:
// Groovy code
import groovy.json.JsonSlurper
import groovy.json.JsonOutput
jsonSlurper = new JsonSlurper()
// The following function will take artifact.SHA_256 as it's parameter to find URI of the artifact
def checkSumBasedSearch(artifactSha) {
virt_repo = "jar-repo" // this virtual may have many physical repos release/stage/snapshot for jar(maven) or it can be a YUM repo for (rpm) or generic repo for (.txt file)
// Note: Virtual repos don't span different repo types (i.e. a virtual repository in Artifactory for "Maven" artifacts (jar/war/etc) can NOT see YUM/PyPi/Generic physical Repos).
// Run aqlCmd on Linux, requires "...", "..", "..." for every distinctive words / characters in the cmd line.
checkSum_URL = artifactoryURL + "/api/search/checksum?sha256="
aqlCmd = ["curl", "-u", username + ":" + password, "${checkSum_URL}" + artifactSha + "&repos=" + virt_repo]
}
def procedure = aqlCmd.execute()
def standardOut = new StringBuilder(), standardErr = new StringBuilder()
procedure.waitForProcessOutput(standardOut, standardErr)
// Fail early
if (! standardErr ) {
println "\n\n-- checkSumBasedSearch() - standardErr exists ---\n" + standardErr +"\n\n-- Exiting with error 12!!!\n\n"
System.exit(12)
}
def obj = jsonSlurper.parseText(standardOut.toString())
def results = obj.results
def uri = results[0].uri // This would work, if a file's sha-1 /256 is always different or sits in different repo at least.
return uri
// to get the URL, I can use:
//aqlCmd = ["curl", "-u", username + ":" + password, "${uri}"]
//def procedure = aqlCmd.execute()
//def obj = jsonSlurper.parseText(standardOut.toString())
//def url = obj.downloadUri
//return url
//aqlCmd = [ "curl", "-u", username + ":" + password, "${url}", "-o", somedirectory + "/" + variableContainingSomeArtifactFilenameThatIWant ]
//
// def procedure = aqlCmd.execute()
//def standardOut = new StringBuilder(), standardErr = new StringBuilder()
//procedure.waitForProcessOutput(standardOut, standardErr)
// Now, I'll get the artifact downloaded in some Directory as some Filename.
}
My concern is, as both files (even though different name -or file-<versioned-timestamp>.txt) have same content in them and generated multiple times per day, how can I get a specific versioned file downloaded for jobA or jobB?
In Artifactory, the SHA_256 property for all files containing same content will be same!! (Artifactory will use SHA-1 for storing these files efficiently to save space, new uploads will be just minimal database level transactions transparent to the user).
Questions:
Will the above logic return jobA's file or jobB's .txt file or any Job's .txt file which uploaded it's file first or latest/acc. to LastModified -aka- last upload time?
How can I get jobA's .txt file and jobB's .txt file downloaded for a given timestamp?
Do I need to add more properties during my rest Api call?
If I was just concerned for the file content, then it doesn't matter much (sha-1/256 dependent) whether it's coming from JobA .txt or job's .txt file, but in a complex case, one may have file name containing meaningful info that they'd like to know to find which file was download (A / B)!

You can use AQL (Artifactory Query Langueage)
https://www.jfrog.com/confluence/display/RTF/Artifactory+Query+Language
curl -u<username>:<password> -XPOST https://repo.jfrog.io/artifactory/api/search/aql -H "Content-Type: text/plain" -T ./search
The content of the file named search is:
items.find(
{
"artifact.module.build.name":{"$eq":"<build name>"},
"artifact.sha1":"<sha1>"
}
)
The above logic (in the original question) will return one of them arbitrary, since you are taking the first result returned and there is no guarantee on the order.
Since your text file contains the timestamp in the name, then you can add the name to the aql given above, it will also filter by the name.
AQL search API is more flexible than the checksum search, use it and customise your query according to the parameters you need.

So, I ended up doing this instead of just returning [0]th element from array in every case.
// Do NOT return [0] first element as yet as Artifactory uses SHA-1/256 so return [Nth].uri where artifact's full name matches with the sha256
// def uri = results[0].uri
def nThIndex=0
def foundFlag = 'false'
for (r in results) {
println "> " + r.uri + " < " + r.uri.toString() + " artifact: " + artFullName
if ( r.uri.toString().contains(artFullName) ) {
foundFlag = 'true'
println "- OK - Found artifact: " + artFullName + " at results[" + nThIndex + "] index."
break; // i.e. a match for the artifact name with SHA-256 we want - has been found.
} else {
nThIndex++;
}
}
if ( foundFlag == 'true' ) {
def uri = results[nThIndex].uri
return uri
} else {
// Fail early if results were found based on SHA256 but not for the artifact but for some other filename with same SHA256
if (! standardErr ) {
println "\n\n\n\n-- [Cool] -- checkSum_Search() - SHA-256 unwanted situation occurred !!! -- results Array was set with some values BUT it didn't contain the artifact (" + artFullName + ") that we were looking for \n\n\n-- !!! Artifact NOT FOUND in the results array during checkSum_Search()---\n\n\n-- Exiting with error 17!!!\n\n\n\n"
System.exit(17) // Nooka
}
}

How to copy files in Groovy

I need to copy a file in Groovy and saw some ways to achieve it on the web:
1
new AntBuilder().copy( file:"$sourceFile.canonicalPath",
tofile:"$destFile.canonicalPath")
2
command = ["sh", "-c", "cp src/*.txt dst/"]
Runtime.getRuntime().exec((String[]) command.toArray())
3
destination.withDataOutputStream { os->
source.withDataInputStream { is->
os << is
}
}
4
import java.nio.file.Files
import java.nio.file.Paths
Files.copy(Paths.get(a), Paths.get(b))
The 4th way seems cleanest to me as I am not sure how good is it to use AntBuilder and how heavy it is, I saw some people reporting issues with Groovy version change.
2nd way is OS dependent, 3rd might not be efficient.
Is there something in Groovy to just copy files like in the 4th statement or should I just use Java for it?

If you have Java 7, I would definitely go with
Path source = ...
Path target = ...
Files.copy(source, target)
With the java.nio.file.Path class, it can work with symbolic and hard links. From java.nio.file.Files:
This class consists exclusively of static methods that operate on
files, directories, or other types of files. In most cases, the
methods defined here will delegate to the associated file system
provider to perform the file operations.
Just as references:
Copy files from one folder to another with Groovy
http://groovyconsole.appspot.com/view.groovy?id=8001
My second option would be the ant task with AntBuilder.

If you are doing this in code, just use something like:
new File('copy.bin').bytes = new File('orig.bin').bytes
If this is for build-related code, this would also work, or use the Ant builder.
Note, if you are sure the files are textual you can use .text rather than .bytes.

If it is a text file, I would go with:
def src = new File('src.txt')
def dst = new File('dst.txt')
dst << src.text

I prefer this way:
def file = new File("old.file")
def newFile = new File("new.file")
Files.copy(file.toPath(), newFile.toPath())

To append to existing file :
def src = new File('src.txt')
def dest = new File('dest.txt')
dest << src.text
To overwrite if file exists :
def src = new File('src.txt')
def dest = new File('dest.txt')
dest.write(src.text)

I'm using AntBuilder for such tasks. It's simple, consistent, 'battle-proven' and fun.
2nd approach is too OS-specific (Linux-only in your case)
3rd it too low-level and it eats up more resources. It's useful if you need to transform the file on the way: change encoding for example
4th looks overcomplicated to me... NIO package is relatively new in JDK.
In the end of the day, I'd go for 1st option. There you can switch from copy to scp task, without re-developing the script almost from scratch

This is the way using platform independent groovy script. If anyone has questions please ask in the comments.
def file = new File("java/jcifs-1.3.18.jar")
this.class.classLoader.rootLoader.addURL(file.toURI().toURL())
def auth_server = Class.forName("jcifs.smb.NtlmPasswordAuthentication").newInstance("domain", "username", "password")
def auth_local = Class.forName("jcifs.smb.NtlmPasswordAuthentication").newInstance(null, "local_username", "local_password")
def source_url = args[0]
def dest_url = args[1]
def auth = auth_server
//prepare source file
if(!source_url.startsWith("\\\\"))
{
source_url = "\\\\localhost\\"+ source_url.substring(0, 1) + "\$" + source_url.substring(1, source_url.length());
auth = auth_local
}
source_url = "smb:"+source_url.replace("\\","/");
println("Copying from Source -> " + source_url);
println("Connecting to Source..");
def source = Class.forName("jcifs.smb.SmbFile").newInstance(source_url,auth)
println(source.canRead());
// Reset the authentication to default
auth = auth_server
//prepare destination file
if(!dest_url.startsWith("\\\\"))
{
dest_url = "\\\\localhost\\"+ dest_url.substring(0, 1) + "\$" +dest_url.substring(2, dest_url.length());
auth = auth_local
}
def dest = null
dest_url = "smb:"+dest_url.replace("\\","/");
println("Copying To Destination-> " + dest_url);
println("Connecting to Destination..");
dest = Class.forName("jcifs.smb.SmbFile").newInstance(dest_url,auth)
println(dest.canWrite());
if (dest.exists()){
println("Destination folder already exists");
}
source.copyTo(dest);

For copying files in Jenkins Groovy
For Linux:
try {
echo 'Copying the files to the required location'
sh '''cd /install/opt/
cp /install/opt/ssl.ks /var/local/system/'''
echo 'File is copied successfully'
}
catch(Exception e) {
error 'Copying file was unsuccessful'
}
**For Windows:**
try {
echo 'Copying the files to the required location'
bat '''#echo off
copy C:\\Program Files\\install\\opt\\ssl.ks C:\\ProgramData\\install\\opt'''
echo 'File is copied successfully'
}
catch(Exception e) {
error 'Copying file was unsuccessful'
}

JAudioTagger Deleting First Few Seconds of Track

I've written a simple Groovy script (below) to set the values of four of the ID3v1 and ID3v2 tag fields in mp3 files using the JAudioTagger library. The script successfully makes the changes but it also deletes the first 5 to 10 seconds of some of the files, other files are unaffected. It's not a big problem, but if anyone knows a simple fix, I would be grateful. All the files are from the same source, all have v1 and v2 tags, I can find no obvious difference in the source files to explain it.
import org.jaudiotagger.*
java.util.logging.Logger.getLogger("org.jaudiotagger").setLevel(java.util.logging.Level.OFF)
Integer trackNum = 0
Integer totalFiles = 0
Integer invalidFiles = 0
validMP3File = true
def dir = new File(/D:\Users\Jeremy\Music\Speech Radio\Unlistened\Z Temp Files to MP3 Tagged/)
dir.eachFile({curFile ->
totalFiles ++
try {
mp3File = org.jaudiotagger.audio.AudioFileIO.read(curFile)
} catch (org.jaudiotagger.audio.exceptions.CannotReadException e) {
validMP3File = false
invalidFiles ++
}
// Get the file name excluding the extension
baseFilename = org.jaudiotagger.audio.AudioFile.getBaseFilename(curFile)
// Check that it is an MP3 file
if (validMP3File) {
if (mp3File.getAudioHeader().getEncodingType() != 'mp3') {
validMP3File = false
invalidFiles ++
}
}
if (validMP3File) {
trackNum ++
if (mp3File.hasID3v1Tag()) {
curTagv1 = mp3File.getID3v1Tag()
} else {
curTagv1 = new org.jaudiotagger.tag.id3.ID3v1Tag()
}
if (mp3File.hasID3v2Tag()) {
curTagv2 = mp3File.getID3v2TagAsv24()
} else {
curTagv2 = new org.jaudiotagger.tag.id3.ID3v23Tag()
}
curTagv1.setField(org.jaudiotagger.tag.FieldKey.TITLE, baseFilename)
curTagv2.setField(org.jaudiotagger.tag.FieldKey.TITLE, baseFilename)
curTagv1.setField(org.jaudiotagger.tag.FieldKey.ARTIST, "BBC Radio")
curTagv2.setField(org.jaudiotagger.tag.FieldKey.ARTIST, "BBC Radio")
curTagv1.setField(org.jaudiotagger.tag.FieldKey.ALBUM, "BBC Radio - 20130205")
curTagv2.setField(org.jaudiotagger.tag.FieldKey.ALBUM, "BBC Radio - 20130205")
curTagv1.setField(org.jaudiotagger.tag.FieldKey.TRACK, trackNum.toString())
curTagv2.setField(org.jaudiotagger.tag.FieldKey.TRACK, trackNum.toString())
mp3File.setID3v1Tag(curTagv1)
mp3File.setID3v2Tag(curTagv2)
mp3File.save()
}
})
println """$trackNum tracks created from $totalFiles files with $invalidFiles invalid files"""

I'm still investigating and it appears that there is no problem with JAudioTagger. Before setting the tags, I use Total Recorder to reduce the quality of the download from 128kbps, 44,100Hz to 56kbps, 22,050Hz. This reduces the file size to less than half and the quality is fine for speech radio.
If I run my script on the original files, none of the audio track is deleted. The deletion of the first part of the audio track only occurs with the files that have been processed by Total Recorder.
Looking at the JAudioTagger logging for these files, there does appear to be a problem with the header:
Checking further because the ID3 Tag ends at 0x23f9 but the mp3 audio doesnt start until 0x7a77
Confirmed audio starts at 0x7a77 whether searching from start or from end of ID3 tag
This check is not performed for files that have not been processed by Total Recorder.
The log of the header read operation also shows (for a 27 minute track):
trackLength:06:52
It looks as though I shall have to find a new MP3 file editor!

Instead of
mp3File.save()
could you try:
mp3File.commit()
No idea if it will help, but that seems to be the documented method?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Streaming textFileStream not supporting wildcards - apache-spark

Related

Application skipping frames while accessing sound files

Copy files from gen2 ADLS based on acknowledgement file

How to get artifact file's URI using Artifactory's checksum API where multiple artifacts have same SHA-1 / SHA-256 values aka file's content

How to copy files in Groovy

JAudioTagger Deleting First Few Seconds of Track

Categories

Resources