Spring Batch program to create the fixed size zip file - zip

I have a design issue.I want to create a 10 MB zip files using spring batch. But not sure what value i can select for chunk size. chunk size is predetermined value. lets say i decide that that chunk size is 100. So i read 100 files and try to create a zip file. but what if zip file size reaches 10 MB by just including 99 files. What will happen to the remaining 1 file?
Regards,
Raj

I had a similar use case. What I did was to create a service that creates zip files of a limited size (10MB). So for example if I had a 99 files that reaches the 10MB, the service split the zip in two parts: file_to_zip.zip and file_to_zip.z01
public List<File> createSplitFile(List<String> resourcesToZip, String directory, Long maxFileSize, String zipFileName) {
List<File> splitZipFiles = null;
try {
ZipFile zipFile = new ZipFile(directory + zipFileName);
Iterator<String> iterator = resourcesToZip.iterator();
List<File> filesToAdd = new ArrayList();
while (iterator.hasNext()) {
File file = new File(iterator.next());
if (file.exists()) {
filesToAdd.add(file);
}
}
ZipParameters parameters = new ZipParameters();
parameters.setCompressionMethod(CompressionMethod.DEFLATE);
parameters.setCompressionLevel(CompressionLevel.NORMAL);
zipFile.createSplitZipFile(filesToAdd, parameters, true, maxFileSize);
splitZipFiles = zipFile.getSplitZipFiles();
} catch (ZipException e) {
LOG.error("Exception trying to compress statement file '{}'", e.getMessage());
}
return splitZipFiles;
}

Related

Download Images stored in Azure Blob Storage as Images using C#

I have stored a bunch of images in Azure Blob Storage. Now I want to retrieve them & resize them.
I have successfully managed to read much information from the account, such as the filename, the date last modified, and the size, but how do I get the actual image? Examples I have seen show me how to download it to a file, but that is no use to me, I want to download it as an image so I can process it.
This is what I have so far:
BlobContainerClient containerClient = blobServiceClient.GetBlobContainerClient(containerName);
Console.WriteLine("Listing blobs...");
// build table to hold the info
DataTable table = new DataTable();
table.Columns.Add("ID", typeof(int));
table.Columns.Add("blobItemName", typeof(string));
table.Columns.Add("blobItemLastModified", typeof(DateTime));
table.Columns.Add("blobItemSizeKB", typeof(double));
table.Columns.Add("blobImage", typeof(Image));
// row counter for table
int intRowNo = 0;
// divider to convert Bytes to KB
double dblBytesToKB = 1024.00;
// List all blobs in the container
await foreach (BlobItem blobItem in containerClient.GetBlobsAsync())
{
// increment row number
intRowNo++;
//Console.WriteLine("\t" + blobItem.Name);
// length in bytes
long? longContentLength = blobItem.Properties.ContentLength;
double dblKb = 0;
if (longContentLength.HasValue == true)
{
long longContentLengthValue = longContentLength.Value;
// convert to double DataType
double dblContentLength = Convert.ToDouble(longContentLengthValue);
// Convert to KB
dblKb = dblContentLength / dblBytesToKB;
}
// get the image
// **** Image thisImage = what goes here ?? actual data from blobItem ****
// Last modified date
string date = blobItem.Properties.LastModified.ToString();
try
{
DateTime dateTime = DateTime.Parse(date);
//Console.WriteLine("The specified date is valid: " + dateTime);
table.Rows.Add(intRowNo, blobItem.Name, dateTime, dblKb);
}
catch (FormatException)
{
Console.WriteLine("Unable to parse the specified date");
}
}
You need to open a read stream for your image, and construct your .NET Image from this stream:
await foreach (BlobItem item in containerClient.GetBlobsAsync()){
var blobClient = containerClient.GetBlobClient(item.Name);
using Stream stream = await blobClient.OpenReadAsync();
Image myImage = Image.FromStream(stream);
//...
}
The blobclient class also exposes some other helpful methods, like a download to a stream.

Uploading PDF file to Google Drive folder and overwriting existing file

I am developing an app that uploads PDF files to a specific Google Drive folder. The file name includes the current date. The following code is for my DriveServiceHelper.class that is used to create a folder in Google Drive and then upload the PDF files into that folder using its folderID:
public class DriveServiceHelper {
Calendar c = Calendar.getInstance();
Date d = c.getTime();
SimpleDateFormat df = new SimpleDateFormat("dd-MM-yyyy");
String currentDate = df.format(d);
String ps_FolderKey;
private final Executor mExecutor = Executors.newSingleThreadExecutor();
private Drive mDriveService;
public DriveServiceHelper(Drive mDriveService) {
this.mDriveService = mDriveService;
}
public Task<String> createFolder() {
return Tasks.call(mExecutor, () -> {
File folderMetadata = new File();
folderMetadata.setName("Covid Assessment Sheets");
folderMetadata.setMimeType("application/vnd.google-apps.folder");
File myFolder = null;
try {
myFolder = mDriveService.files().create(folderMetadata)
.setFields("id")
.execute();
System.out.println("Folder ID: " + myFolder.getId());
} catch (Exception e) {
e.printStackTrace();
}
if (myFolder == null) {
throw new IOException("Null result when requesting file creation");
}
ps_FolderKey = myFolder.getId();
return ps_FolderKey;
});
}
public Task<String> createFilePDF(String filePath, String folderId) {
return Tasks.call(mExecutor, () -> {
File fileMetaData = new File();
fileMetaData.setName("Covid Assessment # " + currentDate);
fileMetaData.setParents(Collections.singletonList(folderId));
java.io.File file = new java.io.File(filePath);
FileContent mediaContent = new FileContent("application/pdf", file);
File myFile = null;
try {
myFile = mDriveService.files().create(fileMetaData, mediaContent).execute();
} catch (Exception e) {
e.printStackTrace();
}
if (myFile == null) {
throw new IOException("Null result when requesting file creation");
}
return myFile.getId();
});
}
}
When uploading the same PDF to a Google Drive folder, I want to overwrite files with the same name, but instead duplicate files are created in the folder as the fileID assigned is different even if file name is the same.
Please help me understand how I should go about this, to automatically overwrite/replace files that already exist with the same name (each file is differentiated by date), and a new PDF file is created if the PDF file does not exist in the folder.
I understand that I might be using the deprecated Drive API, but I was unable to find other solutions online to help me implement what I need. I also came across solutions that include queries to search for existing Google Drive files, but I am not sure I understand how to use it to make it work for me.
Thank you
Google Drive supports multiple files with the same name
Thus, by creating a file with an already existing name, you will not automatically overwrite the old file.
Instead you should do the following:
Use the method Files:list with the query name = 'Covid Assessment Sheets' to find the already existing file(s) with the same name
If desired, you can narrow down the results by also specifying the mimeType and the parent folder (parents)
Retrieve the id of the list result(s)
Use the method Files:delete to delete the existing file
Proceed to create a new file as you are already doing
In Java this would look as following:
FileList result = DriveService.files().list()
.setQ("name = 'Covid Assessment Sheets'");
.setFields("files(id)")
.execute();
List<File> files = result.getFiles();
for (File file : files) {
DriveService.files().delete(file.getId()).execute();
}
An alternative approach would be to update the contents of the already existing file instead of creating a new one.

Spark lists all leaf node even in partitioned data

I have parquet data partitioned by date & hour, folder structure:
events_v3
-- event_date=2015-01-01
-- event_hour=2015-01-1
-- part10000.parquet.gz
-- event_date=2015-01-02
-- event_hour=5
-- part10000.parquet.gz
I have created a table raw_events via spark but when I try to query, it scans all the directories for footer and that slows down the initial query, even if I am querying only one day worth of data.
query:
select * from raw_events where event_date='2016-01-01'
similar problem : http://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCAAswR-7Qbd2tdLSsO76zyw9tvs-Njw2YVd36bRfCG3DKZrH0tw#mail.gmail.com%3E ( but its old)
Log:
App > 16/09/15 03:14:03 main INFO HadoopFsRelation: Listing leaf files and directories in parallel under: s3a://bucket/events_v3/
and then it spawns 350 tasks since there are 350 days worth of data.
I have disabled schemaMerge, and have also specified the schema to read as, so it can just go to the partition that I am looking at, why should it print all the leaf files ?
Listing leaf files with 2 executors take 10 minutes, and the query actual execution takes on 20 seconds
code sample:
val sparkSession = org.apache.spark.sql.SparkSession.builder.getOrCreate()
val df = sparkSession.read.option("mergeSchema","false").format("parquet").load("s3a://bucket/events_v3")
df.createOrReplaceTempView("temp_events")
sparkSession.sql(
"""
|select verb,count(*) from temp_events where event_date = "2016-01-01" group by verb
""".stripMargin).show()
As soon as spark is given a directory to read from it issues call to listLeafFiles (org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala). This in turn calls fs.listStatus which makes an api call to get list of files and directories. Now for each directory this method is called again. This hapens recursively until no directories are left. This by design works good in a HDFS system. But works bad in s3 since list file is an RPC call. S3 on other had supports get all files by prefix, which is exactly what we need.
So for example if we had above directory structure with 1 year worth of data with each directory for hour and 10 sub directory we would have , 365 * 24 * 10 = 87k api calls, this can be reduced to 138 api calls given that there are only 137000 files. Each s3 api calls return 1000 files.
Code:
org/apache/hadoop/fs/s3a/S3AFileSystem.java
public FileStatus[] listStatusRecursively(Path f) throws FileNotFoundException,
IOException {
String key = pathToKey(f);
if (LOG.isDebugEnabled()) {
LOG.debug("List status for path: " + f);
}
final List<FileStatus> result = new ArrayList<FileStatus>();
final FileStatus fileStatus = getFileStatus(f);
if (fileStatus.isDirectory()) {
if (!key.isEmpty()) {
key = key + "/";
}
ListObjectsRequest request = new ListObjectsRequest();
request.setBucketName(bucket);
request.setPrefix(key);
request.setMaxKeys(maxKeys);
if (LOG.isDebugEnabled()) {
LOG.debug("listStatus: doing listObjects for directory " + key);
}
ObjectListing objects = s3.listObjects(request);
statistics.incrementReadOps(1);
while (true) {
for (S3ObjectSummary summary : objects.getObjectSummaries()) {
Path keyPath = keyToPath(summary.getKey()).makeQualified(uri, workingDir);
// Skip over keys that are ourselves and old S3N _$folder$ files
if (keyPath.equals(f) || summary.getKey().endsWith(S3N_FOLDER_SUFFIX)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Ignoring: " + keyPath);
}
continue;
}
if (objectRepresentsDirectory(summary.getKey(), summary.getSize())) {
result.add(new S3AFileStatus(true, true, keyPath));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: fd: " + keyPath);
}
} else {
result.add(new S3AFileStatus(summary.getSize(),
dateToLong(summary.getLastModified()), keyPath,
getDefaultBlockSize(f.makeQualified(uri, workingDir))));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: fi: " + keyPath);
}
}
}
for (String prefix : objects.getCommonPrefixes()) {
Path keyPath = keyToPath(prefix).makeQualified(uri, workingDir);
if (keyPath.equals(f)) {
continue;
}
result.add(new S3AFileStatus(true, false, keyPath));
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: rd: " + keyPath);
}
}
if (objects.isTruncated()) {
if (LOG.isDebugEnabled()) {
LOG.debug("listStatus: list truncated - getting next batch");
}
objects = s3.listNextBatchOfObjects(objects);
statistics.incrementReadOps(1);
} else {
break;
}
}
} else {
if (LOG.isDebugEnabled()) {
LOG.debug("Adding: rd (not a dir): " + f);
}
result.add(fileStatus);
}
return result.toArray(new FileStatus[result.size()]);
}
/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala
def listLeafFiles(fs: FileSystem, status: FileStatus, filter: PathFilter): Array[FileStatus] = {
logTrace(s"Listing ${status.getPath}")
val name = status.getPath.getName.toLowerCase
if (shouldFilterOut(name)) {
Array.empty[FileStatus]
}
else {
val statuses = {
val stats = if(fs.isInstanceOf[S3AFileSystem]){
logWarning("Using Monkey patched version of list status")
println("Using Monkey patched version of list status")
val a = fs.asInstanceOf[S3AFileSystem].listStatusRecursively(status.getPath)
a
// Array.empty[FileStatus]
}
else{
val (dirs, files) = fs.listStatus(status.getPath).partition(_.isDirectory)
files ++ dirs.flatMap(dir => listLeafFiles(fs, dir, filter))
}
if (filter != null) stats.filter(f => filter.accept(f.getPath)) else stats
}
// statuses do not have any dirs.
statuses.filterNot(status => shouldFilterOut(status.getPath.getName)).map {
case f: LocatedFileStatus => f
// NOTE:
//
// - Although S3/S3A/S3N file system can be quite slow for remote file metadata
// operations, calling `getFileBlockLocations` does no harm here since these file system
// implementations don't actually issue RPC for this method.
//
// - Here we are calling `getFileBlockLocations` in a sequential manner, but it should not
// be a big deal since we always use to `listLeafFilesInParallel` when the number of
// paths exceeds threshold.
case f => createLocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen))
}
}
}
To clarify Gaurav's answer, that code snipped is from Hadoop branch-2, Probably not going to surface until Hadoop 2.9 (see HADOOP-13208); and someone needs to update Spark to use that feature (which won't harm code using HDFS, just won't show any speedup there).
One thing to consider is: what makes a good file layout for Object Stores.
Don't have deep directory trees with only a few files per directory
Do have shallow trees with many files
Consider using the first few characters of a file for the most changing value (such as day/hour), rather than the last. Why? Some object stores appear to use the leading characters for their hashing, not the trailing ones ... if you give your names more uniqueness then they get spread out over more servers, with better bandwidth/less risk of throttling.
If you are using the Hadoop 2.7 libraries, switch to s3a:// over s3n://. It's already faster, and getting better every week, at least in the ASF source tree.
Finally, Apache Hadoop, Apache Spark and related projects are all open source. Contributions are welcome. That's not just the code, it's documentation, testing, and, for this performance stuff, testing against your actual datasets. Even giving us details about what causes problems (and your dataset layouts) is interesting.

Pigstorage to read zipped file in pig script

I have a program which dumps tab separated data file zipped to S3.
I have a Pig script which loads data from the S3 bucket. I have specified the .zip extension in the file name so that Pig knows the compression used.
The pig scripts runs and dumps data back into S3.
The logs shows that it is processing records but the dumped files are all empty.
Here is an extract from the logs
Input(s):
Successfully read 375 records (435 bytes) from: "s3://<bucket-name>/<job-id>/test-folder/filename1.zip"
Successfully read 444 records (442 bytes) from: "s3://<bucket-name>/<job-id>/test-folder/filename2.zip"
Output(s):
Successfully stored 375 records (1605 bytes) in: "s3://<bucket-name>/<job-id>/test-folder/output/output1-folder"
Successfully stored 444 records (1814 bytes) in: "s3://<bucket-name>/<job-id>/test-folder/output/output2-folder"
Successfully stored 0 records in: "s3://<bucket-name>/<job-id>/test-folder/output/output3-folder"
The code to load and store data is:
data1 = load '$input1'
using PigStorage('\t') as
(field1:long,
field2:long,
field3:double
);
data2 = load '$input2'
using PigStorage('\t') as
(field1:long,
field2:long,
field3:double
);
store output1 into '$output1-folder'
using PigStorage('\t', '-schema');
store output2 into '$output2-folder'
using PigStorage('\t', '-schema');
store output3 into '$output3-folder'
using PigStorage('\t', '-schema');
Code to compress file
public static void compressFile(String originalArchive, String zipArchive) throws IOException {
try (
ZipOutputStream archive = new ZipOutputStream(new FileOutputStream(zipArchive));
FileInputStream file = new FileInputStream(originalArchive);
) {
final int bufferSize = 100 * 1024;
byte[] buffer = new byte[bufferSize];
archive.putNextEntry(new ZipEntry(zipArchive));
int count = 0;
while ((count = file.read(buffer)) != -1) {
archive.write(buffer, 0, count);
}
file.close();
archive.closeEntry();
archive.close();
}
}
Any help is appreciated!
Thanks!

How to open files in JAR file with filename length greater than 255?

I have a JAR file with following structure:
com
-- pack1
-- A.class
-- pack2
-- AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA.class
When I try to read, extract or rename pack2/AA...AA.class (which has a 262 byte long filename) both Linux and Windows say filename is too long. Renaming inside the JAR file doesn't also work.
Any ideas how to solve this issue and make the long class file readable?
This pages lists the usual limits of file systems: http://en.wikipedia.org/wiki/Comparison_of_file_systems
As you can see in the "Limits" section, almost no file system allows more than 255 characters.
Your only chance is to write a program that extracts the files and shortens file names which are too long. Java at least should be able to open the archive (try jar -tvf to list the content; if that works, truncating should work as well).
java.util.jar can handle it:
try {
JarFile jarFile = new JarFile("/path/to/target.jar");
Enumeration<JarEntry> jarEntries = jarFile.entries();
int i = 0;
while (jarEntries.hasMoreElements()) {
JarEntry jarEntry = jarEntries.nextElement();
System.out.println("processing entry: " + jarEntry.getName());
InputStream jarFileInputStream = jarFile.getInputStream(jarEntry);
OutputStream jarOutputStream = new FileOutputStream(new File("/tmp/test/test" + (i++) + ".class")); // give temporary name to class
while (jarFileInputStream.available() > 0) {
jarOutputStream.write(jarFileInputStream.read());
}
jarOutputStream.close();
jarFileInputStream.close();
}
} catch (IOException ex) {
Logger.getLogger(JARExtractor.class.getName()).log(Level.SEVERE, null, ex);
}
The output willbe test<n>.class for each class.

Resources