Pigstorage to read zipped file in pig script - zip

I have a program which dumps tab separated data file zipped to S3.
I have a Pig script which loads data from the S3 bucket. I have specified the .zip extension in the file name so that Pig knows the compression used.
The pig scripts runs and dumps data back into S3.
The logs shows that it is processing records but the dumped files are all empty.
Here is an extract from the logs
Successfully read 375 records (435 bytes) from: "s3://<bucket-name>/<job-id>/test-folder/filename1.zip"
Successfully read 444 records (442 bytes) from: "s3://<bucket-name>/<job-id>/test-folder/filename2.zip"
Successfully stored 375 records (1605 bytes) in: "s3://<bucket-name>/<job-id>/test-folder/output/output1-folder"
Successfully stored 444 records (1814 bytes) in: "s3://<bucket-name>/<job-id>/test-folder/output/output2-folder"
Successfully stored 0 records in: "s3://<bucket-name>/<job-id>/test-folder/output/output3-folder"
The code to load and store data is:
data1 = load '$input1'
using PigStorage('\t') as
data2 = load '$input2'
using PigStorage('\t') as
store output1 into '$output1-folder'
using PigStorage('\t', '-schema');
store output2 into '$output2-folder'
using PigStorage('\t', '-schema');
store output3 into '$output3-folder'
using PigStorage('\t', '-schema');
Code to compress file
public static void compressFile(String originalArchive, String zipArchive) throws IOException {
try (
ZipOutputStream archive = new ZipOutputStream(new FileOutputStream(zipArchive));
FileInputStream file = new FileInputStream(originalArchive);
) {
final int bufferSize = 100 * 1024;
byte[] buffer = new byte[bufferSize];
archive.putNextEntry(new ZipEntry(zipArchive));
int count = 0;
while ((count = file.read(buffer)) != -1) {
archive.write(buffer, 0, count);
Any help is appreciated!


Download Images stored in Azure Blob Storage as Images using C#

I have stored a bunch of images in Azure Blob Storage. Now I want to retrieve them & resize them.
I have successfully managed to read much information from the account, such as the filename, the date last modified, and the size, but how do I get the actual image? Examples I have seen show me how to download it to a file, but that is no use to me, I want to download it as an image so I can process it.
This is what I have so far:
BlobContainerClient containerClient = blobServiceClient.GetBlobContainerClient(containerName);
Console.WriteLine("Listing blobs...");
// build table to hold the info
DataTable table = new DataTable();
table.Columns.Add("ID", typeof(int));
table.Columns.Add("blobItemName", typeof(string));
table.Columns.Add("blobItemLastModified", typeof(DateTime));
table.Columns.Add("blobItemSizeKB", typeof(double));
table.Columns.Add("blobImage", typeof(Image));
// row counter for table
int intRowNo = 0;
// divider to convert Bytes to KB
double dblBytesToKB = 1024.00;
// List all blobs in the container
await foreach (BlobItem blobItem in containerClient.GetBlobsAsync())
// increment row number
//Console.WriteLine("\t" + blobItem.Name);
// length in bytes
long? longContentLength = blobItem.Properties.ContentLength;
double dblKb = 0;
if (longContentLength.HasValue == true)
long longContentLengthValue = longContentLength.Value;
// convert to double DataType
double dblContentLength = Convert.ToDouble(longContentLengthValue);
// Convert to KB
dblKb = dblContentLength / dblBytesToKB;
// get the image
// **** Image thisImage = what goes here ?? actual data from blobItem ****
// Last modified date
string date = blobItem.Properties.LastModified.ToString();
DateTime dateTime = DateTime.Parse(date);
//Console.WriteLine("The specified date is valid: " + dateTime);
table.Rows.Add(intRowNo, blobItem.Name, dateTime, dblKb);
catch (FormatException)
Console.WriteLine("Unable to parse the specified date");
You need to open a read stream for your image, and construct your .NET Image from this stream:
await foreach (BlobItem item in containerClient.GetBlobsAsync()){
var blobClient = containerClient.GetBlobClient(item.Name);
using Stream stream = await blobClient.OpenReadAsync();
Image myImage = Image.FromStream(stream);
The blobclient class also exposes some other helpful methods, like a download to a stream.

Spring Batch program to create the fixed size zip file

I have a design issue.I want to create a 10 MB zip files using spring batch. But not sure what value i can select for chunk size. chunk size is predetermined value. lets say i decide that that chunk size is 100. So i read 100 files and try to create a zip file. but what if zip file size reaches 10 MB by just including 99 files. What will happen to the remaining 1 file?
I had a similar use case. What I did was to create a service that creates zip files of a limited size (10MB). So for example if I had a 99 files that reaches the 10MB, the service split the zip in two parts: file_to_zip.zip and file_to_zip.z01
public List<File> createSplitFile(List<String> resourcesToZip, String directory, Long maxFileSize, String zipFileName) {
List<File> splitZipFiles = null;
try {
ZipFile zipFile = new ZipFile(directory + zipFileName);
Iterator<String> iterator = resourcesToZip.iterator();
List<File> filesToAdd = new ArrayList();
while (iterator.hasNext()) {
File file = new File(iterator.next());
if (file.exists()) {
ZipParameters parameters = new ZipParameters();
zipFile.createSplitZipFile(filesToAdd, parameters, true, maxFileSize);
splitZipFiles = zipFile.getSplitZipFiles();
} catch (ZipException e) {
LOG.error("Exception trying to compress statement file '{}'", e.getMessage());
return splitZipFiles;

Spark - Read compressed files without file extension

I have a S3 bucket that is filled with Gz files that have no file extension. For example s3://mybucket/1234502827-34231
sc.textFile uses that file extension to select the decoder. I have found many blog post on handling custom file extensions but nothing about missing file extensions.
I think the solution may be sc.binaryFiles and unzipping the file manually.
Another possibility is to figure out how sc.textFile finds the file format. I'm not clear what these classOf[] calls work.
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
Can you try to combine the below solution for ZIP files, with gzipFileInputFormat library?
here - How to open/stream .zip files through Spark?
You can see how to do it using ZIP:
rdd1 = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());
Some details about newAPIHadoopFile() can be found here:
I found several examples out there that almost fit my needs. Here is the final code I used to parse a file compressed with GZ.
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream
import org.apache.spark.input.PortableDataStream
import scala.util.Try
import java.nio.charset._
def extractBSM(ps: PortableDataStream, n: Int = 1024) = Try {
val gz = new GzipCompressorInputStream(ps.open)
Stream.continually {
// Read n bytes
val buffer = Array.fill[Byte](n)(-1)
val i = gz.read(buffer, 0, n)
(i, buffer.take(i))
// Take as long as we've read something
.takeWhile(_._1 > 0)
def decode(charset: Charset = StandardCharsets.UTF_8)(bytes: Array[Byte]) = new String(bytes, StandardCharsets.UTF_8)
val inputFile = "s3://my-bucket/157c96bd-fb21-4cc7-b340-0bd4b8e2b614"
val rdd = sc.binaryFiles(inputFile).flatMapValues(x => extractBSM(x).toOption).map( x => decode()(x._2) )
val rdd2 = rdd.flatMap { x => x.split("\n") }
You can create your own custom codec for decoding your file. You can start by extending GzipCodec and override getDefaultExtension method where you return empty string as an extension.
EDIT: That soultion will not work in all cases due to how CompressionCodecFactory is implemented. For example: By default codec for .lz4 is loaded. This means if name of a file that you want to load ends with 4, that codec will get picked instead of custom (w/o extension). As that codec does not match extension it will get later ditched and no codec will be used.
package com.customcodec;
import org.apache.hadoop.io.compress.GzipCodec;
public class GzipCodecNoExtension extends GzipCodec {
public String getDefaultExtension() {
return "";
In spark app you just register your codec:
SparkConf conf = new SparkConf()
.set("spark.hadoop.io.compression.codecs", "com.customcodec.GzipCodecNoExtension");
You can read binary file and do decompression using map function.
JavaRDD<Tuple2<String, PortableDataStream>> rawData = spark.sparkContext().binaryFiles(readLocation, 1).toJavaRDD();
JavaRDD<String> decompressedData = rawData.map((Function<Tuple2<String, PortableDataStream>, String>) stringPortableDataStreamTuple2 -> {
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPInputStream s = new GZIPInputStream(new ByteArrayInputStream(stringPortableDataStreamTuple2._2.toArray()));
IOUtils.copy(s, out);
return new String(out.toByteArray());
In case of JSON content you can read that into Dataset using
Dataset co = spark.read().json(decompressedData);

Msg_file_get_data in SimGrid

I open file by SimGrid framework:
msg_file_t file = MSG_file_open("/scratch/bin/tesh", NULL);
XBT_INFO("file size is %zd", MSG_file_get_size(file));
It's OK:
[carl:host:(1) 0.000000] [remote_io/INFO] file size is 356434
Then I want to set some data to this file. Firstly, I create typedef structure:
typedef struct {
char* number_used;
}data, *dataPtr;
Then I set data with MSG_file_set_data to this file:
dataPtr data_1 = xbt_new(data, 1);
data_1->number_used = xbt_strdup("1");
MSG_file_set_data(file, data);
But after closing file I can't get the value of data_1->number_used:
file = MSG_file_open("/scratch/bin/tesh", NULL);
dataPtr data_2 = MSG_file_get_data(file);
XBT_INFO("number used %s", data_2->number_used);
It gives segmentation fault and value of data_2 is null. What did I do wrong?
A msg_file_t object only exists between the MSG_file_open and MSG_file_close calls. Calling again MSG_file_open on the same file name creates a new msg_file_t object (a new descriptor). Then user data attached to a msg_file_t are not persistent across multiple open/close on a file name.

How to open files in JAR file with filename length greater than 255?

I have a JAR file with following structure:
-- pack1
-- A.class
-- pack2
When I try to read, extract or rename pack2/AA...AA.class (which has a 262 byte long filename) both Linux and Windows say filename is too long. Renaming inside the JAR file doesn't also work.
Any ideas how to solve this issue and make the long class file readable?
This pages lists the usual limits of file systems: http://en.wikipedia.org/wiki/Comparison_of_file_systems
As you can see in the "Limits" section, almost no file system allows more than 255 characters.
Your only chance is to write a program that extracts the files and shortens file names which are too long. Java at least should be able to open the archive (try jar -tvf to list the content; if that works, truncating should work as well).
java.util.jar can handle it:
try {
JarFile jarFile = new JarFile("/path/to/target.jar");
Enumeration<JarEntry> jarEntries = jarFile.entries();
int i = 0;
while (jarEntries.hasMoreElements()) {
JarEntry jarEntry = jarEntries.nextElement();
System.out.println("processing entry: " + jarEntry.getName());
InputStream jarFileInputStream = jarFile.getInputStream(jarEntry);
OutputStream jarOutputStream = new FileOutputStream(new File("/tmp/test/test" + (i++) + ".class")); // give temporary name to class
while (jarFileInputStream.available() > 0) {
} catch (IOException ex) {
Logger.getLogger(JARExtractor.class.getName()).log(Level.SEVERE, null, ex);
The output willbe test<n>.class for each class.
