ZIP compressed input for Apache Flink - zip

I need to read and processes a specific file within a zip archive in Apache Flink.
In the documentation, I found that
Flink currently supports transparent decompression of input files if these are marked with an appropriate file extension.
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/batch/#read-compressed-files
Is it possible process it while decompressing on the fly in Apache Flink?

The FileInputFormat will delegate the reading compressed files to GZIPInputStream, which will return partial decompressed data while decompressing.

I want to share the solution I implemented meanwhile.
So, after created my own InputFormat I used the following code within the open() method:
#Override
public void open(final FileInputSplit ignored) throws IOException {
...
final XMLInputFactory xmlif = XMLInputFactory.newInstance();
final XMLStreamReader xmlr = xmlif.createXMLStreamReader(filePath.toString(),
InputFormatUtil.readFileWithinZipArchive(filePath, nestedXmlFileName));
while (xmlr.hasNext()) {
...
}
where the implementation of readFileWithinZipArchive(...) is:
public static InputStream readFileWithinZipArchive(final Path zipPath, final String filename) throws IOException {
// using org.apache.flink.core.fs.Path for getting the InputStream from the (remote) zip archive
final InputStream zipInputStream = zipPath.getFileSystem().open(zipPath);
// generating a temporary local copy of the zip file
final File tmpFile = stream2file(zipInputStream);
// then using java.util.zip.ZipFile for extracting the InputStream for the specific file within the zip archive
final ZipFile zipFile = new ZipFile(tmpFile);
return zipFile.getInputStream(zipFile.getEntry(filename));
}

Related

Multipart Returns Bytes or InputStream not sure how to upload it to FTP server

Below is the code where we have multipart object - which will have either bytes or input stream
Map<String, MultipartFile> multipartRequestParams = request.getFileMap();
MultipartFile multipartFile = multipartRequestParams.get("file");
multipartFile.getBytes() (or) multipartFile.getInputStream
How to define a gateway for this and send the file
gateway.upload(multipartFile.getBytes(), multipartFile.getOriginalFilename(), remoteDirectory);
#MessagingGateway
public interface UploadGateway {
#Gateway(requestChannel = "toSftpChannel")
void upload(#Payload byte[] file, #Header("filename") String filename, #Header("path") String path);
}
#Bean
#ServiceActivator(inputChannel = "toSftpChannel")
public MessageHandler toHandler() {
....
....
}
I'm confused how to send this file to the SFTP server via which mechanism?
multipartFile.getBytes() (or) multipartFile.getInputStream
That's not true. The byte array in this case is really extracted from that InputStream. So, according your description both of them are present at the same time.
The rest of your code is OK. That's is fully correct to have a byte arrays (or that file InputStream) as a #Payload and other useful data as #Header.
To transfer that data into SFTP you should use an SftpMessageHandler, which support both byte[] and InputStream as a payload of request message from that toSftpChannel. See more info in docs: https://docs.spring.io/spring-integration/docs/current/reference/html/sftp.html#sftp-outbound.
The filename header could be used from the setFileNameGenerator(FileNameGenerator fileNameGenerator) option of that SftpMessageHandler. You just take this header from a message provided for that FileNameGenerator contract.
Similar could be done for the path header, which, essentially, should lend in the setRemoteDirectoryExpressionString(String remoteDirectoryExpression) option, like this setRemoteDirectoryExpressionString("headers.path").
All that info is present in the mentioned docs.

Untar files containing multiple unrelated csv files programatically with Hadoop

I have several compressed files (.tar.gz) containing unrelated tsv files (something like the list below) in my hdfs. I would like to untar those folders programmatically, potentially leveraging MPP architecture (e.g. Hadoop or Spark) and save them into hdfs.
- browser.tsv
- connection_type.tsv
- country.tsv
- color_depth.tsv
- javascript_version.tsv
- languages.tsv
- operating_systems.tsv
- plugins.tsv
- referrer_type.tsv
- resolution.tsv
- search_engine.tsv
So far I could only come up with a bash script that downloads each file from hdfs, untars and save back the folder into hdfs. I could even parallelize the script, but I am not happy with the solution either.
Thank you :)
Edit:
It would be interesting to see a solution done with any on the below:
Spark 2.4.5
Hive 2.3.6
Pig 0.17.0
Hadoop 2.8.5
I finally found a solution to my problem and it consists of a Mapper-only Hadoop job. Each mapper gets an uncompressed file within the tar folder and writes it to a specific path using the MultipleOutput utility from Hadoop.
Furthermore, I implemented a custom non-splittable Hadoop Input format to handle the Tarball extraction, called TarballInputFormat.
public class TarballInputFormat extends FileInputFormat<Text, Text> {
#Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
#Override
public RecordReader<Text, Text> createRecordReader(InputSplit inputSplit,
TaskAttemptContext taskAttemptContext) {
TarballRecordReader recordReader = new TarballRecordReader();
recordReader.initialize(inputSplit, taskAttemptContext);
return recordReader;
}
}
The TarballRecordReader handles the extraction of all the files within the original tarball file.
public class TarballRecordReader extends RecordReader<Text, Text> {
private static final Log log = LogFactory.getLog(TarballRecordReader.class);
private TarInputStream tarInputStream;
private Text key;
private Text value;
private boolean finished = false;
private String folderName;
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) {
key = new Text();
value = new Text();
try {
FileSplit split = (FileSplit) inputSplit;
Configuration conf = taskAttemptContext.getConfiguration();
Path tarballPath = split.getPath();
folderName = tarballPath.getName().split("\\.")[0];
FileSystem fs = tarballPath.getFileSystem(conf);
FSDataInputStream fsInputStream = fs.open(tarballPath);
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
CompressionCodec codec = compressionCodecs.getCodec(tarballPath);
tarInputStream = new TarInputStream(codec.createInputStream(fsInputStream));
}
catch (IOException ex) {
log.error(ex.getMessage());
}
}
#Override
public boolean nextKeyValue() throws IOException {
TarEntry tarEntry = tarInputStream.getNextEntry();
while (tarEntry != null && tarEntry.isDirectory())
tarEntry = tarInputStream.getNextEntry();
finished = tarEntry == null;
if (finished) {
return false;
}
key.clear();
value.clear();
long tarSize = tarEntry.getSize();
int read;
int offset = 0;
int bufSize = (int) tarSize;
byte[] buffer = new byte[bufSize];
while ((read = tarInputStream.read(buffer, offset, bufSize)) != -1) offset += read;
value.set(buffer);
key.set(folderName + "/" + tarEntry.getName());
return true;
}
#Override
public Text getCurrentKey() {
return key;
}
#Override
public Text getCurrentValue() {
return value;
}
#Override
public float getProgress() {
return finished? 1: 0;
}
#Override
public void close() throws IOException {
if (tarInputStream != null) {
tarInputStream.close();
}
}
}
Each tarball will be extracted keeping the original structure by writing each file relatively to its parent folder. In this solution, we have used the mapper to both read and write the extracted file all at once. This is obviously less performant but it might be a good trade-off for those who need to save their extracted files in the original form (ordered output). An alternative approach could leverage the reducer to write each extracted file line to the file system, which should increase the write throughput at the cost of consistency (unordered files content).
public class ExtractTarball extends Configured implements Tool {
public static final Log log = LogFactory.getLog(ExtractTarball.class);
private static final String LOOKUP_OUTPUT = "lookup";
public static class MapClass extends Mapper<Text, Text, Text, Text> {
private MultipleOutputs<Text, Text> mos;
#Override
protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {
String filename = key.toString();
int length = value.getBytes().length;
System.out.printf("%s: %s%n", filename, length);
mos.write(LOOKUP_OUTPUT, "", value, key.toString());
}
public void setup(Context context) {
mos = new MultipleOutputs<>(context);
}
protected void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "ExtractTarball");
job.setJarByClass(this.getClass());
job.setMapperClass(MapClass.class);
job.setInputFormatClass(TarballInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setNumReduceTasks(0);
MultipleOutputs.addNamedOutput(job, LOOKUP_OUTPUT, TextOutputFormat.class, Text.class, Text.class);
log.isDebugEnabled();
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new ExtractTarball(), args);
System.out.println(exitCode);
System.exit(exitCode);
}
}
This is how the output folder would look like:
- output
- lookup_data
- .browser.tsv-m-00000.crc
- .browser_type.tsv-m-00000.crc
- .color_depth.tsv-m-00000.crc
- .column_headers.tsv-m-00000.crc
- .connection_type.tsv-m-00000.crc
- .country.tsv-m-00000.crc
- .event.tsv-m-00000.crc
- .javascript_version.tsv-m-00000.crc
- .languages.tsv-m-00000.crc
- .operating_systems.tsv-m-00000.crc
- .plugins.tsv-m-00000.crc
- .referrer_type.tsv-m-00000.crc
- .resolution.tsv-m-00000.crc
- .search_engines.tsv-m-00000.crc
- browser.tsv-m-00000
- browser_type.tsv-m-00000
- color_depth.tsv-m-00000
- column_headers.tsv-m-00000
- connection_type.tsv-m-00000
- country.tsv-m-00000
- event.tsv-m-00000
- javascript_version.tsv-m-00000
- languages.tsv-m-00000
- operating_systems.tsv-m-00000
- plugins.tsv-m-00000
- referrer_type.tsv-m-00000
- resolution.tsv-m-00000
- search_engines.tsv-m-00000
- ._SUCCESS.crc
- .part-m-00000.crc
- _SUCCESS
- part-m-00000
Only approach I can see is iterating over each file and read with Spark for example, then with Spark itself you write it back to HDFS uncompressed. So something like this (using PySpark):
for p in paths
df = spark.read.csv(p, sep=r'\t', header=True)
df.write.csv(p, sep=r'\t', header=True)
NOTE: I haven't tested this code yet, it's complicate reproducing it in HDFS and with tar files, might be necessary to add some extra parameter to parse tar files but I hope the idea is clear.
IMHO is not possible to read all those files together in one single iteration because of the different structures they have (and different data they represent).

wordCounts.dstream().saveAsTextFiles("LOCAL FILE SYSTEM PATH", "txt"); does not write to file

I am trying to write JavaPairRDD into file in local system. Code below:
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.dstream().saveAsTextFiles("/home/laxmikant/Desktop/teppppp", "txt");
I am trying to save the logs or the wordcount in file. But it is not able to save in a local file (NOT HDFS).
I also tried to save on HDFS using
saveAsHadoopFiles("hdfs://10.42.0.1:54310/stream","txt")
The above line does not write to file. Can anybody tell the solution?
Various solutions on stackoverflow dont work.
Try to write output as an absolute path:
saveAsTextFiles("file:///home/laxmikant/Desktop/teppppp", "txt");
rdd.saveAsTextFile("C:/Users/testUser/file.txt")
It will not save the data into the file.txt file. It will throw the FileAlreadyExists Exception. Because this method will create the own file and saves the rdd in that particular file.
Try to use the following code to save the rdd's in a file.
rdd.SaveAsTextFile("C:/Users/testUser")
It will create create a file under testUser folder and saves the rdd's into that file.
The syntax seems to correct
saveAsHadoopFiles("hdfs://10.42.0.1:54310/stream","txt");
but the full syntax is
wordCounts.saveAsHadoopFiles("hdfs://10.42.0.1:54310/stream","txt"); // no dstream()
My guess is that the data is somewhere stuck in some system buffer and is not getting written. If you try to stream a lot more data using "nc" then you may see a file with data being created. This is what happened in my case.

SXSSFWorkbook with Custom temp file

SXSSFWorkbook does what I want, but I would like to use a different type of temp file then what is provided and seemingly baked into the implementation.
In SheetDataWriter
public File createTempFile() throws IOException {
return TempFile.createTempFile("poi-sxssf-sheet", ".xml");
}
So...I can extend this by making a MySheetDataWriter and Overriding the call to createTempFile. However, there is no way to for me to use MySheetDataWriter in the SXSSFWorkbook...if I try to extendt it then the package protected method...could not be overidden, because it is not visible.
from SXSSFWorkbook
SheetDataWriter createSheetDataWriter() throws IOException {
if(_compressTmpFiles) {
return new GZIPSheetDataWriter(_sharedStringSource);
}
return new SheetDataWriter(_sharedStringSource);
}
So the bottom line is that I can use the implementation almost exactly as is, but I need a different kind of Temp file...not even just a different directory to put it in, but a completely different implementation. Any ideas on how to do this?
Starting at version 3.11, the createTempFile method you mention (from class TempFile) uses a replaceable TempFileCreationStrategy that can be chosen with the setTempFileCreationStrategy method.
The following example extends the default strategy to log every temp file that is created, but you could change it to return a custom File instance.
TempFile.setTempFileCreationStrategy(new TempFile.DefaultTempFileCreationStrategy() {
#Override
public File createTempFile(String prefix, String suffix) throws IOException {
File f = super.createTempFile(prefix, suffix);
log.debug("Created temp file: " + f);
return f;
}
});

I am unable to fetch excel data to selenium code At ubuntu o/s

public class ReadAndWrite {
public static void main(String[] args) throws InterruptedException, BiffException, IOException
{
System.out.println("hello");
ReadAndWrite.login();
}
public static void login() throws BiffException, IOException, InterruptedException{
WebDriver driver=new FirefoxDriver();
driver.get("URL");
System.out.println("hello");
FileInputStream fi = new FileInputStream("/home/sagarpatra/Desktop/Xpath.ods");
System.out.println("hiiiiiii");
Workbook w = Workbook.getWorkbook(fi);
Sheet sh = w.getSheet(1);
//or w.getSheet(Sheetnumber)
//String variable1 = s.getCell(column, row).getContents();
for(int row=1; row <=sh.getRows();row++)
{
String username = sh.getCell(0, row).getContents();
System.out.println("Username "+username);
driver.get("URL");
driver.findElement(By.name("Email")).sendKeys(username);
String password= sh.getCell(1, row).getContents();
System.out.println("Password "+password);
driver.findElement(By.name("Passwd")).sendKeys(password);
Thread.sleep(10000);
driver.findElement(By.name("Login")).click();
System.out.println("Waiting for page to load fully...");
Thread.sleep(30000);
}
driver.quit();
}
}
I don't know what is wrong with my code, or how to fix it. It outputs the following error:
Exception in thread "main" jxl.read.biff.BiffException: Unable to recognize OLE stream
at jxl.read.biff.CompoundFile.<init>(CompoundFile.java:116)
at jxl.read.biff.File.<init>(File.java:127)
at jxl.Workbook.getWorkbook(Workbook.java:221)
at jxl.Workbook.getWorkbook(Workbook.java:198)
at test.ReadTest.main(ReadTest.java:19)
I would try using Apache MetaModel instead. I have had better luck with that, than using JXL. Here is a example project I wrote that reads from a .XLSX file. I use this library to run tests on a Linux Jenkins server from .XLS files generated on MS Windows.
Also, it should be noted that this library is also perfect for making a parameterized DataProvider that queries a database with JDBC.
Using JXL, you limit yourself to one data type, either .XLS or .CSV. I believe MetaModel is actually using JXL under the hood and wrapping it to make it easier to use. So, it also would support the OpenOffice documents in the same fashion and suffer the same file compatibility issues.

Resources