Apache Pig: Load a file that shows fine using hadoop fs -text - linux

I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig.
What I've tried:
x = load 'part-r-00000';
dump x;
x = load 'part-r-00000' using TextLoader();
dump x;
but that only gives me garbage. How can I view the file using pig?
What might be of relevance is that my hdfs is still using CDH-2 at the moment.
Furthermore, if I download the file to local and run file part-r-00000 it says part-r-00000: data, I don't know how to unzip it locally.

According to HDFS Documentation, hadoop fs -text <file> can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.
If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.
I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.
Update:
Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:
-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot
USING SequenceFileLoader AS (key:long, val:long, etc.);

If you want to manipulate (read/write) sequence files with Pig then you can give a try to Twitter's Elephant-Bird as well.
You can find here examples how to read/write them.
If you use custom Writables in you sequence file then you can implement a custom converter by extending AbstractWritableConverter .
Note, that Elephant-Bird needs to have an installed Thrift in your machine.
Before building it, make sure that it is using the correct Thrift version you have and also provide the correct path of the Thrift executable in its pom.xml:
<plugin>
<groupId>org.apache.thrift.tools</groupId>
<artifactId>maven-thrift-plugin</artifactId>
<version>0.1.10</version>
<configuration>
<thriftExecutable>/path_to_thrift/thrift</thriftExecutable>
</configuration>
</plugin>

Related

How to reference the most current Physical Sequential (PS) file in JCL

I wanted to create a job where I need to consider the latest file available as input file.
File format is as below: FILE1.TEST.TYYMMDD
is there any way to identify latest file based on date present in file name via JCL.
P.S. GDG versions are not created in existing process . Only PS file is created.
Thank you
I wanted to create a job where I need to consider the latest file available as input file. File [name] format is as below: FILE1.TEST.TYYMMDD is there any way to identify latest file based on date present in file name via JCL.
No.
You indicate that GDGs are not created in the existing process. GDGs would be the best way to accomplish your goal. Absent GDGs, you must write code.
You could accomplish your goal by writing (C, clist, COBOL, PL/I, Rexx) code using the LMDINIT and LMDLIST ISPF services. Then you would execute your code by running ISPF in batch. Many mainframe shops have a cataloged procedure to execute ISPF in batch.
Agree with #cschneid that there is not a platform way to handle this. However, I want to point out that GDGs are the platform way of managing PS files for access in a relative form.
Your comment
GDG versions are not created in existing process . Only PS file is
created.
That statement didn't make sense to me. GDGs are not a file type like physical sequential (PS) or partitioned (PO). It's a convention to allow relative reference to files created over time which sounds like what you want. I've only seen the use of GDGs for PS files.
Putting the date in the file name can have its uses but to z/OS its only part of the filename and not meta information that it operates on (like G0000v00's in GDGs.

Hdfs file access in spark

I am developing an application , where I read a file from hadoop, process and store the data back to hadoop.
I am confused what should be the proper hdfs file path format. When reading a hdfs file from spark shell like
val file=sc.textFile("hdfs:///datastore/events.txt")
it works fine and I am able to read it.
But when I sumbit the jar to yarn which contains same set of code it is giving the error saying
org.apache.hadoop.HadoopIllegalArgumentException: Uri without authority: hdfs:/datastore/events.txt
When I add name node ip as hdfs://namenodeserver/datastore/events.txt everything works.
I am bit confused about the behaviour and need an guidance.
Note: I am using aws emr set up and all the configurations are default.
if you want to use sc.textFile("hdfs://...") you need to give the full path(absolute path), in your example that would be "nn1home:8020/.."
If you want to make it simple, then just use sc.textFile("hdfs:/input/war-and-peace.txt")
That's only one /
I think it will work.
Problem solved. As I debugged further fs.defaultFS property was not used from core-site.xml when I just pass path as hdfs:///path/to/file. But all the hadoop config properties are loaded (as I logged the sparkContext.hadoopConfiguration object.
As a work around I manually read the property as sparkContext.hadoopConfiguration().get("fs.defaultFS) and appended this in the path.
I don't know is it a correct way of doing it.

Junk Spark output file on S3 with dollar signs

I have a simple spark job that reads a file from s3, takes five and writes back in s3.
What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.
What is it? How I can prevent spark from creating it?
Here is some code to show what I am doing...
x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")
After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.
Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.
Ok, it seems I found out what it is.
It is some kind of marker file, probably used for determining if the S3 directory object exists or not.
How I reached this conclusion?
First, I found this link that shows the source of
org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir
method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html
Then I googled other source repositories to see if I am going to find different version of the method. I didn't.
At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.
My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.
All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.
s3n:// and s3a:// doesn't generate marker directory like <output>_$folder$
If you are using hadoop with AWS EMR., I found moving from s3 to s3n is straight forward since they both use same file system implementation, whereas s3a involves AWS credential related code change.
('fs.s3.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3n.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

reading the running config file from a network device

Is there any way to read the running configuration file from a network device (cisco ios/ juniper junos) in a properly formatted type say for eg as an XML file?
Basically I need to get all the attributes and its values in a config file. I am using "expect" to read the config file. I would have to write a parser to get the attributes from the config file.
I was wondering if there would be already an implementation of this which I can re-use?
Is there any SDK that can be used to parse the config file, or even better , directly interact with the device and get the data in a standard format?
Kindly guide.
Thanks
Sunil
For Juniper in configuration mode:
show | display xml
For Cisco IOS I've never made this, but you can try to use ODMSpec:
http://www.cisco.com/en/US/docs/ios-xml/ios/xmlpi/command/xmlpi-cr-book.pdf
http://www.cisco.com/en/US/docs/net_mgmt/enhanced_device_interface/2.2/developer/guide/progodm.html
I'm not sure, that it works with running-config.
In ios devices, it is
show run | format
This would give the result in an xml format

Donut.cvs example

I can't find a way to config mahout correctly. That's what happens when I try to run the "donut.cvs" example from "Mahout in Action" book:
Running on hadoop, using /home/myname/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /home/myname/mahout/mahout-examples-0.7-job.jar
Not a valid JAR: C:\home\myname\mahout\mahout-examples-0.7-job.jar
where do I have to change parameters?
This is coming from Hadoop binary and not from mahout. The source file is RunJar.java where it is trying to validate the existence of mahout-examples-0.7-job.jar and failed. Assuming tghat you are running it from Cygwin, issue here is that you are getting an unwanted C:/home (as opposed to /home) at the trailing end of the JAR file path.

Resources