Text Reader Classes in Hadoop - io

I have a directory OUTPUT where I have the output files from a Map Reduce job. The output files are Text files written with a TextOutputFormat.
Now I want to read the key value pairs from the output file. How can I do so using some existing classes in hadoop. One way I could do it was as follows
FileSystem fs = FileSystem.get(conf);
FileStatus[] files = fs.globStatus(new Path(OUTPUT + "/part-*"));
for(FileStatus file:files){
if(file.getLen() > 0){
FSDataInputStream in = fs.open(file.getPath());
BufferedReader bin = new BufferedReader(new InputStreamReader(
in));
String s = bin.readLine();
while(s!=null){
System.out.println(s);
s = bin.readLine();
}
in.close();
}
}
This approach would work but increases my task to a great deal as I now need to manually parse the key value pairs out of each individual line. I am looking for something more handy that directly lets me read key and value in some variables.

Are you forced to use TextOutputFormat as your output format in the previous job?
If not then consider using SequenceFileOutputFormat, then you can use a SequenceFile.Reader to read back the file in Key / Value pairs. You can also still 'view' the file using hadoop fs -text path/to/output/part-r-00000
EDIT: You can also use the KeyValueLineRecordReader class, you'll just need to pass in a FileSplit to teh constructor.

Related

File reading in Kotlin without piggybacking off a button

I am coding a data class that is wanting to read a csv file to grab some information that is stored on the file. How ever, every way that I have tried to read the file will not work.
Here is what I have tried so far:
data class Bird(val birdNumIn: Int){
private var birdNum = birdNumIn
/**
* A function that searches the bird-data.csv file which is a list of birds to find the bird that was
* inputted into the class constructor and then add the values of the bird to the private variables.
*/
fun updateValues(){
var birdNumber = birdNum
var birdInfo: MutableList<String> = mutableListOf()
val minput = InputStreamReader(assets().open("bird-data.csv"), "UTF-8")
val reader = BufferedReader(minput)
}
How ever the assets().open() does not work. It returns an error of trying to open a file that does not exist, but the is in the assets folder, and the filename is spelt right.
I have tried many other methods on trying to read files, like using Java.io.File and using the path of the file.
If you would like to look at our whole project, please feel free to go to our github
What's the assets() function you're calling? This is just a bare data class, it has no connection to the Android environment it's running in, so unless you've injected an AssetManager instance (or a Context to pull it from) into your class, you can't access it.
You probably need to do this:
fun updateValues(context: Context){
val inputStream = context.assets.open("bird-data.csv")
val minput = InputStreamReader(inputStream, "UTF-8")
...
}
which requires your caller to have access to a Context.
Honestly from a quick look at your class, you might want to rework this. Instead of having a bunch of empty fields in your data class (which aren't part of the "data" by the way, only stuff in the constructor parameters is), and then having those updated later by the data class doing some IO, you might want to keep them as just basic stores of data, and create them when you read from your assets file.
So something like:
// all fixed values, initialised during construction
// Also you won't need to override toString now (unless you want to)
data class Bird(
val birdNum: Int
val nameOfBird: String
val birdFilePic: String
val birdFileSong: String
val alternativeName: String
val birdInfoFile: String
) { ... }
Then somewhere else
fun getBirbs(context: Context) {
// open CSV
// read the lines
val allBirds = lines.map {
// parse data for each bird, use it to construct a Bird object
}
}
or whatever you need to do, e.g. loading certain birds by ID.
That way your Bird class is just data and some functions/properties that work with it, it doesn't need a Context because it's not doing any I/O. Something else (that does have access to a Context) is responsible for loading your data and turning it into objects - deserialising it basically. And as soon as it's created, it's ready and initialised and immutable - you don't have to call update on it to get it actually initialised.
And if you ever wanted to do that a different way (e.g. loading a file from the internet) the data class wouldn't need to change, just the thing that does the loading. You could even have different loading classes! One that loads local data, one that fetches from the internet. The point is the separation of concerns, so it's possible to do this kind of thing because that functionality isn't baked into a class that's really about something else.
Up to you but just a thought! Especially if passing the context in like I suggested is a problem - that's a sign your design might need tweaking

Can I append Avro serialized data to an existing Azure blob?

I am asking if I can, but I would also like to know if I should.
Here's my scenario:
I am receiving Avro serialized messages in small batches. I want to store them for later analysis using a Hive table with the Avro SerDe. I'm running in Azure, and I am storing the messages in a blob.
I am trying to avoid having lots of small blobs (because I believe this will have a negative impact on Hive). If I have the Avro header already written to the blob, I believe that can append Avro data blocks with CloudBlockBlob.PutBlockAsync(). (As long, as I know the sync marker.)
However, I've examined two .NET libraries and that don't seem to support my approach. (I have to write the entire Avro container file at once).
http://www.nuget.org/packages/Apache.Avro/
http://www.nuget.org/packages/Microsoft.Hadoop.Avro/
Am I taking the correct approach?
Am I missing something in the libraries?
My question is similiar (but different) to this one:
Can you append data to an existing Avro data file?
The short answer here is that I was trying to do the wrong thing.
First, we decided that Avro is not the appropriate format for the on-the-wire serialization. Primarily, because Avro expects the schema definition to be present in every Avro file. This adds a lot of weight to what is trasmitted. You could still use Avro, but that's not what it's designed for. (It is designed for big files on HDFS.)
Secondly, the existing libraries (for .NET) only support appending to Avro files via a stream. This does not map well to Azure block blobs (you don't want to open a block blob as a stream).
Thirdly, even if these first two could be bypassed, all of the items in a single Avro file are expected to share the same schema. We had a set of heterogenous items flowing in that we wanted to buffer, batch, and write to blob. Trying to segregate the items by type/schema as we were writing them to blob added lots of complication. In the end, we opted to use JSON.
It is possible to do.
First of all, you have to use CloudAppendBlob:
CloudAppendBlob appBlob = container.GetAppendBlobReference(
string.Format("{0}{1}", date.ToString("yyyyMMdd"), ".log"));
appBlob.AppendText(
string.Format(
"{0} | Error: Something went wrong and we had to write to the log!!!\r\n",
dateLogEntry.ToString("o")));
Second step is to tell to avro lib not to write header on append and share the same sync marker between appends:
var avroSerializer = AvroSerializer.Create<Object>();
using (var buffer = new MemoryStream())
{
using (var w = AvroContainer.CreateWriter<Object>(buffer, Codec.Deflate))
{
Console.WriteLine("Init Sample Data Set...");
var headerField = w.GetType().GetField("header", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
var header = headerField.GetValue(w);
var marker = header.GetType().GetField("syncMarker", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
marker.SetValue(header, new byte[16]);
using (var writer = new SequentialWriter<Object>(w, 24))
{
// Serialize the data to stream by using the sequential writer
for (int i = 0; i < 10; i++)
{
writer.Write(new Object());
}
}
}
Console.WriteLine("Append Sample Data Set...");
//Prepare the stream for deserializing the data
using (var w = AvroContainer.CreateWriter<Object>(buffer, Codec.Deflate))
{
var isHeaderWritten = w.GetType().GetField("isHeaderWritten", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
isHeaderWritten.SetValue(w, true);
var headerField = w.GetType().GetField("header", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
var header = headerField.GetValue(w);
var marker = header.GetType().GetField("syncMarker", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
marker.SetValue(header, new byte[16]);
using (var writer = new SequentialWriter<Object>(w, 24))
{
// Serialize the data to stream by using the sequential writer
for (int i = 10; i < 20; i++)
{
writer.Write(new Object());
}
}
}
Console.WriteLine("Deserializing Sample Data Set...");
}

How to write into XML file in Haxe?

I am using Haxe and OpenFL and I got a program that generates a Xml file.
However, I can't figure out how to save that file. I can create Xml tree and check it's valid, but for my life I can't figure out how to write the file.
So, in simple, how to do I write(and create) a file in Haxe? I want to be able to save my newly created Xml File (they serve as sort of settings file and initialization files for my program) on computer, so that I can load it later?
Found the solution right after writing this question.
Solution is to first to use sys.io.File.write() to create the file, then File.saveContent() to save the data in. You can get string from Xml with toString function, the ultimate solution looks like this:
if (!FileSystem.exists("maps"))
{
FileSystem.createDirectory("maps");
}
var number:Int = 1;
if (FileSystem.exists("maps/" + filename + ".xml"))
{
while(FileSystem.exists("maps/" + filename + number +".xml"))
{
number++;
}
filename = filename + number;
}
File.write("maps/" + filename + ".xml", false);
File.saveContent("maps/" + filename + ".xml", root.toString());
This checks if the directory exist and if not, create it and if the file exist, create a new numbered file rather than override it (for the moment, still working on the save feature)
This solution only works on c++, haven't tested others much yet but Flash does not work

Reading/Editing XLIFF using C#

I need to parse an XLIFF file using C#, but I'm having some trouble. These files are fairly complex, containing a huge amount of nodes.
Basically, all I need to do is read the source node from each trans-unit node, do some processing on it, and insert the processed text into the corresponding target node (which will always be present, but empty).
An example of one of the nodes I need to parse would be (the whole file may contain 100s of these):
<trans-unit id="0000000002" datatype="text" restype="string">
<source>Windows Update is not installed</source>
<target/>
<iws:segment-metadata tm_score="0.00" ws_word_count="6" max_segment_length="0">
<iws:status target_content="placeholders_only"/>
</iws:segment-metadata>
<iws:boundary-seg sequence="bs20721"/>
<iws:markup-seg sequence="0000000001">
</trans-unit>
The trans-unit nodes can be buried deep in the files, the header section contains a lot of data. I'd like to use LINQ to XML to read the data, but I'm not having any luck getting it to work. Here's my current code (just trying to read and output the source nodes from the file:
XDocument doc = XDocument.Load(path);
Console.WriteLine("Before loop");
foreach (var transUnitNode in doc.Descendants("trans-unit"))
{
Console.WriteLine("In loop");
XElement sourceNode = transUnitNode.Element("source");
XElement targetNode = transUnitNode.Element("target");
Console.WriteLine("Source: " + sourceNode.Value);
}
I never see 'In loop' and I don't know why, can someone tell me what I'm doing wrong here, or suggest a better way to achieve what I'm trying to do here?
Thanks.
Try
XNamespace df = doc.Root.Name.Namespace;
foreach (XElement transUnitNode in doc.Descendants(df + "trans-unit"))
{
XElement sourceNode = transUnitNode.Element(df + "source");
// and so one, use the df namespace object to qualify any elements names
}
See also http://msdn.microsoft.com/en-us/library/bb387093.aspx.

Dynamic data structures in C#

I have data in a database, and my code is accessing it using LINQ to Entities.
I am writing some software where I need to be able to create a dynamic script. Clients may write the scripts, but it is more likely that they will just modify them. The script will specify stuff like this,
Dataset data = GetDataset("table_name", "field = '1'");
if (data.Read())
{
string field = data["field"];
while (cway.Read())
{
// do some other stuff
}
}
So that script above is going to read data from the database table called 'table_name' in the database into a list of some kind based on the filter I have specified 'field='1''. It is going to be reading particular fields and performing normal comparisons and calculations.
The most important thing is that this has to be dynamic. I can specify any table in our database, any filter and I then must be able to access any field.
I am using a script engine that means the script I am writing has to be written in C#. Datasets are outdated and I would rather keep away from them.
Just to re-iterate I am not really wanting to keep with the above format, and I can define any method I want to behind the scenes for my C# script to call. The above could end up like this for instance,
var data = GetData("table_name", "field = '1'");
while (data.ReadNext())
{
var value = data.DynamicField;
}
Can I use reflection for instance, but perhaps that would be too slow? Any ideas?
If you want to read dynamically a DataReader context, it's a pretty easy step:
ArrayList al = new ArrayList();
SqlDataReader dataReader = myCommand.ExecuteReader();
if (dataReader.HasRows)
{
while (dataReader.Read())
{
string[] fields = new string[datareader.FieldCount];
for (int i =0; i < datareader.FieldCount; ++i)
{
fields[i] = dataReader[i].ToString() ;
}
al.Add(fields);
}
}
This will return an array list composed by a dynamic object based on the number of field the reader has.

Resources