Can I append Avro serialized data to an existing Azure blob? - azure

I am asking if I can, but I would also like to know if I should.
Here's my scenario:
I am receiving Avro serialized messages in small batches. I want to store them for later analysis using a Hive table with the Avro SerDe. I'm running in Azure, and I am storing the messages in a blob.
I am trying to avoid having lots of small blobs (because I believe this will have a negative impact on Hive). If I have the Avro header already written to the blob, I believe that can append Avro data blocks with CloudBlockBlob.PutBlockAsync(). (As long, as I know the sync marker.)
However, I've examined two .NET libraries and that don't seem to support my approach. (I have to write the entire Avro container file at once).
http://www.nuget.org/packages/Apache.Avro/
http://www.nuget.org/packages/Microsoft.Hadoop.Avro/
Am I taking the correct approach?
Am I missing something in the libraries?
My question is similiar (but different) to this one:
Can you append data to an existing Avro data file?

The short answer here is that I was trying to do the wrong thing.
First, we decided that Avro is not the appropriate format for the on-the-wire serialization. Primarily, because Avro expects the schema definition to be present in every Avro file. This adds a lot of weight to what is trasmitted. You could still use Avro, but that's not what it's designed for. (It is designed for big files on HDFS.)
Secondly, the existing libraries (for .NET) only support appending to Avro files via a stream. This does not map well to Azure block blobs (you don't want to open a block blob as a stream).
Thirdly, even if these first two could be bypassed, all of the items in a single Avro file are expected to share the same schema. We had a set of heterogenous items flowing in that we wanted to buffer, batch, and write to blob. Trying to segregate the items by type/schema as we were writing them to blob added lots of complication. In the end, we opted to use JSON.

It is possible to do.
First of all, you have to use CloudAppendBlob:
CloudAppendBlob appBlob = container.GetAppendBlobReference(
string.Format("{0}{1}", date.ToString("yyyyMMdd"), ".log"));
appBlob.AppendText(
string.Format(
"{0} | Error: Something went wrong and we had to write to the log!!!\r\n",
dateLogEntry.ToString("o")));
Second step is to tell to avro lib not to write header on append and share the same sync marker between appends:
var avroSerializer = AvroSerializer.Create<Object>();
using (var buffer = new MemoryStream())
{
using (var w = AvroContainer.CreateWriter<Object>(buffer, Codec.Deflate))
{
Console.WriteLine("Init Sample Data Set...");
var headerField = w.GetType().GetField("header", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
var header = headerField.GetValue(w);
var marker = header.GetType().GetField("syncMarker", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
marker.SetValue(header, new byte[16]);
using (var writer = new SequentialWriter<Object>(w, 24))
{
// Serialize the data to stream by using the sequential writer
for (int i = 0; i < 10; i++)
{
writer.Write(new Object());
}
}
}
Console.WriteLine("Append Sample Data Set...");
//Prepare the stream for deserializing the data
using (var w = AvroContainer.CreateWriter<Object>(buffer, Codec.Deflate))
{
var isHeaderWritten = w.GetType().GetField("isHeaderWritten", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
isHeaderWritten.SetValue(w, true);
var headerField = w.GetType().GetField("header", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
var header = headerField.GetValue(w);
var marker = header.GetType().GetField("syncMarker", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
marker.SetValue(header, new byte[16]);
using (var writer = new SequentialWriter<Object>(w, 24))
{
// Serialize the data to stream by using the sequential writer
for (int i = 10; i < 20; i++)
{
writer.Write(new Object());
}
}
}
Console.WriteLine("Deserializing Sample Data Set...");
}

Related

Is it possible to store Cytoscape.js layout data directly to a file format in app/web server and re-launch the same layout to minimize re-computation?

Some of the cytoscape layout is randomize where the position is not fix every time we launch it. I understand from multiple stack overflow questions where we can save the layout data (i.e. including its position x and y) into browser local storage or session storage so that we can display the same layout using the same data.
However, the problem with local storage or session storage is good for one users. But, imagine if there are thousands of users using the same app, the server will undergo mass computation for each user to store respective data to individual browsers. Can we save the data into a file format directly into app/web server so that 1000 users will see the same layout and this reduces the computation of different data set as well.
Thank you. Would like to know the possibility to convert data into a file and store in the web/app server.
Yes, you can store position data. Actually, there are 2 options in my mind.
Use cy.json(). You can store the elements as JSON like JSON.stringify(cy.json().elements) and then save this JSON string.
cy.json().elements is something like the below image
You can restore this data easily like cy.json({elements: JSON.parse(jsonStr));
As you could see cy.json().elements is a bit big thing. Position data is just a small object like {x: 0, y: 0}. Additional to position it contains many other data. So if you only need to restore the positions, you could store them manually easily with a code like below. You can use ele.id and node.position() functions.
function storePositions() {
const nodes = cy.nodes();
const nodePositions = {};
for (let i = 0; i < nodes.length; i++) {
nodePositions[nodes[i].id()] = nodes[i].position();
}
return nodePositions;
}
You can also restore node positions easily. You can use getElementById and node.position() functions.
function restorePositions(nodePositions) {
const nodes = cy.nodes();
const nodePositions = {};
for (let k in nodePositions) {
const node = cy.getElementById(k);
if (node && node.length > 0) {
node.position(nodePositions[k]);
}
}
return nodePositions;
}

Azure FHIR: Get RawResource in Plain Text

I've just started my research on the "Azure FHIR SQL Server Version".
I had some issues trying to get the Json Resource in plain text, since It is stored compressed in the database (as shown in the following lines):
select r.RawResource from dbo.Resource r where r.IsHistory=0 and r.IsDeleted=0;
RAWRESOURCE
0x1F8B080000000000000A8492CB4EC2501086E7519AAE344122F7E04A4274618C90C8CEB8282D60136E2985A08477F79B39A71513D034E77466CECC3FFF5C0E124A2613D9C84AB64831F2483E65CD3F943BCE00EB4C22594A2A5FFC73FE2BB4502A9C5412EF579786B4898AA4234DEE1BB99531783152137B225DA431D6485A481DF408FB546AC8131FD9F0B839FA9E5BB10FDC1B64CDBD4572F966782C3999D9153F9463C949DF94E994A33E1AF366483BFCE7E014F5D561D4E2AB733A76B7398AF56E68111528D2CE47E4A069B4BE2D795D94487D7053EB538C3D300E01D608CEAABF4A0FCDD5A71C527B71CCFE8B0E3D1BAD74CE8999C1E2A4AA0D33D31E4DBC3564821F362765573953F71575D7E49A1C4D3EED429BBBEBBB781E53A50886F30B982F641C59C7356F1F2DB3ED5AC93DF32A62AB25FBCB99A6F8EEFFC8129CE409E4D17BFFCC2CE1737B5D7458F3B80E3B1CED790FEC2AF14F44ECFCA60432B49D4A2CCB83E7159035C352B50D69A10FE0A8B10DEB63319F18949C6A1CD36734ADDAD5B126DEEDF1DC7AA35BFADB2F00BDE3BDB5475BDBE2ACC41B1AC3ADAFF428DF000000FFFF
I tried different ways to get it, however no one was successful.
select cast(r.RawResource as varchar(max)) VarcharResource,
CONVERT(varchar(max), r.RawResource, 0) VarcharResource2
from dbo.Resource r where r.IsHistory=0 and r.IsDeleted=0;
VarcharResource2
‹ „’ËNÂP†çQš®4A""÷àJBtaŒÈθ(-`n)… „w÷›9§Ð4çtfÎÌ?ÿ\J&ÙÈJ¶H1òH>eÍ?”;Î ëL""YJ*_üsþ+´P*œTïW—†´‰Š¤#M1x1R{""]¤1ÖHZHôûTjÈÙð¸9úž[±ÜdͽErùfx,9™Ù?”cÉIߔ锣>ófH;üçàõÕaÔâ«s:v·9Šõnh(ÒÎGä i´¾-y]”H}pSëSŒ=0ÖΪ¿JÍÕ§R{qÌþ‹=­tΉ™Á⤪3ÓM¼5d‚6'eW9S÷u×äšM>íB›»ë»xS¥†ó˜/dYÇ5o-³íZÉ=ó*b«%ûË™¦øîÿÈœä äÑ{ÿÌ,ás{]tXó¸;íyì*ñODìü¦2´J,˃ç5ÃRµi¡à¨±ëc1Ÿ”œjÓg4­ÚÕ±&ÞíñÜz£[úÛ/ ½ã½µG[Ûâ¬Äí¯ô(ß ÿÿ
Anybody knows the correct way to get back the Json in plain text?
Thanks
The resources are Gzipped, so something like:
string rawResource;
using (rawResourceStream)
using (var gzipStream = new GZipStream(rawResourceStream, CompressionMode.Decompress))
using (var reader = new StreamReader(gzipStream, ResourceEncoding))
{
rawResource = await reader.ReadToEndAsync();
}

Text Reader Classes in Hadoop

I have a directory OUTPUT where I have the output files from a Map Reduce job. The output files are Text files written with a TextOutputFormat.
Now I want to read the key value pairs from the output file. How can I do so using some existing classes in hadoop. One way I could do it was as follows
FileSystem fs = FileSystem.get(conf);
FileStatus[] files = fs.globStatus(new Path(OUTPUT + "/part-*"));
for(FileStatus file:files){
if(file.getLen() > 0){
FSDataInputStream in = fs.open(file.getPath());
BufferedReader bin = new BufferedReader(new InputStreamReader(
in));
String s = bin.readLine();
while(s!=null){
System.out.println(s);
s = bin.readLine();
}
in.close();
}
}
This approach would work but increases my task to a great deal as I now need to manually parse the key value pairs out of each individual line. I am looking for something more handy that directly lets me read key and value in some variables.
Are you forced to use TextOutputFormat as your output format in the previous job?
If not then consider using SequenceFileOutputFormat, then you can use a SequenceFile.Reader to read back the file in Key / Value pairs. You can also still 'view' the file using hadoop fs -text path/to/output/part-r-00000
EDIT: You can also use the KeyValueLineRecordReader class, you'll just need to pass in a FileSplit to teh constructor.

Reading/Editing XLIFF using C#

I need to parse an XLIFF file using C#, but I'm having some trouble. These files are fairly complex, containing a huge amount of nodes.
Basically, all I need to do is read the source node from each trans-unit node, do some processing on it, and insert the processed text into the corresponding target node (which will always be present, but empty).
An example of one of the nodes I need to parse would be (the whole file may contain 100s of these):
<trans-unit id="0000000002" datatype="text" restype="string">
<source>Windows Update is not installed</source>
<target/>
<iws:segment-metadata tm_score="0.00" ws_word_count="6" max_segment_length="0">
<iws:status target_content="placeholders_only"/>
</iws:segment-metadata>
<iws:boundary-seg sequence="bs20721"/>
<iws:markup-seg sequence="0000000001">
</trans-unit>
The trans-unit nodes can be buried deep in the files, the header section contains a lot of data. I'd like to use LINQ to XML to read the data, but I'm not having any luck getting it to work. Here's my current code (just trying to read and output the source nodes from the file:
XDocument doc = XDocument.Load(path);
Console.WriteLine("Before loop");
foreach (var transUnitNode in doc.Descendants("trans-unit"))
{
Console.WriteLine("In loop");
XElement sourceNode = transUnitNode.Element("source");
XElement targetNode = transUnitNode.Element("target");
Console.WriteLine("Source: " + sourceNode.Value);
}
I never see 'In loop' and I don't know why, can someone tell me what I'm doing wrong here, or suggest a better way to achieve what I'm trying to do here?
Thanks.
Try
XNamespace df = doc.Root.Name.Namespace;
foreach (XElement transUnitNode in doc.Descendants(df + "trans-unit"))
{
XElement sourceNode = transUnitNode.Element(df + "source");
// and so one, use the df namespace object to qualify any elements names
}
See also http://msdn.microsoft.com/en-us/library/bb387093.aspx.

Dynamic data structures in C#

I have data in a database, and my code is accessing it using LINQ to Entities.
I am writing some software where I need to be able to create a dynamic script. Clients may write the scripts, but it is more likely that they will just modify them. The script will specify stuff like this,
Dataset data = GetDataset("table_name", "field = '1'");
if (data.Read())
{
string field = data["field"];
while (cway.Read())
{
// do some other stuff
}
}
So that script above is going to read data from the database table called 'table_name' in the database into a list of some kind based on the filter I have specified 'field='1''. It is going to be reading particular fields and performing normal comparisons and calculations.
The most important thing is that this has to be dynamic. I can specify any table in our database, any filter and I then must be able to access any field.
I am using a script engine that means the script I am writing has to be written in C#. Datasets are outdated and I would rather keep away from them.
Just to re-iterate I am not really wanting to keep with the above format, and I can define any method I want to behind the scenes for my C# script to call. The above could end up like this for instance,
var data = GetData("table_name", "field = '1'");
while (data.ReadNext())
{
var value = data.DynamicField;
}
Can I use reflection for instance, but perhaps that would be too slow? Any ideas?
If you want to read dynamically a DataReader context, it's a pretty easy step:
ArrayList al = new ArrayList();
SqlDataReader dataReader = myCommand.ExecuteReader();
if (dataReader.HasRows)
{
while (dataReader.Read())
{
string[] fields = new string[datareader.FieldCount];
for (int i =0; i < datareader.FieldCount; ++i)
{
fields[i] = dataReader[i].ToString() ;
}
al.Add(fields);
}
}
This will return an array list composed by a dynamic object based on the number of field the reader has.

Resources