Text first data serialization with separate metadata

Text first data serialization with separate metadata - object

I'm trying to find a format that will help solve a very particular problem:
Text first solution.
Ability to specify complex objects in a single text line (properties, key\value, lists, complex objects)
Object metadata structure should be separate from the data.
For example:
Metadata: Prop1:int|Prop2:string|PropList:int[,]
Data: 20|Something|10,20,30
that would mean:
Prop1 = 20
Prop2 = "Something"
PropList = [10,20,30]
Is there any existing serialization format resembling this?

I don't see any format can support the scheme from the example you provided. If you really need this schema (Type section, Data section), then you need to write your own parser, and it's easy.
But the most suitable mature format should still be JSON if you don't want to write your own parser.
specify complex objects in a single text line: not YAML, not XML, not INI, not TOML.
Any common format is designed less semantics or business related.

Related

Produce a path to a field in protobuf in Python

I am working on a function that analyzes data (based on some domain-specific logic) in protobufs. When the function finds an issue, I want to include the path to the offending field, including the indexes for the repeated fields.
For example, given the protobuf below:
proto = ECS(
service=[
Service(),
Service(
capacity_provider_strategy=[
ServiceCapacityProviderStrategyItem(base=1),
ServiceCapacityProviderStrategyItem(base=2),
]
)
]
)
Let's assume that the offending field is field = proto.service[1].capacity_provider_strategy[0].
How would I, given only the field produce ecs.service[1].capacity_provider_strategy[0] in a general way?
Please, note that I am looking for a way to produce the path mentioned above solely based on the supplied field since the logic of producing the error message is de-coupled from the analyzing logic. I realize, that (in the analyzing logic) I could keep track of the indexes of the repeated fields, but this would put more overhead on the analyzing function.

Is there a way to create a data structure that can store integer, float, boolean and string data internally in a PDF document?

I am doing a project and need to store small datasets internally in PDF documents and after store, retrieve these data. The data can be integers, float, booleans and strings, and I need to know if it is possible to create an object which can store these type of data (something similar to an Array List in Java) and how can I after that retrieve the data (identification of these objects is what I have the most doubts about).
If you have any answer I would be very grateful if you share it!

If you want to work with PDF, you should have a look at its specification ISO 32000.
Already part 1 in section 7.3 defines among other PDF object types Boolean, Numeric (both integer and real), String, and Array objects.
Furthermore, it says in Annex E that a conforming writer may also add keys to any PDF object that is implemented as a dictionary, except the file trailer dictionary and then describes a way to prevent key name collisions in such dictionaries.
So what you could do is add a custom key (with a prefix you have to register) to the PDF Catalog the value of which is your array (or whatever structure you want to have there), e.g.
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
/PageMode /UseOutlines
/Outlines 3 0 R
/MKLx_SO_Felipe [1.2 False 17 (A String)]
>>
endobj
To add such an entry and to retrieve it again later, you should use an existing, general purpose PDF library for your programming language and runtime. Writing such a thing oneself can turn out to be more complicated than desired.
Alternatively you can store your data in a file with a format of your choice (XML, JSON, ..., you name it) and embed that file in the PDF, either as a file attachment as proposed by Kevin Brown in a comment or as an arbitrary PDF stream referenced from a custom name in some dictionary.

Converting ANTLR parse trees into string and then reverting it

I am new to ANTLR, and I am digging into it for a project. My work would require me to generate a parse tree from a source code file, convert the parse tree into a string that holds all the information about the parse tree in a somewhat "human-readable" form. Parts of this string (representing the parse tree) will then be modified, and the modified string will have to be converted to a changed source code.
I have found out that the .toStringTree(tree) method can be used in ANTLR to print out the tree in LISP format. Is there a better way to represent the parse tree as a string that holds all information?
Can the string-parse-tree be reverted back to the original source code (in the same language) using ANTLR? If no, are there any tools for this?

Can the string-parse-tree be reverted back to the original source code (in the same language) using ANTLR?
That string does not contain the token types, just the matched text. In other words: you cannot create a parse tree from the output of the ToStringTree. Besides, many ANTLR grammars have lexer rules that skip certain input (white spaces and line breaks, for example), so converting a parse tree back to the original input source is not always possible.
If no, are there any tools for this?
Without a doubt, I suggest you do a search on GitHub. But when you have the parse tree, it is trivial to create a custom tree structure and convert that to JSON.

How to save a formatted string?

I was wondering if its possible to save a formatted string in python in any way?
For example, could I create an arbitrary string like this:
s = f"This is my string. This is a {variable}"
and then save it in a csv or an SQL database for later use given that variable would always be set before loading?
I have already tried this with a CSV document and in MySQL without much luck so I concluded that this wasn't possible. Google hasn't given much of a result either.
My specific problem is that I have a large file containing hundreds of symptoms. Each symptom is a class and inherits a parent class which contains 7 base questions regarding the symptom. I would like to load a formatted string in the parent class for the sublass to load(as it inherits the parent class). An example would be something along these lines:
In Parent class:
self.question = f"Do you have a {self.symptom}?"
In Headache class:
self.symptom = "headache"
would be parsed to the string: "Do you have a headache?" etc.
I would really like to load the questions from a database for maintainance purposes since maintaining a large .py file with large number of classes, each with a question in a string format would end up as a total nightmare.
Thanks!
Edit: spelling

I have a quick idea, but I wouldn't recommend it:
Storing it into a string, to execute it:
self.generatequestion = 'question1=f"Do you have a {self.symptom}?"'
in Headache:
self.symptom = "headache"
And then
exec(self.generatequestion1)
print(question1)

U-SQL Error - Change the identifier to use at least one lower case letter

I am fairly new to U-SQL and trying to run a U-SQL script in Azure Data Lake Analytics to process a parquet file using the Parquet extractor functionality. I am getting the below error and I don't find a way to get around it.
Error - Change the identifier to use at least one lower case letter. If that is not possible, then escape that identifier (for example: '[ACTIVITY]'), or embed it in a CSHARP() block (e.g CSHARP(ACTIVITY)).
Unfortunately all the different fields generated in the Parquet file are capitalized and I don't want to to escape these identifiers. I have tried if I could wrap the identifier with CSHARP block and it fails as well (E_CSC_USER_RESERVEDKEYWORDASIDENTIFIER: Reserved keyword CSHARP is used as an identifier.) Is there anyway I could extract the parquet file? Thanks for your help!
Code Snippet:
SET ##FeaturePreviews = "EnableParquetUdos:on";
#var1 =
EXTRACT ACTIVITY string,
AUTHOR_NAME string,
AFFLIATION string
FROM "adl://xxx.azuredatalakestore.net/Abstracts/FY2018_028"
USING Extractors.Parquet();
#var2 =
SELECT *
FROM #var1
ORDER BY ACTIVITY ASC
FETCH 5 ROWS;
OUTPUT #var2
TO "adl://xxx.azuredatalakestore.net/Results/AbstractsResults.csv"
USING Outputters.Csv();

Based on your description you try to say
EXTRACT ALLCAPSNAME int FROM "/data.parquet" USING Extractors.Parquet();
In U-SQL, we reserve all caps identifiers so we can add new keywords in the future without invalidating old scripts.
To work around, you just have to quote the name (escape it) like in any other SQL dialect:
EXTRACT [ALLCAPSNAME] int FROM "/data.parquet" USING Extractors.Parquet();
Note that this is not changing the name of the field. It is just the syntactic way to address the field.
Also note, that in most SQL communities, it is considered a best practice to always quote identifiers to avoid reserved keyword clashes.
If all fields in the Parquet file are all caps, you will have to quote them all... In a future update you will be able to say EXTRACT * FROM … for Parquet (and Orc) files, but you still will need to quote the columns when you refer to them explicitly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Text first data serialization with separate metadata - object

Related

Produce a path to a field in protobuf in Python

Is there a way to create a data structure that can store integer, float, boolean and string data internally in a PDF document?

Converting ANTLR parse trees into string and then reverting it

How to save a formatted string?

U-SQL Error - Change the identifier to use at least one lower case letter

Categories

Resources