Deserialize XMLDocument with encoded characters in attribute names

Deserialize XMLDocument with encoded characters in attribute names - sharepoint

I'm Trying to deserialize xml data into an object with c#. I have always done this using the .NET deserialize method, and that has worked well for most of what I have needed.
Now though, I have XML that is created by Sharepoint and the attribute names of the data I need to deserialize have encoded caracters, namely:
*space, º, ç ã, :, * and a hyphen as
x0020, x00ba, x007a, x00e3, x003a and x002d respectivly
I'm trying to figure out what I have to put in the attributeName parameter in the properties XmlAttribute
x0020 converts to a space well, so, for instance, I can use
[XmlAttribute(AttributeName = "ows_Nome Completo")]
to read
ows_Nome_x0020_Completo="MARIA..."
On The other hand, neither
[XmlAttribute(AttributeName = "ows_Motiva_x00e7__x00e3_o_x003a_")]
nor
[XmlAttribute(AttributeName = "ows_Motivação_x003a_")]
nor
[XmlAttribute(AttributeName = "ows_Motivação:")]
allow me to read
ows_Motiva_x00e7__x00e3_o_x003a_="text to read..."
With the first two I get no value returned, and the third gives me a runtime error for invalid caracters (the colon).
Anyway to get this working with .NET Deserialize, or do I have to build a specific deserializer for this?
Thanks!

What you are looking at (the "cryptic" data) is called XML entities. It's used by SharePoint to safekeep attribute names and similar elements.
There are a few ways of dealing with this, the most elegant ways to solve it is by extracting the List schema and match the element towards the schema. The schema contain all meta-data about your list data. A polished example of a Schema can be seen below or here http://www.bendsoft.com/documentation/camelot-php-tools/1_5/packets/schema-and-content-packets/schemas/example-list-view-schema/
If you don't want to walk that path you could start here http://msdn.microsoft.com/en-us/library/35577sxd.aspx
<Field Name="ContentType">
<ID>c042a256-787d-4a6f-8a8a-cf6ab767f12d</ID>
<DisplayName>Content Type</DisplayName>
<Type>Text</Type>
<Required>False</Required>
<ReadOnly>True</ReadOnly>
<PrimaryKey>False</PrimaryKey>
<Percentage>False</Percentage>
<RichText>False</RichText>
<VisibleInView>True</VisibleInView>
<AppendOnly>False</AppendOnly>
<FillInChoice>False</FillInChoice>
<HTMLEncode>False</HTMLEncode>
<Mult>False</Mult>
<Filterable>True</Filterable>
<Sortable>True</Sortable>
<Group>_Hidden</Group>
</Field>
<Field Name="Title">
<ID>fa564e0f-0c70-4ab9-b863-0177e6ddd247</ID>
<DisplayName>Title</DisplayName>
<Type>Text</Type>
<Required>True</Required>
<ReadOnly>False</ReadOnly>
<PrimaryKey>False</PrimaryKey>
<Percentage>False</Percentage>
<RichText>False</RichText>
<VisibleInView>True</VisibleInView>
<AppendOnly>False</AppendOnly>
<FillInChoice>False</FillInChoice>
<HTMLEncode>False</HTMLEncode>
<Mult>False</Mult>
<Filterable>True</Filterable>
<Sortable>True</Sortable>
</Field>
<Field>
...
</Field>

Well... I guess I kind of hacked a way around, which works for now. Just replaced the _x***_ charecters for nothing, and corrected the XmlAttributes acordingly. This replacement is done by first loading the xml as a string, then replacing, then loading the "clean" text as XML.
But I wopuld still like to know if it is possible to use some XmlAttribute Name for a more direct approach...

Try using System.Xml; XmlConvert.EncodeName and XmlConvert.DecodeName

I use a simply function to get the NameCol:
private string getNameCol(string colName) {
if (colName.Length > 20) colName = colName.Substring(0, 20);
return System.Xml.XmlConvert.EncodeName(colName);
}
I'm already searching for replace characters like á, é, í, ó, ú. EncodeName doesn't convert this characters.
Can use Replace:
.Replace("ó","_x00f3_").Replace("á","_x00e1_")

Related

Concatenate long strings from multiple records into one string

I have a situation where I need to concatenate long strings from multiple records in an Oracle database into a single string. These long strings are portions of a larger XML string, and my ultimate goal is to be able to convert this XML into something resembling query results and pull out specific values.
The data would look something like this, with the MSG_LINE_TEXT field being VARCHAR2(4000). So if the total message is less than 4000 characters, then there'd only be one record. In theory, there could be an infinite number of records for each message, although the highest I've seen so far is 14 records, which means I need to be able to handle strings that are at least 56000 characters long.
MESSAGE_ID MSG_LINE_NUMBER MSG_LINE_TEXT
---------- --------------- --------------------------------
17415414 1 Some XML snippet here
17415414 2 Some XML snippet here
17415414 3 Some XML snippet here
17415414 4 Some XML snippet here
The total XML for one MESSAGE_ID might look something like this. There could be many App_Advice_Error tags, although this specific example only contains one.
<tXML>
<Header>
<Source>MANH_prod_wmsweb</Source>
<Action_Type />
<Sequence_Number />
<Company_ID>1</Company_ID>
<Msg_Locale />
<Version />
<Internal_Reference_ID>17415414</Internal_Reference_ID>
<Internal_Date_Time_Stamp>2021-02-09 13:45:22</Internal_Date_Time_Stamp>
<External_Reference_ID />
<External_Date_Time_Stamp />
<User_ID>ESBUSER</User_ID>
<Message_Type>RESPONSE</Message_Type>
</Header>
<Response>
<Persistent_State>0</Persistent_State>
<Error_Type>2</Error_Type>
<Resp_Code>501</Resp_Code>
<Response_Details>
<Application_Advice>
<Shipper_ID />
<Imported_Object_Type>ASN</Imported_Object_Type>
<Response_Type>Error</Response_Type>
<Transaction_Date>2/9/21 13:45</Transaction_Date>
<Application_Ackg_Code>TE</Application_Ackg_Code>
<Business_Unit></Business_Unit>
<Tran_Set_Identifier_Code></Tran_Set_Identifier_Code>
<Transaction_Purpose_Code>11</Transaction_Purpose_Code>
<Imported_Message_Id></Imported_Message_Id>
<Imported_Object_Id>Reference Number Here</Imported_Object_Id>
<Additional_References>
<Additional_Reference_Info>
<Reference_Type>BusinessPartner</Reference_Type>
<Reference_ID></Reference_ID>
</Additional_Reference_Info>
</Additional_References>
<App_Advice_Errors>
<App_Advice_Error>
<App_Error_Text>Some error text here</App_Error_Text>
<Error_Message_Tokens>
<Error_Message_Token>Object that errored out</Error_Message_Token>
</Error_Message_Tokens>
<App_Err_Cond_Code>6100234</App_Err_Cond_Code>
</App_Advice_Error>
</App_Advice_Errors>
<Imported_Data></Imported_Data>
</Application_Advice>
</Response_Details>
</Response>
</tXML>
The values that I'm most interested in pulling out are the App_Err_Cond_Code, Error_Message_Token, and App_Error_Text tags. I had tried using something like this:
extractvalue(xmltype(msg_line_text), '//XPath of Tag')
This works beautifully for stuff where the entire XML is less than 4000 characters, i.e. the entire XML is stored in a single record. The problem comes when there are multiple records, because each individual snippet of XML isn't a valid XML string on its own, and so XMLTYPE throws an error, hence the reason I'm trying to concatenate them all into a single string, which I can then use with the above method.
I've tried a variety of ways to do this - LISTAGG, XMLAGG, SYS_CONNECT_BY_PATH, as well as writing a custom function something like this:
with
function get_messages(pTranLogID number) return string
is
xml varchar2;
begin
xml := '';
for msg in (
select r.msg_line_text
from tran_log_response_message r, tran_log t
where
t.message_id = r.message_id
and t.tran_log_id = pTranLogID
order by r.msg_line_number
)
loop
xml := xml || msg.msg_line_text;
end loop;
return 'test';
end;
select
tran_log_id, get_messages(tran_log_id)
from
tran_log
where
tran_log_id = '20633610';
/
The problem is that every one of these methods complained that the string was too long. Does anyone have any other ideas? Or maybe a better approach to this problem?
Thanks.

NodeJS why is object[0] returning '{' instead of the first property from this json object?

So I have to go through a bunch of code to get some data from an iframe. the iframe has a lot of data but in there is an object called '_name'. the first key of name is 'extension_id' and its value is a big long string. the json object is enclosed in apostrophes. I have tried removing the apostrophes but still instead of 'extension_id_output' I get a single curly bracket. the json object looks something like this
Frame {
...
...
_name: '{"extension_id":"a big huge string that I need"} "a bunch of other stuff":"this is a valid json object as confirmed by jsonlint", "globalOptions":{"crev":"1.2.50"}}}'
}
it's a whole big ugly paragraph but I really just need the extension_id. so this is the code I'm currently using after attempt 100 or whatever.
var frames = await page.frames();
// I'm using puppeteer for this part but I don't think that's relevant overall.
var thing = frames[1]._name;
console.log(frames[1])
// console.log(thing)
thing.replace(/'/g, '"')
// this is to remove the apostrophes from the outside of the object. I thought that would change things before. it does not. still outputs a single {
JSON.parse(thing)
console.log(thing[0])
instead of getting a big huge string that I need or whatever is written in extension_id. I get a {. that's it. I think that is because the whole object starts with a curly bracket. this is confirmed to me because console.log(thing[2]) prints e. so what's going on? jsonlint says this is a valid json object but maybe it's just a big string and I should be doing some kind of split to grab whaat's between the first : and the first ,. I'm really not sure.

For two reasons:
object[0] doesn't return the value an object's "first property", it returns the value of the property with the name "0", if any (there probably isn't in your object); and
Because it's JSON, and when you're dealing with JSON in JavaScript code, you are by definition dealing with a string. (More here.) If you want to deal with the object that the JSON describes, parse it.
Here's an example of parsing it and getting the value of the extension_id property from it:
const parsed = JSON.parse(frames[1]._name);
console.log(parsed.extension_id); // The ID

Fails to parse Hebrew text from pdf using iText 7 with .net

I am trying to read a PDF file with several pages, using iText 7 on a .NET CORE 2.1
The following is my code:
Rectangle rect = new Rectangle(0, 0, 1100, 1100);
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
inputStr = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy);
inputStr gets the following string:
"\u0011\v\u000e\u0012\u0011\v\f)(*).=*%'\f*).5?5.5*.\a \u0011\u0002\u001b\u0001!\u0016\u0012\u001a!\u0001\u0015\u001a \u0014\n\u0015\u0017\u0001(\u001b)\u0001)\u0016\u001c*\u0012\u0001\u001d\u001a \u0016* \u0015\u0001\u0017\u0016\u001b\u001a(\n,\u0002>&\u00...
and in the Text Visualizer, it looks like that:
)(*).=*%'*).5?5.5*. !!
())* * (
,>&2*06) 2.-=9 )=&,

2..*0.5<.?
.110
)<1,3
  2.3*1>?)10/6
 (& >(*,1=0>>*1?

  2.63)&*,..*0.5
  206)&13'?*9*<
  *-5=0>
?*&..,?)..*0.5
it looks like I am unable to resolve the encoding or there is a specific, custom encoding at the PDF level I cannot read/parse.
Looking at the Document Properties, under Fonts it says the following:
Any ideas how can I parse the document correctly?
Thank you
Yaniv

Analysis of the shared files
file1_copyPasteWorks.pdf
The font definitions here have an invalid ToUnicode entry:
/ToUnicode/Identity-H
The ToUnicode value is specified as
A stream containing a CMap file that maps character codes to Unicode values
(ISO 32000-2, Table 119 — Entries in a Type 0 font dictionary)
Identity-H is a name, not a stream.
Nonetheless, Adobe Reader interprets this name, and for apparently any name starting with Identity- assumes the text encoding for the font to be UCS-2 (essentially UTF-16). As this indeed is the case for the character codes used in the document, copy&paste works, even if for the wrong reasons. (Without this ToUnicode value, Adobe Reader also returns nonsense.)
iText 7, on the other hand, for mapping to Unicode first follows the Encoding value with unexpected results.
Thus, in this case Adobe Reader arrives at a better result by interpreting meaning into an invalid piece of data (and without that also returns nonsense).
file2_copyPasteFails.pdf
The font definitions here have valid but incomplete ToUnicode maps which only contain entries for the used Western European characters but not for Hebrew ones. They don't have Encoding entries.
Both Adobe Reader and iText 7 here trust the ToUnicode map and, therefore, cannot map the Hebrew glyphs.
How to parse
file1_copyPasteWorks.pdf
In case of this file the "problem" is that iText 7 applies the Encoding map. Thus, for decoding the text one can temporarily replace the Encoding map with an identity map:
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); i++)
{
PdfPage page = pdfDocument.GetPage(i);
PdfDictionary fontResources = page.GetResources().GetResource(PdfName.Font);
foreach (PdfObject font in fontResources.Values(true))
{
if (font is PdfDictionary fontDict)
fontDict.Put(PdfName.Encoding, PdfName.IdentityH);
}
string output = PdfTextExtractor.GetTextFromPage(page);
// ... process output ...
}
This code shows the Hebrew characters for your file 1.
file2_copyPasteFails.pdf
Here I don't have a quick work-around. You may want to analyze multiple PDFs of that kind. If they all encode the Hebrew characters the same way, you can create your own ToUnicode map from that and inject it into the fonts like above.

Filter a String which holds a TimeStamp - Kotlin

I have written a function which generate a TimeStamp and convert it to a String using toString(). I want to remove the whitespaces and other special character from that string. Is there is any efficient way to do it ?
This is a function which generate ID using TimeStamp , since timestamp will be unique (Note : When IDs are generated at different M.Sec)
fun autoGenerateID() : String = Timestamp(java.util.Date().getTime()).toString()
When I call the function, It should return :
20190612121912463
But the produced result was :
2019-06-12 12:19:12.463

I would suggest dropping the use of Timestamp class. it is outdated and anything it provides can be achieved in easier ways.
For your use case you could just use the SimpleDateFormat. It would look like this:
SimpleDateFormat("yyyyMMddHHmmssSSS").format(Date())

Get real bool value from YamlStream

I use YamlDotnet to parse a yaml stream to a dictionary of string object via the YamlStream.
The YamlMappingType, YamlSequenceNode and YamlScalarNode are used in order to convert the value to a dictionary, a list or a string.
But I need to get a real boolean value instead of the string equivalent, and for that I use
bool.TryParse(value.ToString(), out valueBool)
value veing a YamlNode.
Is there a better way to do that?
Perhaps another child type of YamlNode?
EDIT:
I don't know the content of the YAML file, I just want to get a dictionary with his values.

Instead of doing the parsing manually, you should use the Deserializer class, which will convert a YAML document into an object graph.
var deserializer = new Deserializer();
var parsed = deserializer.Deserialize<...>(input);
You can see a working example here

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Deserialize XMLDocument with encoded characters in attribute names - sharepoint

Try using System.Xml; XmlConvert.EncodeName and XmlConvert.DecodeName

Related

Concatenate long strings from multiple records into one string

NodeJS why is object[0] returning '{' instead of the first property from this json object?

Fails to parse Hebrew text from pdf using iText 7 with .net

Filter a String which holds a TimeStamp - Kotlin

Get real bool value from YamlStream

Categories

Resources