Identify format of a string (Such as Base64)

Identify format of a string (Such as Base64) - text

I work with a tool that contains everything within XML inside the database.
Some reports that are stored in the database use a third party tool to load, and store the main data to configure the 'report' definition in what is not a human-readable format.
I'd post it here, but it's some 130,000 bytes.
I have attempted to decode it using popular methods that I assumed it would have been encoded in, such as base64, base 32, etc, but none have been able to decode the string.
Is there a way to identify what encoding a given string has, using a tool available online?
I don't have the benefit of access to the developer that built this functionality, the source code generating this string, or any documentation on it.
To give some context around what I'm trying to do - I need to reverse-engineer how a specific definition in a system is generated, so that it can be modified slightly (manually) in a text editor to support an operation that would otherwise require manually re-creating the report.
I apologize is if this may be the wrong exchange site for this question - I realize it's not specific to a 'programming' issue and I haven't tried to solve it using a programming language. If so - please redirect me to the appropriate place and I'll be happy to ask there instead.
Update: The text consists of strictly A-Z, 0-9 characters.

You can check amongst known encoding formats with this tool only if you are sure data is not encrypted

Related

TipTap: Should I use JSON or HTML for backend storage

The TipTap editor and its progeny quasar-tiptap can export user created content in the browser in both HTML and JSON formats. If I plan on allowing round trips of the user data to my server what are the pros and cons of using either format for storage.
I would assume HTML has a greater likelihood of XSS attacks, and indeed such a vulnerability has been found (and rectified) in the past. And using JSON would be easier for backend parsing should it ever be required.
Beyond this, are there any major benefits of using either format? Preserving fidelity of user input is important. Size is important (any differences in image storage?). Editor performance is important. Scripting attack vulnerability is extremely important.
Which to choose?

I am sure at 12 months later you have made your choice. The usual caveats that this question is not really answerable and can only really elicit opinions. However, the best I can offer (and I am definately no expert):
User entered text is a risk (cross-site scripting, SQL injection attack, potentially creating a false alert dialog with disinformation for the user reading another user's content) regardless of how the rich text data is sent to the server and/or stored. This is because eventually the content will be a) very likely stored in a database b) sent to and interpreted by a browser in HTML format. Whether it goes through an intermediary JSON format is largely irrelevant. Stripping out all unwanted tags, attributes and SQL commands is going to be an extremely important server responsibility (known as sanitising), whatever the format. Many more well tested sanitising libraries exist for HTML in many more languages/server side technologies than for the relatively niche ProseMirror JSON format used by TipTap.
The TipTap documentation talks about the advantages of the 2 formats here
The ProseMirror JSON format is reasonably verbose, and inspecting both the JSON and HTML side by side, the JSON format seems to require more characters. Obviously it will depend on the rich text itself - how many formatting marks etc, but I would not worry about this in terms of server storage. Given you are likely using Vue or React with TipTap, it is likely you can set up an editor and have 2 seperate divs displaying the output side by side. You can see the 2 formats for the same text by scrolling to the bottom of this page
Editor performance - if you store the data as HTML you can pass it straight to the client. If you store it as JSON you will have to parse it and create the HTML. If you do this client side, the client will have to download and execute a library to perform this task - on NPM this is currently 14kB, so not a big issue.
Both formats will almost certainly send image data to the server as Base64 strings, so no bandwidth will be saved with either format for image files. Base64 is less efficient for DB storage as compared to saving as binary objects, particularly if compression is used, so you could strip these out at the server, but this will have a cost in time spent developing and testing the backend which may well be better spent on other things

Automatically convert uuids to url safe strings using node-pg

I use uuid for just about every ID in my REST backend powered by node and postgres. I also plan to use validate.js to make sure the queries are formatted correctly.
In order to shorten the URLS for my application, I would like to convert all UUIDS used by my backend into URL safe strings when exposed to the REST consumer.
The problem is that, as far as I can tell there is no such setting within node-pg. And node-pg usually returns the query results as JSON objects using either strings or numbers. That makes it hard to autmatically convert them.
I could of course just go through every single rest endpoint and add code that automatically converts all the types where I know a UUID would be. But that would violate DRY and also be a hotbed for bugs.
I also could try to automatically detect strings that look like UUIDs and then just convert them, but that also seams like it may introduce lots of bugs.
One ideal solution would be some sort of custom code injection into node-pg that automatically converts uuids. Or maybe just some pg function I could use to automatically convert the uuids within the pg-queries themselves (although that would be a bit tedious).
Another Ideal solution might be some way to use validate.js to convert the outputs and inputs during the validation. But I don't know how I could do this.
So basically, what would be a good way to autmatically convert uuids in node-pg to url safe (shorter) strings without having to add a bit of code to every single endpoint?

I think this is what I want: https://github.com/brianc/node-pg-types
It lets me set custom converters for each datatype. The input will probably still have to be converted manually

Workflow for interpreting linked data in .ttl files with Python RDFLib

I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?

RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.

XSS prevention: client or server-side?

What is the best way of prevention from stored XSS ?
should every text field (even plain text) be sanitized server-side to prevent from XSS HTML using something like OWASP Java HTML Sanitizer Project?
or should the client protect itself from XSS bugs by applying XSS prevention rules?
The problem with the first solution is that data may be modified (character encoding, partial or total deletion...). Which can alter the behavior of the application, especially for display concerns.

You apply sanitisation if and only if your data needs to conform to a specific format/standard and you are sure you can safely discard data; e.g. you strip all non-numeric characters from a telephone or credit card number. You always apply escaping for the appropriate context, e.g. HTML-encode user-supplied data when putting it into HTML.
Most of the time you don't want to sanitise, because you want to explicitly allow freeform data input and disallowing certain characters simply makes little sense. One of the few exceptions I see here would be if you're accepting HTML input from your users, you will want to sanitise that HTML to filter out unwanted tags and attributes and ensure the syntax is valid; however, you'd probably want to store the raw, unsanitised version in the database and apply this sanitisation only on output.

The gold standard in security is: Validate your inputs, and encode, not sanitize, your outputs.
First, validate the input server side. A good example of this would be a phone number field on a user profile. Phone numbers should only consist of digits, dashes, and perhaps a +. So why allow users to submit letters, special characters, etc? It only increases attack surface. So validate that field as strictly as you can, and reject bad inputs.
Second, encode the output according to its output context. I'd recommend doing this step server side as well, but it's relatively safe to do client side as long as you are using a good, well tested front-end framework. The main problem with sanitization is that different contexts have different requirements for safety. To prevent XSS when you are injecting user data into an HTML attribute, directly into the page, or into a script tag, you need to do different things. So, you create specific output encoders based on the output context. Entity encode for the HTML context, JSON.stringify for the script context, etc.

The client has to defend against this. For two reasons:
Because this is where the vulnerability happens. You might be calling a 3rd party API and they haven't escaped / encoded everything. It is better not trust anything.
Second, the API could be written for HTML page and for Android App. So why should the server html encode what some may consider html tags in a request when on the way back out it may be going to android app?

How to test mysqli's real_escape_string()?

I'm fairly new to mysqli (not to mysql!) and I'm updating a currently-mysql-function which secures a string (or an (recursive) array with strings) for basic sanitation.
The php.net mysqli::real_escape_string() manual has a very clear warning about that charset. I've implemented this.
How do I test this? I can't find the information I'm looking for. I guess I'm looking for certain strings to input resulting in an unsecure result, and a result considered save.
I don't mean "add slashes to ' <- those single quotes". I'm looking for some more advanced tricks or vulnerabilities.
I'm also not looking for prepared statements. Those are wonderful and I'd love to use those, but not an option at this point because updating I'm a old system as fast as possible, prepared statements are not an option at this point in time. I'll be adding those in the future.

Here is the code you are looking for, I believe. Just change mysql to mysqli.
Also please note that
this function is not to "secure" strings but to format them. Means every string that is going into query have to be processed, no matter if you count it "dangerous" or not.
this function have to be used to format SQL string literals only. And it is utterly useless for all other query parts.
this function should not to be used in the application code, but to support emulated prepared statements only.
Anyway, if your database encoding is conventional utf-8, there is no point to bother with encoding at all. "A clear warning" actually connected to some marginal and extremely rarely used encodings only.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string