How do I consume files in Django 2.2 for yaml parsing? - python-3.x

I'm trying to upgrade my site from Django 1.11 to Django 2.2, and I'm having trouble with uploading and parsing yaml files.
The error message is:
ScannerError : mapping values are not allowed here in "", line 1, column 34: b'---\n recipeVersion: 9\n name: value\n'
^
I'm getting the file contents using a ModelForm with a widget defined as:
'source': AsTextFileInput()
... using ...
class AsTextFileInput(forms.widgets.FileInput):
def value_from_datadict(self, data, files, name):
return files.get(name).read()
... and then I get the source variable to parse with:
cleaned_data = super(RecipeForm, self).clean()
source = cleaned_data.get("source")
From that error message above, it looks like my newlines are being escaped, so yaml sees the text all on a single line. I tried logging the source of this file, and here's how it shows in my log file:
DEBUG b'---\n recipeVersion: 9\n name: value\n'
So, how can I get this file content without (what looks to me like) escaped newlines so I can parse it as yaml?
Edit: my code and yaml (simplified for this question) have not changed; upgrading Python projects has broken the parsing.

Decoding the bytestring fixed it:
class AsTextFileInput(forms.widgets.FileInput):
def value_from_datadict(self, data, files, name):
return files.get(name).read().**decode('utf-8')**

Related

How to keep the format of OpenAI API response?

When I use GPT3's playground, I often get results that are formatted with numbered lists and paragraphs like below:
Here's what the above class is doing:
1. It creates a directory for the log file if it doesn't exist.
2. It checks that the log file is newline-terminated.
3. It writes a newline-terminated JSON object to the log file.
4. It reads the log file and returns a dictionary with the following
- list 1
- list 2
- list 3
- list 4
However, when I directly use their API and extract the response from json result, I get the crammed text version that is very hard to read, something like this:
Here's what the above class is doing:1. It creates a directory for the log file if it doesn't exist.2. It checks that the log file is newline-terminated.3. It writes a newline-terminated JSON object to the log file.4. It reads the log file and returns a dictionary with the following-list 1-list 2-list 3- list4
My question is, how do people keep the formats from GPT results so they are displayed in a neater, more readable way?
Option 1: Edits endpoint
If you run test.py the OpenAI API will return the following completion:
test.py
import openai
openai.api_key = 'sk-xxxxxxxxxxxxxxxxxxxx'
response = openai.Edit.create(
model = 'text-davinci-edit-001',
input = 'I have three items:1. First item.2. Second item.3. Third item.',
instruction = 'Make numbered list'
)
content = response['choices'][0]['text']
print(content)
Option 2: Processing
Process the completion you get from the Completions endpoint by yourself (i.e., write Python code).

Skip processing fenced code blocks when processing Markdown files line by line

I'm a very inexperienced Python coder so it's quite possible that I'm approaching this particular problem in completely the wrong way but I'd appreciate any suggestions/help.
I have a Python script that goes through a Markdown file line by line and rewrites [[wikilinks]] as standard Markdown [wikilink](wikilink) style links. I'm doing this using two regexes in one function as shown below:
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: List object containing modified text. Newlines will be returned as '\n' strings.
"""
file = file_obj
linelist = []
logging.debug("Going to open file %s for processing now.", file)
try:
with open(file, encoding="utf8") as infile:
for line in infile:
linelist.append(re.sub(r"(\[\[)((?<=\[\[).*(?=\]\]))(\]\])(?!\()", r"[\2](\2.md)", line))
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar).
# Capture group $2 returns just foo
linelist_final = [re.sub(r"(\[\[)((?<=\[\[)\d+(?=\]\]))(\]\])(\()((?!=\().*(?=\)))(\))",
r"[\2](\2 \5.md)", line) for line in linelist]
# Finds only references in style [[foo]](bar). Capture group $2 returns foo and capture group $5
# returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
logging.debug("Finished processing file %s", file)
return linelist_final
This works fine for most Markdown files. However, I can occasionally get a Markdown file that has [[wikilinks]] within fenced code blocks such as the following:
# Reference
Here is a reference to “the Reactome Project” using smart quotes.
Here is an image: ![](./images/Screenshot.png)
[[201802150808]](Product discovery)
```
[[201802150808 Product Prioritization]]
def foo():
print("bar")
```
In the above case I should skip processing the [[201802150808 Product Prioritization]] inside the fenced code block. I have a regex that identifies the fenced code block correctly namely:
(?<=```)(.*?)(?=```)
However, since the existing function is running line by line, I have not been able to figure out a way to skip the entire section in the for loop. How do I go about doing this?
You need to use a full Markdown parser to be able to cover all of the edge cases. Of course, most Markdown parsers convert Markdown directly to HTML. However, a few will use a two step process where step one converts the raw text to an Abstract Syntax Tree (AST) and step two renders the AST to the output format. It is not uncommon to find a Markdown renderer (outputs Markdown) which can replace the default HTML renderer.
You would simply need to modify either the parser step (using a plugin to add support for the wikilink syntax) or modify the AST directly. Then pass the AST to a Markdown renderer, which will give you a nicely formatted and normalized Markdown document. If you are looking for a Python solution, mistunePandoc Filters might be a good place to start.
But why go through all that when a few well crafted regular expressions can be run on the source text? Because Markdown parsing is complicated. I know, it seems easy at first. After all Markdown is easy to read for a human (which was one of its defining design goals). However, parsing is actually very complicated with parts of the parser reliant on previous steps.
For example, in addition to fenced code blocks, what about indented code blocks? But you can't just check for indentation at the beginning of a line, because a single line of a nested list could look identical to an indented code block. You want to skip the code block, but not the paragraph nested in a list. And what if your wikilink is broken across two lines? Generally when parsing inline markup, Markdown parsers will treat a single line break no different than a space. The point of all of this is that before you can start parsing inline elements, the entire document needs to first be parsed into its various block-level elements. Only then can you step through those and parse inline elements like links.
I'm sure there are other edge cases I haven't thought of. The only way to cover them all is to use a full-fledged Markdown parser.
I was able to create a reasonably complete solution to this problem by making a few changes to my original function, namely:
Replace the python re built-in with the regex module available on PyPi.
Change the function to read the entire file into a single variable instead of reading it line by line.
The revised function is as follows:
import regex
def modify_links(file_obj):
"""
Function will parse file contents (opened in utf-8 mode) and modify standalone [[wikilinks]] and in-line
[[wikilinks]](wikilinks) into traditional Markdown link syntax.
:param file_obj: Path to file
:return: String containing modified text. Newlines will be returned as '\\n' in the string.
"""
file = file_obj
try:
with open(file, encoding="utf8") as infile:
line = infile.read()
# Read the entire file as a single string
linelist = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
# Ignore fenced & inline code blocks. V1 engine allows in-line flags so
# we enable newline matching only here.
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Ignore code blocks beginning with 4 spaces/1 tab
r"|(\[\[(.*)\]\](?!\s\(|\())", r"[\3](\3.md)", line)
# Finds references that are in style [[foo]] only by excluding links in style [[foo]](bar) or
# [[foo]] (bar). Capture group $3 returns just foo
linelist_final = regex.sub(r"(?V1)"
r"(?s)```.*?```(*SKIP)(*FAIL)(?-s)|(?s)`.*?`(*SKIP)(*FAIL)(?-s)"
r"|(\ {4}|\t).*(*SKIP)(*FAIL)"
# Refer comments above for this portion.
r"|(\[\[(\d+)\]\](\s\(|\()(.*)(?=\))\))", r"[\3](\3 \5.md)", linelist)
# Finds only references in style [[123]](bar) or [[123]] (bar). Capture group $3 returns 123 and capture
# group $5 returns bar
except EnvironmentError:
logging.exception("Unable to open file %s for reading", file)
return linelist_final
The above function handles [[wikilinks]] in inline code blocks, fenced code blocks and code blocks indented with 4 spaces. There is currently one false positive scenario where it ignores a valid [[wiklink]] which is when the link appears on the 3rd level or deeper of a Markdown list, i.e.:
* Level 1
* Level 2
* [[wikilink]] #Not recognized
* [[wikilink]] #Not recognized.
However my documents do not have wikilinks nested at that level in lists so it's not a problem for me.

How to recognize a third paty yaml dump object back- without specifying redundant import statemnets

Imagine you have something similar to the following yaml:
model: !!python/object:Thirdpartyfoo.foo_module.foo_class
some_attribue: value
In addition, assume you already installed package Thirdpartyfoo using some pip install or something.
Now you want to get things out of the yaml back into python obeject so you do:
import yaml
with open('foo.yaml') as f:
dict = yaml.load(f, yaml.Loader)
But after you run it you get error like:
Except ImportError as exc:
raise ConstructorError("while constructing a Python object", mark,
"cannot find module %r (%s)" % (module_name, exc), mark)
if module_name not in sys.modules:
raise ConstructorError("while constructing a Python object", mark,
"module %r is not imported" % module_name, mark)
yaml.constructor.ConstructorError: while constructing a Python object
module 'Thirdpartyfoo.foo_module' is not imported
You end up with a very ugly solution for that:
import Thirdpartyfoo.foo_module.foo_class as dummy_import
with open('foo.yaml') as f:
dict = yaml.load(f, yaml.Loader)
Note that if I won't explicitly mention line dummy_import I will get unused import at line ... by flake8 lint check :)
Any ideas?
Based on the YAML documentation, what yaml.load() does is that it converts YAML documents to python objects.
The function yaml.load() converts a YAML document to a Python object. yaml.load() accepts a byte string, a Unicode string, an open binary file object, or an open text file object.
The reason you're getting this error is that you're loading a data from yaml file that has no proper object to be stored in. As long as you do not import a library in the python runtime environment, it's just a file like any other file on your computer and has no specific value to python. So the only way to fix this issue is that you should import the proper class definition to store your data into it, which you just did in the last part. Just importing Thirdpartyfoo would do the job as well, cause it loads every definition into the python runtime environment.

How to print full yaml or json object to file with node

I am trying to copy the definitions portion of a yaml file into a js doc for a codegen project. I tried using regular expressions (which worked great for copying methods out of swagger generated js files) but apparently regexp does not handle info from yaml files very well. I managed to print MOST of what I want to the commandline through console.log. There are a few arrays that just say [Object], which is problematic. I would like to have their full contents printed. HOWEVER, this is not the main problem. When I try to write this output to a file instead of the console...it just says
"[object Object]
[object Object]"
for my 2 definitions. Has anyone done anything like this before? Here is a snippet of my code and what the console output looks like vs the two lines above TIA!
var doc = yaml.safeLoad(fs.readFileSync('path to my file\swagger.yaml', 'utf8'));
for(var d in doc['definitions']){
logit(doc['definitions'][d]); //logit write to consle and a file
}
safeLoad suggests you are using the js-yaml library. It also provides the safeDump and dump methods.
yamlDef= yaml.safeDump(doc['definitions'][d]);
logit(yamlDef);
to convert YAML to JSON:
var json = JSON.stringify(yamlDef);

Which HTML content type to return data and avoid my line feeds being banalized into spaces

I am running a small web server in python (cherrypy). I wish to return data of the following format:
20100701,1.5127
20100702,1.5184
20100705,1.51075
So at the moment my python looks like that in order to test the output:
return """20100701,1.5127
20100702,1.5184
20100705,1.51075
"""
When I request the URL from my other end, the one supposed to use the data, and expecting to parse line by line, I get the following output:
20100701,1.5127 20100702,1.5184 20100705,1.51075
Line feeds have been replaced by spaces... I guess this might be because my server considers I am sending html, so ignores my line feeds...
Set the content type of the response to text/plain.

Resources