Caching parsed document - python-3.x

I have a set of YAML files. I would like to cache these files so that as much work as possible is re-used.
Each of these files contains two documents. The first document contains “static” information that will always be interpreted in the same way. The second document contains “dynamic” information that must be reinterpreted every time the file is used. Specifically, it uses a tag-based macro system, and the document must be constructed anew each time the file is used. However, the file itself will not change, so the results of parsing the entire file could be cached (at a considerable resource savings).
In ruamel.yaml, is there a simple way to parse an entire file into multiple parsed documents, then run construction on each document individually? This would allow me to cache the result of constructing the first “static” document and cache the parse of the second “dynamic” document for later construction.
Example file:
---
default_argument: name
...
%YAML 1.2
%TAG ! tag:yaml-macros:yamlmacros.lib.extend,yamlmacros.lib.arguments:
---
!merge
name: !argument name
The first document contains metadata that is used (along with other data from elsewhere) in the construction of the second document.

If you don't want to process all YAML documents in a stream completely, you'll have to split up the stream by hand, which is not entirely easy to do in a generic way.
What you need to know is what a YAML stream can consist of:
zero or more documents. Subsequent documents require some sort of separation marker line. If a document is not terminated by a document end marker line, then the following document must begin with a directives end marker line.
A document end marker line is a line that starts with ... followed by space/newline and a directives end marker line is --- followed by space/newline.
The actual production rules are slightly more complicated and "starts with" should ignore the fact that you need to skip any mid-stream byte-order marks.
If you don't have any directives, byte-order-marks and no document-end-markers (and most multi-document YAML streams that I have seen, do not have those), then you can just data = Path().read() the multi-document YAML as a string, split using l = data.split('\n---') and process only the appropriate element of the resulting list with YAML().load(l[N]).
I am not sure the following properly handles all cases, but it does handle your multi-doc stream:
import sys
from pathlib import Path
import ruamel.yaml
docs = []
current = ""
state = "EOD"
for line in Path("example.yaml").open():
if state in ["EOD", "DIR"]:
if line.startswith("%"):
state = "DIR"
else:
state = "BODY"
current += line
continue
if line.startswith('...') and line[3].isspace():
state = "EOD"
docs.append(current)
current = ""
continue
if state == "BODY" and current and line.startswith('---') and line[3].isspace():
docs.append(current)
current = ""
continue
current += line
if current:
docs.append(current)
yaml = ruamel.yaml.YAML()
data = yaml.load(docs[1])
print(data['name'])
which gives:
name

Looks like you can indeed directly operate the parser internals of ruamel.yaml, it just isn't documented. The following function will parse a YAML string into document nodes:
from ruamel.yaml import SafeLoader
def parse_documents(text):
loader = SafeLoader(text)
composer = loader.composer
while composer.check_node():
yield composer.get_node()
From there, the documents can be individually constructed. In order to solve my problem, something like the following should work:
def process_yaml(text):
my_constructor = get_my_custom_constructor()
parsed_documents = list(parse_documents(path.read_text()))
metadata = my_constructor.construct_document(parsed_documents[0])
return (metadata, document[1])
cache = {}
def do_the_thing(file_path):
if file_path not in cache:
cache[file_path] = process_yaml(Path(file_path).read_text())
metadata, document = cache[file_path]
my_constructor = get_my_custom_constructor(metadata)
return my_constructor.construct_document(document)
This way, all of the file IO and parsing is cached, and only the last construction step need be performed each time.

Related

Separating header from the rest of the dataset

I am reading in a csv file and then trying to separate the header from the rest of the file.
hn variable is is the read-in file without the first line.
hn_header is supposed to be the first row in the dataset.
If I define just one of these two variables, the code works. If I define both of them, then the one written later does not contain any data. How is that possible?
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)[1:] #this should contain all rows except the header
hn_header = list(read_file)[0] # this should be the header
print(hn[:5]) #works
print(len(hn_header)) #empty list, does not contain the header
The CSV reader can only iterate through the file once, which it does the first time you convert it to a list. To avoid needing to iterate through multiple times, you can save the list to a variable.
hn_list = list(read_file)
hn = hn_list[1:]
hn_header = hn_list[0]
Or you can split up the file using extended iterable unpacking
hn_header, *hn = list(read_file)
Just change below line in your code, no additional steps needed. read_file = list(reader(opened_file)). I hope now your code is running perfectly.
The reader object is an iterator, and by definition iterator objects can only be used once. When they're done iterating you don't get any more out of them.
You can refer more about from this Why can I only use a reader object once? question and also above block-quote taken from that question.

Appending to YAML file

I can't figure out how to work with YAML files, I have a db.yaml file with this content
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
My program reads genre name and link to top 100 from this file, then it scraps the web page for song names and adds it to dictionary
def load_yaml_file(self):
with open(self.yaml_file, "r") as file_content:
self.data = yaml.load(file_content)
def get_genres_and_links(self):
for genre, link in self.data.get("beatport_links").items():
self.beatport_links[genre] = link
Now I have a list with contents like this
["Adam_Beyer_-_Rome_Future_(Original_Mix)", "Veerus_-_Wheel_(Original_Mix)"]
I would like my program to update db.yaml file with contents from this list (append to it), so in the end I would like db.yaml to look like this:
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
downloaded:
Adam_Beyer_-_Rome_Future_(Original_Mix)
Veerus_-Wheel(Original_Mix)
How can I do that?
You don't need your get_genres_and_links you can directly update your self.data
by doing:
self.data['downloaded'] = some_data
The problem is that in your expected output you have as value for the key downloaded you have a multiline plain scalar and not a list. Although you can do some_data = ' '.join(["Adam_Beyer_-_Rome_Future_(Original_Mix)", "Veerus_-_Wheel_(Original_Mix)"]) will get you the string value, it is almost impossible to get the PyYAML to output the plain scalar multiline and non-compact (reading that is trivial. Instead I would look at dumping to a literal block style scalar and joining the list using "\n".join(). The output would then look like:
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
downloaded: |-
Adam_Beyer_-_Rome_Future_(Original_Mix)
Veerus_-Wheel(Original_Mix)
(you can get rid of the dash after | by appending a newline after the joining the list items).
If your expected output was acceptable looking like:
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
downloaded:
- Adam_Beyer_-_Rome_Future_(Original_Mix)
- Veerus_-Wheel(Original_Mix)
then things are easier and doing:
self.data['downloaded'] = ["Adam_Beyer_-_Rome_Future_(Original_Mix)", "Veerus_-_Wheel_(Original_Mix)"]
with open('some_file', 'w') as fp:
yaml.safe_dump(self.data, fp)
would be enough.
In any event, if you do this kind of loading, modifying, dumping, then you
should look seriously at ruamel.yaml (disclaimer: I am the author of that package). Not only does it implement the newer YAML 1.2, it also preserves comments, tags, special ids, key-order when doing this kind of round-tripping. It also has built-in support for literal style block scalars. And apart from that its default .load() is safe.

convert data string to list

I'm having some troubles processing some input.
I am reading data from a log file and store the different values according to the name.
So my input string consists of ip, name, time and a data value.
A log line looks like this and it has \t spacing:
134.51.239.54 Steven 2015-01-01 06:09:01 5423
I'm reading in the values using this code:
loglines = file.splitlines()
data_fields = loglines[0] # IP NAME DATE DATA
for loglines in loglines[1:]:
items = loglines.split("\t")
ip = items[0]
name = items[1]
date = items[2]
data = items[3]
This works quite well but I need to extract all names to a list but I haven't found a functioning solution.
When i use print name i get:
Steven
Max
Paul
I do need a list of the names like this:
['Steven', 'Max', 'Paul',...]
There is probably a simple solution and i haven't figured it out yet, but can anybody help?
Thanks
Just create an empty list and add the names as you loop through the file.
Also note that if that file is very large, file.splitlines() is probably not the best idea, as it reads the entire file into memory -- and then you basically copy all of that by doing loglines[1:]. Better use the file object itself as an iterator. And don't use file as a variable name, as it shadows the type.
with open("some_file.log") as the_file:
data_fields = next(the_file) # consumes first line
all_the_names = [] # this will hold the names
for line in the_file: # loops over the rest
items = line.split("\t")
ip, name, date, data = items # you can put all this in one line
all_the_names.append(name) # add the name to the list of names
Alternatively, you could use zip and map to put it all into one expression (using that loglines data), but you rather shouldn't do that... zip(*map(lambda s: s.split('\t'), loglines[1:]))[1]

(matlab) unique variable name from string

I have a simple script to import some spectroscopy data from files with some base filename (YYYYMMDD) and a header. My current method pushes the actual spectral intensities to some vector 'rawspectra' and I can call the data by `rawspectra{m,n}.data(q,r)
In the script, I specify by hand the base filename and save it as a string 'filebase'.
I would like to append the name of the rawspectra vector with the filebase so I might be able to use the script to import files acquired on different dates into the same workspace without overwriting the rawspectra vector (and also allowing for easy understanding of which vectors are attached to which experimental conditions. I can easily do this by manually renaming a vector, but I'd rather make this automatic.
My importation script follows:
%for the importation of multiple sequential files, starting at startfile
%and ending at numfiles. All raw scans are subsequently plotted.
numfiles = input('How many spectra?');
startfile = input('What is the starting file number?');
numberspectra = numfiles - (startfile - 1);
filebase = strcat(num2str(input('what is the base file number?')),'_');
rawspectra = cell(startfile, numberspectra);
for k= startfile:numberspectra
filename = strcat(filebase,sprintf('%.3d.txt', k));
%eval(strcat(filebase,'rawspectra')){k} = importdata(filename); - This does not work.
rawspectra{k} = importdata(filename);
figure;
plot(rawspectra{1,k}.data(:,1),rawspectra{1,k}.data(:,2))
end
If any of you can help me out with what should be a seemingly simple task, I would be very appreciative. Basically, I want 'filebase' to go in front of 'rawspectra' and then increment that by k++ within the loop.
Thanks!
Why not just
rawspectra(k) = importdata(filename);
rawspectra(k).filebase = filebase;

python-docx insertion point

I am not sure if I've been missing anything obvious, but I have not found anything documented about how one would go to insert Word elements (tables, for example) at some specific place in a document?
I am loading an existing MS Word .docx document by using:
my_document = Document('some/path/to/my/document.docx')
My use case would be to get the 'position' of a bookmark or section in the document and then proceed to insert tables below that point.
I'm thinking about an API that would allow me to do something along those lines:
insertion_point = my_document.bookmarks['bookmark_name'].position
my_document.add_table(rows=10, cols=3, position=insertion_point+1)
I saw that there are plans to implement something akin to the 'range' object of the MS Word API, this would effectively solve that problem. In the meantime, is there a way to instruct the document object methods where to insert the new elements?
Maybe I can glue some lxml code to find a node and pass that to these python-docx methods? Any help on this subject would be much appreciated! Thanks.
I remembered an old adage, "use the source, Luke!", and could figure it out. A post from python-docx owner on its git project page also gave me a hint: https://github.com/python-openxml/python-docx/issues/7.
The full XML document model can be accessed by using the its _document_part._element property. It behaves exactly like an lxml etree element. From there, everything is possible.
To solve my specific insertion point problem, I created a temp docx.Document object which I used to store my generated content.
import docx
from docx.oxml.shared import qn
tmp_doc = docx.Document()
# Generate content in tmp_doc document
tmp_doc.add_heading('New heading', 1)
# more content generation using docx API.
# ...
# Reference the tmp_doc XML content
tmp_doc_body = tmp_doc._document_part._element.body
# You could pretty print it by using:
#print(docx.oxml.xmlchemy.serialize_for_reading(tmp_doc_body))
I then loaded my docx template (containing a bookmark named 'insertion_point') into a second docx.Document object.
doc = docx.Document('/some/path/example.docx')
doc_body = doc._document_part._element.body
#print(docx.oxml.xmlchemy.serialize_for_reading(doc_body))
The next step is parsing the doc XML to find the index of the insertion point. I defined a small function for the task at hand, which returns a named bookmark parent paragraph element:
def get_bookmark_par_element(document, bookmark_name):
"""
Return the named bookmark parent paragraph element. If no matching
bookmark is found, the result is '1'. If an error is encountered, '2'
is returned.
"""
doc_element = document._document_part._element
bookmarks_list = doc_element.findall('.//' + qn('w:bookmarkStart'))
for bookmark in bookmarks_list:
name = bookmark.get(qn('w:name'))
if name == bookmark_name:
par = bookmark.getparent()
if not isinstance(par, docx.oxml.CT_P):
return 2
else:
return par
return 1
The newly defined function was used toget the bookmark 'insertion_point' parent paragraph. Error control is left to the reader.
bookmark_par = get_bookmark_par_element(doc, 'insertion_point')
We can now use bookmark_par's etree index to insert our tmp_doc generated content at the right place:
bookmark_par_parent = bookmark_par.getparent()
index = bookmark_par_parent.index(bookmark_par) + 1
for child in tmp_doc_body:
bookmark_par_parent.insert(index, child)
index = index + 1
bookmark_par_parent.remove(bookmark_par)
The document is now finalized, the generated content having been inserted at the bookmark location of an existing Word document.
# Save result
# print(docx.oxml.xmlchemy.serialize_for_reading(doc_body))
doc.save('/some/path/generated_doc.docx')
I hope this can help someone, as the documentation regarding this is still yet to be written.
You put [image] as a token in your template document:
for paragraph in document.paragraphs:
if "[image]" in paragraph.text:
paragraph.text = paragraph.text.strip().replace("[image]", "")
run = paragraph.add_run()
run.add_picture(image_path, width=Inches(3))
you have have a paragraph in a table cell as well. just find the cell and do as above.
Python-docx owner suggests how to insert a table into the middle of an existing document:
https://github.com/python-openxml/python-docx/issues/156
Here it is with some improvements:
import re
from docx import Document
def move_table_after(document, table, search_phrase):
regexp = re.compile(search_phrase)
for paragraph in document.paragraphs:
if paragraph.text and regexp.search(paragraph.text):
tbl, p = table._tbl, paragraph._p
p.addnext(tbl)
return paragraph
if __name__ == '__main__':
document = Document('Existing_Document.docx')
table = document.add_table(rows=..., cols=...)
...
move_table_after(document, table, "your search phrase")
document.save('Modified_Document.docx')
Have a look at python-docx-template which allows jinja2 style templates insertion points in a docx file rather than Word bookmarks:
https://pypi.org/project/docxtpl/
https://docxtpl.readthedocs.io/en/latest/
Thanks a lot for taking time to explain all of this.
I was going through more or less the same issue. My specific point was how to merge two or more docx documents, at the end.
It's not exactly a solution to your problem, but here is the function I came with:
def combinate_word(main_file, files, output):
main_doc = Document(main_file)
for file in files:
sub_doc = Document(file)
for element in sub_doc._document_part.body._element:
main_doc._document_part.body._element.append(element)
main_doc.save(output)
Unfortunately, it's not yet possible nor easy to copy images with python-docx. I fall back to win32com ...

Resources