Appending to YAML file - python-3.x

I can't figure out how to work with YAML files, I have a db.yaml file with this content
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
My program reads genre name and link to top 100 from this file, then it scraps the web page for song names and adds it to dictionary
def load_yaml_file(self):
with open(self.yaml_file, "r") as file_content:
self.data = yaml.load(file_content)
def get_genres_and_links(self):
for genre, link in self.data.get("beatport_links").items():
self.beatport_links[genre] = link
Now I have a list with contents like this
["Adam_Beyer_-_Rome_Future_(Original_Mix)", "Veerus_-_Wheel_(Original_Mix)"]
I would like my program to update db.yaml file with contents from this list (append to it), so in the end I would like db.yaml to look like this:
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
downloaded:
Adam_Beyer_-_Rome_Future_(Original_Mix)
Veerus_-Wheel(Original_Mix)
How can I do that?

You don't need your get_genres_and_links you can directly update your self.data
by doing:
self.data['downloaded'] = some_data
The problem is that in your expected output you have as value for the key downloaded you have a multiline plain scalar and not a list. Although you can do some_data = ' '.join(["Adam_Beyer_-_Rome_Future_(Original_Mix)", "Veerus_-_Wheel_(Original_Mix)"]) will get you the string value, it is almost impossible to get the PyYAML to output the plain scalar multiline and non-compact (reading that is trivial. Instead I would look at dumping to a literal block style scalar and joining the list using "\n".join(). The output would then look like:
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
downloaded: |-
Adam_Beyer_-_Rome_Future_(Original_Mix)
Veerus_-Wheel(Original_Mix)
(you can get rid of the dash after | by appending a newline after the joining the list items).
If your expected output was acceptable looking like:
beatport_links:
afro-house: "https://www.beatport.com/genre/afro-house/89/top-100"
big-room: "https://www.beatport.com/genre/big-room/79/top-100"
breaks: "https://www.beatport.com/genre/breaks/9/top-100"
downloaded:
- Adam_Beyer_-_Rome_Future_(Original_Mix)
- Veerus_-Wheel(Original_Mix)
then things are easier and doing:
self.data['downloaded'] = ["Adam_Beyer_-_Rome_Future_(Original_Mix)", "Veerus_-_Wheel_(Original_Mix)"]
with open('some_file', 'w') as fp:
yaml.safe_dump(self.data, fp)
would be enough.
In any event, if you do this kind of loading, modifying, dumping, then you
should look seriously at ruamel.yaml (disclaimer: I am the author of that package). Not only does it implement the newer YAML 1.2, it also preserves comments, tags, special ids, key-order when doing this kind of round-tripping. It also has built-in support for literal style block scalars. And apart from that its default .load() is safe.

Related

Automating The Boring Stuff With Python - Chapter 8 - Exercise - Regex Search

I'm trying to complete the exercise for Chapter 8 using which takes a user supplied regular expression and uses it to search each string in each text file in a folder.
I keep getting the error:
AttributeError: 'NoneType' object has no attribute 'group'
The code is here:
import os, glob, re
os.chdir("C:\Automating The Boring Stuff With Python\Chapter 8 - \
Reading and Writing Files\Practice Projects\RegexSearchTextFiles")
userRegex = re.compile(input('Enter your Regex expression :'))
for textFile in glob.glob("*.txt"):
currentFile = open(textFile) #open the text file and assign it to a file object
textCurrentFile = currentFile.read() #read the contents of the text file and assign to a variable
print(textCurrentFile)
#print(type(textCurrentFile))
searchedText = userRegex.search(textCurrentFile)
searchedText.group()
When I try this individually in the IDLE shell it works:
textCurrentFile = "What is life like for those left behind when the last foreign troops flew out of Afghanistan? Four people from cities and provinces around the country told the BBC they had lost basic freedoms and were struggling to survive."
>>> userRegex = re.compile(input('Enter the your Regex expression :'))
Enter the your Regex expression :troops
>>> searchedText = userRegex.search(textCurrentFile)
>>> searchedText.group()
'troops'
But I can't seem to make it work in the code when I run it. I'm really confused.
Thanks
Since you are just looping across all .txt files, there could be files that doesn't have the word "troops" in it. To prove this, don't call the .group(), just perform:
print(textFile, textCurrentFile, searchedText)
If you see that searchedText is None, then that means the contents of textFile (which is textCurrentFile) doesn't have the word "troops".
You could either:
Add the word troops in all .txt files.
Only select the target .txt files, not all.
Check first if if the match is found before accessing .group()
print(searchedText.group() if searchedText else None)

Reading a list of tuples from a text file in python

I am reading a text file and I want to read a list of tuples so that I can add another tuple to it in my program and write that appended tuple back to the text file.
Example in the file
[('john', 'abc')]
Want to write back to the file as
[('john', 'abc'), ('jack', 'def')]
However, I whenever I keep writing back to the file, the appended tuple seems to add in double quotes along with square brackets. I just want it to appear as above.
You can write a reusable function which takes 2 parameters file_path (on which you want to write tuple), tup (which you want to append) and put your logic inside that. Later you can supply proper data to this function and it will do the job for you.
Note: Don't forget to read the documentation as comments in code
tuples.txt (Before writing)
[('john', 'abc')]
Code
def add_tuple_to_file(file_path, tup):
with open(file_path, 'r+') as f:
content = f.read().strip() # read content from file and remove whitespaces around
tuples = eval(content) # convert string format tuple to original tuple object (not possible using json.loads())
tuples.append(tup) # append new tuple `tup` to the old list
f.seek(0) # After reading file, file pointer reaches to end of file so place it again at beginning
f.truncate() # truncate file (erase old content)
f.write(str(tuples)) # write back the updated list
# Try
add_tuple_to_file("./tuples.txt", ('jack', 'def'))
tuples.txt (After writing back)
[('john', 'abc'), ('jack', 'def')]
References
https://www.geeksforgeeks.org/python-ways-to-convert-string-to-json-object/
How to open a file for both reading and writing?
You can use ast.literal_eval to get the list object from the string.
import ast
s = "[('john', 'abc')]"
o = ast.literal_eval(s)
print(repr(o)==s)
o.append(('jack', 'def'))
newstr = repr(o)
print(newstr)
Here it is in action.

Caching parsed document

I have a set of YAML files. I would like to cache these files so that as much work as possible is re-used.
Each of these files contains two documents. The first document contains “static” information that will always be interpreted in the same way. The second document contains “dynamic” information that must be reinterpreted every time the file is used. Specifically, it uses a tag-based macro system, and the document must be constructed anew each time the file is used. However, the file itself will not change, so the results of parsing the entire file could be cached (at a considerable resource savings).
In ruamel.yaml, is there a simple way to parse an entire file into multiple parsed documents, then run construction on each document individually? This would allow me to cache the result of constructing the first “static” document and cache the parse of the second “dynamic” document for later construction.
Example file:
---
default_argument: name
...
%YAML 1.2
%TAG ! tag:yaml-macros:yamlmacros.lib.extend,yamlmacros.lib.arguments:
---
!merge
name: !argument name
The first document contains metadata that is used (along with other data from elsewhere) in the construction of the second document.
If you don't want to process all YAML documents in a stream completely, you'll have to split up the stream by hand, which is not entirely easy to do in a generic way.
What you need to know is what a YAML stream can consist of:
zero or more documents. Subsequent documents require some sort of separation marker line. If a document is not terminated by a document end marker line, then the following document must begin with a directives end marker line.
A document end marker line is a line that starts with ... followed by space/newline and a directives end marker line is --- followed by space/newline.
The actual production rules are slightly more complicated and "starts with" should ignore the fact that you need to skip any mid-stream byte-order marks.
If you don't have any directives, byte-order-marks and no document-end-markers (and most multi-document YAML streams that I have seen, do not have those), then you can just data = Path().read() the multi-document YAML as a string, split using l = data.split('\n---') and process only the appropriate element of the resulting list with YAML().load(l[N]).
I am not sure the following properly handles all cases, but it does handle your multi-doc stream:
import sys
from pathlib import Path
import ruamel.yaml
docs = []
current = ""
state = "EOD"
for line in Path("example.yaml").open():
if state in ["EOD", "DIR"]:
if line.startswith("%"):
state = "DIR"
else:
state = "BODY"
current += line
continue
if line.startswith('...') and line[3].isspace():
state = "EOD"
docs.append(current)
current = ""
continue
if state == "BODY" and current and line.startswith('---') and line[3].isspace():
docs.append(current)
current = ""
continue
current += line
if current:
docs.append(current)
yaml = ruamel.yaml.YAML()
data = yaml.load(docs[1])
print(data['name'])
which gives:
name
Looks like you can indeed directly operate the parser internals of ruamel.yaml, it just isn't documented. The following function will parse a YAML string into document nodes:
from ruamel.yaml import SafeLoader
def parse_documents(text):
loader = SafeLoader(text)
composer = loader.composer
while composer.check_node():
yield composer.get_node()
From there, the documents can be individually constructed. In order to solve my problem, something like the following should work:
def process_yaml(text):
my_constructor = get_my_custom_constructor()
parsed_documents = list(parse_documents(path.read_text()))
metadata = my_constructor.construct_document(parsed_documents[0])
return (metadata, document[1])
cache = {}
def do_the_thing(file_path):
if file_path not in cache:
cache[file_path] = process_yaml(Path(file_path).read_text())
metadata, document = cache[file_path]
my_constructor = get_my_custom_constructor(metadata)
return my_constructor.construct_document(document)
This way, all of the file IO and parsing is cached, and only the last construction step need be performed each time.

Parse multiple xml files in Python

I am stuck with a problem here. So I want to parse multiple xml files with the same structure within it. I was already able to get all the locations for each file and save them into three different lists, since there are three different types of xml structures. Now I want to create three functions (for each list), which is looping through the lists and parse the information I need. Somehow I am not able to do it. Anybody here who could give me a hint how to do it?
import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys
#### Get the location of each XML file and save them into a list ####
all_xml_list =[]
def locate(pattern,root=os.curdir):
for path, dirs, files in os.walk(os.path.abspath(root)):
for filename in fnmatch.filter(files,pattern):
yield os.path.join(path,filename)
for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
all_xml_list.append(files)
#### Create lists by GameDay Events ####
xml_GameDay_Player = [x for x in all_xml_list if 'Player' in x]
xml_GameDay_Team = [x for x in all_xml_list if 'Team' in x]
xml_GameDay_Match = [x for x in all_xml_list if 'Match' in x]
The XML file looks like this:
<sports-content xmlns:imp="url">
<sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
<sports-title>player-statistics-165483</sports-title>
</sports-metadata>
<sports-event>
<event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
<team>
<team-metadata id="O_17" team-key="17">
<name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
</team-metadata>
<player>
<player-metadata player-key="33201" uniform-number="1">
<name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
</player-metadata>
<player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
<rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
<rating rating-type="grade" rating-value="2.2" />
<rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
<rating rating-type="bemeister" rating-value="16.04086" />
<player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
<stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
<stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
<stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
</player-stats-soccer>
</player-stats>
</player>
</team>
</sports-event>
</sports-content>
I want to extract everything which is within the "player meta tag" and "player-stats coverage" and "player stats soccer" tag.
Improving on #Gnudiff's answer, here is a more resilient approach:
import os
from glob import glob
from lxml import etree
xml_GameDay = {
'Player': [],
'Team': [],
'Match': [],
}
# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
for key in xml_GameDay.keys():
if key in os.path.basename(filename):
xml_GameDay[key].append(filename)
break
def select_first(context, path):
result = context.xpath(path)
if len(result):
return result[0]
return None
# extract data from Player files
for filename in xml_GameDay['Player']:
tree = etree.parse(filename)
for player in tree.xpath('.//player'):
player_data = {
'key': select_first(player, './player-metadata/#player-key'),
'lastname': select_first(player, './player-metadata/name/#last'),
'firstname': select_first(player, './player-metadata/name/#first'),
'nickname': select_first(player, './player-metadata/name/#nickname'),
}
print(player_data)
# ...
XML files can come in a variety of byte encodings and are prefixed by the XML declaration, which declares the encoding of the rest of the file.
<?xml version="1.0" encoding="UTF-8"?>
UTF-8 is a common encoding for XML files (it also is the default), but in reality it can be anything. It's impossible to predict and it's very bad practice to hard-code your program to expect a certain encoding.
XML parsers are designed to deal with this peculiarity in a transparent way, so you don't really have to worry about it, unless you do it wrong.
This is a good example of doing it wrong:
# BAD CODE, DO NOT USE
def file_get_contents(filename):
with open(filename) as f:
return f.read()
tree = etree.XML(file_get_contents('some_filename.xml'))
What happens here is this:
Python opens filename as a text file f
f.read() returns a string
etree.XML() parses that string and creates a DOM object tree
Doesn't sound so wrong, does it? But if the XML is like this:
<?xml version="1.0" encoding="UTF-8"?>
<Player nickname="Mäxchen">...</Player>
then the DOM you will end up with will be:
Player
#nickname="Mäxchen"
You have just destroyed the data. And unless the XML contained an "extended" character like ä, you would not even have noticed that this approach is borked. This can easily slip into production unnoticed.
There is exactly one correct way of opening an XML file (and it's also simpler than the code above): Give the file name to the parser.
tree = etree.parse('some_filename.xml')
This way the parser can figure out the file encoding before it reads the data and you don't have to care about those details.
This won't be a complete solution for your particular case, because this is a bit of task to do and also I don't have keyboard, working from tablet.
In general, you can do it several ways, depending on whether you really need all data or extract specific subset, and whether you know all the possible structures in advance.
For example, one way:
from lxml import etree
Playerdata=[]
for F in xml_Gameday_Player:
tree=etree.XML(file_get_contents(F))
for player in tree.xpath('.//player'):
row=[]
row['player']=player.xpath('./player-metadata/name/#Last/text()')
for plrdata in player.xpath('.//player-stats'):
#do stuff with player data
Playerdata+=row
This is adapted from my existing script, however, it is more tailored to extracting only a specific subset of xml. If you need all data, it would probably be better to use some xml tree walker.
file_get_contents is a small helper function :
def file_get_contents(filename):
with open(filename) as f:
return f.read()
Xpath is a powerful language for finding nodes within xml.
Note that depending on Xpath you use, the result may be either an xml node as in "for player in..." statement, or a string, as in "row['player']=" statement.
you an use xml element tree library. first install it by pip install lxml. then follow the below code structure:
import xml.etree.ElementTree as ET
import os
my_dir = "your_directory"
for fn in os.listdir(my_dir):
tree = ET.parse(os.path.join(my_dir,fn))
root = tree.getroot()
btf = root.find('tag_name')
btf.text = new_value #modify the value of the tag to new_value, whatever you want to put
tree.write(os.path.join(my_dir,fn))
if you still need detail explaination, go through this link
https://www.datacamp.com/community/tutorials/python-xml-elementtree

How to extract a PDF's text using pdfrw

Can pdfrw extract the text out of a document?
I was thinking something along the lines of
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
page_texts = []
for page_nr in doc.numPages:
page_texts.append(doc.getPage(page_nr).parse_page()) # ..or something
In the docs the explain how to extract the text. However, it's just a bytestream. You could iterate over the pages and decode them individually.
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
for page in doc.pages:
bytestream = page.Contents.stream # This is a string with bytes, Not a bytestring
string = #somehow decode bytestream. Maybe using zlib.decompress
# do something with that text
Edit:
May be worth nothing that pdfrw does not yet support text decompression due to its complexity according to the author.
Depends on which filters are applied to the page.Contents.stream. If it is only FlateDecode you can use pdfrw.uncompress.uncompress([page.Contents]) to decode it.
Note: Give the whole Contents object in a list to the function
Note: This is not the same as pdfrw.PdfReader.uncompress()
And then you have to parse the string to find your text. It will be be in blocks of lines between BT (begin text) and ET (end text) markers on lines ending in either 'TJ' or 'Tj' inside round brackets.
Here's an example that may be useful:
for pg_num in range(number_of_pages):
pg_obj = pdfreader.getPage(pg_num)
print(pg_num)
if re.search(r'CSE', pg_obj.extractText()):
cse_count+= 1
pdfwriter.addPage(pg_obj)
Here extractText() would extract the text of the page containing the keyword CSE

Resources