Strings containing " - " always break onto newline with ruamel.yaml - python-3.x

I'm fairly new to YAML, within a Python 3.7 project, and decided to use ruamel.yaml to get me started. I intend to use it to store metadata associated with some video files.
I am creating YAML files with the following code:
data[filename] = [{'video': video_path},
{'key_frame': frame_path},
{'processed': get_timestamp()}]
yaml.dump(data, file_handle)
The created YAML file looks like this:
video.mp4:
- video: /Users/xyz/video.mp4
- key_frame: /Users/xyz/imgOutput/frame
- Trigger.jpg
- processed: '2018-07-26 17:09:06'
The issue is that the key_frame is a file called "frame - Trigger.jpg". However, the line always breaks at the " - " (i.e. space-dash-space) in the filename. Result is something that, as a human-readable file, it looks very wrong. In fact, it's processed correctly when it's read back in (using yaml.open), and treated as a single string filename as it should be. It's just the formatting in the YAML file that's wrong.
Any thoughts on the cause? Is this expected behaviour? I've tried many different ways of quoting the string in case that's it (which doesn't make a difference - even quoted it will split over the line), but fundamentally it does work, from a code sense - but as YAML's big selling point is human-readable files, it'd be nice to understand what's causing it and how to fix it.

In YAML plain scalars (i.e. the ones without single or double quotes) can be wrapped to an indented newline on whitespace. That is what's happening.
To reproduce this is difficult as your question is quite incomplete, but some things can be easily seen from the output:
data is a dict
filename, video_path, and frame_path are defined as strings.
file_handle is probably some file stream opened for writing.
Others are less easily deduced:
get_timestamp() doesn't return a datetime.datetime() instance as one would expect from its name, but a string representation thereof. To prevent this string from being interpreted as a timestamp, it has to be quoted.
you are using the default YAML() instance (which equals typ='rt'), as the non-default ones would write the leaf mappings in flow style ( - {video: /Users/xyz/video.mp4}, etc.)
With that and the appropriate imports you can make a functioning program:
import datetime
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='rt')
def get_timestamp():
return datetime.datetime(2018, 7, 26, 17, 9, 6).isoformat(sep=' ', timespec='seconds')
data = {}
filename = 'video.mp4'
video_path = '/Users/xyz/video.mp4'
frame_path = '/Users/xyz/imgOutput/frame - Trigger.jpg'
file_handle = sys.stdout
data[filename] = [{'video': video_path},
{'key_frame': frame_path},
{'processed': get_timestamp()}]
yaml.dump(data, file_handle)
and this outputs:
video.mp4:
- video: /Users/xyz/video.mp4
- key_frame: /Users/xyz/imgOutput/frame - Trigger.jpg
- processed: '2018-07-26 17:09:06'
So we forgot something and that is:
yaml.width = 24 # range from 24-38 inclusive
with that you get your output:
video.mp4:
- video: /Users/xyz/video.mp4
- key_frame: /Users/xyz/imgOutput/frame
- Trigger.jpg
- processed: '2018-07-26 17:09:06'
so just remove the yaml.width = line and you should be all set.
Next time please provide a minimal, but complete, functioning program that actually produces the output.
My guess is that your frame_path is much longer that you show here, and that you don't have a user xyz. That causes you to get over the default width (defined in the emitter to be 80) and the plain scalar to wrap. Just set yaml.width = 4096 or whatever is necessary for your scalar length and nesting depth.
When in doubt if the YAML output is correct, read it back in (using an YAML(typ='safe').load(input_stream), it should produce the original data.

Can you try str(frame_path)
data[filename] = [{'video': video_path},
{'key_frame': str(frame_path)},
{'processed': get_timestamp()}]

There is nothing special about the dash. If the string is longer than a certain threshold it will break at the first space after that. The examples you gave do not reproduce this behaviour for me, but longer strings do.
The generated YAML is valid. Any string, if quoted or not, can be broken up to several lines.
Maybe you can adjust the threshold in ruamel. I can't find anything in the documentation, though.
(See also my article Strings in YAML)

Related

Removing the first date and timestamp in each line of a log file using Python

I have a series of log files in text file format.
The document format is this:
[2021-12-11T10:21:30.370Z] Branch indexing
[2021-12-11T10:21:30.374Z] Starting the program with default pipeID
[2021-12-11T10:21:30.374Z] Running with durable level: max_survivbility will make this program crash if left running for 20 minutes
[2021-12-11T10:21:30.374Z] Starting the program with default pipeID
Each line in the document starts with:[2021-12-11T10:21:30.370Z]
I want to remove the first set of characters that represent date and timestamp and have a result something like this:
Branch indexing
Starting the program with default pipeID
Running with durable level: max_survivbility will make this program crash if left running for 20 minutes
Starting the program with default pipeID
Can anyone please help me explain how I can do this?
I tried to use this method but it doesn't work since I have '[]' in the date stamp.
import re
text = "[2021-12-11T10:21:30.370Z] Branch indexing"
re.sub("[.*?]", "", text)
This doesn't work for me.
If I try the same method on a text like text = "<2021-12-11T10:21:30.370Z> Branch indexing".
import re
text = "<2021-12-11T10:21:30.370Z> Branch indexing"
re.sub("<.*?>", "", text)
It removes <2021-12-11T10:21:30.370Z>. Why does this not work with [2021-12-11T10:21:30.370Z]?
I need help removing every instance of this format "[2021-12-11T10:21:30.370Z]" in all the log files.
Thank you so much.
I'd rather go with a simple solution for this case, pal. Split the string where the ] ends, then trim the second element of the resulting list, to remove all those extra spaces and then print it, bud. Hope this helps, cheers!
import re
text = "[2021-12-11T10:21:30.370Z] Branch indexing"
print(re.split("]", text)[1].strip())
Your current regex pattern is off because square brackets are regex metacharacters which need to be escaped. Also, you should be running the regex in multiline mode. And the timestamp pattern should be more generic.
text = re.sub(r'^\[.*?\]\s+', '', text, flags=re.M)

Reading values from a file and outputting each number, largest/smallest numbers, sum, and average of numbers from the file

The issue that I am having is that I am able to read the information from the files, but when I try to convert them from a string to an integer, I get an error. I also have issues where the min/max prints as the entire file's contents.
I have tried using if/then statements as well as using different variables for each line in the file.
file=input("Which file do you want to get the data from?")
f=open('data3.txt','r')
sent='-999'
line=f.readline().rstrip('\n')
while len(line)>0:
lines=f.read().strip('\n')
value=int(lines)
if value>value:
max=value
print(max)
else:
min=value
print(min)
total=sum(lines)
print(total)
I expect the code to find the min/max of the numbers in the file as well as the sum and average of the numbers in the file. The results from the file being processed in the code, then have to be written to a different file. My results have consisted in various errors reading that Python is unable to convert from a str to an int as well as printing the entire file's contents instead of the expected results.
does the following work?
lines = list(open('fileToRead.txt'))
intLines = [int(i) for i in lines]
maxValue = max(intLines)
minvalue = min(intLines)
sumValue = sum(intLines)
print("MaxValue : {0}".format( maxValue))
print("MinValue : {0}".format(minvalue))
print("Sum : {0}".format(sumValue))
print("Avergae : {0}".format(sumValue/len(intLines)))
and this is how my filesToRead.txt is formulated (just a simple one, in fact)
10
20
30
40
5
1
I am reading file contents into a list. Then I create a new list (it can be joined with the previous step as part of some code refactoring) which has all the list of ints.Once when I have the list of ints, its easier to calculate max and min on it.
Note that some of the variables are not named properly. Also reading the whole file in one go (like what I have done here) might be a bad idea if the file is too large. In that case, you should never ever read the whole file in one go. In this case , you need to read it line by line, parse the ints and add them to a list of ints. Once when you are done reading the file, close the file. You can then start your calculations based on the list of ints that you have now obtained.
Please let me know if this resolves your query.
Thanks

Attempting to append all content into file, last iteration is the only one filling text document

I'm trying to Create a file and append all the content being calculated into that file, but when I run the script the very last iteration is written inside the file and nothing else.
My code is on pastebin, it's too long, and I feel like you would have to see exactly how the iteration is happening.
Try to summarize it, Go through an array of model numbers, if the model number matches call the function that calculates that MAC_ADDRESS, when done calculating store all the content inside a the file.
I have tried two possible routes and both have failed, giving the same result. There is no error in the code (it runs) but it just doesn't store the content into the file properly there should be 97 different APs and it's storing only 1.
The difference between the first and second attempt,
1 attempt) I open/create file in the beginning of the script and close at the very end.
2 attempt) I open/create file and close per-iteration.
First Attempt:
https://pastebin.com/jCpLGMCK
#Beginning of code
File = open("All_Possibilities.txt", "a+")
#End of code
File.close()
Second Attempt:
https://pastebin.com/cVrXQaAT
#Per function
File = open("All_Possibilities.txt", "a+")
#per function
File.close()
If I'm not suppose to reference other websites, please let me know and I'll just paste the code in his post.
Rather than close(), please use with:
with open('All_Possibilities.txt', 'a') as file_out:
file_out.write('some text\n')
The documentation explains that you don't need + to append writes to a file.
You may want to add some debugging console print() statements, or use a debugger like pdb, to verify that the write() statement actually ran, and that the variable you were writing actually contained the text you thought it did.
You have several loops that could be a one-liner using readlines().
Please do this:
$ pip install flake8
$ flake8 *.py
That is, please run the flake8 lint utility against your source code,
and follow the advice that it offers you.
In particular, it would be much better to name your identifier file than to name it File.
The initial capital letter means something to humans reading your code -- it is
used when naming classes, rather than local variables. Good luck!

Python3 - How to write a number to a file using a variable and sum it with the current number in the file

Suppose I have a file named test.txt and it currently has the number 6 inside of it. I want to use a variable such as x=4 then write to the file and add the two numbers together and save the result in the file.
var1 = 4.0
f=open(test.txt)
balancedata = f.read()
newbalance = float(balancedata) + float(var1)
f.write(newbalance)
print(newbalance)
f.close()
It's probably simpler than you're trying to make it:
variable = 4.0
with open('test.txt') as input_handle:
balance = float(input_handle.read()) + variable
with open('test.txt', 'w') as output_handle:
print(balance, file=output_handle)
Make sure 'test.txt' exists before you run this code and has a number in it, e.g. 0.0 -- you can also modify the code to deal with creating the file in the first place if it's not already there.
Files only read and write strings (or bytes for files opened in binary mode). You need to convert your float to a string before you can write it to your file.
Probably str(newbalance) is what you want, though you could customize how it appears using format if you want. For instance, you could round the number to two decimal places using format(newbalance, '.2f').
Also note that you can't write to a file opened only for reading, so you probably need to either use mode 'r+' (which allows both reading and writing) combined with a f.seek(0) call (and maybe f.truncate() if the length of the new numeric string might be shorter than the old length), or close the file and reopen it in 'w' mode (which will truncate the file for you).

Allocating matrix / structure data instead of string name to variable

I have a script that opens a folder and does some processing on the data present. Say, there's a file "XYZ.tif".
Inside this tif file, there are two groups of datasets, which show up in the workspace as
data.ch1eXYZ
and
data.ch3eXYZ
If I want to continue with the 2nd set, I can use
A=data.ch3eXYZ
However, XYZ usually is much longer and varies per file, whereas data.ch3e is consistent.
Therefore I tried
A=strcat('data.ch3e','origfilename');
where origfilename of course is XYZ, which has (automatically) been extracted before.
However, that gives me a string A (since I practically typed
A='data.ch3eXYZ'
instead of the matrix that data.ch3eXYZ actually is.
I think it's just a problem with ()'s, []'s, or {}'s but Ican't seem to figure it out.
Thanks in advance!
If you know the string, dynamic field references should help you here and are far better than eval
Slightly modified example from the linked blog post:
fldnm = 'fred';
s.fred = 18;
y = s.(fldnm)
Returns:
y =
18
So for your case:
test = data.(['ch3e' origfilename]);
Should be sufficient
Edit: Link to the documentation

Resources