ConfigObj - indentation missing for multiline string value in dict while having set the indentation option - python-3.x

How do I set indentation for string-based value spanning multiple lines in ConfigObj such that the second line, etc. does surpass the delimiter?
For example:
about = {'Info' : {'Purpose': 'blabla continues for fixed chars ...\
\n and another line of bla ... etc.'}} # here nicely aligned under "b" from bla.
# 11 white-spaces.
config = ConfigObj(indent_type= 3*' ', interpolation=True, encoding='utf8')
config.filename = 'lol.ini'
config['About'] = about
config.write()
this result shows in the ini file as:
[About]
[[Info]]
Purpose = '''blabla continues for fixed chars ...
and another line of bla ... etc.''' # here the indentation goes sub-optimal/wrong.
# 11 white spaces but missing the indentations (6 white-spaces)
For two levels the indentation shift would be 6 white-spaces to add (for "About"and for "Info" 3 earch). Apparently, "interpolation=True" is not what does the trick. Any suggestions?
Configobj ver. = 5.0.6
Py = 3.9

Related

Split a big text file into multiple smaller one on set parameter of regex

I have a large text file looking like:
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
......
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
.....
xxxx
.......
asdfghjkl
I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like
group1_sdsdsd.txt
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
group1_ddss.txt
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
and
group1_xxxx.txt
.....
xxxx
.......
asdfghjkl
I have figured that by usinf regex of sort of following can be done
txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times
but not able to figure out completely.
The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.
If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:
^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*
Explanation
^ Start of string
\.{3,}\n Match 3 or more dots and a newline
(\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline
\.{3,} Match 3 or more dots
(?: Non capture group to repeat as a whole part
\n Match a newline
(?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
.* Match the whole line
)* Close the non capture group and optionally repeat it
Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.
See a regex demo and a Python demo with the separate parts.
Example code
import re
pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*"
s = ("....your data here")
matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"
for matchNum, match in enumerate(matches, start=1):
f = open(your_path + "group1_{}".format(match.group(1)), 'w')
f.write(match.group())
f.close()

How to delete last n characters of .txt file without having to re-write all the other characters [duplicate]

After looking all over the Internet, I've come to this.
Let's say I have already made a text file that reads:
Hello World
Well, I want to remove the very last character (in this case d) from this text file.
So now the text file should look like this: Hello Worl
But I have no idea how to do this.
All I want, more or less, is a single backspace function for text files on my HDD.
This needs to work on Linux as that's what I'm using.
Use fileobject.seek() to seek 1 position from the end, then use file.truncate() to remove the remainder of the file:
import os
with open(filename, 'rb+') as filehandle:
filehandle.seek(-1, os.SEEK_END)
filehandle.truncate()
This works fine for single-byte encodings. If you have a multi-byte encoding (such as UTF-16 or UTF-32) you need to seek back enough bytes from the end to account for a single codepoint.
For variable-byte encodings, it depends on the codec if you can use this technique at all. For UTF-8, you need to find the first byte (from the end) where bytevalue & 0xC0 != 0x80 is true, and truncate from that point on. That ensures you don't truncate in the middle of a multi-byte UTF-8 codepoint:
with open(filename, 'rb+') as filehandle:
# move to end, then scan forward until a non-continuation byte is found
filehandle.seek(-1, os.SEEK_END)
while filehandle.read(1) & 0xC0 == 0x80:
# we just read 1 byte, which moved the file position forward,
# skip back 2 bytes to move to the byte before the current.
filehandle.seek(-2, os.SEEK_CUR)
# last read byte is our truncation point, move back to it.
filehandle.seek(-1, os.SEEK_CUR)
filehandle.truncate()
Note that UTF-8 is a superset of ASCII, so the above works for ASCII-encoded files too.
Accepted answer of Martijn is simple and kind of works, but does not account for text files with:
UTF-8 encoding containing non-English characters (which is the default encoding for text files in Python 3)
one newline character at the end of the file (which is the default in Linux editors like vim or gedit)
If the text file contains non-English characters, neither of the answers provided so far would work.
What follows is an example, that solves both problems, which also allows removing more than one character from the end of the file:
import os
def truncate_utf8_chars(filename, count, ignore_newlines=True):
"""
Truncates last `count` characters of a text file encoded in UTF-8.
:param filename: The path to the text file to read
:param count: Number of UTF-8 characters to remove from the end of the file
:param ignore_newlines: Set to true, if the newline character at the end of the file should be ignored
"""
with open(filename, 'rb+') as f:
last_char = None
size = os.fstat(f.fileno()).st_size
offset = 1
chars = 0
while offset <= size:
f.seek(-offset, os.SEEK_END)
b = ord(f.read(1))
if ignore_newlines:
if b == 0x0D or b == 0x0A:
offset += 1
continue
if b & 0b10000000 == 0 or b & 0b11000000 == 0b11000000:
# This is the first byte of a UTF8 character
chars += 1
if chars == count:
# When `count` number of characters have been found, move current position back
# with one byte (to include the byte just checked) and truncate the file
f.seek(-1, os.SEEK_CUR)
f.truncate()
return
offset += 1
How it works:
Reads only the last few bytes of a UTF-8 encoded text file in binary mode
Iterates the bytes backwards, looking for the start of a UTF-8 character
Once a character (different from a newline) is found, return that as the last character in the text file
Sample text file - bg.txt:
Здравей свят
How to use:
filename = 'bg.txt'
print('Before truncate:', open(filename).read())
truncate_utf8_chars(filename, 1)
print('After truncate:', open(filename).read())
Outputs:
Before truncate: Здравей свят
After truncate: Здравей свя
This works with both UTF-8 and ASCII encoded files.
In case you are not reading the file in binary mode, where you have only 'w' permissions, I can suggest the following.
f.seek(f.tell() - 1, os.SEEK_SET)
f.write('')
In this code above, f.seek() will only accept f.tell() b/c you do not have 'b' access. then you can set the cursor to the starting of the last element. Then you can delete the last element by an empty string.
with open(urfile, 'rb+') as f:
f.seek(0,2) # end of file
size=f.tell() # the size...
f.truncate(size-1) # truncate at that size - how ever many characters
Be sure to use binary mode on windows since Unix file line ending many return an illegal or incorrect character count.
with open('file.txt', 'w') as f:
f.seek(0, 2) # seek to end of file; f.seek(0, os.SEEK_END) is legal
f.seek(f.tell() - 2, 0) # seek to the second last char of file; f.seek(f.tell()-2, os.SEEK_SET) is legal
f.truncate()
subject to what last character of the file is, could be newline (\n) or anything else.
This may not be optimal, but if the above approaches don't work out, you could do:
with open('myfile.txt', 'r') as file:
data = file.read()[:-1]
with open('myfile.txt', 'w') as file:
file.write(data)
The code first opens the file, and then copies its content (with the exception of the last character) to the string data. Afterwards, the file is truncated to zero length (i.e. emptied), and the content of data is saved to the file, with the same name.
This is basically the same as vins ms's answer, except that it doesn't use the os package, and that is used the safer 'with open' syntax. This may not be recommended if the text file is huge. (I wrote this since none of the above approaches worked out too well for me in python 3.8).
here is a dirty way (erase & recreate)...
i don't advice to use this, but, it's possible to do like this ..
x = open("file").read()
os.remove("file")
open("file").write(x[:-1])
On a Linux system or (Cygwin under Windows). You can use the standard truncate command. You can reduce or increase the size of your file with this command.
In order to reduce a file by 1G the command would be truncate -s 1G filename. In the following example I reduce a file called update.iso by 1G.
Note that this operation took less than five seconds.
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 30802968576 Blocks: 30081024 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:39:00.572940600 -0400
Modify: 2020-06-12 07:39:00.572940600 -0400
Change: 2020-06-12 07:39:00.572940600 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
chris#SR-ENG-P18 /cygdrive/c/Projects
$ truncate -s -1G update.iso
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 29729226752 Blocks: 29032448 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:42:38.335782800 -0400
Modify: 2020-06-12 07:42:38.335782800 -0400
Change: 2020-06-12 07:42:38.335782800 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
The stat command tells you lots of info about a file including its size.

extract substring from large string

I have a string as:
string="(2021-07-02 01:00:00 AM BST)
---
syl.hs has joined the conversation
(2021-07-02 01:00:23 AM BST)
---
e.wang
Good Morning
How're you?
(2021-07-02 01:05:11 AM BST)
---
wk.wang
Hi, I'm Good.
(2021-07-02 01:08:01 AM BST)
---
perter.derrek
we got the update on work.
It will get complete by next week.
(2021-07-15 08:59:41 PM BST)
---
ad.ft has left the conversation
---
* * *"
I want to extract the conversation text only (text in between name and timestamp) expected output as:
comments=['Good Morning How're you?','Hi, I'm Good.','we got the
update on work.It will get complete by next week.']
What I have tried is:
comments=re.findall(r'---\s*\n(.(?:\n(?!(?:(\s\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*)\w+\s*\n)?---).))',string)
You could use a single capture group:
^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)
The pattern matches:
^ Start of string
---\s*\n Match --- optional whitespace chars and a newline
(?!.* has (?:joined|left) the conversation|\* \* \*) Assert that the line does not contain a has joined or has left the conversation part, or contains * * *
\S.* Match at least a non whitespace char at the start of the line and the rest of the line
( Capture group 1 (this will be returned by re.findall)
(?:\n(?!\(\d|---).*)* Match all lines the do not start with ( and a digit or --
) Close group 1
See a regex demo and a Python demo.
Example
pattern = r"^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)"
result = [m.strip() for m in re.findall(pattern, s, re.M) if m]
print(result)
Output
["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work. \nIt will get complete by next week.']
I've assumed:
The text of interest begins after a block of three lines: a line containing a timestamp, followed by the line "---", which may be padded to the right with spaces, followed by a line comprised of a string of letters containing one period which is neither at the beginning nor end of that string and that string may be padded on the right with spaces.
The block of text of interest may contain blank lines, a blank line being a string that contains nothing other than spaces and a line terminator.
The last line of the block of text of interest cannot be a blank line.
I believe the following regular expression (with multiline (m) and case-indifferent (i) flags set) meets these requirements.
^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n-{3} *\r?\n[a-z]+\.[a-z]+ *\r?\n((?:.*[^ (\n].*\r?\n| *\r?\n(?=(?: *\r?\n)*(?!\(\d{4}\-\d{2}\-\d{2} .*\)).*[^ (\n]))*)
The blocks of lines of interest are contained in capture group 1.
Start your engine!
The elements of the expression are as follows.
^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n # match timestamp line
-{3} *\r?\n # match 3-hyphen line
[a-z]+\.[a-z]+ *\r?\n # match name
( # begin capture group 1
(?: # begin non-capture group (a)
.*[^ (\n].*\r?\n # match a non-blank line
| # or
\ *\r?\n # match a blank line
(?= # begin a positive lookahead
(?: # begin non-capture group (b)
\ *\r?\n # match a blank line
)* # end non-capture group b and execute 0+ times
(?! # begin a negative lookahead
\(\d{4}\-\d{2}\-\d{2} .*\) # match timestamp line
) # end negative lookahead
.*[^ (\n] # march a non-blank line
) # end positive lookahead
)* # end non-capture group a and execute 0+ times
) # end capture group 1
Here is a self-documenting regex that will strip leading and trailing whitespace:
(?x)(?m)(?s) # re.X, re.M, re.S (DOTALL)
(?: # start of non capturing group
^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n # date and time
(?!---\s*\r?\nad\.ft has) # next lines are not the ---\n\ad.ft etc.
---\s*\r?\n # --- line
[\w.]+\s*\r?\n # name line
\s* # skip leading whitespace
) # end of non-capture group
# The folowing is capture group 1. Match characters until you get to the next date-time:
((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
See Regex Demo
See Python Demo
import re
string = """(2021-07-02 01:00:00 AM BST)
---
syl.hs has joined the conversation
(2021-07-02 01:00:23 AM BST)
---
e.wang
Good Morning
How're you?
(2021-07-02 01:05:11 AM BST)
---
wk.wang
Hi, I'm Good.
(2021-07-02 01:08:01 AM BST)
---
perter.derrek
we got the update on work.
It will get complete by next week.
(2021-07-15 08:59:41 PM BST)
---
ad.ft has left the conversation
---
* * *"""
regex = r'''(?x)(?m)(?s) # re.X, re.M, re.S (DOTALL)
(?: # start of non capturing group
^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n # date and time
(?!---\s*\r?\nad\.ft has) # next lines are not the ---\n\ad.ft etc.
---\s*\r?\n # --- line
[\w.]+\s*\r?\n # name line
\s* # skip leading whitespace
) # end of non-capture group
# The folowing is capture group 1. Match characters until you get to the next date-time:
((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
'''
matches = re.findall(regex, string)
print(matches)
Prints:
["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work.\nIt will get complete by next week.']

How to write Chinese characters to file based on unicode code point in Python3

I am trying to write Chinese characters to a CSV file based on their Unicode code points found in a text file in unicode.org/Public/zipped/13.0.0/Unihan.zip. For instance, one example character is U+9109.
In the example below I can get the correct output by hard coding the value (line 8), but keep getting it wrong with every permutation I've tried at generating the bytes from the code point (lines 14-16).
I'm running this in Python 3.8.3 on a Debian-based Linux distro.
Minimal working (broken) example:
1 #!/usr/bin/env python3
2
3 def main():
4
5 output = open("test.csv", "wb")
6
7 # Hardcoded values work just fine
8 output.write('\u9109'.encode("utf-8"))
9
10 # Comma separation
11 output.write(','.encode("utf-8"))
12
13 # Problem is here
14 codepoint = '9109'
15 u_str = '\\' + 'u' + codepoint
16 output.write(u_str.encode("utf-8"))
17
18 # End with newline
19 output.write('\n'.encode("utf-8"))
20
21 output.close()
22
23 if __name__ == "__main__":
24 main()
Executing and viewing results:
example $
example $./test.py
example $
example $cat test.csv
鄉,\u9109
example $
The expected output would look like this (Chinese character occurring on both sides of the comma):
example $
example $./test.py
example $cat test.csv
鄉,鄉
example $
chr is used to convert integers to code points in Python 3. Your code could use:
output.write(chr(0x9109).encode("utf-8"))
But if you specify the encoding in the open instead of using binary mode you don't have to manually encode everything. print to a file handles newlines for you as well.
with open("test.txt",'w',encoding='utf-8') as output:
for i in range(0x4e00,0x4e10):
print(f'U+{i:04X} {chr(i)}',file=output)
Output:
U+4E00 一
U+4E01 丁
U+4E02 丂
U+4E03 七
U+4E04 丄
U+4E05 丅
U+4E06 丆
U+4E07 万
U+4E08 丈
U+4E09 三
U+4E0A 上
U+4E0B 下
U+4E0C 丌
U+4E0D 不
U+4E0E 与
U+4E0F 丏

awk-insert row with specific text within specific position

I have a file where the first couple of rows start with # mark, then follow the classical netlist, where also can be there rows begin with # mark. I need to insert one row with text protect between block of first rows begining on # and first row of classical netlist. In the end of file i need insert row with word unprotect. It will be good to save this modified text to new file with specific name because of the original file protected.
Sample file:
// Generated for: spectre
// Design library name: Kovi
// Design cell name: T_Line
// Design view name: schematic
simulator lang=spectre
global 0
parameters frequency=3.8G Zo=250
// Library name: Kovi
// Cell name: T_Line
// View name: schematic
T8 (7 0 6 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T7 (net034 0 net062 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T5 (net021 0 4 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T4 (net019 0 2 0) tline z0=Zo f=3.8G nl=0.5 vel=1
How about sed
sed -e '/^#/,/^#/!iprotect'$'\n''$aunprotect'$'\n' input_file > new_file
Inserts 'protect' on a line by itself after the first block of commented lines, then adds 'unprotect' at the end.
Note: Because I use $'\n' in place of literal newline bash is assumed as the shell.
Since you awk'd the post
awk 'BEGIN{ protected=""} { if($0 !~ /#/ && !protected){ protected="1"; print "protect";} print $0}END{print "unprotect";}' input_file > output_file
As soon a row is detected without # as the first non-whitespace character, it will output a line with protect. At the end it will output a line for unprotect.
Test file
#
#
#
#Preceded by a tab
begin protect
#
before unprotect
Result
#
#
#
#Preceded by tab
protect
begin protect
#
before unprotect
unprotect
Edit:
Removed the [:space:]* as it seems that is already handled by default.
Support //
If you wanted to support both # and // in the same script, the regex portion would change to /#|\//. The special character / has to be escaped by using \.
This would check for at least one /.
Adding a quantifier {2} will match // exactly: /#|\/{2}/

Resources