Splitting File contents by regex - python-3.x

As the title stated, I need to split a file by regex in Python.
The lay out of the .txt file is as follows
[text1]
file contents I need
[some text2]
more file contents I need
[more text 3]
last bit of file contents I need
I originally tried splitting the files like so:
re.split('\[[A-Za-z]+\]\n', data)
The problem with doing it this way was that it wouldn't capture the blocks that had spaces in between the text within the brackets.
I then tried using a wild card character: re.split('\[(.*?)\]\n', data)
The problem I ran into this was that I found it would split the file contents as well. What's the best way to to get the following result:
['file contents I need','more file contents I need','last bit of file contents I need']?
Thanks in advance.

Instead of using re.split, you could use a capturing group with re.findall which will return the group 1 values.
In the group, match all the lines that do not start with the [.....] pattern
^\[[^][]*]\r?\n\s*(.*(?:\r?\n(?!\[[^][]*]).*)*)
In parts
^ Start of line
\[[^][]*]
\r?\n\s* Match an newline and optional whitespace chars
( Capture group 1
.* Match any char except a newline 0+ times
(?: Non capture group
\r?\n(?!\[[^][]*]).* Match the line if it does not start with with the [...] pattern using a negative lookahead (?!
)* Close group and repeat 0+ times to get all the lines
) Close group
See a regex demo or a Python demo
Example code
import re
regex = r"^\[[^][]*]\r?\n\s*(.*(?:\r?\n(?!\[[^][]*]).*)*)"
data = ("[text1]\n\n"
"file contents I need\n\n"
"[some text2]\n\n"
"more file contents I need\n\n"
"[more text 3]\n\n"
"last bit of file contents I need\n"
"last bit of file contents I need")
matches = re.findall(regex, data, re.MULTILINE)
print(matches)
Output
['file contents I need\n', 'more file contents I need\n', 'last bit of file contents I need\nlast bit of file contents I need']

Given:
txt='''\
[text1]
file contents I need
[some text2]
more file contents I need
multi line at that
[more text 3]
last bit of file contents I need'''
(Which could be from a file...)
You can do:
>>> [e.strip() for e in re.findall(r'(?<=\])([\s\S]*?)(?=\[|\s*\Z)', txt)]
['file contents I need', 'more file contents I need\nmulti line at that', 'last bit of file contents I need']
Demo
You can also use re.finditer to locate each block of interest:
with open(ur_file) as f:
for i, block in enumerate(re.finditer(r'^\s*\[[^]]*\]([\s\S]*?)(?=^\s*\[[^]]*\]|\Z)', f.read(), flags=re.M)):
print(i, block.group(1))
The individual blocks leading and trailing whitespace can be dealt with as desired...

Related

Editing data in a text file in python for a given condition

I have a text file with the following contents:
1 a 20
2 b 30
3 c 40
I need to check if the first character of a particular line is 2 and edit its final two characters to 12, and rewrite the data into the file. New file should look something like this:
1 a 20
2 b 12
3 c 40
Need help doing this in python 3.
Couldn't figure it out. Help.
To modify contents of a file with python you will need to open the file in read mode to extract the contents of the file. You can then make changes on the extracted contents. To make your changes permanent, you have to write the contents back to the file.
The whole process looks something like this:
from pathlib import Path
# Define path to your file
your_file = Path("your_file.txt")
# Read the data in your file
with your_file.open('r') as f:
lines = f.readlines()
# Edit lines that start with a "2"
for i in range(len(lines)):
if lines[i].startswith("2"):
lines[i] = lines[i][:-3] + "12\n"
# Write data back to file
with your_file.open('w') as f:
f.writelines(lines)
Note that in order to change the last two characters of a string, you actually need to change the two characters before the last. This is because of the newline character, which indicates that the line has ended and new characters should be put on the line below. The \n you see after 12 is the newline character. If you don't put this in your replacement string, what originally was the next string will be put directly behind your replacement.

How to read a text file and insert the data into next line on getting \n character

I have a text file where data is comma delimited with a litral \n character in between, i would like to insert the data into newline just after getting the \n character.
text file sample:
'what,is,your,name\n','my,name,is,david.hough\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,eric.knot\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,fisher.cold\n','i,am,a,software,prof\n',..
expected:
I need the output in the below form.
'what,is,your,name',
'my,name,is,david.hough',
'i,am,a,software,prof',
Tried:
file1 = open("test.text", "r")
Lines = file1.readlines()
for line in Lines:
print(line)
result:
'what,is,your,name\n','my,name,is,david.hough\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,eric.knot\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,fisher.cold\n','i,am,a,software,prof\n',..
well my comment does exactly what you asked, break lines at \n. your data is structured quite weirdly, but if you want the expected result that badly you can use regex
import re
file1 = open("test.text","r")
Lines = re.findall(r'\'.*?\',',file1.read().replace("\\n",""))
for line in Lines:
print(line)
Well you don't need push data to the other line manually. The \n does that work when you run the code.
I guess the problem is that you used quotes very frequently, try using a single pair of quotes and use \n after the first sentence and yeah without white space
'what,is,your,name\nmy,name,is,david.hough\ni,am,a,software,prof'

Count the number of characters in a file

The question:
Write a function file_size(filename) that returns a count of the number of characters in the file whose name is given as a parameter. You may assume that when being tested in this CodeRunner question your function will never be called with a non-existent filename.
For example, if data.txt is a file containing just the following line: Hi there!
A call to file_size('data.txt') should return the value 10. This includes the newline character that will be added to the line when you're creating the file (be sure to hit the 'Enter' key at the end of each line).
What I have tried:
def file_size(data):
"""Count the number of characters in a file"""
infile = open('data.txt')
data = infile.read()
infile.close()
return len(data)
print(file_size('data.txt'))
# data.txt contains 'Hi there!' followed by a new line
character.
I get the correct answer for this file however I fail a test that users a larger/longer file which should have a character count of 81 but I still get 10. I am trying to get the code to count the correct size of any file.

Reading from file returns 2 dictionaries

data = [line.strip('\n') for line in file3]
# print(data)
data2 = [line.split(',') for line in data]
data_dictionary = {t[0]:t[1] for t in data2}
print(data_dictionary)
So I'm reading content from a file under the assumption that there is no whitespace at the beginning of each line and not blank lines anywhere.
when I read this file I first strip the newline character and the split the data by a ',' because that is what the data in the file is separated by. but when I make the dictionary it returns two dictionaries instead of one it's doing that for other files where I use this procedure. how do I fix this?

printing sentence from a word search

As an exercise in the code below, I've copied and saved Rice's Tarzan novel into a text file (named tarzan.txt) and within it, I've searched for "row" and printed out the corresponding lines.
Is it difficult to modify this code so that it searches for the word "row" rather than instances of these letters appearing in another word AND it prints the sentence that contain this word rather than simply the line it appears in? Thanks.
PS - in the code below, I couldn't get lines 3, 5, and 6 to indent properly, despite the 4 space suggestion
a="tarzan.txt"
with open (a) as f_obj:
contents=f_obj.readlines()
for line in contents:
if "row" in line:
print(line)
import re
a="tarzan.txt"
with open (a) as f_obj:
contents=f_obj.readlines()
for line in contents:
if re.search(r'\brow\b',line): ####### search for 'row' in line
print contents.index(line) ####### print line number
Here \b means word boundries.

Resources