Insert a new line or an empty line between two sentences - python-3.x

i have a text file as the output of another program. I want to insert a white space above the line that starts with 'Sentence #'
this is what i currently have:
Sentence #26024 (5 tokens):
Today is a good day
[Text=Today CharacterOffsetBegin=1607176 CharacterOffsetEnd=1607178 PartOfSpeech=IN Lemma=if]
[Text=is CharacterOffsetBegin=1607179 CharacterOffsetEnd=1607181
PartOfSpeech=NN Lemma=yo]
[Text=a CharacterOffsetBegin=1607182 CharacterOffsetEnd=1607186 PartOfSpeech=NN Lemma=girl]
[Text=good CharacterOffsetBegin=1607187 CharacterOffsetEnd=1607193 PartOfSpeech=JJ Lemma=doesnt]
[Text=day CharacterOffsetBegin=1607202 CharacterOffsetEnd=1607205
root(ROOT-0, today-1)
root(today-1, is-2)
dobj(a-2, good-3)
amod(day-3, good-4)
Sentence #26025 (4 tokens):
if you can help
[Text=if CharacterOffsetBegin=1607223 CharacterOffsetEnd=1607225 PartOfSpeech=IN Lemma=if]
[Text=you CharacterOffsetBegin=1607226 CharacterOffsetEnd=1607229 PartOfSpeech=PRP Lemma=you]
[Text=can CharacterOffsetBegin=1607230 CharacterOffsetEnd=1607233 PartOfSpeech=MD Lemma=can
mark(help-4, if-1)
nsubj(help-4, you-2)
aux(help-4, can-3)
This is what i want it to look like:
Sentence #26024 (5 tokens):
Today is a good day
[Text=Today CharacterOffsetBegin=1607176 CharacterOffsetEnd=1607178 PartOfSpeech=IN Lemma=if]
[Text=is CharacterOffsetBegin=1607179 CharacterOffsetEnd=1607181
PartOfSpeech=NN Lemma=yo]
[Text=a CharacterOffsetBegin=1607182 CharacterOffsetEnd=1607186 PartOfSpeech=NN Lemma=girl]
[Text=good CharacterOffsetBegin=1607187 CharacterOffsetEnd=1607193 PartOfSpeech=JJ Lemma=doesnt]
[Text=day CharacterOffsetBegin=1607202 CharacterOffsetEnd=1607205
root(ROOT-0, today-1)
root(today-1, is-2)
dobj(a-2, good-3)
amod(day-3, good-4)
Sentence #26025 (4 tokens):
if you can help
[Text=if CharacterOffsetBegin=1607223 CharacterOffsetEnd=1607225 PartOfSpeech=IN Lemma=if]
[Text=you CharacterOffsetBegin=1607226 CharacterOffsetEnd=1607229 PartOfSpeech=PRP Lemma=you]
[Text=can CharacterOffsetBegin=1607230 CharacterOffsetEnd=1607233 PartOfSpeech=MD Lemma=can
mark(help-4, if-1)
nsubj(help-4, you-2)
aux(help-4, can-3)
Can anyone please provide pointers. Thanks
I can't do it manually because it's a large file that will need thoussand of spaces inserted.

This is what i did, in case anyone else has a similar problem.
file = open("nameofoldfile.txt", 'r')
filelines = file.readlines()
for lines in filelines:
lines = lines.strip()
if lines.startswith('Sentence #'):
print('\n')
print(lines)
else:
print(lines)
then i saved the file to a new text file by running on the command prompt
python nameoffile.py > nameoftextfile.txt

Related

Split a big text file into multiple smaller one on set parameter of regex

I have a large text file looking like:
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
......
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
.....
xxxx
.......
asdfghjkl
I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like
group1_sdsdsd.txt
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
group1_ddss.txt
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
and
group1_xxxx.txt
.....
xxxx
.......
asdfghjkl
I have figured that by usinf regex of sort of following can be done
txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times
but not able to figure out completely.
The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.
If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:
^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*
Explanation
^ Start of string
\.{3,}\n Match 3 or more dots and a newline
(\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline
\.{3,} Match 3 or more dots
(?: Non capture group to repeat as a whole part
\n Match a newline
(?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
.* Match the whole line
)* Close the non capture group and optionally repeat it
Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.
See a regex demo and a Python demo with the separate parts.
Example code
import re
pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*"
s = ("....your data here")
matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"
for matchNum, match in enumerate(matches, start=1):
f = open(your_path + "group1_{}".format(match.group(1)), 'w')
f.write(match.group())
f.close()

How to delete last n characters of .txt file without having to re-write all the other characters [duplicate]

After looking all over the Internet, I've come to this.
Let's say I have already made a text file that reads:
Hello World
Well, I want to remove the very last character (in this case d) from this text file.
So now the text file should look like this: Hello Worl
But I have no idea how to do this.
All I want, more or less, is a single backspace function for text files on my HDD.
This needs to work on Linux as that's what I'm using.
Use fileobject.seek() to seek 1 position from the end, then use file.truncate() to remove the remainder of the file:
import os
with open(filename, 'rb+') as filehandle:
filehandle.seek(-1, os.SEEK_END)
filehandle.truncate()
This works fine for single-byte encodings. If you have a multi-byte encoding (such as UTF-16 or UTF-32) you need to seek back enough bytes from the end to account for a single codepoint.
For variable-byte encodings, it depends on the codec if you can use this technique at all. For UTF-8, you need to find the first byte (from the end) where bytevalue & 0xC0 != 0x80 is true, and truncate from that point on. That ensures you don't truncate in the middle of a multi-byte UTF-8 codepoint:
with open(filename, 'rb+') as filehandle:
# move to end, then scan forward until a non-continuation byte is found
filehandle.seek(-1, os.SEEK_END)
while filehandle.read(1) & 0xC0 == 0x80:
# we just read 1 byte, which moved the file position forward,
# skip back 2 bytes to move to the byte before the current.
filehandle.seek(-2, os.SEEK_CUR)
# last read byte is our truncation point, move back to it.
filehandle.seek(-1, os.SEEK_CUR)
filehandle.truncate()
Note that UTF-8 is a superset of ASCII, so the above works for ASCII-encoded files too.
Accepted answer of Martijn is simple and kind of works, but does not account for text files with:
UTF-8 encoding containing non-English characters (which is the default encoding for text files in Python 3)
one newline character at the end of the file (which is the default in Linux editors like vim or gedit)
If the text file contains non-English characters, neither of the answers provided so far would work.
What follows is an example, that solves both problems, which also allows removing more than one character from the end of the file:
import os
def truncate_utf8_chars(filename, count, ignore_newlines=True):
"""
Truncates last `count` characters of a text file encoded in UTF-8.
:param filename: The path to the text file to read
:param count: Number of UTF-8 characters to remove from the end of the file
:param ignore_newlines: Set to true, if the newline character at the end of the file should be ignored
"""
with open(filename, 'rb+') as f:
last_char = None
size = os.fstat(f.fileno()).st_size
offset = 1
chars = 0
while offset <= size:
f.seek(-offset, os.SEEK_END)
b = ord(f.read(1))
if ignore_newlines:
if b == 0x0D or b == 0x0A:
offset += 1
continue
if b & 0b10000000 == 0 or b & 0b11000000 == 0b11000000:
# This is the first byte of a UTF8 character
chars += 1
if chars == count:
# When `count` number of characters have been found, move current position back
# with one byte (to include the byte just checked) and truncate the file
f.seek(-1, os.SEEK_CUR)
f.truncate()
return
offset += 1
How it works:
Reads only the last few bytes of a UTF-8 encoded text file in binary mode
Iterates the bytes backwards, looking for the start of a UTF-8 character
Once a character (different from a newline) is found, return that as the last character in the text file
Sample text file - bg.txt:
Здравей свят
How to use:
filename = 'bg.txt'
print('Before truncate:', open(filename).read())
truncate_utf8_chars(filename, 1)
print('After truncate:', open(filename).read())
Outputs:
Before truncate: Здравей свят
After truncate: Здравей свя
This works with both UTF-8 and ASCII encoded files.
In case you are not reading the file in binary mode, where you have only 'w' permissions, I can suggest the following.
f.seek(f.tell() - 1, os.SEEK_SET)
f.write('')
In this code above, f.seek() will only accept f.tell() b/c you do not have 'b' access. then you can set the cursor to the starting of the last element. Then you can delete the last element by an empty string.
with open(urfile, 'rb+') as f:
f.seek(0,2) # end of file
size=f.tell() # the size...
f.truncate(size-1) # truncate at that size - how ever many characters
Be sure to use binary mode on windows since Unix file line ending many return an illegal or incorrect character count.
with open('file.txt', 'w') as f:
f.seek(0, 2) # seek to end of file; f.seek(0, os.SEEK_END) is legal
f.seek(f.tell() - 2, 0) # seek to the second last char of file; f.seek(f.tell()-2, os.SEEK_SET) is legal
f.truncate()
subject to what last character of the file is, could be newline (\n) or anything else.
This may not be optimal, but if the above approaches don't work out, you could do:
with open('myfile.txt', 'r') as file:
data = file.read()[:-1]
with open('myfile.txt', 'w') as file:
file.write(data)
The code first opens the file, and then copies its content (with the exception of the last character) to the string data. Afterwards, the file is truncated to zero length (i.e. emptied), and the content of data is saved to the file, with the same name.
This is basically the same as vins ms's answer, except that it doesn't use the os package, and that is used the safer 'with open' syntax. This may not be recommended if the text file is huge. (I wrote this since none of the above approaches worked out too well for me in python 3.8).
here is a dirty way (erase & recreate)...
i don't advice to use this, but, it's possible to do like this ..
x = open("file").read()
os.remove("file")
open("file").write(x[:-1])
On a Linux system or (Cygwin under Windows). You can use the standard truncate command. You can reduce or increase the size of your file with this command.
In order to reduce a file by 1G the command would be truncate -s 1G filename. In the following example I reduce a file called update.iso by 1G.
Note that this operation took less than five seconds.
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 30802968576 Blocks: 30081024 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:39:00.572940600 -0400
Modify: 2020-06-12 07:39:00.572940600 -0400
Change: 2020-06-12 07:39:00.572940600 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
chris#SR-ENG-P18 /cygdrive/c/Projects
$ truncate -s -1G update.iso
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 29729226752 Blocks: 29032448 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:42:38.335782800 -0400
Modify: 2020-06-12 07:42:38.335782800 -0400
Change: 2020-06-12 07:42:38.335782800 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
The stat command tells you lots of info about a file including its size.

parsing dates from strings

I have a list of strings in python like this
['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
I want to parse only the date and time (for example, 2016-08-05 15:10:00 )from these strings.
So far I used a for loop like the one below but it's very time consuming, is there a better way to do this?
for files in glob.glob("AM_B0_*.flac.h5"):
if files[11]=='_':
year=files[12:16]
month=files[17:19]
day= files[20:22]
hour=files[23:25]
minute=files[25:27]
second=files[27:29]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year),int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
else:
year=files[11:15]
month=files[16:18]
day= files[19:21]
hour=files[22:24]
minute=files[24:26]
second=files[26:28]
tindex=pd.date_range(start= '%d-%02d-%02d %02d:%02d:%02d' %(int(year), int(month), int(day), int(hour), int(minute), int(second)), periods=60, freq='10S')
Try this (based on the 2nd last '-', no need of if-else case):
filesall = ['AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5',
'AM_B0_D0.0_2016-04-01T010000.flac.h5',
'AM_B0_D3.7_2016-04-13T215000.flac.h5',
'AM_B0_D10.3_2017-03-17T110000.flac.h5',
'AM_B0_D0.7_2016-10-21T104000.flac.h5',
'AM_B0_D4.4_2016-08-05T151000.flac.h5']
def find_second_last(text, pattern):
return text.rfind(pattern, 0, text.rfind(pattern))
for files in filesall:
start = find_second_last(files,'-') - 4 # from yyyy- part
timepart = (files[start:start+17]).replace("T"," ")
#insert 2 ':'s
timepart = timepart[:13] + ':' + timepart[13:15] + ':' +timepart[15:]
# print(timepart)
tindex=pd.date_range(start= timepart, periods=60, freq='10S')
In Place of using file[11] as hard coded go for last or 2nd last index of _ then use your code then you don't have to write 2 times same code. Or use regex to parse the string.

Unable to print Dependent vowels

I am reading the text file consisting of bengali words. But I am unable to print the dependent vowels like KA,KI etc...
Here is my sample code and output
import unicodedata
bengali_phoneme_maplist={u'অ':'A',u'আ':'AA',u'ই':'I',u'ঈ':'II',u'উ':'U',u'ঊ ':'UU',u'ঋ ':'R',u'ঌ ':'L',u'এ ':'E',u'ঐ ':'AI',u'ও ':'O',u'ঔ ':'AU',u'ক':'KA',u'খ ':'KHA',u'গ ':'GA',u'ঘ':'GHA',u'ঙ ':'NGA',u'চ ':'CA',u'ছ':'CHA',u'জ ':'JA',u'ঝ':'JHA',u'ঞ':'NYA',u'ট ':'TTA',u'ঠ':'TTHA',u'ড ':'DDA',u'ঢ':'DDHA',u'ণ ':'NNA',u'ত ':'TA',u'ত ':'THA',u'দ':'DA',u'ধ':'DHA',u'ন':'NA',u'প':'PA',u'ফ':'PHA',u'ব':'BA',u'ভ':'BHA',u'ম ':'MA',u'য ':'YA',u'র':'RA',u'ল ':'LA',u'শ ':'SHA',u'ষ':'SSA',u'স ':'SA',u'হ':'ha',u' া ':'AAV',u' ি':'IV',u'ী':'IIV',u'ু':'UV',u'ূ':'UUV',u'ৃ':'RRV',u'ৄ ':'RR',u'ৄ':'EV',u' ৈ':'EV',u'়':'NUKTHA',u'ঽ':'AVAGRAHA'}
bengali_phoneme_maplist_normalise={unicodedata.normalize('NFKD',k):v
for k,v in bengali_phoneme_maplist.items()}
with open('bengali.txt','r')as infile:
lines=infile.readlines()
for index,line in enumerate(lines):
print('Phonemes in line{0}.total{1} symbols'.format(index,len(line)))
unknown=[]
words=line.split()
for word in words:
print(word,':',sep=' ', end='')
for character in word:
c=unicodedata.normalize('NFKD',character).casefold()
try:
print(bengali_phoneme_maplist_normalise[c],sep='',end='')
except KeyError:
print('_',sep='',end='')
if c not in unknown:
unknown.append(c)
print()
if unknown:
print('Unrecognised symbols:{0},total {1} symbols'.format(','.join(unknown),len(unknown)))
Sample input:
শিল্পাঞ্চলে ঢোকার মুখে, স্ন্যাক্সবারে খাবার কিনছিলেন, বহুজাতিক তথ্যপ্রযুক্তি সংস্থার কর্মী, শুভময় বন্দ্যোপাধ্যায়
Sample output:
Phonemes in line0.total129 symbols
text_000002 :___________
"শিল্পাঞ্চলে :_____PA_NYA____
ঢোকার :DDHA_KA_RA
মুখে, :_UV___
স্ন্যাক্সবারে :__NA___KA__BA_RA_
খাবার :__BA_RA
কিনছিলেন, :KA_NACHA___NA_
Unrecognisedsymbols:t,e,x,_,0,2,",শ,ি,ল,্,া,চ,ে,ো,ম,খ,,,স,য,জ,ত,থ,ং,য়,),
(Note that I know nohting about Bengali. :)
There are a few problems in your code:
There are many extra SPACE chars in the bengali_phoneme_maplist definition. For example, u'ঊ ' should be u'ঊ'. And it seems like it's not easy to input chars like u'া' in an text editor so I suggest you directly use unicode in the code, like '\u09be':'AAV'. (Actually I'd suggest you use '\uxxxx' for all chars and write the real chars in comments.)
u'ত':'TA',u'ত':'THA' should change to u'ত':'TA',u'থ':'THA'.
The chars in bengali_phoneme_maplist are not complete. For example there's no ো , ৌ , ্ and ং
After fixing these errors you will get the correct result.

entering text in a file at specific locations by identifying the number being integer or real in linux

I have an input like below
46742 1 48276 48343 48199 48198
46744 1 48343 48344 48200 48199
46746 1 48344 48332 48201 48200
48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02
48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02
48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02
Each segment with the 2nd entry like 1 being integer is like thousands of lines and then starts the segment with the 2nd entry being real like 3.58077402e+01
Before anything beings I have to input a text like
*Revolved
*Gripped
*Crippled
46742 1 48276 48343 48199 48198
46744 1 48343 48344 48200 48199
46746 1 48344 48332 48201 48200
*Cracked
*Crippled
48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02
48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02
48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02
so I need to enter specific texts at those locations. It is worth mentioning that the file is space delimited and not tabs delimited and that the text starting with * has to be at the very left of the line without spacing. The format of the rest of the file should be kept too.
Any suggestions with sed or awk would be highly appreaciated!
The text in the beginning could entered directly so that is not a prime problem since that is the start of the file, problematic is the second bunch of line so identify that the second entry has turned to real.
An awk with fixed strings:
awk 'BEGIN{print "*Revolved\n*Gripped\n*Crippled"}
match($2,"\+")&&!pr{print "*Cracked\n*Crippled";pr=1}1' yourfile
match($2,"\+")&&!pr : When + char is found at $2 field(real number) and pr flag is null.

Resources