text segmentation by the number of words in python - python-3.x

can anyone tell me what is the problem with my own code?
I want to segment a big text into small texts by words. for example, each segment contains 60 words each.
file=r'C:\Users\Nujou\Desktop\Master\thesis\steganalysis\dataset\economy2.txt'
openFile= open(file, 'r', encoding='utf-8-sig')
words= openFile.read().split()
#print (words)
i = 0
for idx, w in enumerate(words, start=0):
textNum = 1
while textNum <= 20:
wordAsText = []
print("word list before:", wordAsText)
while i<idx+60:
wordAsText.append(words[i])
i+=1
print ("word list after:", wordAsText)
textSeg=' '.join(wordAsText)
print (textNum, textSeg)
files = open(r"C:\Users\Nujou\Desktop\Master\thesis\steganalysis\dataset\datasetEco\Eco" + str(textNum) + ".txt", "w", encoding='utf-8-sig')
files.write(textSeg)
files.close()
idx+=60
if textNum!=20:
continue
textNum+=1
my big file (economy2) contains more than 12K words.
EDIT:
thanks for all responses. I tried what I found here and it is achieved my require.
Edited Code:
file=r'C:\Users\Nujou\Desktop\Master\thesis\steganalysis\dataset\economy2.txt'
openFile= open(file, 'r', encoding='utf-8-sig')
words= openFile.read().split()
#print (words)
n=60
segments=[' '.join(words[i:i+n]) for i in range(0,len(words),n)] #from link
i=1
for s in segments:
seg = open(r"C:\Users\Nujou\Desktop\Master\thesis\steganalysis\dataset\datasetEco\Eco" + str(i) + ".txt", "w", encoding='utf-8-sig')
seg.write(s)
seg.close()
i+=1

Related

how to speed up looping over 4GB tab delimited text file

It took me over 3 minutes to loop over a 4gb text file, counting the number of lines, number of words and chars per line as I go. Is there a faster way to do this?
This is my code:
import time
import csv
import sys
csv.field_size_limit(sys.maxsize)
i=0
countwords={}
countchars={}
start=time.time()
with open("filename.txt", "r", encoding="utf-8") as file:
for line in csv.reader(file, delimiter="\t"):
i+=1
countwords[i]=len(str(line).split())
countchars[i]=len(str(line))
if i%10000==0:
print(i)
end=time.time()
if i>0:
print(i)
print(sum(countwords.values())/i)
print(sum(countchars.values())/i)
print(end-start)
From my limited tested (on a unix dictionary) I get only a minor speedup using numpy, but any win is a win. I'm not sure if using csvreader is a good way of parsing out tabbed delimited text, but I have not checked whether this gives a more optimal speed.
import time
import numpy
# Holds count of words and letters per line of input
countwords = numpy.array( [] )
countchars = numpy.array( [] )
# Holds total count of words and letters per file
word_sum = 0
char_sum = 0
start = time.time()
file_in = open( "filename.txt", "rt", encoding="utf-8" )
for line in file_in:
# cleanup the line, split it into fields by TAB character
line = line.strip()
fields = line.split( '\t' )
# Count the fields, and the letters of each field's content
field_count = len( fields )
char_count = len( line ) - field_count # don't count the '\t' chars too
# keep a separate count of the fields and letters by line
numpy.append( countwords, field_count )
numpy.append( countchars, char_count )
# Keep a running total to save summation at the end
word_sum += field_count
char_sum += char_count
file_in.close()
end = time.time()
print("Total Words: %3d" % ( word_sum ) )
print("Total Letters: %3d" % ( char_sum ) )
print("Elapsed Time: %.2f" % ( end-start ) )
You can avoid allocating extra data, and use lists instead of dictionaries:
import time
import csv
import sys
csv.field_size_limit(sys.maxsize)
countwords=0
countchars=0
start=time.time()
with open("filename.txt", "r", encoding="utf-8") as file:
for i, line in enumerate(csv.reader(file, delimiter="\t")):
words = str(line).split() #we allocate just 1 extra string
wordsLen = len(words)
countwords += wordsLen
# for avoiding posible allocation we iterate throug the chars of the words
# we already have, then we need to add the spaces in between which is
# wordsLen - 1
countchars += len(itertools.chain.from_iterable(words)) + wordsLen - 1)
if i%10000==0:
print(i)
end=time.time()
if i>0:
print(i)
print(countwords/i)
print(countchars/i)
print(end-start)
I managed to write another version of a speedy code (using an idea I saw in a different thread), but it currently has a disadvantage compared to Kingsley's code using numpy, because it does not save data per line, but only aggregate data. In any case, here it is:
import time
start=time.time()
f = open("filename.txt", 'rb')
lines = 0
charcount=0
wordcount=0
#i=10000
buf_size = 1024 * 1024
read_f = f.raw.read
buf = read_f(buf_size)
while buf:
lines += buf.count(b'\t')
'''while lines/i>1:
print(i)
i+=10000'''
charcount+=len((buf.strip()))
wordcount+=len((buf.strip()).split())
buf = read_f(buf_size)
end=time.time()
print(end-start)
print(lines)
print(charcount/lines)
print(wordcount/lines)

Unable to save the file correctly

I have a text file contains a text about a story and I want to find a word "like" and get the next word after it and call a function to find synonyms for that word. here is my code:
file = 'File1.txt'
with open(file, 'r') as open_file:
read_file = open_file.readlines()
output_lines = []
for line in read_file:
words = line.split()
for u, word in enumerate(words):
if 'like' == word:
next_word = words[u + 1]
find_synonymous(next_word )
output_lines.append(' '.join(words))
with open(file, 'w') as open_file:
open_file.write(' '.join(words))
my only problem I think in the text itself, because when I write one sentence including the word (like) it works( for example 'I like movies'). but when I have a file contains a lot of sentences and run the code it deletes all text. can anyone know where could be the problem
You have a couple of problems. find_synonymous(next_word ) doesn't replace the word in the list, so at best you will get the original text back. You do open(file, 'w') inside the for loop, so the file is overwritten for each line. next_word = words[u + 1] will raise an index error if like happens to be the last word on the line and you don't handle the case where the thing that is liked continues on the next line.
In this example, I track an "is_liked" state. If a word is in the like state, it is converted. That way you can handle sentences that are split across lines and don't have to worry about index errors. The list is written to the file outside the loop.
file = 'File1.txt'
with open(file, 'r') as open_file:
read_file = open_file.readlines()
output_lines = []
is_liked = False
for line in read_file:
words = line.split()
for u, word in enumerate(words):
if is_liked:
words[u] = find_synonymous(word)
is_liked = False
else:
is_liked = 'like' == word
output_lines.append(' '.join(words) + '\n')
with open(file, 'w') as open_file:
open_file.writelines(output_lines)

get words before and after a specific word in text files

I have a folder containing some other folders and each of them contains a lot of text files, about 32214 files. I want to print 5 words before and after a specific word and my code should read all of these files.The code below works but it takes about 8 hours to read all of the files and extracts sentences. How can I change the code so that it reads and prints the sentences just in a few minutes? (The language is Persian)
.
.
.
def extact_sentence ():
f= open ("پاکت", "w", encoding = "utf-8")
y = "پاکت"
text= normal_text(folder_path) # the first function to normalize the files
for i in text:
for line in i:
split_line = line.split()
if y in split_line:
index = split_line.index(y)
d = (' '.join(split_line[max(0,index-5):min(index+6,len(split_line))]))
f.write(d + "\n")
f.close()
enter image description here
Use os.walk to access all the files. Then use a rolling window over each file, and check the middle word of each window:
import os
def getRollingWindow(seq, w):
win = [next(seq) for _ in range(window_size)]
yield win
for e in seq:
win[:-1] = win[1:]
win[-1] = e
yield win
def extractSentences(rootDir, searchWord):
with open("پاکت", "w", encoding="utf-8") as outfile:
for root, _dirs, fnames in os.walk(rootDir):
for fname in fnames:
print("Looking in", os.path.join(root, fname))
with open(os.path.join(root, fname)) as infile:
for window in getRollingWindow(word for line in infile for word in line.split(), 11):
if window[5] != searchWord: continue
outfile.write(' '.join(window))

Python edit specific row and column of csv file

I have some python code here with the aim of taking user input to target a specific row of a CSV and then to overwrite it if a certain letter matches.
import csv
line_count = 0
marked_item = int(input("Enter the item number:"))
with open("items.csv", 'r') as f:
reader = csv.reader(f, delimiter=',')
title = next(reader)
print(title)
print(title[3])
lines = []
for line in reader:
if title[3] == 'a':
line_count += 1
if marked_item == line_count:
title[3] = 'b'
lines.append(line)
with open("items.csv", 'w', newline='') as f:
writer = csv.writer(f, delimiter=',')
writer.writerow(title)
writer.writerows(lines)
This code works almost the way I want it to but it is unable to edit any other row but the first. an example of the output this code is:
red,12.95,2,b #note, this change from 'a' to 'b'
blue,42.5,3,a #How can I target this row and change it?
green,40.0,1,a
the problem I then have it targeting another row like row 'blue,42.5,a'. My code is unable to target and then change the value 'a' to 'b'.
you're iterating on line and you change title. Do this:
for line in reader:
if len(line)>3 and line[3] == 'a':
line_count += 1
if marked_item == line_count:
line[3] = 'b'
lines.append(line)
and drop the title = next(reader) since you don't have a title.
full fixed code for input csvs that don't have a title line:
import csv
line_count = 0
marked_item = int(input("Enter the item number:"))
with open("items.csv", 'r') as f:
reader = csv.reader(f, delimiter=',')
lines = []
for line in reader:
if len(line)>3 and line[3] == 'a':
line_count += 1
if marked_item == line_count:
line[3] = 'b'
lines.append(line)
with open("items.csv", 'w', newline='') as f:
writer = csv.writer(f, delimiter=',')
writer.writerow(title)
writer.writerows(lines)

How can I simplify and format this function?

So I have this messy code where I wanted to get every word from frankenstein.txt, sort them alphabetically, eliminated one and two letter words, and write them into a new file.
def Dictionary():
d = []
count = 0
bad_char = '~!##$%^&*()_+{}|:"<>?\`1234567890-=[]\;\',./ '
replace = ' '*len(bad_char)
table = str.maketrans(bad_char, replace)
infile = open('frankenstein.txt', 'r')
for line in infile:
line = line.translate(table)
for word in line.split():
if len(word) > 2:
d.append(word)
count += 1
infile.close()
file = open('dictionary.txt', 'w')
file.write(str(set(d)))
file.close()
Dictionary()
How can I simplify it and make it more readable and also how can I make the words write vertically in the new file (it writes in a horizontal list):
abbey
abhorred
about
etc....
A few improvements below:
from string import digits, punctuation
def create_dictionary():
words = set()
bad_char = digits + punctuation + '...' # may need more characters
replace = ' ' * len(bad_char)
table = str.maketrans(bad_char, replace)
with open('frankenstein.txt') as infile:
for line in infile:
line = line.strip().translate(table)
for word in line.split():
if len(word) > 2:
words.add(word)
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words)) # note 'lines'
A few notes:
follow the style guide
string contains constants you can use to provide the "bad characters";
you never used count (which was just len(d) anyway);
use the with context manager for file handling; and
using a set from the start prevents duplicates, but they aren't ordered (hence sorted).
Using re module.
import re
words = set()
with open('frankenstein.txt') as infile:
for line in infile:
words.extend([x for x in re.split(r'[^A-Za-z]*', line) if len(x) > 2])
with open('dictionary.txt', 'w') as outfile:
outfile.writelines(sorted(words))
From r'[^A-Za-z]*' in re.split, replace 'A-Za-z' with the characters which you want to include in dictionary.txt.

Resources