Change in some character encoding when reding a MS word file using python docx module and saving it back - python-3.x

I am reading a word file (word file is just having a huge table), inserting a blank row after each row in the table and saving it back. After save, the new file is having some characters changed. I am guessing change in the encoding is happening.
Here is my code for reading and saving it.
def insert_row_in_table(table):
empty_row = get_empty_row(table) # this function will return an empty row
for row in table.rows:
tr = row._tr
tr.addnext(copy.deepcopy(empty_row))
def convert(file: str):
doc = docx.Document(file)
row_c = 0
for table in doc.tables:
insert_row_in_table(table)
# save file
file_name = os.path.splitext(file)
new_name = file_name[0] + '_updated' + file_name[1]
doc.save(new_name)
This is how its looking when i am comparing both the files (Left side: Original file, Right side: Updated file)
How to preserve the characters encoding or avoid this issue?

Related

How to write .xlsx data to a .txt file ensuring each column has its own text file, then each row is a new line?

I believe I am close to cracking this however I can't add multiple lines of text to the .txt files. The column do relate to their own .txt files.
import openpyxl
from pathlib import Path
# create workbook
wb = openpyxl.Workbook()
sheet = wb.active
listOfTextFiles = []
# Create a workbook 5x5 with dummy text
for row in range(1, 6):
for col in range(1, 6):
file = sheet.cell(row=row, column=col).value = f'text row:{row}, col:{col}'
listOfTextFiles.append(file)
print(listOfTextFiles) # for testing
wb.save('testSS.xlsx')
for i in range(row): # create 5 text files
textFile = open(f'ssToTextFile{i}.txt', 'w')
textFile.write(listOfTextFiles[i])
The output for each text file is below. I know it has something to do with the 'textFile.write(listOfTextFiles[i])' and I've tried many ways such as replacing [i] with [j] or [file]. I think I am overwriting the text through each loop.
Current output:
ssToTextFile.txt -> text row:1, col:1
What I want the output to be in each .txt file:
ssToTextFile.txt -> text row:1, col:1
text row:2, col:1
text row:3, col:1
text row:4, col:1
text row:5, col:1
Then, the next .txt file to be:
text row:1, col:2
text row:2, col:2 etc
Would appreciate any feedback and the logic behind it please?
Solved. Using sheet.columns on the outer loop I could use [x-1] as the index.
for x in range(sheet.min_row, sheet.max_row + 1):
textFile = open(f'ssToTextFile{x-1}.txt', 'w')
for y in list(sheet.columns)[x-1]:
textFile.write(str(y.value)+ '\n')
print(y.value)

How do I convert multiple multiline txt files to excel - ensuring each file is its own line, then each line of text is it own row? Python3

Using openpyxl and Path I aim to:
Create multiple multiline .txt files,
then insert .txt content into a .xlsx file ensuring file 1 is in column 1 and each line has its own row.
I thought to create a nested list then loop through it to insert the text. I cannot figure how to ensure that all the nested list string is displayed. This is what I have so far which nearly does what I want however it's just a repeat of the first line of text.
from pathlib import Path
import openpyxl
listOfText = []
wb = openpyxl.Workbook() # Create a new workbook to insert the text files
sheet = wb.active
for txtFile in range(5): # create 5 text files
createTextFile = Path('textFile' + str(txtFile) + '.txt')
createTextFile.write_text(f'''Hello, this is a multiple line text file.
My Name is x.
This is text file {txtFile}.''')
readTxtFile = open(createTextFile)
listOfText.append(readTxtFile.readlines()) # nest the list from each text file into a parent list
textFileList = len(listOfText[txtFile]) # get the number of lines of text from the file. They are all 3 as made above
# Each column displays text from each text file
for row in range(1, txtFile + 1):
for col in range(1, textFileList + 1):
sheet.cell(row=row, column=col).value = listOfText[txtFile][0]
wb.save('importedTextFiles.xlsx')
The output is 4 columns/4 rows. All of which say the same 'Hello, this is a multiple line text file.'
Appreciate any help with this!
The problem is in the for loop while writing, change the line sheet.cell(row=row, column=col).value = listOfText[txtFile][0] to sheet.cell(row=col, column=row).value = listOfText[row-1][col-1] and it will work

Truncating cells in csv file

im currently trying to create a piece of software that finds and truncate cells containing more than a set number of characters in .csv files.
here's where i'm at :
import csv
with open('test.csv', 'r', newline = '', encoding = "UTF-8") as csv_file, \
open('output.csv', 'x',newline='',encoding="UTF-8") as output_file:
dialect = csv.Sniffer().sniff(csv_file.read(2048))
dialect.escapechar = '\\'
csv_file.seek(0)
writer = csv.writer(output_file, dialect)
for row in csv.reader(csv_file, dialect) :
copy = row
for col in copy :
#truncate the file to desired lenght
col = col[:253] + (col[:253] and '..')
writer.writerow(copy)
The problem here is that the new file is created but not changed.
Thanks for your consideration.
The problem is, is that you recreate the value col. This means that the old value is not changed and it is the old value that is still in the list. Best is to recreate the original list, and this can be done best with a "list comprehension"
copy = [col[:253] + (col[:253] and '..') for col in copy]
What's more, it really does not do anything if your variables have the same name. So, you named your altered value col, the same name as your loop variable, but this not mean that that what's contained by that loop variable (so the value in the list copy) is now replaced.
That's also why you don't have to do copy = row. You can just use row.

issue in saving string list in to text file

I am trying to save and read the strings which are saved in a text file.
a = [['str1','str2','str3'],['str4','str5','str6'],['str7','str8','str9']]
file = 'D:\\Trails\\test.txt'
# writing list to txt file
thefile = open(file,'w')
for item in a:
thefile.write("%s\n" % item)
thefile.close()
#reading list from txt file
readfile = open(file,'r')
data = readfile.readlines()#
print(a[0][0])
print(data[0][1]) # display data read
the output:
str1
'
both a[0][0] and data[0][0] should have the same value, reading which i saved returns empty. What is the mistake in saving the file?
Update:
the 'a' array is having strings on different lengths. what are changes that I can make in saving the file, so that output will be the same.
Update:
I have made changes by saving the file in csv instead of text using this link, incase of text how to save the data ?
You can save the list directly on file and use the eval function to translate the saved data on file in list again. Isn't recommendable but, the follow code works.
a = [['str1','str2','str3'],['str4','str5','str6'],['str7','str8','str9']]
file = 'test.txt'
# writing list to txt file
thefile = open(file,'w')
thefile.write("%s" % a)
thefile.close()
#reading list from txt file
readfile = open(file,'r')
data = eval(readfile.readline())
print(data)
print(a[0][0])
print(data[0][1]) # display data read
print(a)
print(data)
a and data will not have same value as a is a list of three lists.
Whereas data is a list with three strings.
readfile.readlines() or list(readfile) writes all lines in a list.
So, when you perform data = readfile.readlines() python consider ['str1','str2','str3']\n as a single string and not as a list.
So,to get your desired output you can use following print statement.
print(data[0][2:6])

python csv format all rows to one line

Ive a csv file that I would like to get all the rows in one column. Ive tried importing into MS Excel or Formatting it with Notedpad++ . However with each try it considers a piece of data as a new row.
How can I format file with pythons csv module so that it removes a string "BRAS" and corrects the format. Each row is found between a quote " and delimiter is a pipe |.
Update:
"aa|bb|cc|dd|
ee|ff"
"ba|bc|bd|be|
bf"
"ca|cb|cd|
ce|cf"
The above is supposed to be 3 rows, however my editors see them as 5 rows or 6 and so forth.
import csv
import fileinput
with open('ventoya.csv') as f, open('ventoya2.csv', 'w') as w:
for line in f:
if 'BRAS' not in line:
w.write(line)
N.B I get a unicode error when trying to use in python.
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 18: character maps to <undefined>
This is a quick hack for small input files (the content is read to memory).
#!python2
fnameIn = 'ventoya.csv'
fnameOut = 'ventoya2.csv'
with open(fnameIn) as fin, open(fnameOut, 'w') as fout:
data = fin.read() # content of the input file
data = data.replace('\n', '') # make it one line
data = data.replace('""', '|') # split char instead of doubled ""
data = data.replace('"', '') # remove the first and last "
print data
for x in data.split('|'): # split by bar
fout.write(x + '\n') # write to separate lines
Or if the goal is only to fix the extra (unwanted) newline to form a single-column CSV file, the file can be fixed first, and then read through the csv module:
#!python2
import csv
fnameIn = 'ventoya.csv'
fnameFixed = 'ventoyaFixed.csv'
fnameOut = 'ventoya2.csv'
# Fix the input file.
with open(fnameIn) as fin, open(fnameFixed, 'w') as fout:
data = fin.read() # content of the file
data = data.replace('\n', '') # remove the newlines
data = data.replace('""', '"\n"') # add the newlines back between the cells
fout.write(data)
# It is an overkill, but now the fixed file can be read using
# the csv module.
with open(fnameFixed, 'rb') as fin, open(fnameOut, 'wb') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout)
for row in reader:
writer.writerow(row)
For solving this you need not to go to even code.
1: Just open file in Notepad++
2: In first line select from | symble till next line
3: go to replace and replace the selected format with |
Search mode can be normal or extended :)
Well, since the line breaks are consistent, you could go in and do find/replace as suggested, but you could also do a quick conversion with your python script:
import csv
import fileinput
linecount = 0
with open('ventoya.csv') as f, open('ventoya2.csv', 'w') as w:
for line in f:
line = line.rstrip()
# remove unwanted breaks by concatenating pairs of rows
if linecount%2 == 0:
line1 = line
else:
full_line = line1 + line
full_line = full_line.replace(' ','')
# remove spaces from front of 2nd half of line
# if you want comma delimiters, uncomment next line:
# full_line = full_line.replace('|',',')
if 'BRAS' not in full_line:
w.write(full_line + '\n')
linecount += 1
This works for me with the test data, and if you want to change the delimiters while writing to file, you can. The nice thing about doing with code is: 1. you can do it with code (always fun) and 2. you can remove the line breaks and filter content to the written file at the same time.

Resources