I want to create a corpus in python from multiple text files

I want to create a corpus in python from multiple text files - python-3.x

I want to do text analytics on some text data. Issue is that so far i have worked on CSV file or just 1 file, but here I have multiple text files. So, my approach is to combine them all to 1 file and then use nltk to do some text pre processing and further steps.
I tried to download gutenberg pkg from nltk, and I am not getting any error in the code. But I am not able to see content of 1st text file in 1 cell, 2nd text file in 2nd cell and so on. Kindly help.
filenames = [
"246.txt",
"276.txt",
"286.txt",
"344.txt",
"372.txt",
"383.txt",
"388.txt",
"392.txt",
"556.txt",
"665.txt"
]
with open("result.csv", "w") as f:
for filename in filenames:
f.write(nltk.corpus.gutenberg.raw(filename))
Expected result - I should get 1 csv file with contents of these 10 texts files listed in 10 different rows.

filenames = [
"246.txt",
"276.txt",
"286.txt",
"344.txt",
"372.txt",
"383.txt",
"388.txt",
"392.txt",
"556.txt",
"665.txt"
]
with open("result.csv", "w") as f:
for index, filename in enumerate(filenames):
f.write(nltk.corpus.gutenberg.raw(filename))
# Append a comma to the file content when
# filename is not the content of the
# last file in the list.
if index != (len(filenames) - 1):
f.write(",")
Output:
this,is,a,sentence,spread,over,multiple,files,and,the end
Code and .txt files available at https://github.com/michaelhochleitner/stackoverflow.com-questions-57081411 .
Using Python 2.7.15+ and nltk 3.4.4 . I had to move the .txt files to /home/mh/nltk_data/corpora/gutenberg .

Related

How to use a text list to generate svg files?

I made a svg in inkscape. It's a Hiragana text.
Is there a way to batch export svg files from a list of Hiragana text?
That can be 46 Hiragana svg files.
id="tspan849">あ</tspan></text>

I have write a py script to accomplish the task
Wish this py script can help people who need it.
#Check if the output folder is missing,it will create a new one, if output folder is not found.
import os
path = 'output'
if not os.path.isdir(path):
os.mkdir(path)
#Create a List.
ListFileName='HiraganaList.txt'
with open(ListFileName, mode="r", encoding="utf-8") as f:
#Read the contents of the file into a list.
lines=f.readlines()
#Read the sample file.
SampleFileName='Hiragana_01.svg'
with open(SampleFileName, mode="r", encoding="utf-8") as s:
#Read the sample file contents into a list.
sLines=s.readlines()
#Output directory and named variables.
OutFileFolder='Output'
OutPixFileName='Hiragana_'
#Specifies the string to find.
sTokenString='あ'
iNum=1
#Cycle through List from first line to end line.
for line in lines:
#Declares an output list.
OutputContext=[]
#Numbering
sNum='0' + str(iNum)
#Output file name + path
OutFileName=OutFileFolder + '\\'+OutPixFileName+sNum[-2:]+'.svg'
#Save a new file.
with open(OutFileName, mode="w", encoding="utf-8") as w:
#Cycle through sample contents
for sLine in sLines:
#Determine whether it is consistent with sTokenString
if sLine.find(sTokenString)>0:
print('old->'+sLine)
#Replaced by a new string
sNew=sLine.replace(sTokenString,line.replace('\n',''))
print('New->'+sNew)
#write in to list
OutputContext.append(sNew)
else:
OutputContext.append(sLine)
print(line)
w.writelines(OutputContext)
iNum+=1

Loop url from excel file, download pdf files and name them with combination of multiple columns in Python

Given a test data from this link:
I would like to read excel file and loop all the urls, then download pdf files, and named them with combination of city-type-year-quarter.pdf, ie. for the first file, it would be guangzhou-retail-2021-q2.pdf.
How could I do that based on the code below? Thanks.
Updated code:
df = pd.read_excel('test1.xlsx')
urls = df['url'].tolist()
# df.columns
for index, row in df.iterrows():
with open(f"{}_{}_{}_{}.pdf".format(row.city, row.type, row.year, row.quarter), "wb") as f:
f.write(requests.get(row['url']).content)
Out:
SyntaxError: f-string: empty expression not allowed
Reference link:
Loop url from dataframe and download pdf files in Python

How to read many files have a specific format in python

I am a little bit confused in how to read all lines in many files where the file names have format from "datalog.txt.98" to "datalog.txt.120".
This is my code:
import json
file = "datalog.txt."
i = 97
for line in file:
i+=1
f = open (line + str (i),'r')
for row in f:
print (row)
Here, you will find an example of one line in one of those files:
I need really to your help

I suggest using a loop for opening multiple files with different formats.
To better understand this project I would recommend researching the following topics
for loops,
String manipulation,
Opening a file and reading its content,
List manipulation,
String parsing.
This is one of my favourite beginner guides.
To set the parameters of the integers at the end of the file name I would look into python for loops.
I think this is what you are trying to do
# create a list to store all your file content
files_content = []
# the prefix is of type string
filename_prefix = "datalog.txt."
# loop from 0 to 13
for i in range(0,14):
# make the filename variable with the prefix and
# the integer i which you need to convert to a string type
filename = filename_prefix + str(i)
# open the file read all the lines to a variable
with open(filename) as f:
content = f.readlines()
# append the file content to the files_content list
files_content.append(content)
To get rid of white space from file parsing add the missing line
content = [x.strip() for x in content]
files_content.append(content)
Here's an example of printing out files_content
for file in files_content:
print(file)

Read multiple text files, search few strings , replace and write in python

I have 10s of text files in my local directory named something like test1, test2, test3, and so on. I would like to read all these files, search few strings in the files, replace them by other strings and finally save back into my directory in such a way that something like newtest1, newtest2, newtest3, and so on.
For instance, if there was a single file, I would have done following:
#Read the file
with open('H:\\Yugeen\\TestFiles\\test1.txt', 'r') as file :
filedata = file.read()
#Replace the target string
filedata = filedata.replace('32-83 Days', '32-60 Days')
#write the file out again
with open('H:\\Yugeen\\TestFiles\\newtest1.txt', 'w') as file:
file.write(filedata)
Is there any way that I can achieve this in python?

If you use Pyhton 3 you can use the scandir in os library.
Python 3 docs: os.scandir
With that you can get the directory entries.
with os.scandir('H:\\Yugeen\\TestFiles') as it:
Then loop over these entries and your code could look something like this.
Notice I changed the path in your code to the entry object path.
import os
# Get the directory entries
with os.scandir('H:\\Yugeen\\TestFiles') as it:
# Iterate over directory entries
for entry in it:
# If not file continue to next iteration
# This is no need if you are 100% sure there is only files in the directory
if not entry.is_file():
continue
# Read the file
with open(entry.path, 'r') as file:
filedata = file.read()
# Replace the target string
filedata = filedata.replace('32-83 Days', '32-60 Days')
# write the file out again
with open(entry.path, 'w') as file:
file.write(filedata)
If you use Pyhton 2 you can use listdir. (also applicable for python 3)
Python 2 docs: os.listdir
In this case same code structure. But you also need to handle the full path to file since listdir will only return the filename.

Rename images in file based on csv in Python

I have a folder with a couple thousand images named: 10000.jpg, 10001.jpg, etc.; and a csv file with two columns: id and name.
The csv id matches the images in the folder.
I need to rename the images as per the name column in the csv (e.g. from 10000.jpg to name1.jpg.
I've been trying the os.rename() inside a for loop as per below.
with open('train_labels.csv') as f:
lines = csv.reader(f)
for line in lines:
os.rename(line[0], line[1])
This gives me an encoding error inside the loop.
Any idea what I'm missing in the logic?
Also tried another strategy (below), but got the error: IndexError: list index out of range.
with open('train_labels.csv', 'rb') as csvfile:
lines = csv.reader(csvfile, delimiter = ' ', quotechar='|')
for line in lines:
os.rename(line[0], line[1])

I also got the same error. When i opened CSV file in notepad, I found that there was no comma between ID and name. So please check it. otherwise you can see the solutions in Renaming images in folder

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

I want to create a corpus in python from multiple text files - python-3.x

Related

How to use a text list to generate svg files?

Loop url from excel file, download pdf files and name them with combination of multiple columns in Python

How to read many files have a specific format in python

Read multiple text files, search few strings , replace and write in python

Rename images in file based on csv in Python

Categories

Resources