I have many text files. I tried to convert the txt files into a single CSV file, but it is taking a huge time. I put the code on run mode at night and I slept, it processed only 4500 files, but still morning it is running.
There is any way to fast way to convert the text files into csv?
Here is my code:
import pandas as pd
import os
import glob
from tqdm import tqdm
# create empty dataframe
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])
# get list of files
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
for filename in tqdm(file_list):
# next file/record
mydict = {}
with open(filename) as datafile:
# read each line and split on " " space
for line in tqdm(datafile):
# Note: partition result in 3 string parts, "key", " ", "value"
# array slice third parameter [::2] means steps=+2
# so only take 1st and 3rd item
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
# put dictionary in dataframe
csvout = csvout.append(mydict, ignore_index=True)
# write to csv
csvout.to_csv("train.csv", sep=";", index=False)
Here is my example text file.
ID 0xb379
Delivery_person_ID BANGRES18DEL02
Delivery_person_Age 34.000000
Delivery_person_Ratings 4.500000
Restaurant_latitude 12.913041
Restaurant_longitude 77.683237
Delivery_location_latitude 13.043041
Delivery_location_longitude 77.813237
Order_Date 25-03-2022
Time_Orderd 19:45
Time_Order_picked 19:50
Weather conditions Stormy
Road_traffic_density Jam
Vehicle_condition 2
Type_of_order Snack
Type_of_vehicle scooter
multiple_deliveries 1.000000
Festival No
City Metropolitian
Time_taken (min) 33.000000
CSV is a very simple data format for which you don't need any sophisticated tools to handle. Just text and separators.
In your hopefully simple case there is no need to use pandas and dictionaries.
Except your datafiles are corrupt missing some columns or having some additional columns to skip. But even in this case you can handle such issues better within your own code so you have more control over it and are able to get results within seconds.
Assuming your datafiles are not corrupt having all columns in the right order with no missing columns or having additional ones (so you can rely on their proper formatting), just try this code:
from time import perf_counter as T
sT = T()
filesProcessed = 0
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
import glob, os
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
csv_lines = []
csv_line_counter = 0
for filename in file_list:
filesProcessed += 1
with open(filename) as datafile:
csv_line = ""
for line in datafile.read().splitlines():
# print(line)
var = line.partition(" ")[-1]
csv_line += var.strip() + ';'
csv_lines.append(str(csv_line_counter)+';'+csv_line[:-1])
csv_line_counter += 1
with open("train.csv", "w") as csvfile:
csvfile.write(';'+';'.join(columns)+'\n')
csvfile.write('\n'.join(csv_lines))
eT = T()
print(f'> {filesProcessed=}, {(eT-sT)=:8.6f}')
I guess you will get the result in a speed beyond your expectations (in seconds, not minutes or hours)
On my computer, estimating from processing time of 100 files the time required for 50.000 files will be about 3 seconds.
I could not replicate. I took the example data file and created 5000 copies of it. Then I ran your code using tqdm and without. The below shows without:
import time
import csv
import os
import glob
import pandas as pd
from tqdm import tqdm
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])
file_list = glob.glob(os.path.join(os.getcwd(), "sample_files/", "*.txt"))
t1 = time.time()
for filename in file_list:
# next file/record
mydict = {}
with open(filename) as datafile:
# read each line and split on " " space
for line in datafile:
# Note: partition result in 3 string parts, "key", " ", "value"
# array slice third parameter [::2] means steps=+2
# so only take 1st and 3rd item
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
# put dictionary in dataframe
csvout = csvout.append(mydict, ignore_index=True)
# write to csv
csvout.to_csv("train.csv", sep=";", index=False)
t2 = time.time()
print(t2-t1)
The times I got where:
tqdm 33 seconds
no tqdm 34 seconds
Then I ran using the csv module:
t1 = time.time()
with open('output.csv', 'a', newline='') as csv_file:
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
mydict = {}
d_Writer = csv.DictWriter(csv_file, fieldnames=columns, delimiter=',')
d_Writer.writeheader()
for filename in file_list:
with open(filename) as datafile:
for line in datafile:
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
d_Writer.writerow(mydict)
t2 = time.time()
print(t2-t1)
The time for this was:
csv 0.32231569290161133 seconds.
Try it like this.
import glob
with open('my_file.csv', 'a') as csv_file:
for path in glob.glob('./*.txt'):
with open(path) as txt_file:
txt = txt_file.read() + '\n'
csv_file.write(txt)
Related
I have a data array 10000*3 that need to save as a csv file.Below is my pseudo code, but I don't know how to achive it only with numpy(not pandas).Can someone know how to do?
import numpy as np
import time
arr = np.random.randn(10000, 3)
# That need arr[0,0] = time.time(), arr[i, 0] = time.time() + i
arr[:,0] = time.time() + 1
# And after that, I need to column 1 to datatime string(like "2021-02-12 12:12:12")
arr[:,0] need to apply `time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(item))` this function.
# After that, the data type is, column1:str, column2:float, column3:float and to save it as a csv file.
This can be done with import csv (pre-installed with python default packages). See this tutorial for examples: https://www.pythontutorial.net/python-basics/python-write-csv-file/
with open('countries.csv', 'w', encoding='UTF8') as f:
writer = csv.writer(f)
# write the header
writer.writerow(header)
# write the data
for row in arr:
writer.writerow(row)
I've been self-learning Python for the past Month( 0 coding experience python is my first coding language) and have finally written my first work related usable code , i am trying to refine this code for repeated use , as it converts a comment based xlx data to txt 'string type' file and finally a wordcloud; you can find below the workable code:
how the code works:
step1. xlsx file = 4 column excel worksheet
step2. python extracts all column 'B'
step3. converts it into 'Str' Format , removes spaces & converts into txt file
step4. wordcloud removes words using stopWords,
Step5. generates wordcloud according to the format
I would Like to refine it in a way :
Changing File Directory Through a simple step instead of multiple copy and pasting of directory name ( skip manual changing of all file directories )
txt file's name creation is based on xlsx file's name ( so i don't have to key in manually every time)
if anyone has a better way of refining please do let me know , i am very new to this so if you need any other information to clarify any info , lemme know
Any help would be greatly appreciated, Thank you all in advance
import openpyxl as xl
import wordcloud
from wordcloud import WordCloud,STOPWORDS
from matplotlib.pyplot import imread
import jieba
import pandas as pd
# opening the source excel file ( repeated steps needed for every different document)
filename = "C:\\Users\\shakesmilk\\Desktop\\staub\\staub天猫商品评论.xlsx"
wb1 = xl.load_workbook(filename)
ws1 = wb1.worksheets[0]
# opening the destination excel file ( repeated steps needed for every different document)
filename1 = "C:\\Users\\shakesmilk\\Desktop\\staub\\staub天猫商品评论.xlsx"
wb2 = xl.load_workbook(filename1)
wb2.create_sheet('Sheet2')
ws2 = wb2.worksheets[1]
# calculate total number of rows and
# columns in source excel file
mr = ws1.max_row
mc = ws1.max_column
minr= ws2.min_row
# # copying the cell values from source
# # excel file to destination excel file
for i in range(1, mr + 1):
for j in range(0, mc + 1):
# reading cell value from source excel file
c = ws1.cell(row=i+1, column=2)
# writing the read value to destination excel file
ws2.cell(row=i+1, column=2).value = c.value
# # #deleting first empty row/ column
ws2.delete_cols(1)
#saving the destination excel file
wb2.save(str(filename1))
# #converting sheet 2 with pandas to txt file
df = pd.read_excel(filename,sheet_name=1)
with open("C:\\Users\\shakesmilk\\Desktop\\staub\\file.txt", mode='w',encoding='utf-8') as outfile:
df.to_string(outfile,header = None ,index = None)
#open read & remove spaces from txt file
commentfiletxt= "C:\\Users\\shakesmilk\\Desktop\\staub\\file.txt"
with open(commentfiletxt, 'r' , encoding='utf-8') as f:
lines = f.readlines()
# # remove spaces
lines = [line.replace(' ', '') for line in lines]
# # finally, write lines in the file
with open(commentfiletxt,'w', encoding='utf-8') as f :
f.writelines(lines)
# txt file generated > next to create wordcloud
#wordcloud start
#remove words from wordcloud
stopwords= set(STOPWORDS)
stopwords.update(['此用户没有填写评论', 'hellip','zwj','其他特色','还没用','非常喜欢','产品功能','没有用'])
mask = imread('moon.jpg')
with open(commentfiletxt, 'r',encoding='utf-8') as file:
text = file.read()
words = jieba.lcut(text) # 精确分词
newtxt = ' '.join(words) # 空格拼接
wd = wordcloud.WordCloud(stopwords=stopwords,\
font_path="MSYH.TTC",\
background_color="white", \
width=800, \
height=300, \
max_words=500, \
max_font_size=200, \
mask = mask, \
).generate(text)
# save picture
txt = open(commentfiletxt, mode='r', encoding='utf-8')
# save picture
wd.to_file('staub2.png')
i continued reading "Learning Python 5th Edition" and apparently functions are a good way to make codes reusable ; guess no one is interested in noob codes, but i'm sure there are a lot of beginners out there trying to refine their codes so i'm answering my question as i progress through the book , hopefully this helps out those in need. p.s currently reading about classes , im guesssing i could transform this code further but for now for those in need , this is my 'def' example:
import openpyxl as xl
import wordcloud
from wordcloud import WordCloud,STOPWORDS
from matplotlib.pyplot import imread
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import jieba
import pandas as pd
def open_excel(filename):
global wb1,ws1,filenametxt
wb1 =xl.load_workbook(filename)
ws1 = wb1.worksheets[0]
filenametxt = filename
print('Loading WorkBook Completed')
def create_sheet(filename1):
global wb2,ws2
wb2 = xl.load_workbook(filename1)
wb2.create_sheet('Sheet2')
ws2 = wb2.worksheets[1]
print('Sheet 2 Created')
mr = ws1.max_row
mc = ws1.max_column
minr = ws2.min_row
for i in range(1, mr + 1):
for j in range(0, mc + 1):
# reading cell value from source excel file
c = ws1.cell(row=i + 1, column=2)
ws2.cell(row=i + 1, column=2).value = c.value
wb2.save(filename1)
print("Data Extracted To 'Column B'" )
ws2.delete_cols(1)
print('Empty Space in Column 1 Deleted')
wb2.save(filename1)
def create_txtf(tfile):
global df
df = pd.read_excel(filenametxt, sheet_name=1)
with open(tfile, mode='w', encoding='utf-8') as outfile:
df.to_string(outfile, header=None, index=None)
print('txt file created as file.txt')
def convert_remove(tfile1):
#open read & remove spaces from txt file
global commentfiletxt
commentfiletxt = tfile1
with open(commentfiletxt, 'r', encoding='utf-8') as f:
lines = f.readlines()
# # remove spaces
lines = [line.replace(' ', '') for line in lines]
print('Empty Spaces ' ' are removed' )
# # finally, write lines in the file
with open(commentfiletxt, 'w', encoding='utf-8') as f:
f.writelines(lines)
print('Data is written correctly without spaces')
return(tfile1)
def wordcloudpic(picname,maskpathn):
stopwords = set(STOPWORDS)
stopwords.update(['此用户没有填写评论', 'hellip','zwj','其他特色','还没用','非常喜欢','产品功能','没有用','东西收到了','S','sode','c','s左右','u','middot','u','theta','rdquo','ldquo','ec','ok','好评','不错','很好','满意','好用','老板大气','好',\
'nbsp'])
mask = imread(maskpathn)
mask = mask.astype(np.uint8)
with open(commentfiletxt, 'r',encoding='utf-8') as file:
text = file.read()
words = jieba.lcut(text) # 精确分词
newtxt = ' '.join(words) # 空格拼接
wd = wordcloud.WordCloud(stopwords=stopwords,\
font_path="MSYH.TTC",\
background_color="white", \
width=800, \
height=300, \
max_words=500, \
max_font_size=200, \
mask = mask, \
).generate(text)
txt = open(commentfiletxt, mode='r', encoding='utf-8')
# save picture
wd.to_file(picname)
if __name__ == "__main__":
open_excel("C:\\Users\\shakesmilk\\Desktop\\testtest\\test天猫商品评.xlsx")
create_sheet("C:\\Users\\shakesmilk\\Desktop\\testtest\\test天猫商品评.xlsx")
create_txtf("C:\\Users\\shakesmilk\\Desktop\\testtest\\file.txt")
convert_remove("C:\\Users\\shakesmilk\\Desktop\\testtest\\file.txt")
wordcloudpic('test.png','bubble.jpg')
I'm writing a script to track my orders from a website. I want to import the order# from a txt file and the script should repeat it self as long as there are ordernumbers.I wrote a code where the script imports this txt file and chooses a random ordernumber but the script puts all ordernumbers together and doesnt seperate them how can I fix this ?
this is my code:
f=open("Order#.txt", "r")
OrderNR = f.read()
words = OrderNR.split()
Repeat = len(words)
for i in range(Repeat):
randomlist = OrderNR
Orderrandom = random.choice(randomlist)
Mainlink = 'https://footlocker.narvar.com/footlocker/tracking/startrack?order_number=' + Orderrandom
Instead of using f.read(), try using f.readlines().
# Using readlines()
file1 = open('myfile.txt', 'r')
Lines = file1.readlines()
Try PANDAS
import pandas as pd
df = pd.read_csv('Order#.txt', delimiter='\t')
print(df)
you can see TXT file in table format
I'm using pandas to open a CSV file that contains data from spotify, meanwhile, I have a txt file that contains various artists names from that CSV file. What I'm trying to do is get the value from each row of the txt and automatically search them in the function I've done.
import pandas as pd
import time
df = pd.read_csv("data.csv")
df = df[['artists', 'name', 'year']]
def buscarA():
start = time.time()
newdf = (df.loc[df['artists'].str.contains(art)])
stop = time.time()
tempo = (stop - start)
print (newdf)
e = ('{:.2f}'.format(tempo))
print (e)
with open("teste3.txt", "r") as f:
for row in f:
art = row
buscarA()
but the output is always the same:
Empty DataFrame
Columns: [artists, name, year]
Index: []
The problem here is that when you read the lines of your file in Python, it also gets the line break per row so that you have to strip it off.
Let's suppose that the first line of your teste3.txt file is "James Brown". It'd be read as "James Brown\n" and not recognized in the search.
Changing the last chunk of your code to:
with open("teste3.txt", "r") as f:
for row in f:
art = row.strip()
buscarA()
should work.
I am new to programming and probably there is an answer to my question somewhere like here, the closest i found after searching for days. Most of the info deals with existing csvs or hardcoding data. I am trying to make the program create data every time it runs and work on that so a little stumped here.
The Problem:
I can't seem to get python to attach serial nos to each entry when i run the program am making to log my study blocks. It has various fields following are two of them:
Date Time
12-03-2018 11:30
Following is the code snippet:
d= ''
while d == '':
d = input('Date:')
try:
valid_date = dt.strptime(d, '%Y-%m-%d')
except ValueError:
d = ''
print('Please input date in YYYY-MM-DD format.')
t= ''
while t == '':
t = input('Time:')
try:
valid_time = dt.strptime(t, '%H:%M')
except ValueError:
d = ''
print('Please input time in HH:MM format.')
header = csv.DictWriter(outfile, fieldnames= ['UID', 'Date', 'Time', 'Topic', 'Objective', 'Why', 'Summary'], delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL )
header.writeheader()
log_input = csv.writer(outfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
log_input.writerow([d, t, topic, objective, why, summary])
outfile.close()
df = pd.read_csv('E:\Coursera\HSU\python\pom_blocks_log.csv')
df = pd.read_csv('E:\pom_blocks_log.csv')
df = df.reset_index()
df.columns[0] = 'UID'
df['UID'] = df.index
print (df)
I get the following error when i run the program with the df block:
TypeError: Index does not support mutable operations
I new to python and don't really know how to work with data structures, so i am building small programs to learn. Any help is highly appreciated and apologies if this is a duplicate, please point me to the right direction.
So, i figured it out. Following is the process i followed:
I save the CSV file using the csv module.
I load the CSV file in pandas as dataframe.
What this does is, it allows me to append user entries to the CSV every time the program is run and then i can load it as a dataframe and use pandas to manipulate the data accordingly. Then i added a generator to clean the lines off the delimiter character ',' so that it could be loaded as a dataframe for string columns where ',' is accepted as a valid input. Maybe this is a round about approach but, it works.
Following is the code:
import csv
from csv import reader
from datetime import datetime
import pandas as pd
import numpy as np
with open(r'E:\Coursera\HSU\08_programming\trLog_df.csv','a', encoding='utf-8') as csvfile:
# Date
d = ''#input("Date:")
while d == '':
d = input('Date: ')
try:
valid_date = datetime.strptime(d, '%Y-%m-%d')
except ValueError:
d = ''
print("Incorrect data format, should be YYYY-MM-DD")
# Time
t = ''#input("Date:")
while t == '':
t = input('Time: ')
try:
valid_date = datetime.strptime(t, '%H:%M')
except ValueError:
t = ''
print("Incorrect data format, should be HH:MM")
log_input = csv.writer(csvfile, delimiter= ',',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
log_input.writerow([d, t])
# Function to clean lines off the delimter ','
def merge_last(file_name, merge_after_col=7, skip_lines=0):
with open(file_name, 'r') as fp:
for i, line in enumerate(fp):
if i < 2:
continue
spl = line.strip().split(',')
yield (*spl[:merge_after_col], ','.join(spl[merge_after_col:2]))
# Generator to clean the lines
gen = merge_last(r'E:\Coursera\HSU\08_programming\trLog_df.csv', 1)
# get the column names
header = next(gen)
# create the data frame
df = pd.DataFrame(gen, columns=header)
df.head()
print(df)
If anybody has a better solution, it would be enlightening to know how to do it with efficiency and elegance.
Thank you for reading.