txt file to an specific format - excel

I have a .txt file with some data I would like to convert to xls. The txt file has this format:
1325 2016-09-08 13:42:35
1325 2016-09-08 21:52:24
1325 2016-09-10 13:00:26
1325 2016-09-10 20:47:39
and more data. What I would like to do, is to have a .xls file that contains, inthe first column the first number in the .txt file, in the second column the date of the process, the third column the time of the first process and in the forth time the last time the process was made. I do it manually because I don't know alot of programming, the only thing I could do was to convert it to .xls but basically the file was converted to .xls with no changes. The code I used I found it on internet. How can I do it?
The code I used is:
from os import listdir
from os.path import isfile, join
import xlwt
import xlrd
mypath = input("Please enter the directory path for the input files: ")
textfiles = [ join(mypath,f) for f in listdir(mypath) if isfile(join(mypath,f)) and '.txt' in f]
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
style = xlwt.XFStyle()
style.num_format_str = '#,###0.00'
for textfile in textfiles:
f = open(textfile, 'r+')
row_list = []
for row in f:
row_list.append(row.split('|'))
column_list = zip(*row_list)
workbook = xlwt.Workbook()
worksheet = workbook.add_sheet('Sheet1')
i = 0
for column in column_list:
for item in range(len(column)):
value = column[item].strip()
if is_number(value):
worksheet.write(item, i, float(value), style=style)
else:
worksheet.write(item, i, value)
i+=1
workbook.save(textfile.replace('.txt', '.xls'))

Related

How to convert the 50000 txt file into csv

I have many text files. I tried to convert the txt files into a single CSV file, but it is taking a huge time. I put the code on run mode at night and I slept, it processed only 4500 files, but still morning it is running.
There is any way to fast way to convert the text files into csv?
Here is my code:
import pandas as pd
import os
import glob
from tqdm import tqdm
# create empty dataframe
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])
# get list of files
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
for filename in tqdm(file_list):
# next file/record
mydict = {}
with open(filename) as datafile:
# read each line and split on " " space
for line in tqdm(datafile):
# Note: partition result in 3 string parts, "key", " ", "value"
# array slice third parameter [::2] means steps=+2
# so only take 1st and 3rd item
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
# put dictionary in dataframe
csvout = csvout.append(mydict, ignore_index=True)
# write to csv
csvout.to_csv("train.csv", sep=";", index=False)
Here is my example text file.
ID 0xb379
Delivery_person_ID BANGRES18DEL02
Delivery_person_Age 34.000000
Delivery_person_Ratings 4.500000
Restaurant_latitude 12.913041
Restaurant_longitude 77.683237
Delivery_location_latitude 13.043041
Delivery_location_longitude 77.813237
Order_Date 25-03-2022
Time_Orderd 19:45
Time_Order_picked 19:50
Weather conditions Stormy
Road_traffic_density Jam
Vehicle_condition 2
Type_of_order Snack
Type_of_vehicle scooter
multiple_deliveries 1.000000
Festival No
City Metropolitian
Time_taken (min) 33.000000
CSV is a very simple data format for which you don't need any sophisticated tools to handle. Just text and separators.
In your hopefully simple case there is no need to use pandas and dictionaries.
Except your datafiles are corrupt missing some columns or having some additional columns to skip. But even in this case you can handle such issues better within your own code so you have more control over it and are able to get results within seconds.
Assuming your datafiles are not corrupt having all columns in the right order with no missing columns or having additional ones (so you can rely on their proper formatting), just try this code:
from time import perf_counter as T
sT = T()
filesProcessed = 0
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
import glob, os
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
csv_lines = []
csv_line_counter = 0
for filename in file_list:
filesProcessed += 1
with open(filename) as datafile:
csv_line = ""
for line in datafile.read().splitlines():
# print(line)
var = line.partition(" ")[-1]
csv_line += var.strip() + ';'
csv_lines.append(str(csv_line_counter)+';'+csv_line[:-1])
csv_line_counter += 1
with open("train.csv", "w") as csvfile:
csvfile.write(';'+';'.join(columns)+'\n')
csvfile.write('\n'.join(csv_lines))
eT = T()
print(f'> {filesProcessed=}, {(eT-sT)=:8.6f}')
I guess you will get the result in a speed beyond your expectations (in seconds, not minutes or hours)
On my computer, estimating from processing time of 100 files the time required for 50.000 files will be about 3 seconds.
I could not replicate. I took the example data file and created 5000 copies of it. Then I ran your code using tqdm and without. The below shows without:
import time
import csv
import os
import glob
import pandas as pd
from tqdm import tqdm
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])
file_list = glob.glob(os.path.join(os.getcwd(), "sample_files/", "*.txt"))
t1 = time.time()
for filename in file_list:
# next file/record
mydict = {}
with open(filename) as datafile:
# read each line and split on " " space
for line in datafile:
# Note: partition result in 3 string parts, "key", " ", "value"
# array slice third parameter [::2] means steps=+2
# so only take 1st and 3rd item
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
# put dictionary in dataframe
csvout = csvout.append(mydict, ignore_index=True)
# write to csv
csvout.to_csv("train.csv", sep=";", index=False)
t2 = time.time()
print(t2-t1)
The times I got where:
tqdm 33 seconds
no tqdm 34 seconds
Then I ran using the csv module:
t1 = time.time()
with open('output.csv', 'a', newline='') as csv_file:
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
mydict = {}
d_Writer = csv.DictWriter(csv_file, fieldnames=columns, delimiter=',')
d_Writer.writeheader()
for filename in file_list:
with open(filename) as datafile:
for line in datafile:
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
d_Writer.writerow(mydict)
t2 = time.time()
print(t2-t1)
The time for this was:
csv 0.32231569290161133 seconds.
Try it like this.
import glob
with open('my_file.csv', 'a') as csv_file:
for path in glob.glob('./*.txt'):
with open(path) as txt_file:
txt = txt_file.read() + '\n'
csv_file.write(txt)

Loop over excel files' paths under a directory and pass them to data manipulation function in Python

I need to check the excel files under a directory /Users/x/Documents/test/ by DataCheck function from data_check.py, so I can do data manipulation of many excel files, data_check.py has code structure as follows:
import pandas as pd
def DataCheck(filePath):
df = pd.read_excel(filePath)
try:
df = df.dropna(subset=['building', 'floor', 'room'], how = 'all')
...
...
...
df.to_excel(writer, 'Sheet1', index = False)
if __name__ == '__main__':
status = True
while status:
rawPath = input(r"")
filePath = rawPath.strip('\"')
if filePath.strip() == "":
status = False
DataCheck(filePath)
In order to loop all the excel files' paths under a directory, I use:
import os
directory = '/Users/x/Documents/test/'
for filename in os.listdir(directory):
if filename.endswith(".xlsx") or filename.endswith(".xls"):
print(os.path.join(directory, filename))
else:
pass
Out:
/Users/x/Documents/test/test 3.xlsx
/Users/x/Documents/test/test 2.xlsx
/Users/x/Documents/test/test 4.xlsx
/Users/x/Documents/test/test.xlsx
But I don't know how to combine the code above together, to pass the excel files' paths to DataCheck(filePath).
Thanks for your kind help at advance.
Call the function with the names instead of printing them:
import os
directory = '/Users/x/Documents/test/'
for filename in os.listdir(directory):
if filename.endswith(".xlsx") or filename.endswith(".xls"):
fullname = os.path.join(directory, filename)
DataCheck(fullname)

what is wrong with this Pandas and txt file code

I'm using pandas to open a CSV file that contains data from spotify, meanwhile, I have a txt file that contains various artists names from that CSV file. What I'm trying to do is get the value from each row of the txt and automatically search them in the function I've done.
import pandas as pd
import time
df = pd.read_csv("data.csv")
df = df[['artists', 'name', 'year']]
def buscarA():
start = time.time()
newdf = (df.loc[df['artists'].str.contains(art)])
stop = time.time()
tempo = (stop - start)
print (newdf)
e = ('{:.2f}'.format(tempo))
print (e)
with open("teste3.txt", "r") as f:
for row in f:
art = row
buscarA()
but the output is always the same:
Empty DataFrame
Columns: [artists, name, year]
Index: []
The problem here is that when you read the lines of your file in Python, it also gets the line break per row so that you have to strip it off.
Let's suppose that the first line of your teste3.txt file is "James Brown". It'd be read as "James Brown\n" and not recognized in the search.
Changing the last chunk of your code to:
with open("teste3.txt", "r") as f:
for row in f:
art = row.strip()
buscarA()
should work.

Storing outputdata in CSV using python

I have extracted data from different excel sheets spread in different folders, I have organized the folders numerically from 2015 to 2019 and each folder has twelve subfolders (from 1 to 12) here's my code:
import os
from os import walk
import pandas as pd
path = r'C:\Users\Sarah\Desktop\IOMTest'
my_files = []
for (dirpath, dirnames, filenames) in walk(path):
my_files.extend([os.path.join(dirpath, fname) for fname in filenames])
all_sheets = []
for file_name in my_files:
#Display sheets names using pandas
pd.set_option('display.width',300)
mosul_file = file_name
xl = pd.ExcelFile(mosul_file)
mosul_df = xl.parse(0, header=[1], index_col=[0,1,2])
#Read Excel and Select columns
mosul_file = pd.read_excel(file_name, sheet_name = 0 ,
index_clo=None, na_values= ['NA'], usecols = "A, E, G, H , L , M" )
#Remove NaN values
data_mosul_df = mosul_file.apply (pd.to_numeric, errors='coerce')
data_mosul_df = mosul_file.dropna()
print(data_mosul_df)
then I saved the extracted columns in a csv file
def save_frames(frames, output_path):
for frame in frames:
frame.to_csv(output_path, mode='a+', header=False)
if __name__ == '__main__':
frames =[pd.DataFrame(data_mosul_df)]
save_frames(frames, r'C:\Users\Sarah\Desktop\tt\c.csv')
My problem is that when I open the csv file it seems that it doesn't store all the data but only the last excel sheet that it has read or sometimes the two last excel sheets. however, when I print my data inside the console (in Spyder) I see that all the data are treated
data_mosul_df = mosul_file.apply (pd.to_numeric, errors='coerce')
data_mosul_df = mosul_file.dropna()
print(data_mosul_df)
the picture below shows the output csv created. I am wondering if it is because from Column A to Column E the information are the same ? so that's why it overwrite ?
I would like to know how to modify the code so that it extract and store the data chronologically from folders (2015 to 2019) taking into accout subfolders (from 1 to 12) in each folder and how to create a csv that stores all the data ? thank you
Rewrite your loop:
for file_name in my_files:
#Display sheets names using pandas
pd.set_option('display.width',300)
mosul_file = file_name
xl = pd.ExcelFile(mosul_file)
mosul_df = xl.parse(0, header=[1], index_col=[0,1,2])
#Read Excel and Select columns
mosul_file = pd.read_excel(file_name, sheet_name = 0 ,
index_clo=None, na_values= ['NA'], usecols = "A, E, G, H , L , M" )
#Remove NaN values
data_mosul_df = mosul_file.apply (pd.to_numeric, errors='coerce')
data_mosul_df = mosul_file.dropna()
#Make a list of df's
all_sheets.append(data_mosul_df)
Rewrite your save_frames:
def save_frames(frames, output_path):
frames.to_csv(output_path, mode='a+', header=False)
Rewrite your main:
if __name__ == '__main__':
frames = pd.concat(all_sheets)
save_frames(frames, r'C:\Users\Sarah\Desktop\tt\c.csv')

MySQL date range pull using Python : TypeError:unhashable type: 'bytearray'

I am trying to figure out how to pull data from a table that has a column that is called 'sent_time' and the datetime falls between two datetimes. I was finally able to figure out how to use dateutil parser to be able to input the two dates for the date range pull. My problem now is that I'm getting this error:
Traceback (most recent call last):
File "C:\Python34\timerange.py", line 75, in <module>
worksheet.write(r,0,row[0])
File "C:\Python34\lib\site-packages\xlsxwriter\worksheet.py", line 64, in cell_wrapper
return method(self, *args, **kwargs)
File "C:\Python34\lib\site-packages\xlsxwriter\worksheet.py", line 436, in write
return self.write_string(row, col, *args)
File "C:\Python34\lib\site-packages\xlsxwriter\worksheet.py", line 64, in cell_wrapper
return method(self, *args, **kwargs)
File "C:\Python34\lib\site-packages\xlsxwriter\worksheet.py", line 470, in write_string
string_index = self.str_table._get_shared_string_index(string)
File "C:\Python34\lib\site-packages\xlsxwriter\sharedstrings.py", line 128, in _get_shared_string_index
if string not in self.string_table:
TypeError: unhashable type: 'bytearray'
It's the bytearray that has got me puzzled. Could you guys tell me what I'm doing wrong and how I can fix it?
I want to give you all the information I have with all the other files and what I'm shooting for to see if you can replicate and actually get it working just to see if it's not just my system or some configuration I have..
I have a database with one table.Lets call it ‘table1’ The table is broken down with columns like this:
sent_time | delivered_time |id1_active |id2_active |id3_active |id1_inactive |id2_inactive |id3_inactive |location_active |location_inactive …..`lots more
Lets say that these are two or more customers delivering goods to and from each other. Each customer has three id#s.
I created a ‘config.ini’ file to make my life a bit easier
[mysql]
host = localhost
database = db_name
user = root
password = blahblah
I created a ‘python_mysql_dbconfig.py’
from configparser import ConfigParser
def read_db_config(filename=’config.ini’, section=’mysql’):
“”” Read database configuration file and return a dictionary object
:param filename: name of the configuration file
:param section: section of database configuration
:return: a dictionary of database parameters
“””
# create parser and read ini configuration file
parser = ConfigParser()
parser.read(filename)
# get section, default to mysql
db = {}
if parser.has_section(section):
items = parser.items(section)
for item in items:
db[item[0]] = item[1]
else:
raise Exception(‘{0} not found in the {1} file’.format(section, filename))
return db
This is the code that I'm working on right now...could you take a look?
# Establish a MySQL connection
from mysql.connector import MySQLConnection, Error
from python_mysql_dbconfig import read_db_config
db_config = read_db_config()
conn = MySQLConnection(**db_config)
cursor = conn.cursor(raw=True)
#to export to excel
import xlsxwriter
from xlsxwriter.workbook import Workbook
#to get the csv converter functions
import os
import subprocess
import glob
#to get the datetime functions
import datetime
from datetime import datetime
import dateutil.parser
#creates the path needed for output files
path = 'C:/Python34/output_files/'
#creates the workbook
output_filename = input('output filename:')
workbook = xlsxwriter.Workbook(path + output_filename + '.xlsx')
worksheet = workbook.add_worksheet()
#formatting definitions
bold = workbook.add_format({'bold': True})
date_format = workbook.add_format({'num_format': 'yyyy-mm-dd hh:mm:ss'})
timeShape = '%Y-%m-%d %H:%M:%S'
#actual query
query = (
"SELECT sent_time, delivered_time, OBJ, id1_active, id2_active, id3_active, id1_inactive, id2_inactive, id3_inactive, location_active, location_inactive FROM table1 "
"WHERE sent_time BETWEEN %s AND %s"
)
userIn = dateutil.parser.parse(input('start date:'))
userEnd = dateutil.parser.parse(input('end date:'))
# Execute sql Query
cursor.execute(query,(userIn, userEnd))
result = cursor.fetchall()
#sets up the header row
worksheet.write('A1','sent_time',bold)
worksheet.write('B1', 'delivered_time',bold)
worksheet.write('C1', 'customer_name',bold)
worksheet.write('D1', 'id1_active',bold)
worksheet.write('E1', 'id2_active',bold)
worksheet.write('F1', 'id3_active',bold)
worksheet.write('G1', 'id1_inactive',bold)
worksheet.write('H1', 'id2_inactive',bold)
worksheet.write('I1', 'id3_inactive',bold)
worksheet.write('J1', 'location_active',bold)
worksheet.write('K1', 'location_inactive',bold)
worksheet.autofilter('A1:K1') #dropdown menu created for filtering
#print into client to see that you have results
print(" sent_time ", " delivered_time ", "OBJ", "\t id1_active ", " id2_active ", " id3_active ", "\t", " id1_inactive ", " id2_inactive ", " id3_inactive ", "\tlocation_active", "\tlocation_inactive")
for row in result:
print(*row, sep='\t')
# Create a For loop to iterate through each row in the XLS file, starting at row 2 to skip the headers
for r, row in enumerate(result, start=1): #where you want to start printing results inside workbook
for c, col in enumerate(row):
worksheet.write_datetime(r,0,row[0], date_format)
worksheet.write_datetime(r,1, row[1], date_format)
worksheet.write(r,2, row[2])
worksheet.write(r,3, row[3])
worksheet.write(r,4, row[4])
worksheet.write(r,5, row[5])
worksheet.write(r,6, row[6])
worksheet.write(r,7, row[7])
worksheet.write(r,8, row[8])
worksheet.write(r,9, row[9])
worksheet.write(r,10, row[10])
#close out everything and save
cursor.close()
workbook.close()
conn.close()
#print number of rows and bye-bye message
print ("- - - - - - - - - - - - -")
rows = len(result)
print ("I just imported "+ str(rows) + " rows from MySQL!")
print ("")
print ("Good to Go!!!")
print ("")
#CONVERTS JUST CREATED FILE TO CSV
# set path to folder containing xlsx files
out_path ='C:/Python34/csv_files'
os.chdir(path)
# find the file with extension .xlsx
xlsx = glob.glob(output_filename + '.xlsx')
# create output filenames with extension .csv
csvs = [x.replace('.xlsx','.csv') for x in xlsx]
# zip into a list of tuples
in_out = zip(xlsx,csvs)
# loop through each file, calling the in2csv utility from subprocess
for xl,csv in in_out:
out = open(csv,'w')
command = 'c:/python34/scripts/in2csv %s\\%s' % (path,xl)
proc = subprocess.Popen(command,stdout=out)
proc.wait()
out.close()
print('XLSX and CSV files named ' + output_filename + ' were created')
You've disabled type conversion in cursor = conn.cursor(raw=True). Remove the raw=True so the driver stops giving you straight bytearrays for all types.

Resources