I have many text files. I tried to convert the txt files into a single CSV file, but it is taking a huge time. I put the code on run mode at night and I slept, it processed only 4500 files, but still morning it is running.
There is any way to fast way to convert the text files into csv?
Here is my code:
import pandas as pd
import os
import glob
from tqdm import tqdm
# create empty dataframe
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])
# get list of files
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
for filename in tqdm(file_list):
# next file/record
mydict = {}
with open(filename) as datafile:
# read each line and split on " " space
for line in tqdm(datafile):
# Note: partition result in 3 string parts, "key", " ", "value"
# array slice third parameter [::2] means steps=+2
# so only take 1st and 3rd item
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
# put dictionary in dataframe
csvout = csvout.append(mydict, ignore_index=True)
# write to csv
csvout.to_csv("train.csv", sep=";", index=False)
Here is my example text file.
ID 0xb379
Delivery_person_ID BANGRES18DEL02
Delivery_person_Age 34.000000
Delivery_person_Ratings 4.500000
Restaurant_latitude 12.913041
Restaurant_longitude 77.683237
Delivery_location_latitude 13.043041
Delivery_location_longitude 77.813237
Order_Date 25-03-2022
Time_Orderd 19:45
Time_Order_picked 19:50
Weather conditions Stormy
Road_traffic_density Jam
Vehicle_condition 2
Type_of_order Snack
Type_of_vehicle scooter
multiple_deliveries 1.000000
Festival No
City Metropolitian
Time_taken (min) 33.000000
CSV is a very simple data format for which you don't need any sophisticated tools to handle. Just text and separators.
In your hopefully simple case there is no need to use pandas and dictionaries.
Except your datafiles are corrupt missing some columns or having some additional columns to skip. But even in this case you can handle such issues better within your own code so you have more control over it and are able to get results within seconds.
Assuming your datafiles are not corrupt having all columns in the right order with no missing columns or having additional ones (so you can rely on their proper formatting), just try this code:
from time import perf_counter as T
sT = T()
filesProcessed = 0
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
import glob, os
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
csv_lines = []
csv_line_counter = 0
for filename in file_list:
filesProcessed += 1
with open(filename) as datafile:
csv_line = ""
for line in datafile.read().splitlines():
# print(line)
var = line.partition(" ")[-1]
csv_line += var.strip() + ';'
csv_lines.append(str(csv_line_counter)+';'+csv_line[:-1])
csv_line_counter += 1
with open("train.csv", "w") as csvfile:
csvfile.write(';'+';'.join(columns)+'\n')
csvfile.write('\n'.join(csv_lines))
eT = T()
print(f'> {filesProcessed=}, {(eT-sT)=:8.6f}')
I guess you will get the result in a speed beyond your expectations (in seconds, not minutes or hours)
On my computer, estimating from processing time of 100 files the time required for 50.000 files will be about 3 seconds.
I could not replicate. I took the example data file and created 5000 copies of it. Then I ran your code using tqdm and without. The below shows without:
import time
import csv
import os
import glob
import pandas as pd
from tqdm import tqdm
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])
file_list = glob.glob(os.path.join(os.getcwd(), "sample_files/", "*.txt"))
t1 = time.time()
for filename in file_list:
# next file/record
mydict = {}
with open(filename) as datafile:
# read each line and split on " " space
for line in datafile:
# Note: partition result in 3 string parts, "key", " ", "value"
# array slice third parameter [::2] means steps=+2
# so only take 1st and 3rd item
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
# put dictionary in dataframe
csvout = csvout.append(mydict, ignore_index=True)
# write to csv
csvout.to_csv("train.csv", sep=";", index=False)
t2 = time.time()
print(t2-t1)
The times I got where:
tqdm 33 seconds
no tqdm 34 seconds
Then I ran using the csv module:
t1 = time.time()
with open('output.csv', 'a', newline='') as csv_file:
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
mydict = {}
d_Writer = csv.DictWriter(csv_file, fieldnames=columns, delimiter=',')
d_Writer.writeheader()
for filename in file_list:
with open(filename) as datafile:
for line in datafile:
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
d_Writer.writerow(mydict)
t2 = time.time()
print(t2-t1)
The time for this was:
csv 0.32231569290161133 seconds.
Try it like this.
import glob
with open('my_file.csv', 'a') as csv_file:
for path in glob.glob('./*.txt'):
with open(path) as txt_file:
txt = txt_file.read() + '\n'
csv_file.write(txt)
I have to read a CSV file N lines at a time.
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
for row in csv_reader:
print row
I know I can loop N times at a time, build a list of list and process that way.
But is there a simpler way of using csv_reader so that I read n lines at a time.
Hi I don't think that you'll be able to do that without a loop with csv package.
You should use pandas (pip install --user pandas) instead:
import pandas
df = pandas.read_csv('myfile.csv')
start = 0
step = 2 # Your 'N'
for i in range(0, len(df), step):
print(df[i:i+step])
start = i
Pandas has a chunksize option to their read_csv() method and I would probably explore that option.
If I was going to do it myself by hand, I would probably do something like:
import csv
def process_batch(rows):
print(rows)
def get_batch(reader, batch_size):
return [row for _ in range(batch_size) if (row:=next(reader, None))]
with open("data.csv", "r") as file_in:
reader = csv.reader(file_in)
while batch := get_batch(reader, 5):
process_batch(batch)
I have a code which iterates through the text, and tells me which is the maximum amount of times each dna STR is found. The only step missing to be able to match these values with the CSV file, is to store them into a list, BUT I AM NOT ABLE TO DO SO. When I run the code, the maximum values are printed independently for each STR sequence.
I have tried to "append" the values into a list, but I was not successful, thus, I cannot match it with the dna sequences of the CSV (large nor small).
Any help or advcise is greatly appreciated!
Here is my code, and the results I get with using "text 1" and "small csv":
`
import cs50
import sys
import csv
import os
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
csv_db = sys.argv[1]
file_seq = sys.argv[2]
with open(csv_db, newline='') as csvfile: #with open(csv_db) as csv_file
csv_reader = csv.reader(csvfile, delimiter=',')
header = next(csv_reader)
i = 1
while i < len(header):
STR = header[i]
len_STR = len(STR)
with open(file_seq, 'r') as my_file:
file_reader = my_file.read()
counter = 0
a = 0
b = len_STR
list = []
for text in file_reader:
if file_reader[a:b] != STR:
a += 1
b += 1
else:
counter += 1
a += len_STR
b += len_STR
list.append(counter)
print(list)
i += 1
`
The problem is in place of variable "list" declaration. Every time you iterates through STRs in variable "header" you declares:
list = []
Thus, you create absolutely new variable, which stores only the length of current STR. To make a list with all STRs appended you need to declare variable "list" before the while loop and operator "print" after the while loop:
list = []
while i < len(header):
<your loop code>
print(list)
This should solve your problem.
P.S. Avoid to use "list" as a variable declaration. The "list" is a python built-in function and it is automatically declared. So, when you redeclare it, you will not be able to use list() function in your code.
So I have created a script to read lines from a file (1500 lines)
Write them as 10 per line
(and do every possible output we can get with product a b c d a , a b c d b etc...)
The thing is the moment I run the script my computer freezes completly(because it writes so much data)
So I thought if its possible to run the script every 100 mb it will save it to a file and save the current state so when I run the script again it will actuly run from where we stopped (the last line on the 100mb file)
Or if you have another solution I would love to hear it :P
heres the script :
from itertools import product
with open('file.txt', 'r') as f:
content = f.readlines()
comb = product(content, repeat=10)
new_content = [elem for elem in list(comb)]
with open('log.txt', 'w') as f:
for line in new_content:
f.write(str(line) + '\n')
The line
new_content = [elem for elem in list(comb)]
takes the generator and transforms it into a list in memory, twice. The result is the same as just doing
new_content = list(comb)
Your computer freezes up because this will use all of the available RAM.
Since you use new_content only for iterating over it, you could just iterate over the initial generator directly instead:
from itertools import product
with open('file.txt', 'r') as f:
content = f.readlines()
comb = product(content, repeat=10)
with open('log.txt', 'w') as f:
for line in comb:
f.write(str(line) + '\n')
But now this will fill up your harddisk, since with an input size of 1500 lines it will produce 57665039062500000000000000000000 lines (1500**10) of output.
I would open the file in a separate function and yield a line at a time - that way you're never going to blow your memory.
function read_file(filename):
with open(filename", "r") as f:
for line in f:
yield line
Then you can use this in your code:
for line in read_file("log.txt"):
f.write(line + "\n")
Using Tweepy, I am writing to a csv file with python and the header repeats every other row
x=0
x+=1
with open('NAME' + str(x) + '.csv', 'w' , newline='') as f:
for user in tweepy.Cursor(api.followers, screen_name="Name").items(5):
thewriter = csv.writer(f)
thewriter.writerow(['Username', 'location'])
thewriter = csv.writer(f)
thewriter.writerow([user.screen_name , user.location])
Your script should change to this:
x=0
x+=1
with open('NAME' + str(x) + '.csv', 'w' , newline='') as f:
thewriter = csv.writer(f)
thewriter.writerow(['Username', 'location'])
for user in tweepy.Cursor(api.followers, screen_name="Name").items(5):
thewriter.writerow([user.screen_name , user.location])
You only need to create thewriter object one time, and of course, you only want to create the headers once, not every other row as you saw. Moving things out of the for loop where you are looping through the rows enables that.