Import to Python a specific format file line per line - python-3.x

How can I Import this file which contains plain text with numbers?
It's difficult to import because the first line contains 7 numbers and the second line contains 8 numbers...
In general:
LINE 1: 7 numbers.
LINE 2: 8 numbers.
LINE 3: 7 numbers.
LINE 4: 8 numbers.
... and so on
I just had tried to read but cannot import it. I need to save the data in a NumPy array.
filepath = 'CHALLENGE.001'
with open(filepath) as fp:
line = fp.readline()
cnt = 1
while line:
print("Line {}: {}".format(cnt, line.strip()))
line = fp.readline()
cnt += 1
LINK TO DATA
This file contains information for each frequency has is explained below:

You'll have to skip the blank lines when reading as well.
Just check if the first line is blank. If it isn't, read 3 more lines.
Rinse and repeat.
Here's an example of both a numpy array and a pandas dataframe.
import pandas as pd
import numpy as np
filepath = 'CHALLENGE.001'
data = []
headers = ['frequency in Hz',
'ExHy coherency',
'ExHy scalar apparent resistivity',
'ExHy scalar phase',
'EyHz coherency',
'EyHx scalar apparent resistivity',
'EyHx scalar phase',
're Zxx/√(µo)',
'im Zxx/√(µo)',
're Zxy/√(µo)',
'im Zxy/√(µo)',
're Zyx/√(µo)',
'im Zyx/√(µo)',
're Zyy/√(µo)',
'im Zyy/√(µo)',
]
with open(filepath) as fp:
while True:
line = fp.readline()
if not len(line):
break
fp.readline()
line2 = fp.readline()
fp.readline()
combined = line.strip().split() + line2.strip().split()
data.append(combined)
df = pd.DataFrame(data, columns=headers).astype('float')
array = np.array(data).astype(np.float)
# example of type
print(type(df['frequency in Hz'][0]))

Related

How to convert the 50000 txt file into csv

I have many text files. I tried to convert the txt files into a single CSV file, but it is taking a huge time. I put the code on run mode at night and I slept, it processed only 4500 files, but still morning it is running.
There is any way to fast way to convert the text files into csv?
Here is my code:
import pandas as pd
import os
import glob
from tqdm import tqdm
# create empty dataframe
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])
# get list of files
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
for filename in tqdm(file_list):
# next file/record
mydict = {}
with open(filename) as datafile:
# read each line and split on " " space
for line in tqdm(datafile):
# Note: partition result in 3 string parts, "key", " ", "value"
# array slice third parameter [::2] means steps=+2
# so only take 1st and 3rd item
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
# put dictionary in dataframe
csvout = csvout.append(mydict, ignore_index=True)
# write to csv
csvout.to_csv("train.csv", sep=";", index=False)
Here is my example text file.
ID 0xb379
Delivery_person_ID BANGRES18DEL02
Delivery_person_Age 34.000000
Delivery_person_Ratings 4.500000
Restaurant_latitude 12.913041
Restaurant_longitude 77.683237
Delivery_location_latitude 13.043041
Delivery_location_longitude 77.813237
Order_Date 25-03-2022
Time_Orderd 19:45
Time_Order_picked 19:50
Weather conditions Stormy
Road_traffic_density Jam
Vehicle_condition 2
Type_of_order Snack
Type_of_vehicle scooter
multiple_deliveries 1.000000
Festival No
City Metropolitian
Time_taken (min) 33.000000
CSV is a very simple data format for which you don't need any sophisticated tools to handle. Just text and separators.
In your hopefully simple case there is no need to use pandas and dictionaries.
Except your datafiles are corrupt missing some columns or having some additional columns to skip. But even in this case you can handle such issues better within your own code so you have more control over it and are able to get results within seconds.
Assuming your datafiles are not corrupt having all columns in the right order with no missing columns or having additional ones (so you can rely on their proper formatting), just try this code:
from time import perf_counter as T
sT = T()
filesProcessed = 0
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
import glob, os
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
csv_lines = []
csv_line_counter = 0
for filename in file_list:
filesProcessed += 1
with open(filename) as datafile:
csv_line = ""
for line in datafile.read().splitlines():
# print(line)
var = line.partition(" ")[-1]
csv_line += var.strip() + ';'
csv_lines.append(str(csv_line_counter)+';'+csv_line[:-1])
csv_line_counter += 1
with open("train.csv", "w") as csvfile:
csvfile.write(';'+';'.join(columns)+'\n')
csvfile.write('\n'.join(csv_lines))
eT = T()
print(f'> {filesProcessed=}, {(eT-sT)=:8.6f}')
I guess you will get the result in a speed beyond your expectations (in seconds, not minutes or hours)
On my computer, estimating from processing time of 100 files the time required for 50.000 files will be about 3 seconds.
I could not replicate. I took the example data file and created 5000 copies of it. Then I ran your code using tqdm and without. The below shows without:
import time
import csv
import os
import glob
import pandas as pd
from tqdm import tqdm
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])
file_list = glob.glob(os.path.join(os.getcwd(), "sample_files/", "*.txt"))
t1 = time.time()
for filename in file_list:
# next file/record
mydict = {}
with open(filename) as datafile:
# read each line and split on " " space
for line in datafile:
# Note: partition result in 3 string parts, "key", " ", "value"
# array slice third parameter [::2] means steps=+2
# so only take 1st and 3rd item
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
# put dictionary in dataframe
csvout = csvout.append(mydict, ignore_index=True)
# write to csv
csvout.to_csv("train.csv", sep=";", index=False)
t2 = time.time()
print(t2-t1)
The times I got where:
tqdm 33 seconds
no tqdm 34 seconds
Then I ran using the csv module:
t1 = time.time()
with open('output.csv', 'a', newline='') as csv_file:
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
mydict = {}
d_Writer = csv.DictWriter(csv_file, fieldnames=columns, delimiter=',')
d_Writer.writeheader()
for filename in file_list:
with open(filename) as datafile:
for line in datafile:
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
d_Writer.writerow(mydict)
t2 = time.time()
print(t2-t1)
The time for this was:
csv 0.32231569290161133 seconds.
Try it like this.
import glob
with open('my_file.csv', 'a') as csv_file:
for path in glob.glob('./*.txt'):
with open(path) as txt_file:
txt = txt_file.read() + '\n'
csv_file.write(txt)

File operation using numpy

I am trying to delete phrase from text file using numpy.I have tried
num = [] and num1.append(num1)
'a' instead of 'w' to write the file back.
While append doesn't delete the phrase
writes' first run deletes the phrase
second run deletes second line which is not phrase
third run empties the file
import numpy as np
phrase = 'the dog barked'
num = 0
with open("yourfile.txt") as myFile:
for num1, line in enumerate(myFile, 1):
if phrase in line:
num += num1
else:
break
a=np.genfromtxt("yourfile.txt",dtype=None, delimiter="\n", encoding=None )
with open('yourfile.txt','w') as f:
for el in np.delete(a,(num),axis=0):
f.write(str(el)+'\n')
'''
the bird flew
the dog barked
the cat meowed
'''
I think you can still use nums.append(num1) with w mode, the issue I think you're getting is that you used the enumerate function for myFile's lines using 1-index instead of 0-index as expected in numpy array. Changing it from enumerate(myFile, 1) to enumerate(myFile, 0) seems to fix the issue
import numpy as np
phrase = 'the dog barked'
nums = []
with open("yourfile.txt") as myFile:
for num1, line in enumerate(myFile, 0):
if phrase in line:
nums.append(num1)
a=np.genfromtxt("yourfile.txt",dtype=None, delimiter="\n", encoding=None )
with open('yourfile.txt','w') as f:
for el in np.delete(a,nums,axis=0):
f.write(str(el)+'\n')

How to add numbers in Text file with Strings [duplicate]

This question already has an answer here:
How to perform addition in text file bypassing strings
(1 answer)
Closed 4 years ago.
import csv
csv_file = 'Annual Budget.csv'
txt_file = 'annual_budget.txt'
with open(txt_file, 'w') as my_output_file:
with open(csv_file, 'r') as my_input_file:
reader = csv.reader(my_input_file)
for row in reader:
my_output_file.write(" ".join(row)+'\n')
data = []
with open(r'annual_budget.txt', 'r') as f:
reader = csv.reader(f)
header = next(reader)
for line in reader:
rowdata = map(float, line)
data.extend(rowdata)
print(sum(data)/len(data))
Trying to add the numbers in a text file with strings but error is continually thrown.
Output:
data.extend(rowdata)
ValueError: could not convert string to float:
Data Set: [1]: https://i.stack.imgur.com/xON30.png
TLDR: You're treating space delimited text as csv which is not parsing correctly by the csv module.
At the time I worked this problem out for you, you had not provided origional csv data, so for this problem I assumed your csv file contained the following data, based on your screenshot of the txt file.
Annual Budget,Q2,Q4
100,450,20
600,765,50
500,380,79
800,480,455
1100,65,4320
Now, about the code.
You're defining data = [] in a place where it is not only
not used, but also causing it to be reset to an empty
list with every loop through the file you're converting.
If we add a print statement directly under it we get this for output:
showing added print statement:
with open(txt_file, 'w') as my_output_file:
with open(csv_file, 'r') as my_input_file:
reader = csv.reader(my_input_file)
for row in reader:
my_output_file.write(" ".join(row)+'\n')
data = []
print(data)
Output:
[]
[]
[]
[]
[]
[]
Moving data = [] to the top of the file prevents that.
Now with the second with statement and loops you're treating the txt file you just created as a csv file. csv data is comma delimited not space delimited. The csv reader isn't parsing the row correctly. if we add a print loop to check what is coming out of the map function we can see it's not doing what you're expecting it to, which is convert it to a list of floats.
Relevent code:
for line in reader:
rowdata = map(float, line)
for element in rowdata:
print(element)
output:
Traceback (most recent call last):
File "test.py", line 17, in <module>
for element in rowdata:
ValueError: could not convert string to float: '100 450 20'
There are multiple ways to solve the problem, but I think the best is to simply skip the whole first loop where you convert it to a space delimited file. Doing so we just depend on the csv module to do it's job.
example code:
import csv
data = []
with open('Annual Budget.csv', 'r') as f:
reader = csv.reader(f) # Gets the reader
header = next(reader) # advance the reader past the header.
for line in reader:
rowdata = map(float, line)
for element in rowdata:
print(element)
Output:
100.0
450.0
20.0
600.0
765.0
50.0
500.0
380.0
79.0
800.0
480.0
455.0
1100.0
65.0
4320.0
Now we'll add your last couple lines of code back in and remove the test code:
import csv
data = []
with open('Annual Budget.csv', 'r') as f:
reader = csv.reader(f) # Gets the reader
header = next(reader) # advance the reader past the header.
for line in reader:
rowdata = map(float, line)
data.extend(rowdata)
print(sum(data)/len(data))
Which now outputs:
677.6

How to fix the code about appending a number in the line?

I create a new column (name:Account) in the csv, then try to make a sequence (c = float(a) + float(b)) and for each number in sequence append to the original line in the csv, which is the value of the new column. Here is my code:
# -*- coding: utf-8 -*-
import csv
with open('./tradedate/2007date.csv') as inf:
reader = csv.reader(inf)
all = []
row = next(reader)
row.append('Amount')
all.append(row)
a =50
for i, line in enumerate(inf):
if i != 0:
size = sum(1 for _ in inf) # count the line number
for b in range(1, size+1):
c = float(a) + float(b) # create the sequence: in 1st line add 1, 2nd line add 2, 3rd line add 3...etc
line.append(c) # this is the error message: AttributeError: 'str' object has no attribute 'append'
all.append(line)
with open('main_test.csv', 'w', newline = '') as new_csv:
csv_writer = csv.writer(new_csv)
csv_writer.writerows(all)
The csv is like this:
日期,成交股數,成交金額,成交筆數,發行量加權股價指數,漲跌點數,Account
96/01/02,"5,738,692,838","141,743,085,172","1,093,711","7,920.80",97.08,51
96/01/03,"5,974,259,385","160,945,755,016","1,160,347","7,917.30",-3.50,52
96/01/04,"5,747,756,529","158,857,947,106","1,131,747","7,934.51",17.21,53
96/01/05,"5,202,769,867","143,781,214,318","1,046,480","7,835.57",-98.94,54
96/01/08,"4,314,344,739","115,425,522,734","888,324","7,736.71",-98.86,55
96/01/09,"4,533,381,664","120,582,511,893","905,970","7,790.01",53.30,56
The Error message is:
Traceback (most recent call last):
File "main.py", line 21, in <module>
line.append(c)
AttributeError: 'str' object has no attribute 'append'
Very thanks for any help!!
I'm a little confused why you're structuring your code this way, but the simplest fix would be to change the append (since you can't append to a string) to += a string version of c, i.e.
line += str(c)
or
line += ',{}'.format(c)
(I'm not clear based on how you're written this if you need the comma or not)
The biggest problem is that you're not using your csv reader - below is a better implementation. With the csv reader it's cleaner to do the append that you want to do versus using the file object directly.
import csv
with open('./tradedate/2007date.csv') as old_csv:
with open('main_test.csv', 'w') as new_csv:
writer = csv.writer(new_csv, lineterminator='\n')
reader = csv.reader(old_csv)
all = []
row = next(reader)
row.append('Line Number')
all.append(row)
line_number = 51
for row in reader:
row.append(line_number)
all.append(row)
line_number += 1
writer.writerows(all)

ValueError: could not convert string to float: left_column_pixel

l can't read values of pixels from pandas in img() opencv here are my code and the reported errorr
import cv2
import numpy as np
import csv
import os
import pandas as pd
path_csv='/home/'
npa=pd.read_csv(path_csv+"char.csv", usecols=[2,3,4,5], header=None)
nb_charac=npa.shape[0]-1
#stock the actual letters of your csv in an array
characs=[]
cpt=0
#take characters
f = open(path_csv+"char.csv", 'rt')
reader = csv.reader(f)
for row in reader:
if cpt>=1: #skip header
characs.append(str(row[1]))
cpt+=1
#open your image
path_image= '/home/'
img=cv2.imread(os.path.join(path_image,'image1.png'))
path_save= '/home/2/'
i=0
#for every line on your csv,
for i in range(nb_charac):
#get coordinates
#coords=npa[i,:]
coords=npa.iloc[[i]]
charac=characs[i]
#actual cropping of the image (easy with numpy)
img_charac=img[int(coords[2]):int(coords[4]),int(coords[3]):int(coords[5])]
img_charac=cv2.resize(img_charac, (32, 32), interpolation=cv2.INTER_NEAREST)
i+=1
#charac=charac.strip('"\'')
#x=switch(charac)
#saving the image
cv2.imwrite(path_save+str(charac)+"_"+str(i)+"_"+str(img_charac.shape)+".png",img_charac)
img_charac2 = 255 - img_charac
cv2.imwrite(path_save +str(charac)+ "_switched" + str(i) + "_" + str(img_charac2.shape) + ".png", img_charac2)
print(i)
l got the following error
img_charac=img[int(coords[2]):int(coords[3]),int(coords[0]):int(coords[1])]
File "/usr/lib/python2.7/dist-packages/pandas/core/series.py", line 79, in wrapper
return converter(self.iloc[0])
ValueError: invalid literal for int() with base 10: 'left_column_pixel'
the error is related to this line of code :
img_charac=img[int(coords[2]):int(coords[4]),int(coords[3]):int(coords[5])]
such that my variable coords is as follow :
>>> coords=npa.iloc[[1]]
>>> coords
2 3 4 5
1 38 104 2456 2492
and the different values of the column 2,3,4,5 needed in image_char are :
>>> coords[2]
1 38
Name: 2, dtype: object
>>> coords[3]
1 104
Name: 3, dtype: object
>>> coords[4]
1 2456
Name: 4, dtype: object
>>> coords[5]
1 2492
Name: 5, dtype: object
l updated the line of img_charac as follow
img_charac = img[int(float(coords[2].values[0])):int(float(coords[4].values[0])), int(float(coords[3].values[0])):int(float(coords[5].values[0]))]
l don't have anymore
ValueError: invalid literal for int() with base 10: 'left_column_pixel'
but l got the following error :
ValueError: could not convert string to float: left_column_pixel
l noticed that outside the loop img_charac works
I think the ValueError occurs because you are reading the header row of your csv file within the first iteration of your for-loop. The header contains string labels which can't converted to integers:
for i in range(nb_charac) will start with i having 0 as the first value.
Then, coords=npa.iloc[[i]] will return the first row (0th row) of your csv-file.
Since you've set header=None in npa=pd.read_csv(path_csv+"char.csv", usecols=[2,3,4,5], header=None), you iterate over strings within your header row.
So either set header=0 or for i in range(1, nb_charac).

Resources