Import txt file and filter with space - python-3.x

I'm writing a script to track my orders from a website. I want to import the order# from a txt file and the script should repeat it self as long as there are ordernumbers.I wrote a code where the script imports this txt file and chooses a random ordernumber but the script puts all ordernumbers together and doesnt seperate them how can I fix this ?
this is my code:
f=open("Order#.txt", "r")
OrderNR = f.read()
words = OrderNR.split()
Repeat = len(words)
for i in range(Repeat):
randomlist = OrderNR
Orderrandom = random.choice(randomlist)
Mainlink = 'https://footlocker.narvar.com/footlocker/tracking/startrack?order_number=' + Orderrandom

Instead of using f.read(), try using f.readlines().
# Using readlines()
file1 = open('myfile.txt', 'r')
Lines = file1.readlines()

Try PANDAS
import pandas as pd
df = pd.read_csv('Order#.txt', delimiter='\t')
print(df)
you can see TXT file in table format

Related

How to convert the 50000 txt file into csv

I have many text files. I tried to convert the txt files into a single CSV file, but it is taking a huge time. I put the code on run mode at night and I slept, it processed only 4500 files, but still morning it is running.
There is any way to fast way to convert the text files into csv?
Here is my code:
import pandas as pd
import os
import glob
from tqdm import tqdm
# create empty dataframe
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])
# get list of files
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
for filename in tqdm(file_list):
# next file/record
mydict = {}
with open(filename) as datafile:
# read each line and split on " " space
for line in tqdm(datafile):
# Note: partition result in 3 string parts, "key", " ", "value"
# array slice third parameter [::2] means steps=+2
# so only take 1st and 3rd item
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
# put dictionary in dataframe
csvout = csvout.append(mydict, ignore_index=True)
# write to csv
csvout.to_csv("train.csv", sep=";", index=False)
Here is my example text file.
ID 0xb379
Delivery_person_ID BANGRES18DEL02
Delivery_person_Age 34.000000
Delivery_person_Ratings 4.500000
Restaurant_latitude 12.913041
Restaurant_longitude 77.683237
Delivery_location_latitude 13.043041
Delivery_location_longitude 77.813237
Order_Date 25-03-2022
Time_Orderd 19:45
Time_Order_picked 19:50
Weather conditions Stormy
Road_traffic_density Jam
Vehicle_condition 2
Type_of_order Snack
Type_of_vehicle scooter
multiple_deliveries 1.000000
Festival No
City Metropolitian
Time_taken (min) 33.000000
CSV is a very simple data format for which you don't need any sophisticated tools to handle. Just text and separators.
In your hopefully simple case there is no need to use pandas and dictionaries.
Except your datafiles are corrupt missing some columns or having some additional columns to skip. But even in this case you can handle such issues better within your own code so you have more control over it and are able to get results within seconds.
Assuming your datafiles are not corrupt having all columns in the right order with no missing columns or having additional ones (so you can rely on their proper formatting), just try this code:
from time import perf_counter as T
sT = T()
filesProcessed = 0
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
import glob, os
file_list = glob.glob(os.path.join(os.getcwd(), "train/", "*.txt"))
csv_lines = []
csv_line_counter = 0
for filename in file_list:
filesProcessed += 1
with open(filename) as datafile:
csv_line = ""
for line in datafile.read().splitlines():
# print(line)
var = line.partition(" ")[-1]
csv_line += var.strip() + ';'
csv_lines.append(str(csv_line_counter)+';'+csv_line[:-1])
csv_line_counter += 1
with open("train.csv", "w") as csvfile:
csvfile.write(';'+';'.join(columns)+'\n')
csvfile.write('\n'.join(csv_lines))
eT = T()
print(f'> {filesProcessed=}, {(eT-sT)=:8.6f}')
I guess you will get the result in a speed beyond your expectations (in seconds, not minutes or hours)
On my computer, estimating from processing time of 100 files the time required for 50.000 files will be about 3 seconds.
I could not replicate. I took the example data file and created 5000 copies of it. Then I ran your code using tqdm and without. The below shows without:
import time
import csv
import os
import glob
import pandas as pd
from tqdm import tqdm
csvout = pd.DataFrame(columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"])
file_list = glob.glob(os.path.join(os.getcwd(), "sample_files/", "*.txt"))
t1 = time.time()
for filename in file_list:
# next file/record
mydict = {}
with open(filename) as datafile:
# read each line and split on " " space
for line in datafile:
# Note: partition result in 3 string parts, "key", " ", "value"
# array slice third parameter [::2] means steps=+2
# so only take 1st and 3rd item
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
# put dictionary in dataframe
csvout = csvout.append(mydict, ignore_index=True)
# write to csv
csvout.to_csv("train.csv", sep=";", index=False)
t2 = time.time()
print(t2-t1)
The times I got where:
tqdm 33 seconds
no tqdm 34 seconds
Then I ran using the csv module:
t1 = time.time()
with open('output.csv', 'a', newline='') as csv_file:
columns =["ID","Delivery_person_ID" ,"Delivery_person_Age" ,"Delivery_person_Ratings","Restaurant_latitude","Restaurant_longitude","Delivery_location_latitude","Delivery_location_longitude","Order_Date","Time_Orderd","Time_Order_picked","Weather conditions","Road_traffic_density","Vehicle_condition","Type_of_order","Type_of_vehicle", "multiple_deliveries","Festival","City","Time_taken (min)"]
mydict = {}
d_Writer = csv.DictWriter(csv_file, fieldnames=columns, delimiter=',')
d_Writer.writeheader()
for filename in file_list:
with open(filename) as datafile:
for line in datafile:
name, var = line.partition(" ")[::2]
mydict[name.strip()] = var.strip()
d_Writer.writerow(mydict)
t2 = time.time()
print(t2-t1)
The time for this was:
csv 0.32231569290161133 seconds.
Try it like this.
import glob
with open('my_file.csv', 'a') as csv_file:
for path in glob.glob('./*.txt'):
with open(path) as txt_file:
txt = txt_file.read() + '\n'
csv_file.write(txt)

Refining Python code for repeated use (Xlsx To Txt file and to WordCloud )

I've been self-learning Python for the past Month( 0 coding experience python is my first coding language) and have finally written my first work related usable code , i am trying to refine this code for repeated use , as it converts a comment based xlx data to txt 'string type' file and finally a wordcloud; you can find below the workable code:
how the code works:
step1. xlsx file = 4 column excel worksheet
step2. python extracts all column 'B'
step3. converts it into 'Str' Format , removes spaces & converts into txt file
step4. wordcloud removes words using stopWords,
Step5. generates wordcloud according to the format
I would Like to refine it in a way :
Changing File Directory Through a simple step instead of multiple copy and pasting of directory name ( skip manual changing of all file directories )
txt file's name creation is based on xlsx file's name ( so i don't have to key in manually every time)
if anyone has a better way of refining please do let me know , i am very new to this so if you need any other information to clarify any info , lemme know
Any help would be greatly appreciated, Thank you all in advance
import openpyxl as xl
import wordcloud
from wordcloud import WordCloud,STOPWORDS
from matplotlib.pyplot import imread
import jieba
import pandas as pd
# opening the source excel file ( repeated steps needed for every different document)
filename = "C:\\Users\\shakesmilk\\Desktop\\staub\\staub天猫商品评论.xlsx"
wb1 = xl.load_workbook(filename)
ws1 = wb1.worksheets[0]
# opening the destination excel file ( repeated steps needed for every different document)
filename1 = "C:\\Users\\shakesmilk\\Desktop\\staub\\staub天猫商品评论.xlsx"
wb2 = xl.load_workbook(filename1)
wb2.create_sheet('Sheet2')
ws2 = wb2.worksheets[1]
# calculate total number of rows and
# columns in source excel file
mr = ws1.max_row
mc = ws1.max_column
minr= ws2.min_row
# # copying the cell values from source
# # excel file to destination excel file
for i in range(1, mr + 1):
for j in range(0, mc + 1):
# reading cell value from source excel file
c = ws1.cell(row=i+1, column=2)
# writing the read value to destination excel file
ws2.cell(row=i+1, column=2).value = c.value
# # #deleting first empty row/ column
ws2.delete_cols(1)
#saving the destination excel file
wb2.save(str(filename1))
# #converting sheet 2 with pandas to txt file
df = pd.read_excel(filename,sheet_name=1)
with open("C:\\Users\\shakesmilk\\Desktop\\staub\\file.txt", mode='w',encoding='utf-8') as outfile:
df.to_string(outfile,header = None ,index = None)
#open read & remove spaces from txt file
commentfiletxt= "C:\\Users\\shakesmilk\\Desktop\\staub\\file.txt"
with open(commentfiletxt, 'r' , encoding='utf-8') as f:
lines = f.readlines()
# # remove spaces
lines = [line.replace(' ', '') for line in lines]
# # finally, write lines in the file
with open(commentfiletxt,'w', encoding='utf-8') as f :
f.writelines(lines)
# txt file generated > next to create wordcloud
#wordcloud start
#remove words from wordcloud
stopwords= set(STOPWORDS)
stopwords.update(['此用户没有填写评论', 'hellip','zwj','其他特色','还没用','非常喜欢','产品功能','没有用'])
mask = imread('moon.jpg')
with open(commentfiletxt, 'r',encoding='utf-8') as file:
text = file.read()
words = jieba.lcut(text) # 精确分词
newtxt = ' '.join(words) # 空格拼接
wd = wordcloud.WordCloud(stopwords=stopwords,\
font_path="MSYH.TTC",\
background_color="white", \
width=800, \
height=300, \
max_words=500, \
max_font_size=200, \
mask = mask, \
).generate(text)
# save picture
txt = open(commentfiletxt, mode='r', encoding='utf-8')
# save picture
wd.to_file('staub2.png')
i continued reading "Learning Python 5th Edition" and apparently functions are a good way to make codes reusable ; guess no one is interested in noob codes, but i'm sure there are a lot of beginners out there trying to refine their codes so i'm answering my question as i progress through the book , hopefully this helps out those in need. p.s currently reading about classes , im guesssing i could transform this code further but for now for those in need , this is my 'def' example:
import openpyxl as xl
import wordcloud
from wordcloud import WordCloud,STOPWORDS
from matplotlib.pyplot import imread
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
import jieba
import pandas as pd
def open_excel(filename):
global wb1,ws1,filenametxt
wb1 =xl.load_workbook(filename)
ws1 = wb1.worksheets[0]
filenametxt = filename
print('Loading WorkBook Completed')
def create_sheet(filename1):
global wb2,ws2
wb2 = xl.load_workbook(filename1)
wb2.create_sheet('Sheet2')
ws2 = wb2.worksheets[1]
print('Sheet 2 Created')
mr = ws1.max_row
mc = ws1.max_column
minr = ws2.min_row
for i in range(1, mr + 1):
for j in range(0, mc + 1):
# reading cell value from source excel file
c = ws1.cell(row=i + 1, column=2)
ws2.cell(row=i + 1, column=2).value = c.value
wb2.save(filename1)
print("Data Extracted To 'Column B'" )
ws2.delete_cols(1)
print('Empty Space in Column 1 Deleted')
wb2.save(filename1)
def create_txtf(tfile):
global df
df = pd.read_excel(filenametxt, sheet_name=1)
with open(tfile, mode='w', encoding='utf-8') as outfile:
df.to_string(outfile, header=None, index=None)
print('txt file created as file.txt')
def convert_remove(tfile1):
#open read & remove spaces from txt file
global commentfiletxt
commentfiletxt = tfile1
with open(commentfiletxt, 'r', encoding='utf-8') as f:
lines = f.readlines()
# # remove spaces
lines = [line.replace(' ', '') for line in lines]
print('Empty Spaces ' ' are removed' )
# # finally, write lines in the file
with open(commentfiletxt, 'w', encoding='utf-8') as f:
f.writelines(lines)
print('Data is written correctly without spaces')
return(tfile1)
def wordcloudpic(picname,maskpathn):
stopwords = set(STOPWORDS)
stopwords.update(['此用户没有填写评论', 'hellip','zwj','其他特色','还没用','非常喜欢','产品功能','没有用','东西收到了','S','sode','c','s左右','u','middot','u','theta','rdquo','ldquo','ec','ok','好评','不错','很好','满意','好用','老板大气','好',\
'nbsp'])
mask = imread(maskpathn)
mask = mask.astype(np.uint8)
with open(commentfiletxt, 'r',encoding='utf-8') as file:
text = file.read()
words = jieba.lcut(text) # 精确分词
newtxt = ' '.join(words) # 空格拼接
wd = wordcloud.WordCloud(stopwords=stopwords,\
font_path="MSYH.TTC",\
background_color="white", \
width=800, \
height=300, \
max_words=500, \
max_font_size=200, \
mask = mask, \
).generate(text)
txt = open(commentfiletxt, mode='r', encoding='utf-8')
# save picture
wd.to_file(picname)
if __name__ == "__main__":
open_excel("C:\\Users\\shakesmilk\\Desktop\\testtest\\test天猫商品评.xlsx")
create_sheet("C:\\Users\\shakesmilk\\Desktop\\testtest\\test天猫商品评.xlsx")
create_txtf("C:\\Users\\shakesmilk\\Desktop\\testtest\\file.txt")
convert_remove("C:\\Users\\shakesmilk\\Desktop\\testtest\\file.txt")
wordcloudpic('test.png','bubble.jpg')

what is wrong with this Pandas and txt file code

I'm using pandas to open a CSV file that contains data from spotify, meanwhile, I have a txt file that contains various artists names from that CSV file. What I'm trying to do is get the value from each row of the txt and automatically search them in the function I've done.
import pandas as pd
import time
df = pd.read_csv("data.csv")
df = df[['artists', 'name', 'year']]
def buscarA():
start = time.time()
newdf = (df.loc[df['artists'].str.contains(art)])
stop = time.time()
tempo = (stop - start)
print (newdf)
e = ('{:.2f}'.format(tempo))
print (e)
with open("teste3.txt", "r") as f:
for row in f:
art = row
buscarA()
but the output is always the same:
Empty DataFrame
Columns: [artists, name, year]
Index: []
The problem here is that when you read the lines of your file in Python, it also gets the line break per row so that you have to strip it off.
Let's suppose that the first line of your teste3.txt file is "James Brown". It'd be read as "James Brown\n" and not recognized in the search.
Changing the last chunk of your code to:
with open("teste3.txt", "r") as f:
for row in f:
art = row.strip()
buscarA()
should work.

How do I remove first column in csv file?

I have a CSV file where the first row in the first column is blank with some numbers in the second and third row. This whole column is useless and I need to remove it so I can convert the data into a JSON file. I just need to know how to remove the first column of data so I can parse it. Any help is greatly appreciated!
My script is as follows
#!/usr/bin/python3
import pandas as pd
import csv, json
xls = pd.ExcelFile(r'C:\Users\Andy-\Desktop\Lab2Data.xlsx')
df = xls.parse(sheetname="Sheet1", index_col=None, na_values=['NA'])
df.to_csv('file.csv')
file = open('file.csv', 'r')
lines = file.readlines()
file.close()
data = {}
with open('file.csv') as csvFile:
csvReader = csv.DictReader(csvFile)
for rows in csvReader:
id = rows['Id']
data[id] = rows
with open('Lab2.json', 'w') as jsonFile:
jsonFile.write(json.dumps(data, indent=4))
I don't know much about json files but this will remove the first column from your csv file.
with open ('new_file.csv', 'w') as out_file :
with open ('file.csv') as in_file :
for line in in_file :
test_string = line.strip ('\n').split (',')
out_file.write (','.join (test_string [1:]) + '\n')

How to merge big data of csv files column wise into a single csv file using Pandas?

I have lots of big data csv files in terms of countries and I want to merge their column in a single csv file, furthermore, each file has 'Year' as an index and having same in terms of length and numbers. You can see below is a given example of a Japan.csv file.
If anyone can help me please let me know. Thank you!!
Try using:
import pandas as pd
import glob
l = []
path = 'path/to/directory/'
csvs = glob.glob(path + "/*.csv")
for i in csvs:
df = pd.read_csv(i, index_col=None, header=0)
l.append(df)
df = pd.concat(l, ignore_index=True)
This should work. It goes over each file name, reads it and combines everything into one df. You can export this df to csv or do whatever with it. gl.
import pandas as pd
def combine_csvs_into_one_df(names_of_files):
one_big_df = pd.DataFrame()
for file in names_of_files:
try:
content = pd.read_csv(file)
except PermissionError:
print (file,"was not found")
continue
one_big_df = pd.concat([one_big_df,content])
print (file," added!")
print ("------")
print ("Finished")
return one_big_df

Resources