Merge Multiple txt files and removing duplicates from resulting big file - python-3.x

I've been trying to get this one to work, without success because:
The files to be merged are large (up to 20MB each);
Duplicate lines come in separate files. That's why I need to remove it from the resulting merged file;
Right now, the code is running, but nothing shows up and it basically merged the files, not dealing with duplicates.
import os
import io
import pandas as pd
merged_df = pd.DataFrame()
for file in os.listdir(r"C:\Users\username\Desktop\txt"):
if file.endswith(".txt"):
file_path = os.path.join(r"C:\Users\username\Desktop\txt", file)
bytes = open(file_path, 'rb').read()
merged_df = merged_df.append(pd.read_csv(io.StringIO(
bytes.decode('latin-1')), sep=";", parse_dates=['Data']))
SellOutCombined = open('test.txt', 'a')
SellOutCombined.write(merged_df.to_string())
SellOutCombined.close()
print(len(merged_df))
Any help is appreciated.

Related

Combining CSV files using Pandas is appending additional columns rather than right below?

I'm not exactly sure what is the best way to describe this problem, but the photos below should be pretty clear.
First photo is the current output and the second photo is the desired output.
Here is the code I'm using to combine these files:
import os
import glob
import pandas as pd
os.chdir("mydir")
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files inthe list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ], axis=1)
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
I've also used this with no luck:
import pandas as pd
import glob
import os
# merging the files
joined_files = os.path.join("mydir", "clean_csv*.csv")
# A list of all joined files is returned
joined_list = glob.glob(joined_files)
# Finally, the files are joined
df = pd.concat(map(pd.read_csv, joined_list), ignore_index=True)
df.to_csv('output-test.csv', index=True, encoding='utf-8-sig', header=None)

Python CSV merge issue

New to python and I am presently in the process of CSV merge using Python 3.7.
import pandas as pd
import os
newdir = 'C:\\xxxx\\xxxx\\xxxx\\xxxx'
list = os.listdir(newdir)
writer = pd.ExcelWriter('test.xlsx')
for i in range(0,len(list)):
data = pd.read_csv(list[i],encoding="gbk", index_col=0)
data.to_excel(writer, sheet_name=list[i])
writer.save()
I try to result as below:
FileNotFoundError: [Errno 2] File b'a.csv' does not exist: b'a.csv'
The problem is all of not csv merge into one xlsx file. Please let me know solution.
os.listdir only returns the filenames. You'll need to prepend the folder name to the filename.
import pandas as pd
import os
newdir = 'C:\\xxxx\\xxxx\\xxxx\\xxxx'
names = os.listdir(newdir)
writer = pd.ExcelWriter('test.xlsx')
for name in names:
path = os.path.join(newdir, name)
data = pd.read_csv(path, encoding="gbk", index_col=0)
data.to_excel(writer, sheet_name=name)
writer.save()
Note that I did not bother to check the rest of your code.
Oh and please avoid using builtins to name your variables.

Problem with processing large(>1 GB) CSV file

I have a large CSV file and I have to sort and write the sorted data to another csv file. The CSV file has 10 columns. Here is my code for sorting.
data = [ x.strip().split(',') for x in open(filename+'.csv', 'r').readlines() if x[0] != 'I' ]
data = sorted(data, key=lambda x: (x[6], x[7], x[8], int(x[2])))
with open(filename + '_sorted.csv', 'w') as fout:
for x in data:
print(','.join(x), file=fout)
It works fine with file size below 500 Megabytes but cannot process files with a size greater than 1 GB. Is there any way I can make this process memory efficient? I am running this code on Google Colab.
Here is a Link to a blog about using pandas for large datasets. In the examples from the link, they are looking at analyzing data from large datasets ~1gb in size.
Simply type the following to import your csv data into python.
import pandas as pd
gl = pd.read_csv('game_logs.csv', sep = ',')

How to Read multiple files in Python for Pandas separate dataframes

I am trying to read 6 files into 7 different data frames but I am unable to figure out how should I do that. File names can be complete random, that is I know the files but it is not like data1.csv data2.csv.
I tried using something like this:
import sys
import os
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
f1='Norway.csv'
f='Canada.csv'
f='Chile.csv'
Norway = pd.read_csv(Norway.csv)
Canada = pd.read_csv(Canada.csv)
Chile = pd.read_csv(Chile.csv )
I need to read multiple files in different dataframes. it is working fine when I do with One file like
file='Norway.csv
Norway = pd.read_csv(file)
And I am getting error :
NameError: name 'norway' is not defined
You can read all the .csv file into one single dataframe.
for file_ in all_files:
df = pd.read_csv(file_,index_col=None, header=0)
list_.append(df)
# concatenate all dfs into one
big_df = pd.concat(dfs, ignore_index=True)
and then split the large dataframe into multiple (in your case 7). For example, -
import numpy as np
num_chunks = 3
df1,df2,df3 = np.array_split(big_df,num_chunks)
Hope this helps.
After googling for a while looking for an answer, I decided to combine answers from different questions into a solution to this question. This solution will not work for all possible cases. You have to tweak it to meet all your cases.
check out the solution to this question
# import libraries
import pandas as pd
import numpy as np
import glob
import os
# Declare a function for extracting a string between two characters
def find_between( s, first, last ):
try:
start = s.index( first ) + len( first )
end = s.index( last, start )
return s[start:end]
except ValueError:
return ""
path = '/path/to/folder/containing/your/data/sets' # use your path
all_files = glob.glob(path + "/*.csv")
list_of_dfs = [pd.read_csv(filename, encoding = "ISO-8859-1") for filename in all_files]
list_of_filenames = [find_between(filename, 'sets/', '.csv') for filename in all_files] # sets is the last word in your path
# Create a dictionary with table names as the keys and data frames as the values
dfnames_and_dfvalues = dict(zip(list_of_filenames, list_of_dfs))

Combine multiple csv files into a single xls workbook Python 3

We are in the transition at work from python 2.7 to python 3.5. It's a company wide change and most of our current scripts were written in 2.7 and no additional libraries. I've taken advantage of the Anaconda distro we are using and have already change most of our scripts over using the 2to3 module or completely rewriting them. I am stuck on one piece of code though, which I did not write and the original author is not here. He also did not supply comments so I can only guess at the whole of the script. 95% of the script works correctly until the end where after it creates 7 csv files with different parsed information it has a custom function to combine the csv files into and xls workbook with each csv as new tab.
import csv
import xlwt
import glob
import openpyxl
from openpyxl import Workbook
Parsefiles = glob.glob(directory + '/' + "Parsed*.csv")
def xlsmaker():
for f in Parsefiles:
(path, name) = os.path.split(f)
(chort_name, extension) = os.path.splittext(name)
ws = wb.add_sheet(short_name)
xreader = csv.reader(open(f, 'rb'))
newdata = [line for line in xreader]
for rowx, row in enumerate(newdata)
for colx, value in enumerate(row):
if value.isdigit():
ws.write(rowx, colx, value)
xlsmaker()
for f in Parsefiles:
os.remove(f)
wb.save(directory + '/' + "Finished" + '' + oshort + '' + timestr + ".xls")
This was written all in python 2.7 and still works correctly if I run it in python 2.7. The issue is that it throws an error when running in python 3.5.
File "parsetool.py", line 521, in (module)
xlsmaker()
File "parsetool.py", line 511, in xlsmaker
ws = wb.add_sheet(short_name)
File "c:\pythonscripts\workbook.py", line 168 in add_sheet
raise TypeError("The paramete you have given is not of the type '%s'"% self._worksheet_class.__name__)
TypeError: The parameter you have given is not of the type "Worksheet"
Any ideas about what should be done to fix the above error? Iv'e tried multiple rewrites, but I get similar errors or new errors. I'm considering just figuring our a whole new method to create the xls, possibly pandas instead.
Not sure why it errs. It is worth the effort to rewrite the code and use pandas instead. Pandas can read each csv file into a separate dataframe and save all dataframes as a separate sheet in an xls(x) file. This can be done by using the ExcelWriter of pandas. E.g.
import pandas as pd
writer = pd.ExcelWriter('yourfile.xlsx', engine='xlsxwriter')
df = pd.read_csv('originalfile.csv')
df.to_excel(writer, sheet_name='sheetname')
writer.save()
Since you have multiple csv files, you would probably want to read all csv files and store them as a df in a dict. Then write each df to Excel with a new sheet name.
Multi-csv Example:
import pandas as pd
import sys
import os
writer = pd.ExcelWriter('default.xlsx') # Arbitrary output name
for csvfilename in sys.argv[1:]:
df = pd.read_csv(csvfilename)
df.to_excel(writer,sheet_name=os.path.splitext(csvfilename)[0])
writer.save()
(Note that it may be necessary to pip install openpyxl to resolve errors with xlsxwriter import missing.)
You can use the code below, to read multiple .csv files into one big .xlsx Excel file.
I also added the code for replacing ',' by '.' (or vice versa) for improved compatibility on windows environments and according to your locale settings.
import pandas as pd
import sys
import os
import glob
from pathlib import Path
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
writer = pd.ExcelWriter('fc15.xlsx') # Arbitrary output name
for csvfilename in all_filenames:
txt = Path(csvfilename).read_text()
txt = txt.replace(',', '.')
text_file = open(csvfilename, "w")
text_file.write(txt)
text_file.close()
print("Loading "+ csvfilename)
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8')
df.to_excel(writer,sheet_name=os.path.splitext(csvfilename)[0])
print("done")
writer.save()
print("task completed")
Here's a slight extension to the accepted answer. Pandas 1.5 complains about the call to writer.save(). The fix is to use the writer as a context manager.
import sys
from pathlib import Path
import pandas as pd
with pd.ExcelWriter("default.xlsx") as writer:
for csvfilename in sys.argv[1:]:
p = Path(csvfilename)
sheet_name = p.stem[:31]
df = pd.read_csv(p)
df.to_excel(writer, sheet_name=sheet_name)
This version also trims the sheet name down to fit in Excel's maximum sheet name length, which is 31 characters.
If your csv file is in Chinese with gbk encoding, you can use the following code
import pandas as pd
import glob
import datetime
from pathlib import Path
now = datetime.datetime.now()
extension = "csv"
all_filenames = [i for i in glob.glob(f"*.{extension}")]
with pd.ExcelWriter(f"{now:%Y%m%d}.xlsx") as writer:
for csvfilename in all_filenames:
print("Loading " + csvfilename)
df = pd.read_csv(csvfilename, encoding="gb18030")
df.to_excel(writer, index=False, sheet_name=Path(csvfilename).stem)
print("done")
print("task completed")

Resources