How to iterate through a folder of csv files - python-3.x

I'm really new to python, so please bear with me!
I have folder on my desktop that contains a few csv files named as "File 1.csv", "File 2.csv" and so on. In each file, there is a table that looks like:
Animal Level
Cat 1
Dog 2
Bird 3
Snake 4
But each one of the files has a few differences in the "Animal" column. I wrote the following code that compares only two files at a time and returns the animals that match:
def matchlist(file1, file2):
new_df = pd.DataFrame()
file_one = pd.read_csv(file1)
file_two = pd.read_csv(file2)
for i in file_one["Animal"]:
df_temp = file_two[file_two["Animal"] == i]
new_df = new_df.append(df_temp)
df_temp = pd.DataFrame()
return new_df
But that only compares two files at a time. Is there a way that will iterate through all the files in that single folder and return all the ones that match to the new_df above?
For example, new_df compares file 1 and file 2. Then, I am looking for code that compares new_df to file 3, file 4, file 5, and so on.
Thanks!

Im not sure if it is really what you want, i can't comment yet on you questions... so:
this function returns a dataframe with animals that can be found in all csv files (can be very small) it uses the animal name as key, so the level value will not be considered
import pandas as pd
import os, sys
def matchlist_iter(folder_path):
# get list with filenames in folder and throw away all non ncsv
files = [file_path for file_path in os.listdir(folder_path) if file_path.endswith('.csv')]
# init return df with first csv
df = pd.read_csv(os.path.join(folder_path, files[0]), )
for file_path in files[1:]:
print('compare: {}'.format(file_path))
df_other = pd.read_csv(os.path.join(folder_path, file_path))
# only keep the animals that are in both frames
df = df_other[df['Animal'].isin(df_other['Animal'])]
return df
if __name__ == '__main__':
matched = matchlist_iter(sys.argv[1])
print(matched)
i've found a similar question with more answerers regarding the match here: Compare Python Pandas DataFrames for matching rows
EDIT: added csv and sample output
csv
Animal, Level
Cat, 1
Dog, 2
Bird, 3
Snake, 4
csv
Animal, Level
Cat, 1
Parrot, 2
Bird, 3
Horse, 4
output
compare: csv2.csv
Animal Level
0 Cat 1
2 Bird 3

I constructed a set of six files in which only the first columns of File 1.csv and File 6.csv are identical.
You need only the first column of each csv for comparison purposes, I therefore arrange to extract only those from each file.
>>> import pandas as pd
>>> from pathlib import Path
>>> column_1 = pd.read_csv('File 1.csv', sep='\s+')['Animal'].tolist()
>>> column_1
['Cat', 'Dog', 'Bird', 'Snake']
>>> for filename in Path('.').glob('*.csv'):
... if filename.name == 'File 1.csv':
... continue
... next_column = pd.read_csv(filename.name, sep='\s+')['Animal'].tolist()
... if column_1 == next_column:
... print (filename.name)
...
File 6.csv
As expected, File 6.csv is the only file found to be identical (in the first column) to File 1.csv.

Related

pandas read_csv with data and headers in alternate columns

I have a generated CSV file that
doesn't have headers
has header and data occur alternately in every row (headers do not change from row to row).
E.g.:
imageId,0,feat1,30,feat2,34,feat,90
imageId,1,feat1,0,feat2,4,feat,89
imageId,2,feat1,3,feat2,3,feat,80
IMO, this format is redundant and cumbersome (I don't see why anyone would generate files in this format). The saner/normal CSV of the same data (which I can directly read using pd.read_csv():
imageId,feat1,feat2,feat
0,30,34,90
1,0,4,89
2,3,3,80
My question is, how do I read the original data into a pd dataframe? For now, I do a read_csv and then drop all alternate columns:
df=pd.read_csv(file, header=None)
df=df[range(1, len(df.columns), 2]
Problem with this is I don't get the headers, unless I make it a point to specify them.
Is there a simpler way of telling pandas that the format has data and headers in every row?
Select columns by indexing in DataFrame.iloc and set new columns names with get first row and pair values (assuming pair columns have same values like in sample data):
#default headers
df = pd.read_csv(file, header=None)
df1 = df.iloc[:, 1::2]
df1.columns = df.iloc[0, ::2].tolist()
print (df1)
imageId feat1 feat2 feat
0 0 30 34 90
1 1 0 4 89
2 2 3 3 80
I didn't measure but I would expect that it could be a problem to read the entire file (redundant headers and actual data) before filtering for the interesting stuff. So I tried to exploit the optional parameters nrows and usecols to (hopefully) limit the amount of memory needed to process the CSV input file.
# --- Utilities for generating test data ---
import random as rd
def write_csv(file, line_count=100):
with open(file, 'w') as f:
r = lambda : rd.randrange(100);
for i in range(line_count):
line = f"imageId,{i},feat1,{r()},feat2,{r()},feat,{r()}\n"
f.write(line)
file = 'text.csv'
# Generate a small CSV test file
write_csv(file, 10)
# --- Actual answer ---
import pandas as pd
# Read columns of the first row
dfi = pd.read_csv(file, header=None, nrows=1)
ncols = dfi.size
# Read data columns
dfd = pd.read_csv(file, header=None, usecols=range(1, ncols, 2))
dfd.columns = dfi.iloc[0, ::2].to_list()
print(dfd)

How to clean CSV file for a coordinate system using pandas?

I wanted to create a program to convert CSV files to DXF(AutoCAD), but the CSV file sometimes comes with a header and sometimes no and there are cells that cannot be empty such as coordinates, and I also noticed that after excluding some of the inputs the value is nan or NaN and it was necessary to get rid of them so I offer you my answer and please share your opinions to implement a better method.
sample input
output
solution
import string
import pandas
def pandas_clean_csv(csv_file):
"""
Function pandas_clean_csv Documentation
- I Got help from this site, it's may help you as well:
Get the row with the largest number of missing data for more Documentation
https://moonbooks.org/Articles/How-to-filter-missing-data-NAN-or-NULL-values-in-a-pandas-DataFrame-/
"""
try:
if not csv_file.endswith('.csv'):
raise TypeError("Be sure you select .csv file")
# get punctuations marks as list !"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
punctuations_list = [mark for mark in string.punctuation]
# import csv file and read it by pandas
data_frame = pandas.read_csv(
filepath_or_buffer=csv_file,
header=None,
skip_blank_lines=True,
error_bad_lines=True,
encoding='utf8',
na_values=punctuations_list
)
# if elevation column is NaN convert it to 0
data_frame[3] = data_frame.iloc[:, [3]].fillna(0)
# if Description column is NaN convert it to -
data_frame[4] = data_frame.iloc[:, [4]].fillna('-')
# select coordinates columns
coord_columns = data_frame.iloc[:, [1, 2]]
# convert coordinates columns to numeric type
coord_columns = coord_columns.apply(pandas.to_numeric, errors='coerce', axis=1)
# Find rows with missing data
index_with_nan = coord_columns.index[coord_columns.isnull().any(axis=1)]
# Remove rows with missing data
data_frame.drop(index_with_nan, 0, inplace=True)
# iterate data frame as tuple data
output_clean_csv = data_frame.itertuples(index=False)
return output_clean_csv
except Exception as E:
print(f"Error: {E}")
exit(1)
out_data = pandas_clean_csv('csv_files/version2_bad_headers.csl')
for i in out_data:
print(i[0], i[1], i[2], i[3], i[4])
Here you can Download my test CSV files

create dataframe of liste

I want to create dataframe form existing lists( each row of file will be written in row dataframe.
with open(filename, mode='r', encoding='cp1252') as f:
lines=f.readlines()
liste1 = str(lines[0])
df1 = pd.DataFrame(liste1)
who can help me please?
below the 3 first rows of file f1.
[‘x1’, ‘major’, ’1198’, ‘TCP’]
[‘x1’, ‘minor’, ‘1198’, ‘UDP’]
[‘x2’, ‘major’, ’1198’, ‘UDP’]
If I understand this properly, want each row in the DataFrame to be a string you read from a line in the file?
Note that liste in your case is a string so I am not sure what you are going for.
This approach should work anyways.
import pandas as pd
df1 = pd.DataFrame()
with open(filename, mode='r', encoding='cp1252') as f:
lines=f.readlines()
liste1 = str(lines[0])
df1 = df1.append(pd.Series(liste1), ignore_index=True)
So if liste1 has form
> "This is a string"
then your DataFrame will look like this
df1.head()
0
0 This is a string
if liste1 has form
> ["This", "is", "a", "list"]
then your DataFrame will look like this
df1.head()
0 1 2 3
0 This is a list
You can then call this append() routine as many times as you want inside a loop.
However, I suspect that there is a function, such as pd.read_table(), that can do this all for you automatically (as #jezrael suggested in the comments to your question).

How to merge big data of csv files column wise into a single csv file using Pandas?

I have lots of big data csv files in terms of countries and I want to merge their column in a single csv file, furthermore, each file has 'Year' as an index and having same in terms of length and numbers. You can see below is a given example of a Japan.csv file.
If anyone can help me please let me know. Thank you!!
Try using:
import pandas as pd
import glob
l = []
path = 'path/to/directory/'
csvs = glob.glob(path + "/*.csv")
for i in csvs:
df = pd.read_csv(i, index_col=None, header=0)
l.append(df)
df = pd.concat(l, ignore_index=True)
This should work. It goes over each file name, reads it and combines everything into one df. You can export this df to csv or do whatever with it. gl.
import pandas as pd
def combine_csvs_into_one_df(names_of_files):
one_big_df = pd.DataFrame()
for file in names_of_files:
try:
content = pd.read_csv(file)
except PermissionError:
print (file,"was not found")
continue
one_big_df = pd.concat([one_big_df,content])
print (file," added!")
print ("------")
print ("Finished")
return one_big_df

Creating multiple dataframes with a loop

This undoubtedly reflects lack of knowledge on my part, but I can't find anything online to help. I am very new to programming. I want to load 6 csvs and do a few things to them before combining them later. The following code iterates over each file but only creates one dataframe, called df.
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
for df, file in zip(dfs, files):
df = pd.read_csv(file)
print(df.shape)
print(df.dtypes)
print(list(df))
Use dictionary to store you DataFrames and access them by name
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs_names = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
dfs ={}
for dfn,file in zip(dfs_names, files):
dfs[dfn] = pd.read_csv(file)
print(dfs[dfn].shape)
print(dfs[dfn].dtypes)
print(dfs['df3'])
Use list to store you DataFrames and access them by index
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = []
for file in files:
dfs.append( pd.read_csv(file))
print(dfs[len(dfs)-1].shape)
print(dfs[len(dfs)-1].dtypes)
print (dfs[2])
Do not store intermediate DataFrame, just process them and add to resulting DataFrame.
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
df = pd.DataFrame()
for file in files:
df_n = pd.read_csv(file)
print(df_n.shape)
print(df_n.dtypes)
# do you want to do
df = df.append(df_n)
print (df)
If you will process them differently, then you do not need a general structure to store them. Do it simply independent.
df = pd.DataFrame()
def do_general_stuff(d): #here we do common things with DataFrame
print(d.shape,d.dtypes)
df1 = pd.read_csv("data1.csv")
# do you want to with df1
do_general_stuff(df1)
df = df.append(df1)
del df1
df2 = pd.read_csv("data2.csv")
# do you want to with df2
do_general_stuff(df2)
df = df.append(df2)
del df2
df3 = pd.read_csv("data3.csv")
# do you want to with df3
do_general_stuff(df3)
df = df.append(df3)
del df3
# ... and so on
And one geeky way, but don't ask how it works:)
from collections import namedtuple
files = ['data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv']
df = namedtuple('Cdfs',
['df1', 'df2', 'df3', 'df4', 'df5', 'df6']
)(*[pd.read_csv(file) for file in files])
for df_n in df._fields:
print(getattr(df, df_n).shape,getattr(df, df_n).dtypes)
print(df.df3)
I think you think your code is doing something that it is not actually doing.
Specifically, this line: df = pd.read_csv(file)
You might think that in each iteration through the for loop this line is being executed and modified with df being replaced with a string in dfs and file being replaced with a filename in files. While the latter is true, the former is not.
Each iteration through the for loop is reading a csv file and storing it in the variable df effectively overwriting the csv file that was read in during the previous for loop. In other words, df in your for loop is not being replaced with the variable names you defined in dfs.
The key takeaway here is that strings (e.g., 'df1', 'df2', etc.) cannot be substituted and used as variable names when executing code.
One way to achieve the result you want is store each csv file read by pd.read_csv() in a dictionary, where the key is name of the dataframe (e.g., 'df1', 'df2', etc.) and value is the dataframe returned by pd.read_csv().
list_of_dfs = {}
for df, file in zip(dfs, files):
list_of_dfs[df] = pd.read_csv(file)
print(list_of_dfs[df].shape)
print(list_of_dfs[df].dtypes)
print(list(list_of_dfs[df]))
You can then reference each of your dataframes like this:
print(list_of_dfs['df1'])
print(list_of_dfs['df2'])
You can learn more about dictionaries here:
https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries
A dictionary can store them too
import pandas as pd
from pprint import pprint
files = ('doms_stats201610051.csv', 'doms_stats201610052.csv')
dfsdic = {}
dfs = ('df1', 'df2')
for df, file in zip(dfs, files):
dfsdic[df] = pd.read_csv(file)
print(dfsdic[df].shape)
print(dfsdic[df].dtypes)
print(list(dfsdic[df]))
print(dfsdic['df1'].shape)
print(dfsdic['df2'].shape)

Resources