consolidating csv file into one master file - python-3.x

I am facing the following challenges
I have approximately 400 files which i have to consolidate into one master file but there is one problem that the files have different headers and when I try to consolidate it put the data into different rows on the basis of column
Example:-
lets say i have two files C1 and C2
file C1.csv
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
and file C2.csv
name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3
so I have these two files and from these files I want that when I consolidate these files the output will be like this:-
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3
for Consolidation I am using the following code
import glob
import pandas as pd
directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist: # Iterate through each of the 300 files
df1 = pd.read_csv(file) # create df using the file
df1col = list (df1.columns) # save columns to a list
df2 = consolidated # set the consolidated as your df2
df2col = list (df2.columns) # save columns from consolidated result as list
commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
# print (commoncol)
if commoncol == []: # In first iteration, consolidated file is empty, which will return in a blank df
consolidated = pd.concat([df1, df2], axis=1).fillna(value=0) # concatenate (outer join) with no common columns replacing null values with 0
else:
consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0) # merge both df specifying the common column and replace null values with 0
# print (consolidated) << Optionally, check the consolidated df at each iteration
# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)
but it can't merge the columns having same data like the output shown earlier.

From your two-file example, you know the final (least common) header for the output, and you know what one of the bigger headers looks like.
My take on that is to think of every "other" kind of header as needing a mapping to the final header, like concatenating add-lines 1-3 into a single address field. We can use the csv module to read and write row-by-row and send the rows to the appropriate consolidator (mapping) based on the header of the input file.
The csv module provides a DictReader and DictWriter which makes dealing with fields you know by name very handy; especially, the DictWriter() constructor has the extrasaction="ignore" option which means that if you tell the writer your fields are:
Col1, Col2, Col3
and you pass a dict like:
{"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"}
it will just ignore Col4 and only write Cols 1-3:
writer = csv.DictWriter(sys.stdout, fieldnames=["Col1", "Col2", "Col3"], extrasaction="ignore")
writer.writeheader()
writer.writerow({"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"})
# Col1,Col2,Col3
# val1,val2,val3
import csv
def consolidate_add_lines_1_to_3(row):
row["address"] = "-".join([row["add-line1"], row["add-line2"], row["add-line3"]])
return row
# Add other consolidators here...
# ...
Final_header = ["name", "phone-no", "address"]
f_out = open("output.csv", "w", newline="")
writer = csv.DictWriter(f_out, fieldnames=Final_header, extrasaction="ignore")
writer.writeheader()
for fname in ["file1.csv", "file2.csv"]:
f_in = open(fname, newline="")
reader = csv.DictReader(f_in)
for row in reader:
if "add-line1" in row and "add-line2" in row and "add-line3" in row:
row = consolidate_add_lines_1_to_3(row)
# Add conditions for other consolidators here...
# ...
writer.writerow(row)
f_in.close()
f_out.close()
If there are more than one kind of other header, you'll need to seek those out, and figure out the extra consolidators to write, and the conditions to trigger them in for row in reader loop.

Related

Remove columns starting with same special character in a csv file using Python

My csv file has below columns:
AFM_reversal_indicator,Alert_Message,axiom_key,_timediff,player,__mv_Splunk_Alert_Id,__mv_nbr_plastic,__mv_code_acct_stat_demo.
I want to remove columns starting with "__mv".
I saw some posts where pandas are used to filter out columns.
Is it possible to do it using csv module in python, If yes how ?
Also, with Pandas what regex should I give:
df.filter(regex='')
df.to_csv(output_file_path)
P.S I am using python3.8
You mean with standard python? You can use a list comprehension, e.g.
import csv
with open( 'data.csv', 'r' ) as f:
DataGenerator = csv.reader( f )
Header = next( DataGenerator )
Header = [ Col.strip() for Col in Header ]
Data = list( DataGenerator )
if Data[-1] == []: del( Data[-1] )
Data = [ [Row[i] for i in range( len( Header ) ) if not Header[i].startswith( "__mv" ) ] for Row in Data ]
Header = [ Col for Col in Header if not Col.startswith( "__mv" ) ]
However, this is just a simple example. You'll probably have further things to consider, e.g. what type your csv columns have, whether you want to read all the data at once like I do here, or one-by-one from the generator to save on memory, etc.
You could also use the builtin filter command instead of the inner list comprehension.
Also, if you have numpy installed and you wanted something more 'numerical', you can always use "structured numpy arrays" (https://numpy.org/doc/stable/user/basics.rec.html). They're quite nice. (personally I prefer them to pandas anyway). Numpy also has its own csv-reading functions (see: https://www.geeksforgeeks.org/how-to-read-csv-files-with-numpy/)
You don't need to use .filter for that. You can just find out which are those columns and then drop them from the DataFrame
import pandas as pd
# Load the dataframe (In our case create a dummy one with your columns)
df = pd.DataFrame(columns = ["AFM_reversal_indicator", "Alert_Message,axiom_key", "_timediff,player", "__mv_Splunk_Alert_Id", "__mv_nbr_plastic", "__mv_code_acct_stat_demo"])
# Get a list of all column names starting with "__mv"
mv_columns = [col for col in df.columns if col.startswith("__mv")]
# Drop the columns
df = df.drop(columns=mv_columns)
# Save the updated dataframe to a CSV file
df.to_csv("cleaned_data.csv", index=False)
The mv_columns will iterate through the columns in your DataFrame and pick those that starts with "__mv". Then the .drop will just remove those from it.
If for some reason you want to use csv package only, then the solution might not be as elegant as with pandas. But here is a suggestion:
import csv
with open("original_data.csv", "r") as input_file, open("cleaned_data.csv", "w", newline="") as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
header_row = next(reader)
mv_columns = [col for col in header_row if col.startswith("__mv")]
mv_column_indices = [header_row.index(col) for col in mv_columns]
new_header_row = [col for col in header_row if col not in mv_columns]
writer.writerow(new_header_row)
for row in reader:
new_row = [row[i] for i in range(len(row)) if i not in mv_column_indices]
writer.writerow(new_row)
So, first, you read the first row that supposed to be your headers. With a similar logic as before, you find those columns that starts with "__mv" and then you get their indices. You write the new columns to your output file with those columns that don't exist to the "__mv" columns. Then you need to iterate through the rest of the CSV and remove those columns as you go.

comparing two csv files in python that have different data sets

using python, I want to compare two csv files but only compare row2 of the first csv against row0 of the second csv, but print out in a new csv file only the lines where there are no matches for the compared rows.
Example....
currentstudents.csv contains the following information
Susan,Smith,susan.smith#mydomain.com,8
John,Doe,john.doe#mydomain.com,9
Cool,Guy,cool.guy#mydomain.com,3
Test,User,test.user#mydomain.com,5
previousstudents.csv contains the following information
susan.smith#mydomain.com
john.doe#mydomain.com
test.user#mydomain.com
After comparing the two csv files, a new csv called NewStudents.csv should be written with the following information:
Cool,Guy,cool.guy#mydomain.com,3
Here is what I have, but this fails to produce what I need....The below code will work, if I omit all data except the email address in the original currentstudents.csv file, but then I dont end up with the needed data in the final csv file.
def newusers():
for line in fileinput.input(r'C:\work\currentstudents.csv', inplace=1):
print(line.lower(), end='')
with open(r'C:\work\previousstudents.csv', 'r') as t1, open(r'C:\work\currentstudents.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open(r'C:\work\NewStudents.csv', 'w') as outFile:
for (line[0]) in filetwo:
if (line[0]) not in fileone:
outFile.write(line)
Thanks in advance!
This script writes NewStudents.csv:
import csv
with open('sample.csv', newline='') as csvfile1, \
open('sample2.csv', newline='') as csvfile2, \
open('NewStudents.csv', 'w', newline='') as csvfile3:
reader1 = csv.reader(csvfile1)
reader2 = csv.reader(csvfile2)
csvwriter = csv.writer(csvfile3)
emails = set(row[0] for row in reader2)
for row in reader1:
if row[2] not in emails:
csvwriter.writerow(row)
The content of NewStudents.csv:
Cool,Guy,cool.guy#mydomain.com,3
With a pandas option
For small files it's not going to matter, but for larger files, the vectorized operations of pandas will be significantly faster than iterating through emails (multiple times) with csv.
Read the data with pd.read_csv
Merge the data with pandas.DataFrame.merge
The columns do not have names in the question, so columns are selected by column index.
Select the desired new students with Boolean indexing with [all_students._merge == 'left_only'].
.iloc[:, :-2] selects all rows, and all but last two columns.
import pandas as pd
# read the two csv files
cs = pd.read_csv('currentstudents.csv', header=None)
ps = pd.read_csv('previousstudents.csv', header=None)
# merge the data
all_students = cs.merge(ps, left_on=2, right_on=0, how='left', indicator=True)
# select only data from left_only
new_students = all_students.iloc[:, :-2][all_students._merge == 'left_only']
# save the data without the index or header
new_students.to_csv('NewStudents.csv', header=False, index=False)
# NewStudents.csv
Cool,Guy,cool.guy#mydomain.com,3

Merge duplicate rows in a text file using python based on a key column

I have a csv file and I need to merge records of those rows based on a key column name
a.csv
Name|Acc#|ID|Age
Suresh|2345|a-b2|24
Mahesh|234|a-vf|34
Mahesh|4554|a-bg|45
Keren|344|s-bg|45
yankie|999|z-bg|34
yankie|3453|g-bgbbg|45
Expected output: Merging records based on name like values from both the rows for name Mahesh and yankie are merged
Name|Acc#|ID|Age
Suresh|2345|a-b2|24
Mahesh|[234,4555]|[a-vf,a-bg]|[34,45]
Keren|344|s-bg|45
yankie|[999,3453]|[z-bg,g-bgbbg]|[34,45]
can someone help me with this in python?
import pandas as pd
df = pd.read_csv("a.csv", sep="|", dtype=str)
new_df = df.groupby('Name',as_index=False).aggregate(lambda tdf: tdf.unique().tolist() if tdf.shape[0] > 1 else tdf)
new_df.to_csv("data.csv", index=False, sep="|")
Output:
Name|Acc#|ID|Age
Keren|344|s-bg|45
Mahesh|['234', '4554']|['a-vf', 'a-bg']|['34', '45']
Suresh|2345|a-b2|24
yankie|['999', '3453']|['z-bg', 'g-bgbbg']|['34', '45']

Searching for specific information in one CSV file derived from two other CSV files (python)

I have a nutrition database with 3 different CSV files. After cleaning, the first file contains two columns: nutrient id, and nutrient name; the second file contains two columns: food id, and food description (name); and finally, the third file contains three columns: nutrient id, food id, and amount (of the nutrient in this food). Since there are several millions of lines, I can't every time open each file separately, and check which id corresponds to which nutrient or food. So I am trying to create a code, which will read all three files, then search for matches in id columns for both nutrient (from the first file) and food (from the second file), replace id with a name, and return 3 columns: nutrient_name, food_name, amount. Now, there is a complication, namely: in the 1 and 2 files, the lines are sorted by id-s, while in the third file (with amount) the lines are sorted by nutrient_id (meaning that food id-s column is chaotic). So I can't just merge three files, or replace id columns with name columns in the third file...
Here is an example of my code, which does not return what I need. I am quite stuck at this because I can't find the answer on the internet. Thanks!
#-*- coding: utf-8 -*-
"""
Created on Fri Nov 8 17:38:45 2019
#author: user
"""
import pandas as pd
#%% reading csv files
#read the first scv file with nutrient_name, nutrient_id
df1 = pd.read_csv('nutrient.csv', low_memory=False)
print(df1)
#read specific columns from the first csv file
df1 = pd.read_csv('nutrient.csv', usecols = ['id', 'name'], low_memory=False)
df1.rename(columns={'name' : 'nut_name'}, inplace = True)
print(df1)
#read the second scv file with food_id and food_name , read specific columns
df2 = pd.read_csv('food.csv', usecols = ['fdc_id', 'description'], low_memory=False)
print(df2)
#read the third csv file with food_id, nutrient_id and nutrient amount
df3 = pd.read_csv('food_nutrient.csv', usecols=['fdc_id','nutrient_id', 'amount'], low_memory=False)
print(df3)
#%% create a list of rows from each csv file
# Create an empty list 1
Id_list =[]
Name_list = []
# Iterate over each rowin first csv file
for index, rows in df1.iterrows():
# append the list to the final list
Id_list.append(rows.id)
Name_list.append(rows.nut_name)
# Print the list
print(Id_list[:10])
print(Name_list[:10])
# Create an empty list 2
Food_id_list =[]
Food_name_list =[]
# Iterate over each rowin seconf csv file
for index, rows in df2.iterrows():
# append the list to the final list
Food_id_list.append(rows.fdc_id)
Food_name_list.append(rows.description)
print(Food_id_list[:10])
print(Food_name_list[:10])
# Create an empty list 1
Amount_list =[]
Name_list1 = []
Food_name1 = []
# Iterate over each rowin third csv file
for index, rows in df3.iterrows():
# append the list to the final list
Amount_list.append(rows.amount)
Name_list1.append(rows.nutrient_id)
Food_name1.append(rows.fdc_id)
# Print the list
print(Amount_list[:10])
print(Name_list1[:10])
print(Food_name1[:10])
#%% search in the third csv only rows, where amount of the certain nut in certain food is not empty
value
for i in Name_list:
#for j in Food_name_list:
if i in df3['nutrient_id']:
print(df3.loc[i, 'amount'])
Thanks in advance!
This is exactly what SQL was created for. SQL's join command joins multiple tables.
Having played around with pandas a lot, I would highly suggest picking up a simple SQL course or first try going through a simple SQL join tutorial as this is very much an introductory problem in SQL.

Pandas - Adding dummy header column in csv

I am trying concat several csv files by customer group using the below code:
files = glob.glob(file_from + "/*.csv") <<-- Path where the csv resides
df_v0 = pd.concat([pd.read_csv(f) for f in files]) <<-- Dataframe that concat all csv files from files mentioned above
The problem is the number of column in the csv varies by customer and they do not have a header file.
I am trying to see if I could add in a dummmy header column with labels such as col_1, col_2 ... depending on the number of columns in that csv.
Could anyone guide as to how could I get this done. Thanks.
Update on trying to search for a specific string in the Dataframe:
Sample Dataframe
col_1,col_2,col_3
fruit,grape,green
fruit,watermelon,red
fruit,orange,orange
fruit,apple,red
Trying to filter out rows having the word red and expect it to return rows 2 and 4.
Tried the below code:
df[~df.apply(lambda x: x.astype(str).str.contains('red')).any(axis=1)]
Use parameters header=None for default range columns 0, 1, 2 and skiprows=1 if necessary remove original columns names:
df_v0 = pd.concat([pd.read_csv(f, header=None, skiprows=1) for f in files])
If want also change columns names add rename:
dfs = [pd.read_csv(f, header=None, skiprows=1).rename(columns = lambda x: f'col_{x + 1}')
for f in files]
df_v0 = pd.concat(dfs)

Resources