Pandas - Adding dummy header column in csv - python-3.x

I am trying concat several csv files by customer group using the below code:
files = glob.glob(file_from + "/*.csv") <<-- Path where the csv resides
df_v0 = pd.concat([pd.read_csv(f) for f in files]) <<-- Dataframe that concat all csv files from files mentioned above
The problem is the number of column in the csv varies by customer and they do not have a header file.
I am trying to see if I could add in a dummmy header column with labels such as col_1, col_2 ... depending on the number of columns in that csv.
Could anyone guide as to how could I get this done. Thanks.
Update on trying to search for a specific string in the Dataframe:
Sample Dataframe
col_1,col_2,col_3
fruit,grape,green
fruit,watermelon,red
fruit,orange,orange
fruit,apple,red
Trying to filter out rows having the word red and expect it to return rows 2 and 4.
Tried the below code:
df[~df.apply(lambda x: x.astype(str).str.contains('red')).any(axis=1)]

Use parameters header=None for default range columns 0, 1, 2 and skiprows=1 if necessary remove original columns names:
df_v0 = pd.concat([pd.read_csv(f, header=None, skiprows=1) for f in files])
If want also change columns names add rename:
dfs = [pd.read_csv(f, header=None, skiprows=1).rename(columns = lambda x: f'col_{x + 1}')
for f in files]
df_v0 = pd.concat(dfs)

Related

Remove columns starting with same special character in a csv file using Python

My csv file has below columns:
AFM_reversal_indicator,Alert_Message,axiom_key,_timediff,player,__mv_Splunk_Alert_Id,__mv_nbr_plastic,__mv_code_acct_stat_demo.
I want to remove columns starting with "__mv".
I saw some posts where pandas are used to filter out columns.
Is it possible to do it using csv module in python, If yes how ?
Also, with Pandas what regex should I give:
df.filter(regex='')
df.to_csv(output_file_path)
P.S I am using python3.8
You mean with standard python? You can use a list comprehension, e.g.
import csv
with open( 'data.csv', 'r' ) as f:
DataGenerator = csv.reader( f )
Header = next( DataGenerator )
Header = [ Col.strip() for Col in Header ]
Data = list( DataGenerator )
if Data[-1] == []: del( Data[-1] )
Data = [ [Row[i] for i in range( len( Header ) ) if not Header[i].startswith( "__mv" ) ] for Row in Data ]
Header = [ Col for Col in Header if not Col.startswith( "__mv" ) ]
However, this is just a simple example. You'll probably have further things to consider, e.g. what type your csv columns have, whether you want to read all the data at once like I do here, or one-by-one from the generator to save on memory, etc.
You could also use the builtin filter command instead of the inner list comprehension.
Also, if you have numpy installed and you wanted something more 'numerical', you can always use "structured numpy arrays" (https://numpy.org/doc/stable/user/basics.rec.html). They're quite nice. (personally I prefer them to pandas anyway). Numpy also has its own csv-reading functions (see: https://www.geeksforgeeks.org/how-to-read-csv-files-with-numpy/)
You don't need to use .filter for that. You can just find out which are those columns and then drop them from the DataFrame
import pandas as pd
# Load the dataframe (In our case create a dummy one with your columns)
df = pd.DataFrame(columns = ["AFM_reversal_indicator", "Alert_Message,axiom_key", "_timediff,player", "__mv_Splunk_Alert_Id", "__mv_nbr_plastic", "__mv_code_acct_stat_demo"])
# Get a list of all column names starting with "__mv"
mv_columns = [col for col in df.columns if col.startswith("__mv")]
# Drop the columns
df = df.drop(columns=mv_columns)
# Save the updated dataframe to a CSV file
df.to_csv("cleaned_data.csv", index=False)
The mv_columns will iterate through the columns in your DataFrame and pick those that starts with "__mv". Then the .drop will just remove those from it.
If for some reason you want to use csv package only, then the solution might not be as elegant as with pandas. But here is a suggestion:
import csv
with open("original_data.csv", "r") as input_file, open("cleaned_data.csv", "w", newline="") as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
header_row = next(reader)
mv_columns = [col for col in header_row if col.startswith("__mv")]
mv_column_indices = [header_row.index(col) for col in mv_columns]
new_header_row = [col for col in header_row if col not in mv_columns]
writer.writerow(new_header_row)
for row in reader:
new_row = [row[i] for i in range(len(row)) if i not in mv_column_indices]
writer.writerow(new_row)
So, first, you read the first row that supposed to be your headers. With a similar logic as before, you find those columns that starts with "__mv" and then you get their indices. You write the new columns to your output file with those columns that don't exist to the "__mv" columns. Then you need to iterate through the rest of the CSV and remove those columns as you go.

consolidating csv file into one master file

I am facing the following challenges
I have approximately 400 files which i have to consolidate into one master file but there is one problem that the files have different headers and when I try to consolidate it put the data into different rows on the basis of column
Example:-
lets say i have two files C1 and C2
file C1.csv
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
and file C2.csv
name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3
so I have these two files and from these files I want that when I consolidate these files the output will be like this:-
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3
for Consolidation I am using the following code
import glob
import pandas as pd
directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist: # Iterate through each of the 300 files
df1 = pd.read_csv(file) # create df using the file
df1col = list (df1.columns) # save columns to a list
df2 = consolidated # set the consolidated as your df2
df2col = list (df2.columns) # save columns from consolidated result as list
commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
# print (commoncol)
if commoncol == []: # In first iteration, consolidated file is empty, which will return in a blank df
consolidated = pd.concat([df1, df2], axis=1).fillna(value=0) # concatenate (outer join) with no common columns replacing null values with 0
else:
consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0) # merge both df specifying the common column and replace null values with 0
# print (consolidated) << Optionally, check the consolidated df at each iteration
# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)
but it can't merge the columns having same data like the output shown earlier.
From your two-file example, you know the final (least common) header for the output, and you know what one of the bigger headers looks like.
My take on that is to think of every "other" kind of header as needing a mapping to the final header, like concatenating add-lines 1-3 into a single address field. We can use the csv module to read and write row-by-row and send the rows to the appropriate consolidator (mapping) based on the header of the input file.
The csv module provides a DictReader and DictWriter which makes dealing with fields you know by name very handy; especially, the DictWriter() constructor has the extrasaction="ignore" option which means that if you tell the writer your fields are:
Col1, Col2, Col3
and you pass a dict like:
{"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"}
it will just ignore Col4 and only write Cols 1-3:
writer = csv.DictWriter(sys.stdout, fieldnames=["Col1", "Col2", "Col3"], extrasaction="ignore")
writer.writeheader()
writer.writerow({"Col1": "val1", "Col2": "val2", "Col3": "val3", "Col4": "val4"})
# Col1,Col2,Col3
# val1,val2,val3
import csv
def consolidate_add_lines_1_to_3(row):
row["address"] = "-".join([row["add-line1"], row["add-line2"], row["add-line3"]])
return row
# Add other consolidators here...
# ...
Final_header = ["name", "phone-no", "address"]
f_out = open("output.csv", "w", newline="")
writer = csv.DictWriter(f_out, fieldnames=Final_header, extrasaction="ignore")
writer.writeheader()
for fname in ["file1.csv", "file2.csv"]:
f_in = open(fname, newline="")
reader = csv.DictReader(f_in)
for row in reader:
if "add-line1" in row and "add-line2" in row and "add-line3" in row:
row = consolidate_add_lines_1_to_3(row)
# Add conditions for other consolidators here...
# ...
writer.writerow(row)
f_in.close()
f_out.close()
If there are more than one kind of other header, you'll need to seek those out, and figure out the extra consolidators to write, and the conditions to trigger them in for row in reader loop.

Merge duplicate rows in a text file using python based on a key column

I have a csv file and I need to merge records of those rows based on a key column name
a.csv
Name|Acc#|ID|Age
Suresh|2345|a-b2|24
Mahesh|234|a-vf|34
Mahesh|4554|a-bg|45
Keren|344|s-bg|45
yankie|999|z-bg|34
yankie|3453|g-bgbbg|45
Expected output: Merging records based on name like values from both the rows for name Mahesh and yankie are merged
Name|Acc#|ID|Age
Suresh|2345|a-b2|24
Mahesh|[234,4555]|[a-vf,a-bg]|[34,45]
Keren|344|s-bg|45
yankie|[999,3453]|[z-bg,g-bgbbg]|[34,45]
can someone help me with this in python?
import pandas as pd
df = pd.read_csv("a.csv", sep="|", dtype=str)
new_df = df.groupby('Name',as_index=False).aggregate(lambda tdf: tdf.unique().tolist() if tdf.shape[0] > 1 else tdf)
new_df.to_csv("data.csv", index=False, sep="|")
Output:
Name|Acc#|ID|Age
Keren|344|s-bg|45
Mahesh|['234', '4554']|['a-vf', 'a-bg']|['34', '45']
Suresh|2345|a-b2|24
yankie|['999', '3453']|['z-bg', 'g-bgbbg']|['34', '45']

Choose an specific column of an imported text file

I am trying to import a text file into Python. The first column is date and others are integers. After importing the text file I want to extract each column, name them and plot each variable vs date (the first column). How can I extract columns? And how can I choose the 2nd column onwards? I tried two different methods for importing the file:
btcv = np.genfromtxt('example_Feb.388.btcv.txt', dtype=None);
and
btcv = pd.read_csv('example_Feb.388.btcv.txt', header = None)
The text file looks like:
"2015-06-17 00:00" -6.830000 -5.642747 -5.642747 -4.057440 -3.867922 -4.377454
"2015-06-18 00:00" -6.830000 -5.630413 -5.630413 -4.045107 -3.855588 -4.365120
"2015-06-19 00:00" -5.245973 -5.627623 -5.627623 -3.967911 -3.836147 -4.309624
"2015-06-20 00:00" -4.568952 -5.620628 -5.620628 -3.871517 -3.837915 -4.238232
"2015-06-21 00:00" -4.620864 -5.615302 -5.615302 -3.980928 -4.001598 -4.272657
"2015-06-22 00:00" -4.673435 -5.622433 -5.622433 -4.025599 -4.071035 -4.285809
With 1000 rows and 188 columns.
I tried
btcv.date = btcv[:,0]
and it did not work! and btcv[0] returns the full array.
Thanks.
using pandas you can read it as a csv and set the delimeter to whitespace
pd.read_csv('example.csv', delim_whitespace=True, header=None)
This will read the file into a pandas dataframe. You can then name your columns. For example
df.columns = ['date', 'first', 'second']
then you can access each column by name E.g
date = df.date
make the date the frame index
df.index = df.date
and then plot the data frame with a plotting tool

Transfer cell values from different columns and sheets from multiple excel files with same structure into a single dataframe

I have a reporting sheet in excel that contains a set of datapoints that I want to compile from multiple files with the same format into a master dataset.
The initial step I undertook was to extract the data points I need from multiple sheet into one pandas dataframe. See the steps below
I initally imported the excel file and parsed it
import pandas as pd
xl = pd.ExcelFile(r"C:\Users\Nicola\Desktop\ISP 2016-20 Ops-Technical Form.xlsm")
df = xl.parse("FSL, WASH, DRM") #name of sheet #1
Then I located the data points needed for synthesis
a=df.iloc[5:20,3:5]
a1=df.iloc[6:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
Then I concatenated and equalised columns positioning to maintain the whole list of values within the same column:
dfcon=pd.concat(([a,b]))
dfcon2=pd.concat(([a1,b1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
And lastly created a dataframe with the string of values I need
master=pd.DataFrame(dfcont2)
finalmaster=master.transpose()
The next two steps I wish to pursue are:
1) Replicate the same code for 50 excel files
2) Compile all string of values from this set of excel files into one single pandas dataframe without running this code over again and compile manually by exporting it into excel.
Any support would be greatly appreciated. Thanks
I believe need loop by file names created by glob and last concat together (all files have same structure):
import glob
dfs = []
for f in glob.glob('*.xlsm'):
df = pd.read_excel(io=f, sheet_name=1)
a=df.iloc[5:20,3:5]
a1=df.iloc[6:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
dfcon=pd.concat(([a,b]))
dfcon2=pd.concat(([a1,b1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
dfs.append(dfcont2.T)
out = pd.concat(dfs, ignore_index=True)
Found the solution that works for me, thank you for the input, jezrael.
To further explain:
1) Imported the files with same structure from my Desktop directory, parsed and selected the Excel sheet from which data can be extracted from different locations (iloc)
import glob
dfs = []
for f in glob.glob('C:/Users/Nicola/Desktop/OPS Form/*.xlsm'):
df = pd.ExcelFile(io=f, sheet_name=1)
df = df.parse("FSL, WASH, DRM")
a=df.iloc[5:20,3:5]
a1=df.iloc[7:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
c=df.iloc[50:56,3:5]
c1=df.iloc[38:39,10:12]
d=df.iloc[57:61,3:5]
e=df.iloc[63:71,3:5]
2) Concatenated and repositioned columns order to compose the first version of the dataframe (output)
dfcon=pd.concat(([a,b,c,d,e]))
dfcon2=pd.concat(([a1,b1,c1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
dfs.append(dfcont2.T)
3) Output presented the same string of values but repeated twice [same label and form-specific entry] from recursive data pull-outs linked to iloc locations.
output = pd.concat(dfs, ignore_index=True)
4) This last snippet simply allowed me to extract the label only once and to select all entries ordered in odd numbers. With the last concatenation, I generated the dataframe I seeked, ready to be processed analytically.
a=output[2:3]
b=output[1::2]
pd.concat([a,b], axis=0, ignore_index=True)

Resources