How do I set an unnamed column as the index? - python-3.x

In all the examples I have found, a column name is usually required to set it as the index
Instead of going into excel to add a column header, I was wondering if it's possible to set an empty header as the index. The column has all the values I want included, but lacks a column name:
My script is currently:
import pandas as pd
data = pd.read_csv('file.csv')
data

You could also just select the column by id with iloc:
data = data.set_index(data.iloc[:, 0])
Or when you call pd.read_csv(), specify index_col:
data = pd.read_csv('path.csv', index_col=0)

You don't need to rename the first column in excel. It's as easy in pandas as well:
new_columns = data.columns.values
new_columns[0] = 'Month'
data.columns = new_columns
Afterwards, you can set the index:
data = data.set_index('Month')

You can do as follows:
import pandas as pd
data = pd.read_csv('file.csv',index_col=0)
data

When I have encountered columns missing names, Pandas always name them 'Unnamed: n', where n = ColumnNumber-1. ie 'Unnamed: 0' for first column, 'Unnamed: 1' for second etc. So I think that in your case the following code should be useful:
# set your column as the dataframe index
data.index = data['Unnamed: 0']
# now delete the column
data.drop('Unnamed: 0', axis=1, inplace=True)
# also delete the index name which was 'Unnamed: 0' obviously
del data.index.name

Related

Remove columns starting with same special character in a csv file using Python

My csv file has below columns:
AFM_reversal_indicator,Alert_Message,axiom_key,_timediff,player,__mv_Splunk_Alert_Id,__mv_nbr_plastic,__mv_code_acct_stat_demo.
I want to remove columns starting with "__mv".
I saw some posts where pandas are used to filter out columns.
Is it possible to do it using csv module in python, If yes how ?
Also, with Pandas what regex should I give:
df.filter(regex='')
df.to_csv(output_file_path)
P.S I am using python3.8
You mean with standard python? You can use a list comprehension, e.g.
import csv
with open( 'data.csv', 'r' ) as f:
DataGenerator = csv.reader( f )
Header = next( DataGenerator )
Header = [ Col.strip() for Col in Header ]
Data = list( DataGenerator )
if Data[-1] == []: del( Data[-1] )
Data = [ [Row[i] for i in range( len( Header ) ) if not Header[i].startswith( "__mv" ) ] for Row in Data ]
Header = [ Col for Col in Header if not Col.startswith( "__mv" ) ]
However, this is just a simple example. You'll probably have further things to consider, e.g. what type your csv columns have, whether you want to read all the data at once like I do here, or one-by-one from the generator to save on memory, etc.
You could also use the builtin filter command instead of the inner list comprehension.
Also, if you have numpy installed and you wanted something more 'numerical', you can always use "structured numpy arrays" (https://numpy.org/doc/stable/user/basics.rec.html). They're quite nice. (personally I prefer them to pandas anyway). Numpy also has its own csv-reading functions (see: https://www.geeksforgeeks.org/how-to-read-csv-files-with-numpy/)
You don't need to use .filter for that. You can just find out which are those columns and then drop them from the DataFrame
import pandas as pd
# Load the dataframe (In our case create a dummy one with your columns)
df = pd.DataFrame(columns = ["AFM_reversal_indicator", "Alert_Message,axiom_key", "_timediff,player", "__mv_Splunk_Alert_Id", "__mv_nbr_plastic", "__mv_code_acct_stat_demo"])
# Get a list of all column names starting with "__mv"
mv_columns = [col for col in df.columns if col.startswith("__mv")]
# Drop the columns
df = df.drop(columns=mv_columns)
# Save the updated dataframe to a CSV file
df.to_csv("cleaned_data.csv", index=False)
The mv_columns will iterate through the columns in your DataFrame and pick those that starts with "__mv". Then the .drop will just remove those from it.
If for some reason you want to use csv package only, then the solution might not be as elegant as with pandas. But here is a suggestion:
import csv
with open("original_data.csv", "r") as input_file, open("cleaned_data.csv", "w", newline="") as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
header_row = next(reader)
mv_columns = [col for col in header_row if col.startswith("__mv")]
mv_column_indices = [header_row.index(col) for col in mv_columns]
new_header_row = [col for col in header_row if col not in mv_columns]
writer.writerow(new_header_row)
for row in reader:
new_row = [row[i] for i in range(len(row)) if i not in mv_column_indices]
writer.writerow(new_row)
So, first, you read the first row that supposed to be your headers. With a similar logic as before, you find those columns that starts with "__mv" and then you get their indices. You write the new columns to your output file with those columns that don't exist to the "__mv" columns. Then you need to iterate through the rest of the CSV and remove those columns as you go.

Any optimize way to iterate excel and provide data into pd.read_sql() as a string one by one

#here I have to apply the loop which can provide me the queries from excel for respective reports:
df1 = pd.read_sql(SQLqueryB2, con=con1)
df2 = pd.read_sql(ORCqueryC2, con=con2)
if (df1.equals(df2)):
print(Report2 +" : is Pass")
Can we achieve above by something doing like this (by iterating ndarray)
df = pd.read_excel(path) for col, item in df.iteritems():
OR do the only option left to read the excel from "openpyxl" library and iterate row, columns and then provide the values. Hope I am clear with the question, if any doubt please comment me.
You are trying to loop through an excel file, run the 2 queries, see if they match and output the result, correct?
import pandas as pd
from sqlalchemy import create_engine
# add user, pass, database name
con = create_engine(f"mysql+pymysql://{USER}:{PWD}#{HOST}/{DB}")
file = pd.read_excel('excel_file.xlsx')
file['Result'] = '' # placeholder
for i, row in file.iterrows():
df1 = pd.read_sql(row['SQLQuery'], con)
df2 = pd.read_sql(row['Oracle Queries'], con)
file.loc[i, 'Result'] = 'Pass' if df1.equals(df2) else 'Fail'
file.to_excel('results.xlsx', index=False)
This will save a file named results.xlsx that mirrors the original data but adds a column named Result that will be Pass or Fail.
Example results.xlsx:

How to make a function with dynamic variables to add rows in pandas?

I'm trying to make a table from a list of data using pandas.
Originally I wanted to make a function where I can pass dynamic variables so I could continuously add new rows from data list.
It works up until a point where adding rows part begun. Column headers are adding, but the data - no. It either keeps value at only last col or adds nothing.
My scrath was:
for title in titles:
for x in data:
table = {
title: data[x]
}
df.DataFrame(table, columns=titles, index[0]
columns list:
titles = ['timestamp', 'source', 'tracepoint']
data list:
data = ['first', 'second', 'third',
'first', 'second', 'third',
'first', 'second', 'third']
How can I make something like this?
timestamp, source, tracepoint
first, second, third
first, second, third
first, second, third
If you just want to initialize pandas dataframe, you can use dataframe’s constructor.
And you can also append row by using a dict.
Pandas provides other useful functions,
such as concatenation between data frames, insert/delete columns. If you need, please check pandas’s doc.
import pandas as pd
# initialization by dataframe’s constructor
titles = ['timestamp', 'source', 'tracepoint']
data = [['first', 'second', 'third'],
['first', 'second', 'third'],
['first', 'second', 'third']]
df = pd.DataFrame(data, columns=titles)
print('---initialization---')
print(df)
# append row
new_row = {
'timestamp': '2020/11/01',
'source': 'xxx',
'tracepoint': 'yyy'
}
df = df.append(new_row, ignore_index=True)
print('---append result---')
print(df)
output
---initialization---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
---append result---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
3 2020/11/01 xxx yyy

Merge duplicate rows in a text file using python based on a key column

I have a csv file and I need to merge records of those rows based on a key column name
a.csv
Name|Acc#|ID|Age
Suresh|2345|a-b2|24
Mahesh|234|a-vf|34
Mahesh|4554|a-bg|45
Keren|344|s-bg|45
yankie|999|z-bg|34
yankie|3453|g-bgbbg|45
Expected output: Merging records based on name like values from both the rows for name Mahesh and yankie are merged
Name|Acc#|ID|Age
Suresh|2345|a-b2|24
Mahesh|[234,4555]|[a-vf,a-bg]|[34,45]
Keren|344|s-bg|45
yankie|[999,3453]|[z-bg,g-bgbbg]|[34,45]
can someone help me with this in python?
import pandas as pd
df = pd.read_csv("a.csv", sep="|", dtype=str)
new_df = df.groupby('Name',as_index=False).aggregate(lambda tdf: tdf.unique().tolist() if tdf.shape[0] > 1 else tdf)
new_df.to_csv("data.csv", index=False, sep="|")
Output:
Name|Acc#|ID|Age
Keren|344|s-bg|45
Mahesh|['234', '4554']|['a-vf', 'a-bg']|['34', '45']
Suresh|2345|a-b2|24
yankie|['999', '3453']|['z-bg', 'g-bgbbg']|['34', '45']

Unquoted date in first column pf CSV for Python/Pandas read_csv

Incoming CSV from American Express download looks like below. (I would prefer each field has quotes around it, but it doesn't. It is treating the quoted long number in the second CSV column as the first column in the Pandas data frame, i.e. 320193480240275508 as my "Date" column:
12/13/19,'320193480240275508',Alamo Rent A Car,John
Doe,-12345,178.62,Travel-Vehicle Rental,DEBIT,
colnames = ['Date', 'TransNum', 'Payee', 'NotUsed4', 'NotUsed5', 'Amount', 'AmexCategory', 'DebitCredit']
df = pd.read_csv(filenameIn, names=colnames, header=0, delimiter=",")
delimiter=",")
pd.set_option('display.max_rows', 15)
pd.set_option('display.width', 200)
print (df)
print (df.values)
Start
Date ... DebitCredit 12/13/19 '320193480240275508' ... NaN
I have a routine to reformat the date ( to handle things like 1/3/19, and to add the century). It is called like this:
df['Date'][j] = reformatAmexDate2(df['Date'][j])
That routine shows the date as follows:
def reformatAmexDate2(oldDate):
print ("oldDate=" + oldDate)
oldDate='320193480240275508'
I saw this post which recommended dayfirst=True, and added that, but same result. I never even told Pandas that column 1 is a date, so it should treat it as text I believe.
IIUC, the problem seems to be name=colnames, it sets new names for your columns being read from csv file, as you are trying to read specific columns from csv file, you can use usecol
df = pd.read_csv(filenameIn,usecols=colnames, header=0, delimiter=",")
Looking at the data, I didn't notice the comma after the column value, i.e. the comma after "DEBIT,"
12/13/19,'320193480240275508',Alamo Rent A Car,John Doe,-12345,178.62,Travel-Vehicle Rental,DEBIT,
I just added another column at the end of my columns array:
colnames = ['Date', 'TransNum', 'Payee', 'NotUsed4', 'NotUsed5', 'Amount', 'AmexCategory', 'DebitCredit','NotUsed9']
and life is wonderful.

Resources