Python 3 appending to a empty DataFrame - python-3.x

Fairly new to coding. I have looked at a couple of other similar questions about appending DataFrames in python but could not solve the problem.
I have the data below (CSV) in an excel xls file:
Venue Name,Cost ,Restriction,Capacity
Cinema,5,over 13,50
Bar,10,over 18,50
Restaurant,15,no restriction,25
Hotel,7,no restriction,100
I am using the code below to try to "filter" out rows which have "no restriction" under the restrictions column. The code seems to work right through to the last line i.e. both print statements are giving me what I would expect.
import pandas as pd
import numpy as np
my_file = pd.ExcelFile("venue data.xlsx")
mydata = my_file.parse(0, index_col = None, na_values = ["NA"])
my_new_file = pd.DataFrame()
for index in mydata.index:
if "no restriction" in mydata.Restriction[index]:
print (mydata.Restriction[index])
print (mydata.loc[index:index])
my_new_file.append(mydata.loc[index:index], ignore_index = True)

Don't loop through dataframes. It's almost never necessary.
Use:
df2 = df[df['Restriction'] != 'no restriction']
Or
df2 = df.query("Restriction != 'no restriction'")

Related

Remove columns starting with same special character in a csv file using Python

My csv file has below columns:
AFM_reversal_indicator,Alert_Message,axiom_key,_timediff,player,__mv_Splunk_Alert_Id,__mv_nbr_plastic,__mv_code_acct_stat_demo.
I want to remove columns starting with "__mv".
I saw some posts where pandas are used to filter out columns.
Is it possible to do it using csv module in python, If yes how ?
Also, with Pandas what regex should I give:
df.filter(regex='')
df.to_csv(output_file_path)
P.S I am using python3.8
You mean with standard python? You can use a list comprehension, e.g.
import csv
with open( 'data.csv', 'r' ) as f:
DataGenerator = csv.reader( f )
Header = next( DataGenerator )
Header = [ Col.strip() for Col in Header ]
Data = list( DataGenerator )
if Data[-1] == []: del( Data[-1] )
Data = [ [Row[i] for i in range( len( Header ) ) if not Header[i].startswith( "__mv" ) ] for Row in Data ]
Header = [ Col for Col in Header if not Col.startswith( "__mv" ) ]
However, this is just a simple example. You'll probably have further things to consider, e.g. what type your csv columns have, whether you want to read all the data at once like I do here, or one-by-one from the generator to save on memory, etc.
You could also use the builtin filter command instead of the inner list comprehension.
Also, if you have numpy installed and you wanted something more 'numerical', you can always use "structured numpy arrays" (https://numpy.org/doc/stable/user/basics.rec.html). They're quite nice. (personally I prefer them to pandas anyway). Numpy also has its own csv-reading functions (see: https://www.geeksforgeeks.org/how-to-read-csv-files-with-numpy/)
You don't need to use .filter for that. You can just find out which are those columns and then drop them from the DataFrame
import pandas as pd
# Load the dataframe (In our case create a dummy one with your columns)
df = pd.DataFrame(columns = ["AFM_reversal_indicator", "Alert_Message,axiom_key", "_timediff,player", "__mv_Splunk_Alert_Id", "__mv_nbr_plastic", "__mv_code_acct_stat_demo"])
# Get a list of all column names starting with "__mv"
mv_columns = [col for col in df.columns if col.startswith("__mv")]
# Drop the columns
df = df.drop(columns=mv_columns)
# Save the updated dataframe to a CSV file
df.to_csv("cleaned_data.csv", index=False)
The mv_columns will iterate through the columns in your DataFrame and pick those that starts with "__mv". Then the .drop will just remove those from it.
If for some reason you want to use csv package only, then the solution might not be as elegant as with pandas. But here is a suggestion:
import csv
with open("original_data.csv", "r") as input_file, open("cleaned_data.csv", "w", newline="") as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
header_row = next(reader)
mv_columns = [col for col in header_row if col.startswith("__mv")]
mv_column_indices = [header_row.index(col) for col in mv_columns]
new_header_row = [col for col in header_row if col not in mv_columns]
writer.writerow(new_header_row)
for row in reader:
new_row = [row[i] for i in range(len(row)) if i not in mv_column_indices]
writer.writerow(new_row)
So, first, you read the first row that supposed to be your headers. With a similar logic as before, you find those columns that starts with "__mv" and then you get their indices. You write the new columns to your output file with those columns that don't exist to the "__mv" columns. Then you need to iterate through the rest of the CSV and remove those columns as you go.

Any optimize way to iterate excel and provide data into pd.read_sql() as a string one by one

#here I have to apply the loop which can provide me the queries from excel for respective reports:
df1 = pd.read_sql(SQLqueryB2, con=con1)
df2 = pd.read_sql(ORCqueryC2, con=con2)
if (df1.equals(df2)):
print(Report2 +" : is Pass")
Can we achieve above by something doing like this (by iterating ndarray)
df = pd.read_excel(path) for col, item in df.iteritems():
OR do the only option left to read the excel from "openpyxl" library and iterate row, columns and then provide the values. Hope I am clear with the question, if any doubt please comment me.
You are trying to loop through an excel file, run the 2 queries, see if they match and output the result, correct?
import pandas as pd
from sqlalchemy import create_engine
# add user, pass, database name
con = create_engine(f"mysql+pymysql://{USER}:{PWD}#{HOST}/{DB}")
file = pd.read_excel('excel_file.xlsx')
file['Result'] = '' # placeholder
for i, row in file.iterrows():
df1 = pd.read_sql(row['SQLQuery'], con)
df2 = pd.read_sql(row['Oracle Queries'], con)
file.loc[i, 'Result'] = 'Pass' if df1.equals(df2) else 'Fail'
file.to_excel('results.xlsx', index=False)
This will save a file named results.xlsx that mirrors the original data but adds a column named Result that will be Pass or Fail.
Example results.xlsx:

Formatting xlwings view output

I am using xlwings to write a dataframe to an excel sheet. Nothing special, and all works perfectly.
xw.view(
dataframe,
abook.sheets.add(after = abook.sheets[-1]),
table=True
)
My issue is that the output excel sheet has filters in the top two rows, which I have to manually disable (by selecting the rows and clearning contents).
Thanks to https://github.com/xlwings/xlwings/issues/679#issuecomment-369138719
I changed my code to the following:
abook = xw.books.active
xw.view(
dataframe,
abook.sheets.add(after = abook.sheets[-1]),
table=True
)
sheetname = abook.active.name
if wb.sheets[sheetname].api.AutoFilterMode == True:
wb.sheets[sheetname].api.AutoFilterMode = False
which looked promising, but it didn't resolve my issue.
I would appreciate any pointers, how I can have the filters turned off by default. I am using the latest xlwings on win 10, 11.
Thanks
The solution was to add the
table=False
parameter to the xw.view(df) method. According to the docs:
table (bool, default True) – If your object is a pandas DataFrame, by default it is formatted as an Excel Table
Now to write a dataframe df, I call:
import xlwings as xw
import pandas as pd
df = pd.DataFrame(...)
xw.view(df, table=False)
Updated on 14 January 2023:
Just for completeness, using the argument table=True in view adds a table with a filter. If you would like to keep the table, but remove the filter, you can remove the filter with ws.tables[0].show_autofilter = False:
import xlwings as xw
import pandas as pd
df = pd._testing.makeDataFrame()
xw.view(df, table=True)
ws = xw.sheets.active
ws.tables[0].show_autofilter = False
Or with api.AutoFilter(Field=[...], VisibleDropDown=False), whereby Field is a list of integers describing the concerning column numbers:
import xlwings as xw
import pandas as pd
df = pd._testing.makeDataFrame()
xw.view(df, table=True)
ws = xw.sheets.active
ws.used_range.api.AutoFilter(Field=list(range(1, ws.used_range[-1].column + 1)), VisibleDropDown=False)

Error when using pandas read_excel(header=[0,1])

I'm trying to use pandas read_excel to work with a file. The file has two columns of headers so I'm trying to use the multiIndex feature apart of the header keyword argument.
import pandas as pd, os
"""data in 2015 MOR Folder"""
filename = 'MOR-JANUARY 2015.xlsx'
print(os.path.isfile(filename))
df1 = pd.read_excel(filename, header=[0,1], sheetname='MOR')
print(df1)
the error I get is ValueError: Length of new names must be 1, got 2. The file is in this google drive folder https://drive.google.com/drive/folders/0B0ynKIVAlSgidFFySWJoeFByMDQ?usp=sharing
I'm trying to follow the solution posted here
Read excel sheet with multiple header using Pandas
I could be mistaken but I don't think pandas handles parsing excel rows where there are merged cells. So in that first row, the merged cells get parsed as mostly empty cells. You'd need them nicely repeated to act correctly. This is what motivates the ffill below. If you could control the Excel workbook ahead of time and you might be able to use the code you have.
my solution
It's not pretty, but it'll get it done.
filename = 'MOR-JANUARY 2015.xlsx'
df1 = pd.read_excel(filename, sheetname='MOR', header=None)
vals = df1.values
mux = pd.MultiIndex.from_arrays(df1.ffill(1).values[:2, 1:], names=[None, 'DATE'])
df1 = pd.DataFrame(df1.values[2:, 1:], df1.values[2:, 0], mux)

Splitting the rows of a Dataframe using pyspark

code:
import os.path
file_name = os.path.join('databricks-datasets', 'cs190', 'data-001', 'millionsong.txt')
raw_data_df = sqlContext.read.load(file_name, 'text')
sample_points = raw_data_df.take(5)
print sample_points
Example output:
[Row(1,2,3),Row(4,5,6)]
From this output,wanted to parse each row in the DataFrame into individual elements, using Spark's select and split methods.
For example, split "1,2,3" into ['1','2','3']
Code:
raw_data_df.select((explode(split(raw_data_df.value,"\s+"))))
But the code doesnt seem to worf as expected any suggestions would be helpful.
Try this..
raw_data_df.map(lambda l: l[0].select(l[0].split(",")))

Resources