Read last sheet of spreadsheet using pandas dataframe - excel

I ma trying to use pandas dataframes to read the last sheet of a spreadsheet since I don't need the rest. how do I tell python just to take the last one? I can not find a flag on the documentation that says how to do this. I can specify the sheet with sheet_name flag but this does not work for me since I don't know how many sheets I have
raw_excel = pd.read_excel(path, sheet_name=0)

You can use the ExcelFile function.
xl = pd.ExcelFile(path)
# See all sheet names
sheet_names = xl.sheet_names
# Last sheet name
last_sheet = sheet_names[-1]
# Read a last sheet to DataFrame
xl.parse(last_sheet)

Related

How to collect cell values in excel and make them into one column

I'm brand new to coding and to this forum, so please accept my apologies in advance for being a newbie and probably not understanding what i'm supposed to say!
I was asked a question which I didn't know how to approach earlier. The user was trying to collect cell values in multiple rows from Excel (split out by a delimiter) and then create one complete column of single values in rows. Example in picture1 below. Source file is how the data is received and output is what the user is trying to do with it:
I hope I have explained that correctly. I'm looking for some python code that will automate it. There could be thousands of values that need putting into rows
Thanks in advance!
Andy
Have a look at the openpyxl package:
https://openpyxl.readthedocs.io/en/stable/index.html
This allows you to directly access cells in your excel sheet within python.
As some of your cells seem to contain multiple values separated by semicolons you could read the cells as strings and use the
splitstring = somelongstring.split(';')
to seperate the values. This results in a list containing the separated values
Basic manipulations using this package are described in this tutorial:
https://openpyxl.readthedocs.io/en/stable/tutorial.html
Edit:
An example iterating over all columns in a worksheet would be:
from openpyxl import load_workbook
wb = load_workbook('test.xlsx')
for row in wb.iter_cols(values_only=True):
for value in row:
do_something(value)
I was able to find some code online and butcher is to get what I needed. Here is the code I ended up with:
import pandas as pd
iris = pd.read_csv('iris.csv')
from itertools import chain
# return list from series of comma-separated strings
def chainer(s):
return list(chain.from_iterable(s.str.split(',')))
# calculate lengths of splits
lens = iris['Order No'].str.split(',').map(len)
# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'Order No': np.repeat(iris['Order No'], lens),'Order No': chainer(iris['Order No'])})

Making a vector out of excel columns using python

everyone...
I just started on python a couple of days ago because I require to handle some excel data in order to automatically update the data of certain cells from one file into another.
However, I'm kind of stuck since I have barely programmed before, and it's my first time using python as well, but my job required me to find a solution and I'm trying to make it work even though it's not my field of expertise.
I used the "xlrd library", imported my file and managed to print the columns I'm needing... However, I can't find a way to put those columns into a matrix in order to handle the data like this:
Matrix =[DataColumnA DataColumnG DataColumnH] in the size [nrows x 3]
As for now, I have 3 different outputs for the 3 different columns I need, but I'm trying to join them together into one big matrix.
So far my code looks like this:
import xlrd
workbook = xlrd.open_workbook("190219_serviciosWRAmanualV5.xls");
worksheet = workbook.sheet_by_name("ServiciosDWDM");
workbook2 = xlrd.open_workbook("Potencia2.xlsx");
worksheet2 = workbook2.sheet_by_name("Hoja1");
filas = worksheet.nrows
filas2 = worksheet2.nrows
columnas = worksheet.ncols
for row in range (2, filas):
Equipo_A = worksheet.cell(row,12).value
Client_A = worksheet.cell(row,13).value
Line_A = worksheet.cell(row, 14).value
print (Equipo_A, Line_A, Client_A)
So I have only gotten, as mentioned above, the data in the columns which is what I'm printing which you can see.
What I'm trying to do, or the main thing I need to do is to read the cell of the first row in Column A and look for it in the other excel file... if the names match, I would have to validate that for the same row (in file 1) the data in both the ColumnG and ColumnH is the same as the data in the second file.
If they match I would have to update Column J in the first file with the data from the second file.
My other approach is to retrieve the value of the cell in ColumnA and look for it in the column A of the second file, then I would make an if conditional to see if ColumnsG and H are equal to Column C of 2nd file and so on...
The thing here is, I have no idea how to pin point the position of the cell and extract the data to make the conditional for this second approach.
I'm not sure if by making that matrix my approach is okay or if the second way is better, so any suggestion would be absolutely appreciated.
Thank you in advance!

Python Pandas check cells for a range of numbers copy or skip if not there

I would use pandas isin or iloc functions but the excel format is complex and there are sometimes data followed by cols of no info, and the main pool of entries are cols with up to 3 pieces of data in a cell with only a '|' to separate them. Some of the cells are missing a number and I want to skip those but copy the ones with them.
Above is my current code. I have a giant excel with thousands of entries and worse, the column/rows are not neat. There are several pieces of data in each column cell per row. What I've noticed is that a number called 'tail #' is missing in some of them. What I want to do is search for that number, if it has it then copy that cell, if it does not then go to the next column in the row. Then repeat that for all cells. There is a giant header, but when I transformed it into CSV, I removed that with formatting. This is also why I am looking for a number because there are several headers. for example, years that say like 2010 but then several empty columns till the next one maybe 10 cols later. Also please not that under this header of years are several columns of data per row that are separated by two columns with no info. Also, the info in a column looks like this, '13|something something|some more words'. If it has a number as you see, I want to copy it. The numbers seem to range from 0 to no greater than 30. Lastly, I'm trying to write this using pandas but I may need a more manual way to do things because using isin, and iloc was not working.
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import os.path as op
from openpyxl import workbook
import re
def extract_export_columns(df, list_of_columns, file_path):
column_df = df[list_of_columns]
column_df.to_csv(file_path, index=False, sep="|")
#Orrginal file
input_base_path = 'C:/Users/somedoc input'
main_df_data_file = pd.read_csv(op.join (input_base_path, 'som_excel_doc.csv '))
#Filter for tailnumbers
tail_numbers = main_df_data_file['abcde'] <= 30
main_df_data_file[tail_abcd]
#iterate over list
#number_filter = main_df_data_file.Updated.isin(["15"])
#main_df_data_file[number_filter]
#print(number_filter)
#for row in main_df_data_file.values:
#for value in row:
# print(value)
#print(row)
# to check the condition
# Product of code
output_base_path = r'C:\Users\some_doc output'
extract_export_columns(main_df_data_file,
['Updated 28 Feb 18 Tail #'],
op.join(output_base_path, 'UBC_example3.txt'))
The code I have loads into csv, and successfully creates a text file. I want to build the body function to scan an excel/csv file to copy and paste to a text file data that contains a number.
https://drive.google.com/file/d/1stXxgqBeo_sGksVYL9HHdn2IflFL_bb8/view?usp=sharing

Python: How to read multiple spreadsheets into a new format in a CSV?

I (newcomer) try to read from an excel document several tables and read in a new format in a single csv.
In the csv, i need the following fields: year (from a global variable), month (from a global variable), outlet (name of the tablesheet); rowvalue [a] (string to explain the row), columnvalue [1] (string to explain the cloumn), cellvalue (float)
The corresponding values must then be entered in these.
From the respective tables, only RowNum 6 to 89 need to be read
#BWA-Reader
#read the excel spreadsheet with all sheets
#Python 3.6
Importe
import openpyxl
import xlrd
from PIL import Image as PILImage
import csv
# year value of the Business analysis
year = "2018"
# month value of the Business analysis
month = "11"
# .xlxs path
wb = openpyxl.load_workbook("BWA Zusammenfassung 18-11.xlsx")
print("Found your Spreadsheet")
# List of sheets
sheets = wb.get_sheet_names()
# remove unneccessary sheets
list_to_remove = ("P",'APn','AP')
sheets_clean = list(set(sheets).difference(set(list_to_remove)))
print("sheets to load: " + str(sheets_clean))
# for loop for every sheet based on sheets_clean
for sheet in sheets_clean:
# for loop to build list for row and cell value
all_rows = []
for row in wb[sheet].rows:
current_row = []
for cell in row:
current_row.append (cell.value)
all_rows.append(current_row)
print(all_rows)
# i´m stucked -.-´
I expect an output like:
2018;11;Oldenburg;total_sales;monthly;145840.00
all sheets in one csv
Thank you so much for every idea how to solve my project!
The complete answer to this question is very dependent on the actual dataset.
I would recommend looking into pandas' read_excel() function. This will make it so much easier to extract the needed rows/columns/cells, all without looping through all of the sheets.
You might need some tutorials on pandas in order to get there, but judging by what you are trying to do, pandas might be a useful skill to have in the future!

Openpyxl to check for keywords, then modify next to cells to contain those keywords and total found

I'm using python 3.x and openpyxl to parse an excel .xlsx file.
For each row, I check a column (C) to see if any of those keywords match.
If so, I add them to a separate list variable and also determine how many keywords were matched.
I then want to add the actual keywords into the next cell, and the total of keywords into the cell after. This is where I am having trouble, actually writing the results.
contents of the keywords.txt and results.xlsx file
here
import openpyxl
# Here I read a keywords.txt file and input them into a keywords variable
# I throwaway the first line to prevent a mismatch due to the unicode BOM
with open("keywords.txt") as f:
f.readline()
keywords = [line.rstrip("\n") for line in f]
# Load the workbook
wb = openpyxl.load_workbook("results.xlsx")
ws = wb.get_sheet_by_name("Sheet")
# Iterate through every row, only looking in column C for the keyword match.
for row in ws.iter_rows("C{}:E{}".format(ws.min_row, ws.max_row)):
# if there's a match, add to the keywords_found list
keywords_found = [key for key in keywords if key in row[0].value]
# if any keywords found, enter the keywords in column D
# and how many keywords into column E
if len(keywords_found):
row[1].value = keywords_found
row[2].value = len(keywords_found)
Now, I understand where I'm going wrong, in that ws.iter_rows(..) returns a tuple, which can't be modified. I figure I could two for loops, one for each row, and another for the columns in each row, but this test is a small example of a real-world scenario, where the amount of rows are in the tens of thousands.
I'm not quite sure which is the best way to go about this. Thankyou in advance for any help that you can provide.
Use the ws['C'] and then the offset() method of the relevant cell.
Thanks Charlie for the offset() tip. I modified the code slightly and now it works a treat.
for row in ws.iter_rows("C{}:C{}"...)
for cell in row:
....
if len(keywords_found):
cell.offset(0,1).value = str(keywords_found)
cell.offset(0,2).value = str(len(keywords_found))

Resources