I have a problem with detecting which rows are hidden when I open workbook in read-only mode.
It works flawlessly when I set read_only parameter to False while loading workbook, because then I can iterate over row_dimensions to check which rows are hidden - but opening workbook in read-write mode takes much longer (~2 mins vs ~20 secs in read-only mode) and consumes over 1GB of RAM.
Unfortunately read-only worksheets don't have row_dimensions attribute.
Any help is welcome.
The underlying issue is that the parser is used once and discarded after iterating over all the rows. This is how read_only mode can optimize memory allocation and generate rows upon request. Interestingly enough, the parser itself is still creating the row_dimensions with the row attributes in it!
There are a couple of work arounds you could attempt. In lieu of forking and creating an official fix that exposes the ReadOnlyWorksheet parser, I went with monkey patching:
from openpyxl.worksheet._read_only import ReadOnlyWorksheet, WorkSheetParser, EMPTY_CELL
# The override:
class MyReadOnlyWorksheet(ReadOnlyWorksheet):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.parser = None
def row_is_hidden(self, row_index):
str_row_index = str(row_index)
if self.parser and str_row_index in self.parser.row_dimensions:
return self.parser.row_dimensions[str_row_index].get('hidden') == '1'
if self.parser is None or row_index > self.parser.row_counter:
raise RuntimeError('Must generate the row before calling')
return False
def _cells_by_row(self, min_col, min_row, max_col, max_row, values_only=False):
"""
The source worksheet file may have columns or rows missing.
Missing cells will be created.
Logically the same but saves the parser to "self" during row iteration
"""
filler = EMPTY_CELL
if values_only:
filler = None
max_col = max_col or self.max_column
max_row = max_row or self.max_row
empty_row = []
if max_col is not None:
empty_row = (filler,) * (max_col + 1 - min_col)
counter = min_row
idx = 1
src = self._get_source()
parser = WorkSheetParser(src, self._shared_strings,
data_only=self.parent.data_only, epoch=self.parent.epoch,
date_formats=self.parent._date_formats)
### Cache parser in order to check generated row attrs ###
self.parser = parser
for idx, row in parser.parse():
if max_row is not None and idx > max_row:
break
# some rows are missing
for _ in range(counter, idx):
counter += 1
yield empty_row
# return cells from a row
if counter <= idx:
row = self._get_row(row, min_col, max_col, values_only)
counter += 1
yield row
if max_row is not None and max_row < idx:
for _ in range(counter, max_row+1):
yield empty_row
src.close()
# the monkey patch:
import openpyxl.reader.excel
openpyxl.reader.excel.ReadOnlyWorksheet = MyReadOnlyWorksheet
# the test drive:
from openpyxl import load_workbook
file_location = '' # load your file
workbook = load_workbook(file_location, data_only=True, keep_links=False, read_only=True)
for worksheet in workbook.worksheets:
row_gen = worksheet.rows
for i, row in enumerate(row_gen, start=1):
if worksheet.row_is_hidden(i):
continue # do not process hidden rows.
This does what you need, but beware! I would add sufficient test coverage before using in production (think things like future version re-keying row_dimension dict, removing row_dimensions from read_only parsing, etc). You can similarly add your own accessors to the worksheet that exposes other row attrs (or return the entire dict).
Happy coding!
Related
I have a csv file which has around 58 million cells containing numerical data. I want to extract data from every 16 cells which are 49 rows apart.
Let me describe it clearly.
The data I need to extract
The above image shows the the first set of data that is to be extracted (rows 23 to 26, columns 92 to 95). This data has to be written in another file csv file (preferably in a row).
Then I will move down 49 rows (row 72), then extract 4rows x 4columns. Shown in image below.
Next set of data
Similarly, I need to keep going till I reach the end of the file.
Third set
The next set will be the image shown above.
I have to keep going till I reach the end of the file and extract thousands of such data.
I had written a code for this but its not working. I don't know where is the mistake. I will also attach it here.
import pandas as pd
import numpy
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
arrY = []
ex = 0
for i in range(len(df)):
if i == 0:
for j in range(4):
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
arrY.append(l)
else:
for j in range(4):
if j+22+i*(49) >= len(df):
ex = 1
break
# print(j)
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
arrY.append(l)
if ex == 1:
break
# print(arrY)
a = []
for i in range(len(arrY) - 3):
p = arrY[i]+arrY[i+1]+arrY[i+2]+arrY[i+3]
a.append(p)
print(numpy.shape(a))
numpy.savetxt('myfile.csv', a, delimiter=',')
Using the above code, I didn't get the result I wanted.
Please help with this and correct where I have gone wrong.
I couldn't attach my csv file here, Please try to use any sample sheet that you have or can create a simple one.
Thanks in advance! Have a great day.
i don't know what exactly you are doing in your code
but i wrote my own
import csv
from itertools import chain
CSV_PATH = 'TS_trace31.csv'
new_data = []
with open(CSV_PATH, 'r') as csvfile:
reader = csv.reader(csvfile)
# row_num for storing big jumps e.g. 23, 72, 121 ...
row_num = 23
# n for storing the group number 0 - 3
# with n we can find the 23, 24, 25, 26
n = 0
# row_group for storing every 4 group rows
row_group = []
# looping over every row in main file
for row in reader:
if reader.line_num == row_num + n:
# for the first time this is going to be 23 + 0
# then we add one number to the n
# so the next cycle will be 24 and so on
n += 1
print(reader.line_num)
# add each row to it group
row_group.append(row[91:95])
# check if we are at the end of the group e.g. 26
if n == 4:
# reset the group number
n = 0
# add the jump to main row number
row_num += 49
# combine all the row_group to a single row
new_data.append(list(chain(*row_group)))
# clear the row_group for next set of rows
row_group.clear()
print('='*50)
else:
continue
# and finally write all the rows in a new file
with open('myfile.csv', 'w') as new_csvfile:
writer = csv.writer(new_csvfile)
writer.writerows(new_data)
Edit 12/07/19: The problem was not in fact with pd.rename fuction but the fact that I did not return from the function the pandas dataframe and as a result the column change did not exist when printing. i.e.
def change_column_names(as_pandas, old_name, new_name):
as_pandas.rename(columns={old_name: new_name}, inplace=)
return as_pandas <- This was missing*
Please see the user comment below to uptick them for finding this error for me.
Alternatively, you can continue reading.
The data can be downloaded from this link, yet I have added a sample dataset. The formatting of the file is not a typical CSV file and I believe this may have been an assessment piece and is related to Hidden Decision Tree article. I have given the portion of the code as it solves the issues surrounding the format of the text file as mentioned above and allows the user to rename the column.
The problem occured when I tried to assign create a re-naming function:
def change_column_names(as_pandas, old_name, new_name):
as_pandas.rename(columns={old_name: new_name}, inplace=)
However, it seem to work when I set the variable names inside rename function.
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'}, inplace=True)
return as_pandas
Sample Dataset
Title URL Date Unique Pageviews
oupUrl=tutorials 18-Apr-15 5608
"An Exclusive Interview with Data Expert, John Bottega" http://www.datasciencecentral.com/forum/topics/an-exclusive-interview-with-data-expert-john-bottega?groupUrl=announcements 10-Jun-14 360
Announcing Composable Analytics http://www.datasciencecentral.com/forum/topics/announcing-composable-analytics 15-Jun-14 367
Announcing the release of Spark 1.5 http://www.datasciencecentral.com/forum/topics/announcing-the-release-of-spark-1-5 12-Sep-15 156
Are Extreme Weather Events More Frequent? The Data Science Answer http://www.datasciencecentral.com/forum/topics/are-extreme-weather-events-more-frequent-the-data-science-answer 5-Oct-15 204
Are you interested in joining the University of California for an empiricalstudy on 'Big Data'? http://www.datasciencecentral.com/forum/topics/are-you-interested-in-joining-the-university-of-california-for-an 7-Feb-13 204
Are you smart enough to work at Google? http://www.datasciencecentral.com/forum/topics/are-you-smart-enough-to-work-at-google 11-Oct-15 3625
"As a software engineer, what's the best skill set to have for the next 5-10years?" http://www.datasciencecentral.com/forum/topics/as-a-software-engineer-what-s-the-best-skill-set-to-have-for-the- 12-Feb-16 2815
A Statistician's View on Big Data and Data Science (Updated) http://www.datasciencecentral.com/forum/topics/a-statistician-s-view-on-big-data-and-data-science-updated-1 21-May-14 163
A synthetic variance designed for Hadoop and big data http://www.datasciencecentral.com/forum/topics/a-synthetic-variance-designed-for-hadoop-and-big-data?groupUrl=research 26-May-14 575
A Tough Calculus Question http://www.datasciencecentral.com/forum/topics/a-tough-calculus-question 10-Feb-16 937
Attribution Modeling: Key Analytical Strategy to Boost Marketing ROI http://www.datasciencecentral.com/forum/topics/attribution-modeling-key-concept 24-Oct-15 937
Audience expansion http://www.datasciencecentral.com/forum/topics/audience-expansion 6-May-13 223
Automatic use of insights http://www.datasciencecentral.com/forum/topics/automatic-use-of-insights 27-Aug-15 122
Average length of dissertations by higher education discipline. http://www.datasciencecentral.com/forum/topics/average-length-of-dissertations-by-higher-education-discipline 4-Jun-15 1303
This is the full code that produces the Key Error:
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'}, inplace=True)
def change_column_names(as_pandas, old_name, new_name):
as_pandas.rename(columns={old_name: new_name}, inplace=True)
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'},
inplace=True)
def open_as_dataframe(file_name_in):
reader = pd.read_csv(file_name_in, encoding='windows-1251')
return reader
# Get each column of data including the heading and separate each element
i.e. Title, URL, Date, Page Views
# and save to string_of_rows with comma separator for storage as a csv
# file.
def get_columns_of_data(*args):
# Function that accept variable length arguments
string_of_rows = str()
num_cols = len(args)
try:
if num_cols > 0:
for number, element in enumerate(args):
if number == (num_cols - 1):
string_of_rows = string_of_rows + element + '\n'
else:
string_of_rows = string_of_rows + element + ','
except UnboundLocalError:
print('Empty file \'or\' No arguments received, cannot be zero')
return string_of_rows
def open_file(file_name):
try:
with open(file_name) as csv_file_in, open('HDT_data5.txt', 'w') as csv_file_out:
csv_read = csv.reader(csv_file_in, delimiter='\t')
for row in csv_read:
try:
row[0] = row[0].replace(',', '')
csv_file_out.write(get_columns_of_data(*row))
except TypeError:
continue
print("The file name '{}' was successfully opened and read".format(file_name))
except IOError:
print('File not found \'OR\' Not in current directory\n')
# All acronyms used in variable naming correspond to the function at time
# of return from function.
# csv_list being a list of the v file contents the remainder i.e. 'st' of
# csv_list_st = split_title().
def main():
open_file('HDTdata3.txt')
multi_sets = open_as_dataframe('HDT_data5.txt')
# change_column_names(multi_sets)
change_column_names(multi_set, 'Old_Name', 'New_Name')
print(multi_sets)
main()
I cleaned up your code so it would run. You were changing the column names but not returning the result. Try the following:
import pandas as pd
import numpy as np
import math
def set_new_columns(as_pandas):
titles_list = ['Year > 2014', 'Forum', 'Blog', 'Python', 'R',
'Machine_Learning', 'Data_Science', 'Data',
'Analytics']
for number, word in enumerate(titles_list):
as_pandas.insert(len(as_pandas.columns), titles_list[number], 0)
def title_length(as_pandas):
# Insert new column header then count the number of letters in 'Title'
as_pandas.insert(len(as_pandas.columns), 'Title_Length', 0)
as_pandas['Title_Length'] = as_pandas['Title'].map(str).apply(len)
# Although it is log, percentage of change is inverse linear comparison of
#logX1 - logX2
# therefore you could think of it as the percentage change in Page Views
# map
# function allows for function to be performed on all rows in column
# 'Page_Views'.
def log_page_view(as_pandas):
# Insert new column header
as_pandas.insert(len(as_pandas.columns), 'Log_Page_Views', 0)
as_pandas['Log_Page_Views'] = as_pandas['Page_Views'].map(lambda x: math.log(1 + float(x)))
def change_to_numeric(as_pandas):
# Check for missing values then convert the column to numeric.
as_pandas = as_pandas.replace(r'^\s*$', np.nan, regex=True)
as_pandas['Page_Views'] = pd.to_numeric(as_pandas['Page_Views'],
errors='coerce')
def change_column_names(as_pandas):
as_pandas.rename(columns={'Unique Pageviews': 'Page_Views'}, inplace=True)
return as_pandas
def open_as_dataframe(file_name_in):
reader = pd.read_csv(file_name_in, encoding='windows-1251')
return reader
# Get each column of data including the heading and separate each element
# i.e. Title, URL, Date, Page Views
# and save to string_of_rows with comma separator for storage as a csv
# file.
def get_columns_of_data(*args):
# Function that accept variable length arguments
string_of_rows = str()
num_cols = len(args)
try:
if num_cols > 0:
for number, element in enumerate(args):
if number == (num_cols - 1):
string_of_rows = string_of_rows + element + '\n'
else:
string_of_rows = string_of_rows + element + ','
except UnboundLocalError:
print('Empty file \'or\' No arguments received, cannot be zero')
return string_of_rows
def open_file(file_name):
import csv
try:
with open(file_name) as csv_file_in, open('HDT_data5.txt', 'w') as csv_file_out:
csv_read = csv.reader(csv_file_in, delimiter='\t')
for row in csv_read:
try:
row[0] = row[0].replace(',', '')
csv_file_out.write(get_columns_of_data(*row))
except TypeError:
continue
print("The file name '{}' was successfully opened and read".format(file_name))
except IOError:
print('File not found \'OR\' Not in current directory\n')
# All acronyms used in variable naming correspond to the function at time
# of return from function.
# csv_list being a list of the v file contents the remainder i.e. 'st' of
# csv_list_st = split_title().
def main():
open_file('HDTdata3.txt')
multi_sets = open_as_dataframe('HDT_data5.txt')
multi_sets = change_column_names(multi_sets)
change_to_numeric(multi_sets)
log_page_view(multi_sets)
title_length(multi_sets)
set_new_columns(multi_sets)
print(multi_sets)
main()
I have been working on code that takes rows from csv file and transfer them into the lists of integers for further mathematical operations. However, if a row turns out to be empty, it causes problems. Also, the user will not know which row is empty, so the solution should be general rather than pointing at a row and removing it. Here is the code:
import csv
import statistics as st
def RepresentsInt(i):
try:
int(i)
return True
except ValueError:
return False
l = []
with open('Test.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
l.append([int(r) if RepresentsInt(r) else 0 for r in row])
for row in l:
row=[x for x in row if x!=0]
row.sort()
print(row)
I've tried l=[row for row in l if row!=[]] and ...
if row==[]:
l.remove(row)
... but both do nothing, and there is no error code for either. Here is my csv file:
1,2,5,4
2,3
43,65,34,56,7
0,5
7,8,9,6,5
33,45,65,4
If I run the code, I will get [] for row 4 and 6 (which are empty).
This worked on my machine:
import csv
def RepresentsInt(i):
try:
int(i)
return True
except ValueError:
return False
l = []
with open('Test.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
l.append([int(r) for r in row if RepresentsInt(r)])
rows = [row for row in l if row]
for row in rows:
print(row)
It is unclear what you are doing with the statistics module, but the following program should you what you asked for. The pprint module is imported to make displaying the generated table easier to read. If this answer solves the problem presented in your question but you are having difficulty somewhere else, make sure you open another question targeted at the new problem.
#! /usr/bin/env python3
import csv
import pprint
def main():
table = []
# Add rows to table.
with open('Test.csv', newline='') as file:
table.extend(csv.reader(file))
# Convert table cells to numbers.
for index, row in enumerate(table):
table[index] = [int(value) if value.isdigit() else 0 for value in row]
# Remove zeros from the rows.
for index, row in enumerate(table):
table[index] = [value for value in row if value]
# Remove empty rows and display the table.
table = [row for row in table if row]
pprint.pprint(table)
if __name__ == '__main__':
main()
I have an excel spreadsheet that I am trying to parse with xlrd. The spreadsheet itself makes extensive use of named ranges.
If I use:
for name in book.name_map:
print(name)
I can see all of the names are there.
However I can't make any of the methods work (cell method and area2d). Can anyone give me an example of the syntax to be able to read the cell range that a name is pointing to given the name.
The Excel file is an XLSM file with lots of visual basic that also operates on these named ranges.
I think that the naming support in XLRD is broken for XLSM files but I found an answer by switching to openpyxl. This has a function get_named_ranges() which contains all of the named ranges. The support after that is a bit thin so I wrote my own class to turn the named ranges in my spreadsheet into a class where I can access the same information using the same names.
# -- coding: utf-8 --
"""
Created on Wed Sep 14 09:42:09 2016
#author: ellwood
"""
from openpyxl import load_workbook
class NamedArray(object):
''' Named range object
'''
C_CAPS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
def __init__(self, workbook, named_range_raw):
''' Initialise a NameArray object from the named_range_raw information in the given
workbook
'''
self.wb = workbook
sheet_str, cellrange_str = str(named_range_raw).split('!')
self.sheet =sheet_str.split('"')[1]
self.loc = self.wb[self.sheet]
if ':' in cellrange_str:
self.has_range = True
self.has_value = False
lo,hi = cellrange_str.split(':')
self.ad_lo = lo.replace('$','')
self.ad_hi = hi.replace('$','')
else:
self.has_range = False
self.has_value = True
self.ad_lo = cellrange_str.replace('$','')
self.ad_hi = self.ad_lo
self.min_row = self.get_row(self.ad_lo)
self.max_row = self.get_row(self.ad_hi)
self.rows = self.max_row - self.min_row + 1
self.min_col = self.col_to_n(self.ad_lo)
self.max_col = self.col_to_n(self.ad_hi)
self.cols = self.max_col - self.min_col + 1
def size_of(self):
''' Returns two dimensional size of named space
'''
return self.cols, self.rows
def cols(self):
''' Returns number of cols in named space
'''
return self.cols
def rows(self):
''' Returns number of rows in named space
'''
return self.rows
def value(self, r=1, c=1):
''' Returns the value at row r, column c
'''
if self.has_value:
return self.loc.cell(self.ad_lo).value
assert r < self.max_rows
assert c < self.max_cols
return self.loc.cell(self.n_to_col(self.min_col + c-1)+str(self.min_row+r-1)).value
def is_range(self):
''' if true then name defines a table more than 1 cell
'''
return self.range
def is_value(self):
''' if true then name defines the location of a single value
'''
return None
def __str__(self):
''' printed description of named space
'''
locs = 's ' + self.ad_lo + ':' + self.ad_hi if self.is_range else ' ' + self.ad_lo
return('named range'+ str(self.size_of()) + ' in sheet ' + self.sheet + ' # location' + locs)
#classmethod
def get_row(cls, ad):
''' get row number from cell string
Cell string is assumed to be in excel format i.e "ABC123" where row is 123
'''
row = 0
for l in ad:
if l in "1234567890":
row = row*10 + int(l)
return row
#classmethod
def col_to_n(cls, ad):
''' find column number from xl address
Cell string is assumed to be in excel format i.e "ABC123" where column is abc
column number is integer represenation i.e.(A-A)*26*26 + (B-A)*26 + (C-A)
'''
n = 0
for l in ad:
if l in cls.C_CAPS:
n = n*26 + cls.C_CAPS.find(l)+1
return n
#classmethod
def n_to_col(cls,n):
''' make xl column address from column number
'''
ad = ''
while n > 0:
ad = cls.C_CAPS[n%26-1] + ad
n = n // 26
return ad
class Struct(object):
''' clast which turns a dictionary into a structure
'''
def __init__(self, **entries):
self.__dict__.update(entries)
def repr__(self):
return '<%s>' % str('\n '.join('%s : %s' % (k, repr(v)) for (k, v) in self.__dict.iteritems()))
def get_names(workbook):
''' Get a structure containing all of the names in the workbook
'''
named_ranges = wb.get_named_ranges()
name_list = {}
for named_range in named_ranges:
name = named_range.name
if name[0:2] == 'n_':
# only store the names beginning with 'n_'
name_list[name[2:]] = NamedArray(wb, str(named_range))
for item in name_list:
print (item, '=', name_list[item])
return Struct(**name_list)
# ------------------
# program example
# -----------------
wb = load_workbook('test.xlsm', data_only=True)
n = get_names(wb)
print(n.my_name.value())
One Small optimisation is that I prefixed all of the names I was interested in importing wiht 'n_' so I could then ignore any built in Excel names. I hope this is useful to someone.
I am trying to learn how to use multiprocessing and have managed to get the code below to work. The goal is to work through every combination of the variables within the CostlyFunction by setting n equal to some number (right now it is 100 so the first 100 combinations are tested). I was hoping I could manipulate w as each process returned its list (CostlyFunction returns a list of 7 values) and only keep the results in a given range. Right now, w holds all 100 lists and then lets me manipulate those lists but, when I use n=10MM, w becomes huge and costly to hold. Is there a way to evaluate CostlyFunction's output as the workers return values and then 'throw out' values I don't need?
if __name__ == "__main__":
import csv
csvFile = open('C:\\Users\\bryan.j.weiner\\Desktop\\test.csv', 'w', newline='')
#width = -36000000/1000
#fronteir = [None]*1000
currtime = time()
n=100
po = Pool()
res = po.map_async(CostlyFunction,((i,) for i in range(n)))
w = res.get()
spamwriter = csv.writer(csvFile, delimiter=',')
spamwriter.writerows(w)
print(('2: parallel: time elapsed:', time() - currtime))
csvFile.close()
Unfortunately, Pool doesn't have a 'filter' method; otherwise, you might've been able to prune your results before they're returned. Pool.imap is probably the best solution you'll find for dealing with your memory issue: it returns an iterator over the results from CostlyFunction.
For sorting through the results, I made a simple list-based class called TopList that stores a fixed number of items. All of its items are the highest-ranked according to a key function.
from collections import Userlist
def keyfunc(a):
return a[5] # This would be the sixth item in a result from CostlyFunction
class TopList(UserList):
def __init__(self, key, *args, cap=10): # cap is the largest number of results
super().__init__(*args) # you want to store
self.cap = cap
self.key = key
def add(self, item):
self.data.append(item)
self.data.sort(key=self.key, reverse=True)
self.data.pop()
Here's how your code might look:
if __name__ == "__main__":
import csv
csvFile = open('C:\\Users\\bryan.j.weiner\\Desktop\\test.csv', 'w', newline='')
n = 100
currtime = time()
po = Pool()
best = TopList(keyfunc)
result_iter = po.imap(CostlyFunction, ((i,) for i in range(n)))
for result in result_iter:
best.add(result)
spamwriter = csv.writer(csvFile, delimiter=',')
spamwriter.writerows(w)
print(('2: parallel: time elapsed:', time() - currtime))
csvFile.close()