Read int values from a column in excel sheet using XLRD - python-3.x

I have a cell in an excel workbook with comma separated values.
This cell can have values with following pattern.
0 or 123 or 123, 345.
I want to extract them as list of integers using XLRD or pandas.read_excel.
I have tried using xlrd with the following snippet.
book = open_workbook(args.path)
dep_cms = book.sheet_by_index(1)
for row_index in range(1, dep_cms.nrows)
excelList = []
excelList.extend([x.strip() for x in dep_cms.cell(row_index, 8).value.split(',')])
I have even tried pandas
excel_frame = read_excel(args.path, sheet_name=2, skiprows=1, verbose=True, na_filter=False)
data_need = excel_frame['Dependent CMS IDS'].tolist()
print(data_need)
But got the list index is out of range.
Reading sheet 2
Traceback (most recent call last):
File "ExcelCellCSVRead.py", line 25, in <module>
excel_frame = read_excel(args.path, sheet_name=2, skiprows=1, verbose=True, na_filter=False)
File "C:\Users\Kris\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\excel\_base.py", line 311, in read_excel
return io.parse(
File "C:\Users\Kris\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\excel\_base.py", line 868, in parse
return self._reader.parse(
File "C:\Users\Kris\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\excel\_base.py", line 441, in parse
sheet = self.get_sheet_by_index(asheetname)
File "C:\Users\Kris\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\excel\_xlrd.py", line 46, in get_sheet_by_index
return self.book.sheet_by_index(index)
File "C:\Users\Kris\AppData\Local\Programs\Python\Python38-32\lib\site-packages\xlrd\book.py", line 466, in sheet_by_index
return self._sheet_list[sheetx] or self.get_sheet(sheetx)
IndexError: list index out of range
It is not working with single value in a cell (for example, just 0 or some value like 123). It is outputting AttributeError: 'float' object has no attribute 'split'.
It only works if I have comma separated values, and converts them into list of strings like ['123', '345']. I guess split condition is the culprit.
How to extract the values of this cell using XLRD or pandas to a list of integers?
Regards

Comma seperated value (CSV) cannot be compaired to excel during importing.
Instead of using read_excel you can use read_csv.
below is the code snippet that how your code will look like after applying read_csv
Import Pandas as pd
df = pd.read_csv("your file name.csv")
data_need = df["Column_name"].tolist()

Related

how to filter a particular column with python pandas?

I have an excel file where I have 2 columns: 'Name' and 'size'. The 'Name' column has multiple file types, namely ".apk, .dat, .vdex, .ttc" etc. But I only want to populate the files with the file extension ending with .apk. I do not want any other file type in the new excel file.
I have written the below code:
import pandas as pd
import json
def json_to_excel():
with open('installed-files.json') as jf:
data = json.load(jf)
df = pd.DataFrame(data)
new_df = df[df.columns.difference(['SHA256'])]
new_xl = new_df.to_excel('abc.xlsx')
return new_xl
def filter_apk(): `MODIFIED CODE`
old_xl = json_to_excel()
data = pd.read_excel(old_xl)
a = data[data["Name"].str.contains("\.apk")]
a.to_excel('zybg.xlsx')
Above program does following:
json_to_excel(), takes a Json file, converts it to a .xlsx format and save.
filter_apk() is suppose to create multiple excel file based on the file extension present in "Name" column.
1st function is doing what I intend to.
2nd function is not doing anything. Neither its throwing any error. I have followed this weblink
Below are the few samples of the "name" column
/system/product/<Path_to>/abc.apk
/system/fonts/wwwr.ttc
/system/framework/framework.jar
/system/<Path_to>/icu.dat
/system/<Path_to>/Normal.apk
/system/<Path_to>/Tv.apk
How to get that working? Or is there a better way to achieve the objective?
Please suggest.
ERROR
raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'NoneType'>
Note:
I have all the files at the same location.
modified code:
import pandas as pd
import json
def json_to_excel():
with open('installed-files.json') as jf:
data = json.load(jf)
df = pd.DataFrame(data)
new_df = df[df.columns.difference(['SHA256'])]
new_df.to_excel('abc.xlsx')
def filter_apk():
json_to_excel()
old_xl = pd.read_excel('abc.xlsx')
data = pd.read_excel(old_xl)
a = data[data["Name"].str.contains("\.apk")]
a.to_excel('zybg.xlsx')
t = filter_apk()
print(t)
New error:
Traceback (most recent call last):
File "C:/Users/amitesh.sahay/PycharmProjects/work_allocation/TASKS/Jenkins.py", line 89, in <module>
t = filter_apk()
File "C:/Users/amitesh.sahay/PycharmProjects/work_allocation/TASKS/Jenkins.py", line 84, in filter_apk
data = pd.read_excel(old_xl)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel\_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel\_base.py", line 867, in __init__
self._reader = self._engines[engine](self._io)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel\_xlrd.py", line 22, in __init__
super().__init__(filepath_or_buffer)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel\_base.py", line 344, in __init__
filepath_or_buffer, _, _, _ = get_filepath_or_buffer(filepath_or_buffer)
File "C:\Users\amitesh.sahay\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\common.py", line 243, in get_filepath_or_buffer
raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'pandas.core.frame.DataFrame'>
There is a difference between your use-case and use-case shown in the weblink. You want to apply a single filter (apk files), whereas the example you saw had multiple filters which were to be applied one after another (multiple species).
This will do the trick.
def filter_apk():
old_xl = json_to_excel()
data = pd.read_excel(old_xl)
a = data[data["Name"].str.contains("\.apk")]
a.to_excel("<path_to_new_excel>\\new_excel_name.xlsx")
Regarding your new updated question. I guess your first function is not working as you think it is working.
new_xl = new_df.to_excel('abc.xlsx')
This will write an excel file, as you are expecting it to do. Which works.
However, assigning it to new_xl, does not do anything since there is no return on pd.to_excel. So when you return new_xl as output of your json_to_excel function, you actually return None. Therefore in your second function, old_xl = json_to_excel() will make old_xl have the value None.
So, your functions should be something like this:
def json_to_excel():
with open('installed-files.json') as jf:
data = json.load(jf)
df = pd.DataFrame(data)
new_df = df[df.columns.difference(['SHA256'])]
new_df.to_excel('abc.xlsx')
def filter_apk():
json_to_excel()
data= pd.read_excel('abc.xlsx')
a = data[data["Name"].str.contains("\.apk")]
a.to_excel('zybg.xlsx')

I am trying to read a csv file with pandas and then to search a string in the first column, to use total row for calculations

I am reading a CSV file with pandas, and then I try to find a word like "Net income" in the first column. Then I want to use the whole row which has this structure: string/number/number/number/... to do some calculations with the numbers.
The problem is that find is not working.
data = pd.read_csv(name)
data.str.find('Net income')
Traceback (most recent call last):
File "C:\Users\thoma\Desktop\python programme\manage.py", line 16, in <module>
data.str.find('Net income')
I am using CSV files from here: Income Statement for Deutsche Lufthansa AG (DLAKF) from Morningstar.com
I found this: Python | Pandas Series.str.find() - GeeksforGeeks
Traceback (most recent call last):
File "C:\Users\thoma\Desktop\python programme\manage.py", line 16, in <module>
data.str.find('Net income')
File "C:\Users\thoma\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 5067, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
So, it works now. But I still have a question. After using the describe function with pandas I get this:
<bound method NDFrame.describe of 2014-12 615
2015-12 612
2016-12 636
2017-12 713
2018-12 736
Name: Goodwill, dtype: object>
I have problems to use the data. So how can I f.e. use the second column here? I tried to do a new table:
new_Table['Goodwill'] = data1['Goodwill'].describe
but this does not work.
I also would like to add more "second" columns to new_Table.
Hi you should filter the column name like df[‘col name’].str.find(x) this required a series not a data frame.
I recommend setting your header row if pandas isnt recognizing named rows in your CSV file.
Something like:
new_header = data.iloc[0] #grab the first row for the header
data = data[1:] #take the data less the header row
data.columns = new_header
From there you can summarize each column by name:
data['Net Income'].describe
edit: I looked at the csv file, I recommend reshaping the data first before analyzing columns.Something like...
data=data.transpose
So in summation:
data = pd.read_csv(name)
data=data.transpose #flip the columns/rows
new_header = data.iloc[0] #grab the first row for the header
data = data[1:] #take the data less the header row
data.columns = new_header
data['Net Income'].describe #analyze

Iterating over text files in a folder and creating dataframe with 1 file / row in Python

I have a corpus of 14K text files that I want to read in to a dataframe. I want each file to be a unique row in said dataframe. Here's what I have so far:
import glob
import os
import pandas as pd
os.chdir("/Users/Wintermute/Desktop/senior_thesis/topic_models/corpus/")
content = pd.DataFrame()
i = 0
for file in glob.glob("*.txt"):
with open(file, 'r') as f:
i += 1
print(i)
content[i,] = f.readlines()
df = pd.DataFrame(content)
df.to_csv("corpus_article_by_line.csv")
When I run the program it acts as I would expect for the first 5 text files, but then I get a valueError: length of values does not match length of index.
Full error message:
Traceback (most recent call last):
File "/Users/Wintermute/PycharmProjects/cs4/test.py", line 13, in
content[i,] = f.readlines()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in setitem
self._set_item(key, value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item
6
value = self._sanitize_column(key, value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
Perhaps you have a non-uniform number of lines in your text files?

AttributeError, whein writing back from ipython to Excel with pandas

System:
Windows 7
Anaconda -> Spyder with 2.7.12 Python
I got this AttributeError:
File "<ipython-input-4-d258b656588d>", line 1, in <module>
runfile('C:/xxx/.spyder/pandas excel.py', wdir='C:/xxx/.spyder')
File "C:\xxx\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\xxx\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/xxx/.spyder/pandas excel.py", line 33, in <module>
moving_avg.to_excel(writer, sheet_name='Methodentest', startcol=12, startrow=38)
File "C:\xxx\Anaconda2\lib\site-packages\pandas\core\generic.py", line 2672, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'to_excel'
This is my code:
import pandas as pd
#Adjustmend of Data for Date funtioniert nicht?
#parsen = lambda x: pd.datetime.strptime(x, '%Y-%m')
#Open new file object
xl = pd.ExcelFile('C:\xxx\Desktop\Beisspieldatensatz.xlsx')
#parse_dates={'Zeit': ['Jahr', 'Monat']}, index_col = 0, date_parser=parsen)
#Link to specific sheet
df = xl.parse('Methodentest')
#Narrow the data input
df2 = df[['Jahr', 'Monat', 'Umsatzmenge']]
#Establishment values under the year 2015
df3 = df2[(df2['Jahr']<2015)]
#Execute gleitender Mittelwert History 36 Month or 36 rows
moving_avg = pd.rolling_mean(df3["Umsatzmenge"],36)
print (moving_avg.head())
#Create a pandas excel writer
writer = pd.ExcelWriter(r'C:\xxx\Desktop\Beisspieldatensatz.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
moving_avg.to_excel(writer, sheet_name='Methodentest', startcol=12, startrow=38)
# Close the Pandas Excel writer and output the Excel file.
writer.save()
I want to read a data set in ipython from Excel. In the next step I want to "pare" my data, but this is not working?? (that's why I put this part in hashtag). After this I want do a mathematical method like here moving average for the next 18 month and store this information in moving_average.
My Data set start monthly from the 01.2012. Then the code must write back the new figures in Excel in specific row and column -> Here the error occurred.
I think you need to convert your Series back to a DataFrame before saving, try
moving_avg.to_frame().to_excel(writer, sheet_name='Methodentest', startcol=12, startrow=38)

python pandas merging excel sheets not working

I'm trying to merge two excel sheets using the common filed Serial but throwing some errors. My program is as below :
(user1_env)root#ubuntu:~/user1/test/compare_files# cat compare.py
import pandas as pd
source1_df = pd.read_excel('a.xlsx', sheetname='source1')
source2_df = pd.read_excel('a.xlsx', sheetname='source2')
joined_df = source1_df.join(source2_df, on='Serial')
joined_df.to_excel('/root/user1/test/compare_files/result.xlsx')
getting error as below :
(user1_env)root#ubuntu:~/user1/test/compare_files# python3.5 compare.py
Traceback (most recent call last):
File "compare.py", line 5, in <module>
joined_df = source1_df.join(source2_df, on='Serial')
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/core/frame.py", line 4385, in join
rsuffix=rsuffix, sort=sort)
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/core/frame.py", line 4399, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/tools/merge.py", line 223, in get_result
rdata.items, rsuf)
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/core/internals.py", line 4445, in items_overlap_with_suffix
to_rename)
ValueError: columns overlap but no suffix specified: Index(['Serial'], dtype='object')
I'm referring below SO link for the issue :
python compare two excel sheet and append correct record
Small modification worked for me,
import pandas as pd
source1_df = pd.read_excel('a.xlsx', sheetname='source1')
source2_df = pd.read_excel('a.xlsx', sheetname='source2')
joined_df = pd.merge(source1_df,source2_df,on='Serial',how='outer')
joined_df.to_excel('/home/gk/test/result.xlsx')
It is because of the overlapping column names after join. You can either set your index to Serial and join, or specify a rsuffix= or lsuffix= value in your join function so that the suffix value would be appended to the common column names.

Resources