python pandas merging excel sheets not working - excel

I'm trying to merge two excel sheets using the common filed Serial but throwing some errors. My program is as below :
(user1_env)root#ubuntu:~/user1/test/compare_files# cat compare.py
import pandas as pd
source1_df = pd.read_excel('a.xlsx', sheetname='source1')
source2_df = pd.read_excel('a.xlsx', sheetname='source2')
joined_df = source1_df.join(source2_df, on='Serial')
joined_df.to_excel('/root/user1/test/compare_files/result.xlsx')
getting error as below :
(user1_env)root#ubuntu:~/user1/test/compare_files# python3.5 compare.py
Traceback (most recent call last):
File "compare.py", line 5, in <module>
joined_df = source1_df.join(source2_df, on='Serial')
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/core/frame.py", line 4385, in join
rsuffix=rsuffix, sort=sort)
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/core/frame.py", line 4399, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/tools/merge.py", line 223, in get_result
rdata.items, rsuf)
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/core/internals.py", line 4445, in items_overlap_with_suffix
to_rename)
ValueError: columns overlap but no suffix specified: Index(['Serial'], dtype='object')
I'm referring below SO link for the issue :
python compare two excel sheet and append correct record

Small modification worked for me,
import pandas as pd
source1_df = pd.read_excel('a.xlsx', sheetname='source1')
source2_df = pd.read_excel('a.xlsx', sheetname='source2')
joined_df = pd.merge(source1_df,source2_df,on='Serial',how='outer')
joined_df.to_excel('/home/gk/test/result.xlsx')

It is because of the overlapping column names after join. You can either set your index to Serial and join, or specify a rsuffix= or lsuffix= value in your join function so that the suffix value would be appended to the common column names.

Related

Read int values from a column in excel sheet using XLRD

I have a cell in an excel workbook with comma separated values.
This cell can have values with following pattern.
0 or 123 or 123, 345.
I want to extract them as list of integers using XLRD or pandas.read_excel.
I have tried using xlrd with the following snippet.
book = open_workbook(args.path)
dep_cms = book.sheet_by_index(1)
for row_index in range(1, dep_cms.nrows)
excelList = []
excelList.extend([x.strip() for x in dep_cms.cell(row_index, 8).value.split(',')])
I have even tried pandas
excel_frame = read_excel(args.path, sheet_name=2, skiprows=1, verbose=True, na_filter=False)
data_need = excel_frame['Dependent CMS IDS'].tolist()
print(data_need)
But got the list index is out of range.
Reading sheet 2
Traceback (most recent call last):
File "ExcelCellCSVRead.py", line 25, in <module>
excel_frame = read_excel(args.path, sheet_name=2, skiprows=1, verbose=True, na_filter=False)
File "C:\Users\Kris\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\excel\_base.py", line 311, in read_excel
return io.parse(
File "C:\Users\Kris\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\excel\_base.py", line 868, in parse
return self._reader.parse(
File "C:\Users\Kris\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\excel\_base.py", line 441, in parse
sheet = self.get_sheet_by_index(asheetname)
File "C:\Users\Kris\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\excel\_xlrd.py", line 46, in get_sheet_by_index
return self.book.sheet_by_index(index)
File "C:\Users\Kris\AppData\Local\Programs\Python\Python38-32\lib\site-packages\xlrd\book.py", line 466, in sheet_by_index
return self._sheet_list[sheetx] or self.get_sheet(sheetx)
IndexError: list index out of range
It is not working with single value in a cell (for example, just 0 or some value like 123). It is outputting AttributeError: 'float' object has no attribute 'split'.
It only works if I have comma separated values, and converts them into list of strings like ['123', '345']. I guess split condition is the culprit.
How to extract the values of this cell using XLRD or pandas to a list of integers?
Regards
Comma seperated value (CSV) cannot be compaired to excel during importing.
Instead of using read_excel you can use read_csv.
below is the code snippet that how your code will look like after applying read_csv
Import Pandas as pd
df = pd.read_csv("your file name.csv")
data_need = df["Column_name"].tolist()

How can I remove Attribute error from my Csv program showing? [duplicate]

So I copied and pasted a demo program from the book I am using to learn Python:
#!/usr/bin/env python
import csv
total = 0
priciest = ('',0,0,0)
r = csv.reader(open('purchases.csv'))
for row in r:
cost = float(row[1]) * float(row[2])
total += cost
if cost == priciest[3]:
priciest = row + [cost]
print("You spent", total)
print("Your priciest purchase was", priciest[1], priciest[0], "at a total cost of", priciest[3])
And I get the Error:
Traceback (most recent call last):
File "purchases.py", line 2, in <module>
import csv
File "/Users/Solomon/Desktop/Python/csv.py", line 5, in <module>
r = csv.read(open('purchases.csv'))
AttributeError: 'module' object has no attribute 'read'
Why is this happening? How do I fix it?
Update:
Fixed All The Errors
Now I'm getting:
Traceback (most recent call last):
File "purchases.py", line 6, in <module>
for row in r:
_csv.Error: line contains NULL byte
What was happening in terms of the CSV.py:
I had a file with the same code named csv.py, saved in the same directory. I thought that the fact that it was named csv .py was screwing it up, so I started a new file called purchases.py, but forgot to delete csv
Don't name your file csv.py.
When you do, Python will look in your file for the csv code instead of the standard library csv module.
Edit: to include the important note in the comment: if there's a csv.pyc file left over in that directory, you'll have to delete that. that is Python bytecode which would be used in place of re-running your csv.py file.
There is a discrepancy between the code in the traceback of your error:
r = csv.read(open('purchases.csv'))
And the code you posted:
r = csv.reader(open('purchases.csv'))
So which are you using?
At any rate, fix that indentation error in line 2:
#!/usr/bin/env python
import csv
total = 0
And create your csv reader object with a context handler, so as not to leave the file handle open:
with open('purchases.csv') as f:
r = csv.reader(f)

How to fix error 'Writing 0 cols but got 4 aliases' in python using pandas

I'm trying to save some data scraped using selenium to CSV file using pandas, but I'm getting this error and I don't know what's the problem.
I tried to use header=False instead of the header=['','','',''],
it works without error but gave me an empty CSV file.
Error:
full_datas = list(zip(d_links, d_title, d_price, d_num))
df = pd.DataFrame(data=full_datas)
df.to_csv('cc.csv', encoding='utf-8-sig', index=False, header=['Links','Title,','Price','PNum'])
Empty csv:
full_datas = list(zip(d_links, d_title, d_price, d_num))
df = pd.DataFrame(data=full_datas)
df.to_csv('cc.csv', encoding='utf-8-sig', index=False, header=False)
I'm expecting a CSV file with my data on it
Traceback (most recent call last):
File "E:/data/some/urls.py", line 90, in <module>
df.to_csv('cc.csv', encoding='utf-8-sig', index=False, header=['Links','Title,','Price','PNum'])
File "C:\Users\Iwillsolo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\core\generic.py", line 3228, in to_csv
formatter.save()
File "C:\Users\Iwillsolo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\formats\csvs.py", line 202, in save
self._save()
File "C:\Users\Iwillsolo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\formats\csvs.py", line 310, in _save
self._save_header()
File "C:\Users\Iwillsolo\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\formats\csvs.py", line 242, in _save_header
"aliases".format(ncols=len(cols), nalias=len(header))
ValueError: Writing 0 cols but got 4 aliases
sometimes the list might be empty so i managed to fix it by using zip_longest
thank you guys for trying to help :)

I am trying to read a csv file with pandas and then to search a string in the first column, to use total row for calculations

I am reading a CSV file with pandas, and then I try to find a word like "Net income" in the first column. Then I want to use the whole row which has this structure: string/number/number/number/... to do some calculations with the numbers.
The problem is that find is not working.
data = pd.read_csv(name)
data.str.find('Net income')
Traceback (most recent call last):
File "C:\Users\thoma\Desktop\python programme\manage.py", line 16, in <module>
data.str.find('Net income')
I am using CSV files from here: Income Statement for Deutsche Lufthansa AG (DLAKF) from Morningstar.com
I found this: Python | Pandas Series.str.find() - GeeksforGeeks
Traceback (most recent call last):
File "C:\Users\thoma\Desktop\python programme\manage.py", line 16, in <module>
data.str.find('Net income')
File "C:\Users\thoma\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 5067, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
So, it works now. But I still have a question. After using the describe function with pandas I get this:
<bound method NDFrame.describe of 2014-12 615
2015-12 612
2016-12 636
2017-12 713
2018-12 736
Name: Goodwill, dtype: object>
I have problems to use the data. So how can I f.e. use the second column here? I tried to do a new table:
new_Table['Goodwill'] = data1['Goodwill'].describe
but this does not work.
I also would like to add more "second" columns to new_Table.
Hi you should filter the column name like df[‘col name’].str.find(x) this required a series not a data frame.
I recommend setting your header row if pandas isnt recognizing named rows in your CSV file.
Something like:
new_header = data.iloc[0] #grab the first row for the header
data = data[1:] #take the data less the header row
data.columns = new_header
From there you can summarize each column by name:
data['Net Income'].describe
edit: I looked at the csv file, I recommend reshaping the data first before analyzing columns.Something like...
data=data.transpose
So in summation:
data = pd.read_csv(name)
data=data.transpose #flip the columns/rows
new_header = data.iloc[0] #grab the first row for the header
data = data[1:] #take the data less the header row
data.columns = new_header
data['Net Income'].describe #analyze

AttributeError, whein writing back from ipython to Excel with pandas

System:
Windows 7
Anaconda -> Spyder with 2.7.12 Python
I got this AttributeError:
File "<ipython-input-4-d258b656588d>", line 1, in <module>
runfile('C:/xxx/.spyder/pandas excel.py', wdir='C:/xxx/.spyder')
File "C:\xxx\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\xxx\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/xxx/.spyder/pandas excel.py", line 33, in <module>
moving_avg.to_excel(writer, sheet_name='Methodentest', startcol=12, startrow=38)
File "C:\xxx\Anaconda2\lib\site-packages\pandas\core\generic.py", line 2672, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'to_excel'
This is my code:
import pandas as pd
#Adjustmend of Data for Date funtioniert nicht?
#parsen = lambda x: pd.datetime.strptime(x, '%Y-%m')
#Open new file object
xl = pd.ExcelFile('C:\xxx\Desktop\Beisspieldatensatz.xlsx')
#parse_dates={'Zeit': ['Jahr', 'Monat']}, index_col = 0, date_parser=parsen)
#Link to specific sheet
df = xl.parse('Methodentest')
#Narrow the data input
df2 = df[['Jahr', 'Monat', 'Umsatzmenge']]
#Establishment values under the year 2015
df3 = df2[(df2['Jahr']<2015)]
#Execute gleitender Mittelwert History 36 Month or 36 rows
moving_avg = pd.rolling_mean(df3["Umsatzmenge"],36)
print (moving_avg.head())
#Create a pandas excel writer
writer = pd.ExcelWriter(r'C:\xxx\Desktop\Beisspieldatensatz.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
moving_avg.to_excel(writer, sheet_name='Methodentest', startcol=12, startrow=38)
# Close the Pandas Excel writer and output the Excel file.
writer.save()
I want to read a data set in ipython from Excel. In the next step I want to "pare" my data, but this is not working?? (that's why I put this part in hashtag). After this I want do a mathematical method like here moving average for the next 18 month and store this information in moving_average.
My Data set start monthly from the 01.2012. Then the code must write back the new figures in Excel in specific row and column -> Here the error occurred.
I think you need to convert your Series back to a DataFrame before saving, try
moving_avg.to_frame().to_excel(writer, sheet_name='Methodentest', startcol=12, startrow=38)

Resources