Create dictionary using cell values from two columns of excel - python-3.x

I have an excel sheet with columns ISIN and URL as header as shown below:
*ISIN URL*
ISIN1 https://mylink3.pdf
ISIN2 https://mylink2.pdf
I need to create a dictionary of the values of this sheet and have used the below code:
import pandas as pd
my_dic = pd.read_excel('PDFDwn.xlsx', index_col=0).to_dict()
print(my_dic)
The output that I receive is as below.
{'URL': {'ISIN1': 'https://mylink3.pdf', 'ISIN2': 'https://mylink2.pdf'}}
whereas expected output should be as below without URL piece.
{'ISIN1': 'https://mylink3.pdf', 'ISIN2': 'https://mylink2.pdf'}

try this,
print df.set_index('ISIN')['URL'].to_dict()
Output:
{'ISIN2': 'https://mylink2.pdf', 'ISIN1': 'https://mylink3.pdf'}
As Per User sample:
my_dic = pd.read_excel('PDFDwn.xlsx', index_col=0).set_index('ISIN')['URL'].to_dict()

Simple solution:
my_dic['URL']
Output:
{'ISIN1': 'https://mylink3.pdf', 'ISIN2': 'https://mylink2.pdf'}

Related

How to get a similar sheet name in pandas

I am trying to find a similar sheet name in an excel using pandas.
Currently I am using below code to get dataframe of a sheet in pandas.
excel= pd.ExcelFile(excel)
tab_name = 'Employee'
emp_df= excel.parse(tab_name)
But this code will fail if the sheet name in excel contains any space or some other extra characters.
Is there any easy way to do this ?
I used similarity api (fuzzywuzzy) to find similar sheet only when sheet not found error thrown when running excel.parse(tab_name)
from fuzzywuzzy import fuzz
import xlrd
try:
tab_df = excel.parse(tab_name)
except xlrd.biffh.XLRDError:
sheet_names=excel.sheet_names
ratios = [fuzz.ratio(tab_name, tbname) for tbname in sheet_names]
if(max(ratios)>50):
tab_name = sheet_names[ratios.index(max(ratios))]
tab_df = excel.parse(tab_name)
else:
logger.error(tab_name+"Not found")

Pandas read_excel for multiple sheets returns same nb_rows on every sheet

I've got an Excel file with multiple sheets with same structure.
Number of rows varies on every sheet, but pd.read_excel() returns df with nb_rows == nb_rows on the first sheet.
I've checked Excel sheets with CTRL+down - there is no empty lines in the middle of the sheet.
How can I fix the problem?
The example code is follows:
import pandas as pd
xls_sheets = ['01', '02', '03']
fname = 'C:\\data\\data.xlsx'
xls = pd.ExcelFile(fname)
for sheet in xls_sheets:
df = pd.read_excel(io=xls, sheet_name=sheet)
print(len(df))
Output:
>> 4043 #Actual nb_rows = 4043
>> 4043 #Actual nb_rows = 11015
>> 4043 #Actual nb_rows = 5622
python 3.5, pandas 0.20.1
Check the names of sheets are they correct in your xls_sheets list if yes then Try it after installing xlrd library/module (pip install xlrd) and then run the code again. Because for me it works fine. Hope this helps you!
Given the limited information on the question and assuming you want to read all of the sheets in the Excel file, I would suggest that you use the following:
data=pd.read_excel('excelfile.xlsx', sheet_name=None)
datais a dictionary where the keys are the sheet names and the values are the data in each sheet. Please try this method. It may solve your problem.

Creating a list from the values that were copied in excel

I am trying to create a list from the values that I copied from MS excel cells. For example; I copied first 5 row in the first column and want to make a list like this:
a_list=[2503531709, 4789009637, 8171670652, 8434851938, 9629960060]
I see that the pyperclip takes the values like this
'2503531709\r\n4789009637\r\n8171670652\r\n8434851938\r\n9629960060\r\n'
I wrote the following one. I did [i:i+9] just for this case. The length of values may be more than 10.
import pyperclip
isbn=pyperclip.paste()
a_list=[]
for i in range(len(isbn)):
if ('\r') or ('\n') not in isbn[i:i+9]:
a_list.append(isbn[i:i+9])
print(a_list)
The code did not work as I expected. How can I differentiate the values and add to the list?
Just use str.split:
a_list = isbn.split('\r\n')
If you want the values to be integers:
a_list = [int(val) for val in isbn.split('\r\n') if val]

How to use Pandas to display specific columns from csv file?

I have a csv file with a number of columns in it. It is for students. I want to display only male students and their names. I used 1 for male students and 0 for female students. My code is:
import pandas as pd
data = pd.read_csv('normalizedDataset.csv')
results = pd.concat([data['name'], ['students']==1])
print results
I have got this error:
TypeError: cannot concatenate a non-NDFrame object
Can anyone help please. Thanks.
You can specify to read only certain column names of your data when you load your csv. Then use loc to locate all values where students equals 1.
data = pd.read_csv('normalizedDataset.csv', usecols=['name', 'students'])
data = data.loc[data.students == 1, :]
BTW, your original error is because you are trying to concatenate a dataframe with False.
>>> ['students']==1
False
No need to concat, you're stripping things away, not building.
Try:
data[data['friends']==1]['name']
To provide clarity on why you were getting the error:
The second thing you were trying to concat was:
['students']==1
Which is not an NDFrame object. You'd want to replace that with.
data[data['students']==1]['students']

using split() to split values in an entire column in a python dataframe

I am trying to clean a list of url's that has garbage as shown.
/gradoffice/index.aspx(
/gradoffice/index.aspx-
/gradoffice/index.aspxjavascript$
/gradoffice/index.aspx~
I have a csv file with over 190k records of different url's. I tried to load the csv into a pandas dataframe and took the entire column of url's into a list by using the statement
str = df['csuristem']
it clearly gave me all the values in the column. when i use the following code - It is only printing 40k records and it starts some where in the middle. I don't know where am going wrong. the program runs perfectly but is showing me only partial number of results. any help would be much appreciated.
import pandas
table = pandas.read_csv("SS3.csv", dtype=object)
df = pandas.DataFrame(table)
str = df['csuristem']
for s in str:
s = s.split(".")[0]
print s
I am looking to get an output like this
/gradoffice/index.
/gradoffice/index.
/gradoffice/index.
/gradoffice/index.
Thank you,
Santhosh.
You need to do the following, so call .str.split on the column and then .str[0] to access the first portion of the split string of interest:
In [6]:
df['csuristem'].str.split('.').str[0]
Out[6]:
0 /gradoffice/index
1 /gradoffice/index
2 /gradoffice/index
3 /gradoffice/index
Name: csuristem, dtype: object

Resources