Probably a simple solution but I couldn't find a fix scrolling through previous questions so thought I would ask.
I'm reading in a csv using pd.read_csv() One column is giving me issues:
0 ['Bupa', 'O2', 'EE', 'Thomas Cook', 'YO! Sushi...
1 ['Marriott', 'Evans']
2 ['Toni & Guy', 'Holland & Barrett']
3 []
4 ['Royal Mail', 'Royal Mail']
It looks fine here but when I reference the first value in the column i get:
df['brand_list'][0]
Out : '[\'Bupa\', \'O2\', \'EE\', \'Thomas Cook\', \'YO! Sushi\', \'Costa\', \'Starbucks\', \'Apple Store\', \'HMV\', \'Marks & Spencer\', "Sainsbury\'s", \'Superdrug\', \'HSBC UK\', \'Boots\', \'3 Store\', \'Vodafone\', \'Marks & Spencer\', \'Clarks\', \'Carphone Warehouse\', \'Lloyds Bank\', \'Pret A Manger\', \'Sports Direct\', \'Currys PC World\', \'Warrens Bakery\', \'Primark\', "McDonald\'s", \'HSBC UK\', \'Aldi\', \'Premier Inn\', \'Starbucks\', \'Pizza Hut\', \'Ladbrokes\', \'Metro Bank\', \'Cotswold Outdoor\', \'Pret A Manger\', \'Wetherspoon\', \'Halfords\', \'John Lewis\', \'Waitrose\', \'Jessops\', \'Costa\', \'Lush\', \'Holland & Barrett\']'
Which is obviously a string not a list as expected. How can I retain the list type when I read in this data?
I've tried the import ast method I've seen in other posts: df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x)) Which didn't work.
I've also tried to replicate with dummy dataframes:
df1 = pd.DataFrame({'a' : [['test','test1','test3'], ['test59'], ['test'], ['rhg','wreg']],
'b' : [['erg','retbn','ert','eb'], ['g','eg','egr'], ['erg'], 'eg']})
df1['a'][0]
Out: ['test', 'test1', 'test3']
Which works as I would expect - this suggests to me that the solution lies in how I am importing the data
Apologies, I was being stupid. The following should work:
import ast
df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x))
df['brand_list_new'][0]
Out: ['Bupa','O2','EE','Thomas Cook','YO! Sushi',...]
As desired
I am extracting a dataframe from yahoo API for a list of companies but I want to do it paralelly(i.e, the dataframes for all the companies should be extracted simultaneously).So, it reduces my run time....
My code:
import pandas_datareader.data as web
import pandas as pd
import datetime
end_date = datetime.datetime.now().strftime('%d/%m/%Y')
temp = datetime.datetime.now() - datetime.timedelta(6*365/12)
start_date = temp.strftime('%d/%m/%Y')
f = web.DataReader('ACC.NS', 'yahoo', start_date, end_date)
print(f)
The output of this code is a dataframe shown below:
This i have done for single company....
I want to do it for a list of companies which is:
Company_Names = ['ACC', 'ADANIENT', 'ADANIPORTS', 'ADANIPOWER', 'AJANTPHARM', 'ALBK', 'AMARAJABAT', 'AMBUJACEM', 'APOLLOHOSP', 'APOLLOTYRE', 'ARVIND', 'ASHOKLEY', 'ASIANPAINT', 'AUROPHARMA', 'AXISBANK',
'BAJAJ-AUTO', 'BAJFINANCE', 'BAJAJFINSV', 'BALKRISIND', 'BANKBARODA', 'BANKINDIA', 'BATAINDIA', 'BEML', 'BERGEPAINT', 'BEL', 'BHARATFIN', 'BHARATFORG', 'BPCL', 'BHARTIARTL', 'INFRATEL', 'BHEL', 'BIOCON', 'BOSCHLTD', 'BRITANNIA',
'CADILAHC', 'CANFINHOME', 'CANBK', 'CAPF', 'CASTROLIND', 'CEATLTD', 'CENTURYTEX', 'CESC', 'CGPOWER', 'CHENNPETRO', 'CHOLAFIN', 'CIPLA', 'COALINDIA', 'COLPAL', 'CONCOR', 'CUMMINSIND', 'DABUR', 'DCBBANK',
'DHFL', 'DISHTV', 'DIVISLAB', 'DLF', 'DRREDDY', 'EICHERMOT', 'ENGINERSIN', 'EQUITAS', 'ESCORTS', 'EXIDEIND',
'FEDERALBNK', 'GAIL', 'GLENMARK', 'GMRINFRA', 'GODFRYPHLP', 'GODREJCP', 'GODREJIND', 'GRANULES', 'GRASIM', 'GSFC', 'HAVELLS', 'HCLTECH', 'HDFCBANK', 'HDFC', 'HEROMOTOCO', 'HEXAWARE', 'HINDALCO', 'HCC', 'HINDPETRO', 'HINDUNILVR',
'HINDZINC', 'ICICIBANK', 'ICICIPRULI', 'IDBI', 'IDEA', 'IDFCBANK', 'IDFC', 'IFCI', 'IBULHSGFIN', 'INDIANB', 'IOC', 'IGL', 'INDUSINDBK', 'INFIBEAM', 'INFY', 'INDIGO', 'IRB', 'ITC', 'JISLJALEQS', 'JPASSOCIAT', 'JETAIRWAYS', 'JINDALSTEL',
'JSWSTEEL', 'JUBLFOOD', 'JUSTDIAL', 'KAJARIACER', 'KTKBANK', 'KSCL', 'KOTAKBANK', 'KPIT', 'L&TFH', 'LT', 'LICHSGFIN', 'LUPIN', 'M&MFIN', 'MGL', 'M&M', 'MANAPPURAM', 'MRPL', 'MARICO', 'MARUTI', 'MFSL', 'MINDTREE', 'MOTHERSUMI', 'MRF', 'MCX',
'MUTHOOTFIN', 'NATIONALUM', 'NBCC', 'NCC', 'NESTLEIND', 'NHPC', 'NIITTECH', 'NMDC', 'NTPC', 'ONGC', 'OIL', 'OFSS', 'ORIENTBANK', 'PAGEIND', 'PCJEWELLER', 'PETRONET', 'PIDILITIND', 'PEL', 'PFC', 'POWERGRID', 'PTC', 'PNB', 'PVR', 'RAYMOND',
'RBLBANK', 'RELCAPITAL', 'RCOM', 'RELIANCE', 'RELINFRA', 'RPOWER', 'REPCOHOME', 'RECLTD', 'SHREECEM', 'SRTRANSFIN', 'SIEMENS', 'SREINFRA', 'SRF', 'SBIN', 'SAIL', 'STAR', 'SUNPHARMA', 'SUNTV', 'SUZLON', 'SYNDIBANK', 'TATACHEM', 'TATACOMM', 'TCS',
'TATAELXSI', 'TATAGLOBAL', 'TATAMTRDVR', 'TATAMOTORS', 'TATAPOWER', 'TATASTEEL', 'TECHM', 'INDIACEM', 'RAMCOCEM', 'SOUTHBANK', 'TITAN', 'TORNTPHARM', 'TORNTPOWER', 'TV18BRDCST', 'TVSMOTOR', 'UJJIVAN', 'ULTRACEMCO', 'UNIONBANK', 'UBL', 'UPL',
'VEDL', 'VGUARD', 'VOLTAS', 'WIPRO', 'WOCKPHARMA', 'YESBANK', 'ZEEL']
For all this companies the dataframe 'f' should be extracted parallely in order to save run time. Can anyone help me to solve this?
I am trying to use SEC (U.S. Security and Exchange Commision data). The SEC provides useful data in a txtformat. I am using
Financial Statement Data Sets for the second quarter of 2017. You can find the data I use here.
I try to read the txtfiles into a pandas dataframe. I tried it the following ways:
sub = pd.read_fwf('sub.txt')
sub_1 = pd.read_csv('sub.txt')
I get no error with using Pandas' read_fwf function - but the output is utter rubbish. Here is the head of the dataframe:
adsh cik name sic countryba stprba cityba zipba bas1 bas2 baph countryma stprma cityma zipma mas1 mas2 countryinc stprinc ein former changed afs wksi fye form period fy fp filed accepted prevrpt detail instance nciks aciks Unnamed: 1
0 0000002178-17-000038\t2178\tADAMS RESOURCES & ... NaN
1 0000002488-17-000107\t2488\tADVANCED MICRO DEV... NaN
I do get an error when using read_csv: Error tokenizing data. C error: Expected 2 fields in line 7, saw 3
Any ideas on how tor read the data into a pandas dataframe?
It looks like the files are tab separated - that's why you're seeing \t in the results. pandas read_csv defaults to comma separated values, so you have to change the separator. This is controlled by the sep parameter. In addition, you will need to provide the proper encoding (errors are thrown when trying to read the num, pre, and tag files). Generally ISO-8859-1 is a good choice.
#import pandas
import pandas as pd
#read in the .txt file and choose a separator and encoding standard
df = pd.read_csv('sub.txt', sep='\t', encoding='ISO-8859-1')
#output the results
print(df)
adsh cik name \
0 0000002178-17-000038 2178 ADAMS RESOURCES & ENERGY, INC.
1 0000002488-17-000107 2488 ADVANCED MICRO DEVICES INC
2 0000002969-17-000019 2969 AIR PRODUCTS & CHEMICALS INC /DE/
3 0000002969-17-000024 2969 AIR PRODUCTS & CHEMICALS INC /DE/
4 0000003499-17-000010 3499 ALEXANDERS INC
5 0000003545-17-000043 3545 ALICO INC
6 0000003570-17-000073 3570 CHENIERE ENERGY INC
I'm struggling with getting a simple correlation done. I've tried all that was suggested under similar questions.
Here are the relevant parts of the code, the various attempts I've made and their results.
import numpy as np
import pandas as pd
try01 = data[['ESA Index_close_px', 'CCMP Index_close_px' ]].corr(method='pearson')
print (try01)
Out:
Empty DataFrame
Columns: []
Index: []
try04 = data['ESA Index_close_px'][5:50].corr(data['CCMP Index_close_px'][5:50])
print (try04)
Out:
**AttributeError: 'float' object has no attribute 'sqrt'**
using numpy
try05 = np.corrcoef(data['ESA Index_close_px'],data['CCMP Index_close_px'])
print (try05)
Out:
AttributeError: 'float' object has no attribute 'sqrt'
converting the columns to lists
ESA_Index_close_px_list = list()
start_value = 1
end_value = len (data['ESA Index_close_px']) +1
for items in data['ESA Index_close_px']:
ESA_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
CCMP_Index_close_px_list = list()
start_value = 1
end_value = len (data['CCMP Index_close_px']) +1
for items in data['CCMP Index_close_px']:
CCMP_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
try06 = np.corrcoef(['ESA_Index_close_px_list','CCMP_Index_close_px_list'])
print (try06)
Out:
****TypeError: cannot perform reduce with flexible type****
Also tried .astype but not made any difference.
data['ESA Index_close_px'].astype(float)
data['CCMP Index_close_px'].astype(float)
Using Python 3.5, pandas 0.18.1 and numpy 1.11.1
Would really appreciate any suggestion.
**edit1:*
Data is coming from an excel spreadsheet
data = pd.read_excel('C:\\Users\\Ako\\Desktop\\ako_files\\for_corr_tool.xlsx') prior to the correlation attempts, there are only column renames and
data = data.drop(data.index[0])
to get rid of a line
regarding the types:
print (type (data['ESA Index_close_px']))
print (type (data['ESA Index_close_px'][1]))
Out:
**edit2*
parts of the data:
print (data['ESA Index_close_px'][1:10])
print (data['CCMP Index_close_px'][1:10])
Out:
2 2137
3 2138
4 2132
5 2123
6 2127
7 2126.25
8 2131.5
9 2134.5
10 2159
Name: ESA Index_close_px, dtype: object
2 5241.83
3 5246.41
4 5243.84
5 5199.82
6 5214.16
7 5213.33
8 5239.02
9 5246.79
10 5328.67
Name: CCMP Index_close_px, dtype: object
Well, I've encountered the same problem today.
try use .astype('float64') to help make the type correct.
data['ESA Index_close_px'][5:50].astype('float64').corr(data['CCMP Index_close_px'][5:50].astype('float64'))
This works well for me. Hope it can help you as well.
You can try as following:
Top15['Citable docs per capita']=(Top15['Citable docs per capita']*100000)
Top15['Citable docs per capita'].astype('int').corr(Top15['Energy Supply per Capita'].astype('int'))
It worked for me.