trouble with transpose in pd.read_csv - python-3.x

I have a data in a CSV file structured like this
Installation Manufacturing Sales & Distribution Project Development Other
43,934 24,916 11,744 - 12,908
52,503 24,064 17,722 - 5,948
57,177 29,742 16,005 7,988 8,105
69,658 29,851 19,771 12,169 11,248
97,031 32,490 20,185 15,112 8,989
119,931 30,282 24,377 22,452 11,816
137,133 38,121 32,147 34,400 18,274
154,175 40,434 39,387 34,227 18,111
I want to skip the header and transpose the list like this
43934 52503 57177 69658 97031 119931 137133 154175
24916 24064 29742 29851 32490 30282 38121 40434
11744 17722 16005 19771 20185 24377 32147 39387
0 0 7988 12169 15112 22452 34400 34227
12908 5948 8105 11248 8989 11816 18274 18111
Here is my code
import pandas as pd
import csv
FileName = "C:/Users/kesid/Documents/Pthon/Pthon.csv"
data = pd.read_csv(FileName, header= None)
data = list(map(list, zip(*data)))
print(data)
I am getting the error "TypeError: zip argument #1 must support iteration". Any help much appreciated.

You can read_csv in "normal" mode:
df = pd.read_csv('input.csv')
(colum names will be dropper later).
Start processing from replacing NaN with 0:
df.fillna(0, inplace=True)
Then use either df.values.T or df.values.T.tolist(), whatever
better suits your needs.

You should use skiprows=[0] to skip reading the first row and use .T to transpose
df = pd.read_csv(filename, skiprows=[0], header=None).T

Related

How to read in pandas column as column of lists?

Probably a simple solution but I couldn't find a fix scrolling through previous questions so thought I would ask.
I'm reading in a csv using pd.read_csv() One column is giving me issues:
0 ['Bupa', 'O2', 'EE', 'Thomas Cook', 'YO! Sushi...
1 ['Marriott', 'Evans']
2 ['Toni & Guy', 'Holland & Barrett']
3 []
4 ['Royal Mail', 'Royal Mail']
It looks fine here but when I reference the first value in the column i get:
df['brand_list'][0]
Out : '[\'Bupa\', \'O2\', \'EE\', \'Thomas Cook\', \'YO! Sushi\', \'Costa\', \'Starbucks\', \'Apple Store\', \'HMV\', \'Marks & Spencer\', "Sainsbury\'s", \'Superdrug\', \'HSBC UK\', \'Boots\', \'3 Store\', \'Vodafone\', \'Marks & Spencer\', \'Clarks\', \'Carphone Warehouse\', \'Lloyds Bank\', \'Pret A Manger\', \'Sports Direct\', \'Currys PC World\', \'Warrens Bakery\', \'Primark\', "McDonald\'s", \'HSBC UK\', \'Aldi\', \'Premier Inn\', \'Starbucks\', \'Pizza Hut\', \'Ladbrokes\', \'Metro Bank\', \'Cotswold Outdoor\', \'Pret A Manger\', \'Wetherspoon\', \'Halfords\', \'John Lewis\', \'Waitrose\', \'Jessops\', \'Costa\', \'Lush\', \'Holland & Barrett\']'
Which is obviously a string not a list as expected. How can I retain the list type when I read in this data?
I've tried the import ast method I've seen in other posts: df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x)) Which didn't work.
I've also tried to replicate with dummy dataframes:
df1 = pd.DataFrame({'a' : [['test','test1','test3'], ['test59'], ['test'], ['rhg','wreg']],
'b' : [['erg','retbn','ert','eb'], ['g','eg','egr'], ['erg'], 'eg']})
df1['a'][0]
Out: ['test', 'test1', 'test3']
Which works as I would expect - this suggests to me that the solution lies in how I am importing the data
Apologies, I was being stupid. The following should work:
import ast
df['brand_list_new'] = df['brand_list'].apply(lambda x: ast.literal_eval(x))
df['brand_list_new'][0]
Out: ['Bupa','O2','EE','Thomas Cook','YO! Sushi',...]
As desired

load csv and delete \r\n python

I can not load my csv with pandas correctly, I have some matrices and vectors but I get a (\ r \ n) for example i tried this:
test = pd.read_csv('test.csv',sep='\t',index_col=0)
test.head()
and i get this:
test['image_gray'][0]
'[[0.4297102 0.4297102 0.42578863 ... 0.37573176 0.34549804 0.30628235]\r\n [0.41794549 0.41402392 0.40618078 ... 0.37573176 0.34549804 0.30628235]\r\n [0.39833765 0.39441608 0.38151255 ... 0.3718102 0.34549804 0.30628235]\r\n ...\r\n [0.03164039 0.01987569 0.01763569 ... 0.55161137 0.55553294 0.55496745]\r\n [0.03385765 0.02771882 0.01763569 ... 0.55945451 0.56281059 0.56281059]\r\n [0.03777922 0.02771882 0.01763569 ... 0.56281059 0.56673216 0.57065373]]'
i don't want (\r\n) i have the same proble in others columns of my dataframe.
¿What could i do ?
If you're using windows. then try passing the os.linesep as the lineterminator to read_csv
pd.read_csv('test.csv', sep='\t', sep='\t', lineterminator=os.linesep)
But the file you have might be malformed with \r\n\n. Check that.

how to perform calculation for all elements of list parellely or simultaneously to reduce the run time?

I am extracting a dataframe from yahoo API for a list of companies but I want to do it paralelly(i.e, the dataframes for all the companies should be extracted simultaneously).So, it reduces my run time....
My code:
import pandas_datareader.data as web
import pandas as pd
import datetime
end_date = datetime.datetime.now().strftime('%d/%m/%Y')
temp = datetime.datetime.now() - datetime.timedelta(6*365/12)
start_date = temp.strftime('%d/%m/%Y')
f = web.DataReader('ACC.NS', 'yahoo', start_date, end_date)
print(f)
The output of this code is a dataframe shown below:
This i have done for single company....
I want to do it for a list of companies which is:
Company_Names = ['ACC', 'ADANIENT', 'ADANIPORTS', 'ADANIPOWER', 'AJANTPHARM', 'ALBK', 'AMARAJABAT', 'AMBUJACEM', 'APOLLOHOSP', 'APOLLOTYRE', 'ARVIND', 'ASHOKLEY', 'ASIANPAINT', 'AUROPHARMA', 'AXISBANK',
'BAJAJ-AUTO', 'BAJFINANCE', 'BAJAJFINSV', 'BALKRISIND', 'BANKBARODA', 'BANKINDIA', 'BATAINDIA', 'BEML', 'BERGEPAINT', 'BEL', 'BHARATFIN', 'BHARATFORG', 'BPCL', 'BHARTIARTL', 'INFRATEL', 'BHEL', 'BIOCON', 'BOSCHLTD', 'BRITANNIA',
'CADILAHC', 'CANFINHOME', 'CANBK', 'CAPF', 'CASTROLIND', 'CEATLTD', 'CENTURYTEX', 'CESC', 'CGPOWER', 'CHENNPETRO', 'CHOLAFIN', 'CIPLA', 'COALINDIA', 'COLPAL', 'CONCOR', 'CUMMINSIND', 'DABUR', 'DCBBANK',
'DHFL', 'DISHTV', 'DIVISLAB', 'DLF', 'DRREDDY', 'EICHERMOT', 'ENGINERSIN', 'EQUITAS', 'ESCORTS', 'EXIDEIND',
'FEDERALBNK', 'GAIL', 'GLENMARK', 'GMRINFRA', 'GODFRYPHLP', 'GODREJCP', 'GODREJIND', 'GRANULES', 'GRASIM', 'GSFC', 'HAVELLS', 'HCLTECH', 'HDFCBANK', 'HDFC', 'HEROMOTOCO', 'HEXAWARE', 'HINDALCO', 'HCC', 'HINDPETRO', 'HINDUNILVR',
'HINDZINC', 'ICICIBANK', 'ICICIPRULI', 'IDBI', 'IDEA', 'IDFCBANK', 'IDFC', 'IFCI', 'IBULHSGFIN', 'INDIANB', 'IOC', 'IGL', 'INDUSINDBK', 'INFIBEAM', 'INFY', 'INDIGO', 'IRB', 'ITC', 'JISLJALEQS', 'JPASSOCIAT', 'JETAIRWAYS', 'JINDALSTEL',
'JSWSTEEL', 'JUBLFOOD', 'JUSTDIAL', 'KAJARIACER', 'KTKBANK', 'KSCL', 'KOTAKBANK', 'KPIT', 'L&TFH', 'LT', 'LICHSGFIN', 'LUPIN', 'M&MFIN', 'MGL', 'M&M', 'MANAPPURAM', 'MRPL', 'MARICO', 'MARUTI', 'MFSL', 'MINDTREE', 'MOTHERSUMI', 'MRF', 'MCX',
'MUTHOOTFIN', 'NATIONALUM', 'NBCC', 'NCC', 'NESTLEIND', 'NHPC', 'NIITTECH', 'NMDC', 'NTPC', 'ONGC', 'OIL', 'OFSS', 'ORIENTBANK', 'PAGEIND', 'PCJEWELLER', 'PETRONET', 'PIDILITIND', 'PEL', 'PFC', 'POWERGRID', 'PTC', 'PNB', 'PVR', 'RAYMOND',
'RBLBANK', 'RELCAPITAL', 'RCOM', 'RELIANCE', 'RELINFRA', 'RPOWER', 'REPCOHOME', 'RECLTD', 'SHREECEM', 'SRTRANSFIN', 'SIEMENS', 'SREINFRA', 'SRF', 'SBIN', 'SAIL', 'STAR', 'SUNPHARMA', 'SUNTV', 'SUZLON', 'SYNDIBANK', 'TATACHEM', 'TATACOMM', 'TCS',
'TATAELXSI', 'TATAGLOBAL', 'TATAMTRDVR', 'TATAMOTORS', 'TATAPOWER', 'TATASTEEL', 'TECHM', 'INDIACEM', 'RAMCOCEM', 'SOUTHBANK', 'TITAN', 'TORNTPHARM', 'TORNTPOWER', 'TV18BRDCST', 'TVSMOTOR', 'UJJIVAN', 'ULTRACEMCO', 'UNIONBANK', 'UBL', 'UPL',
'VEDL', 'VGUARD', 'VOLTAS', 'WIPRO', 'WOCKPHARMA', 'YESBANK', 'ZEEL']
For all this companies the dataframe 'f' should be extracted parallely in order to save run time. Can anyone help me to solve this?

How do read a SEC txt-file into a pandas dataframe?

I am trying to use SEC (U.S. Security and Exchange Commision data). The SEC provides useful data in a txtformat. I am using
Financial Statement Data Sets for the second quarter of 2017. You can find the data I use here.
I try to read the txtfiles into a pandas dataframe. I tried it the following ways:
sub = pd.read_fwf('sub.txt')
sub_1 = pd.read_csv('sub.txt')
I get no error with using Pandas' read_fwf function - but the output is utter rubbish. Here is the head of the dataframe:
adsh cik name sic countryba stprba cityba zipba bas1 bas2 baph countryma stprma cityma zipma mas1 mas2 countryinc stprinc ein former changed afs wksi fye form period fy fp filed accepted prevrpt detail instance nciks aciks Unnamed: 1
0 0000002178-17-000038\t2178\tADAMS RESOURCES & ... NaN
1 0000002488-17-000107\t2488\tADVANCED MICRO DEV... NaN
I do get an error when using read_csv: Error tokenizing data. C error: Expected 2 fields in line 7, saw 3
Any ideas on how tor read the data into a pandas dataframe?
It looks like the files are tab separated - that's why you're seeing \t in the results. pandas read_csv defaults to comma separated values, so you have to change the separator. This is controlled by the sep parameter. In addition, you will need to provide the proper encoding (errors are thrown when trying to read the num, pre, and tag files). Generally ISO-8859-1 is a good choice.
#import pandas
import pandas as pd
#read in the .txt file and choose a separator and encoding standard
df = pd.read_csv('sub.txt', sep='\t', encoding='ISO-8859-1')
#output the results
print(df)
adsh cik name \
0 0000002178-17-000038 2178 ADAMS RESOURCES & ENERGY, INC.
1 0000002488-17-000107 2488 ADVANCED MICRO DEVICES INC
2 0000002969-17-000019 2969 AIR PRODUCTS & CHEMICALS INC /DE/
3 0000002969-17-000024 2969 AIR PRODUCTS & CHEMICALS INC /DE/
4 0000003499-17-000010 3499 ALEXANDERS INC
5 0000003545-17-000043 3545 ALICO INC
6 0000003570-17-000073 3570 CHENIERE ENERGY INC

python - cannot make corr work

I'm struggling with getting a simple correlation done. I've tried all that was suggested under similar questions.
Here are the relevant parts of the code, the various attempts I've made and their results.
import numpy as np
import pandas as pd
try01 = data[['ESA Index_close_px', 'CCMP Index_close_px' ]].corr(method='pearson')
print (try01)
Out:
Empty DataFrame
Columns: []
Index: []
try04 = data['ESA Index_close_px'][5:50].corr(data['CCMP Index_close_px'][5:50])
print (try04)
Out:
**AttributeError: 'float' object has no attribute 'sqrt'**
using numpy
try05 = np.corrcoef(data['ESA Index_close_px'],data['CCMP Index_close_px'])
print (try05)
Out:
AttributeError: 'float' object has no attribute 'sqrt'
converting the columns to lists
ESA_Index_close_px_list = list()
start_value = 1
end_value = len (data['ESA Index_close_px']) +1
for items in data['ESA Index_close_px']:
ESA_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
CCMP_Index_close_px_list = list()
start_value = 1
end_value = len (data['CCMP Index_close_px']) +1
for items in data['CCMP Index_close_px']:
CCMP_Index_close_px_list.append(items)
start_value = start_value+1
if start_value == end_value:
break
else:
continue
try06 = np.corrcoef(['ESA_Index_close_px_list','CCMP_Index_close_px_list'])
print (try06)
Out:
****TypeError: cannot perform reduce with flexible type****
Also tried .astype but not made any difference.
data['ESA Index_close_px'].astype(float)
data['CCMP Index_close_px'].astype(float)
Using Python 3.5, pandas 0.18.1 and numpy 1.11.1
Would really appreciate any suggestion.
**edit1:*
Data is coming from an excel spreadsheet
data = pd.read_excel('C:\\Users\\Ako\\Desktop\\ako_files\\for_corr_‌​tool.xlsx') prior to the correlation attempts, there are only column renames and
data = data.drop(data.index[0])
to get rid of a line
regarding the types:
print (type (data['ESA Index_close_px']))
print (type (data['ESA Index_close_px'][1]))
Out:
**edit2*
parts of the data:
print (data['ESA Index_close_px'][1:10])
print (data['CCMP Index_close_px'][1:10])
Out:
2 2137
3 2138
4 2132
5 2123
6 2127
7 2126.25
8 2131.5
9 2134.5
10 2159
Name: ESA Index_close_px, dtype: object
2 5241.83
3 5246.41
4 5243.84
5 5199.82
6 5214.16
7 5213.33
8 5239.02
9 5246.79
10 5328.67
Name: CCMP Index_close_px, dtype: object
Well, I've encountered the same problem today.
try use .astype('float64') to help make the type correct.
data['ESA Index_close_px'][5:50].astype('float64').corr(data['CCMP Index_close_px'][5:50].astype('float64'))
This works well for me. Hope it can help you as well.
You can try as following:
Top15['Citable docs per capita']=(Top15['Citable docs per capita']*100000)
Top15['Citable docs per capita'].astype('int').corr(Top15['Energy Supply per Capita'].astype('int'))
It worked for me.

Resources