Pandas HDF limiting number of rows of CSV file - python-3.x

I have a CSV file with 3GB. I'm trying to save it to HDF format with Pandas so I can load it faster.
import pandas as pd
import traceback
df_all = pd.read_csv('file_csv.csv', iterator=True, chunksize=20000)
for _i, df in enumerate(df_all):
try:
print ('Saving %d chunk...' % _i, end='')
df.to_hdf('file_csv.hdf',
'file_csv',
format='table',
data_columns=True)
print ('Done!')
except:
traceback.print_exc()
print (df)
print (df.info())
del df_all
The original CSV file has about 3 million rows, which is reflected by the output of this piece of code. The last line of output is: Saving 167 chunk...Done!
That means: 167*20000 = 3.340.000 rows
My issue is:
df_hdf = pd.read_hdf('file_csv.hdf')
df_hdf.count()
=> 4613 rows
And:
item_info = pd.read_hdf('ItemInfo_train.hdf', where="item=1")
Returns nothing, even I'm sure the "item" column has an entry equals to 1 in the original file.
What can be wrong?

Use append=True to tell to_hdf to append new chunks to the same file.
df.to_hdf('file_csv.hdf', ..., append=True)
Otherwise, each call overwrites the previous contents and only the last chunk remains saved in file_csv.hdf.
import os
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(10, size=(100, 2)), columns=list('AB'))
df.to_csv('file_csv.csv')
if os.path.exists('file_csv.hdf'): os.unlink('file_csv.hdf')
for i, df in enumerate(pd.read_csv('file_csv.csv', chunksize=50)):
print('Saving {} chunk...'.format(i), end='')
df.to_hdf('file_csv.hdf',
'file_csv',
format='table',
data_columns=True,
append=True)
print('Done!')
print(df.loc[df['A']==1])
print('-'*80)
df_hdf = pd.read_hdf('file_csv.hdf', where="A=1")
print(df_hdf)
prints
Unnamed: 0 A B
22 22 1 7
30 30 1 7
41 41 1 9
44 44 1 0
19 69 1 3
29 79 1 1
31 81 1 5
34 84 1 6
Use append=True to tell to_hdf to append new chunks to the same file. Otherwise, only the last chunk is saved in file_csv.hdf:
import os
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(10, size=(100, 2)), columns=list('AB'))
df.to_csv('file_csv.csv')
if os.path.exists('file_csv.hdf'): os.unlink('file_csv.hdf')
for i, df in enumerate(pd.read_csv('file_csv.csv', chunksize=50)):
print('Saving {} chunk...'.format(i), end='')
df.to_hdf('file_csv.hdf',
'file_csv',
format='table',
data_columns=True,
append=True)
print('Done!')
print(df.loc[df['A']==1])
print('-'*80)
df_hdf = pd.read_hdf('file_csv.hdf', where="A=1")
print(df_hdf)
prints
Unnamed: 0 A B
22 22 1 7
30 30 1 7
41 41 1 9
44 44 1 0
19 69 1 3
29 79 1 1
31 81 1 5
34 84 1 6

Related

Appending DataFrame to empty DataFrame in {Key: Empty DataFrame (with columns)}

I am struggling to understand this one.
I have a regular df (same columns as the empty df in dict) and an empty df which is a value in a dictionary (the keys in the dict are variable based on certain inputs, so can be just one key/value pair or multiple key/value pairs - think this might be relevant). The dict structure is essentially:
{key: [[Empty DataFrame
Columns: [list of columns]
Index: []]]}
I am using the following code to try and add the data:
dict[key].append(df, ignore_index=True)
The error I get is:
temp_dict[product_match].append(regular_df, ignore_index=True)
TypeError: append() takes no keyword arguments
Is this error due to me mis-specifying the value I am attempting to append the df to (like am I trying to append the df to the key instead here) or something else?
Your dictionary contains a list of lists at the key, we can see this in the shown output:
{key: [[Empty DataFrame Columns: [list of columns] Index: []]]}
# ^^ list starts ^^ list ends
For this reason dict[key].append is calling list.append as mentioned by #nandoquintana.
To append to the DataFrame access the specific element in the list:
temp_dict[product_match][0][0].append(df, ignore_index=True)
Notice there is no inplace version of append. append always produces a new DataFrame:
Sample Program:
import numpy as np
import pandas as pd
temp_dict = {
'key': [[pd.DataFrame()]]
}
product_match = 'key'
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 100, (5, 4)))
temp_dict[product_match][0][0].append(df, ignore_index=True)
print(temp_dict)
Output (temp_dict was not updated):
{'key': [[Empty DataFrame
Columns: []
Index: []]]}
The new DataFrame will need to be assigned to the correct location.
Either a new variable:
some_new_variable = temp_dict[product_match][0][0].append(df, ignore_index=True)
some_new_variable
0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65
Or back to the list:
temp_dict[product_match][0][0] = (
temp_dict[product_match][0][0].append(df, ignore_index=True)
)
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Assuming there the DataFrame is actually an empty DataFrame, append is unnecessary as simply updating the value at the key to be that DataFrame works:
temp_dict[product_match] = df
temp_dict
{'key': 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65}
Or if list of list is needed:
temp_dict[product_match] = [[df]]
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Maybe you have an empty list at dict[key]?
Remember that "append" list method (unlike Pandas dataframe one) only receives one parameter:
https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

Splitting time formatted object doesn't work with python and pandas

I have the simple line of code:
print(df['Duration'])
df['Duration'].str.split(':')
print(df['Duration'])
Here are the value I have for each print
00:58:59
00:27:41
00:27:56
Name: Duration, dtype: object
Why is the split not working here? What do I'm missing?
str.split doesn't modify column inplace, so you need to assign the result to something:
import pandas as pd
df = pd.DataFrame({'Duration':['00:58:59', '00:27:41', '00:27:56'], 'other':[10, 20, 30]})
df['Duration'] = df['Duration'].str.split(':')
print(df)
Prints:
Duration other
0 [00, 58, 59] 10
1 [00, 27, 41] 20
2 [00, 27, 56] 30
If you want to expand the columns of DataFrame by splitting, you can try:
import pandas as pd
df = pd.DataFrame({'Duration':['00:58:59', '00:27:41', '00:27:56'], 'other':[10, 20, 30]})
df[['hours', 'minutes', 'seconds']] = df['Duration'].str.split(':', expand=True)
print(df)
Prints:
Duration other hours minutes seconds
0 00:58:59 10 00 58 59
1 00:27:41 20 00 27 41
2 00:27:56 30 00 27 56

Remove index from dataframe using Python

I am trying to create a Pandas Dataframe from a string using the following code -
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)
I am getting the following result -
0 1 2
0 A B C
1 0 34 88
2 2 45 200
3 3 47 65
4 4 32 140
5 None None
But I need something like the following -
A B C
0 34 88
2 45 200
3 47 65
4 32 140
I added "index = False" while creating the dataframe like -
df = pd.DataFrame([x.split(';') for x in data.split('\n')],index = False)
But, it gives me an error -
TypeError: Index(...) must be called with a collection of some kind, False
was passed
How is this achievable?
Use read_csv with StringIO and index_col parameetr for set first column to index:
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
df = pd.read_csv(pd.compat.StringIO(input_string),sep=';', index_col=0)
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
Your solution should be changed with split by default parameter (arbitrary whitespace), pass to DataFrame all values of lists without first with columns parameter and if need first column to index add DataFrame.set_axis:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index('A')
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
For general solution use first value of first list in set_index:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index(L[0][0])
EDIT:
You can set column name instead index name to A value:
df = df.rename_axis(df.index.name, axis=1).rename_axis(None)
print (df)
A B C
0 34 88
2 45 200
3 47 65
4 32 140
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split()])
df.columns = df.iloc[0]
df = df.iloc[1:].rename_axis(None, axis=1)
df.set_index('A',inplace = True)
df
output
B C
A
0 34 88
2 45 200
3 47 65
4 32 140

Output columns do not match with the data

I am trying to make a dataframe with Historical data of daily No. of stock Advancing and declining with their respective volumes of Nifty 50 index.
Being new to python I am having trouble handling pandas dataframe and conditions.
Below is the code that I wrote, but the output's columns are wrong:
import datetime
from datetime import date, timedelta
import nsepy as ns
from nsepy.derivatives import get_expiry_date
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#setting default dates
end_date = date.today()
start_date = end_date - timedelta(365)
#Deriving the names of 50 stocks in Nifty 50 Index
nifty_50 = pd.read_html('https://en.wikipedia.org/wiki/NIFTY_50')
nifty50_symbols = nifty_50[1][1]
results = []
for x in nifty50_symbols:
data = ns.get_history(symbol = x, start=start_date, end=end_date)
results.append(data)
df = pd.concat(results)
output = []
for x in df.index:
Dates = df[df.index == x]
adv = 0
dec = 0
net = 0
advol = 0
devol = 0
netvol = 0
for s in Dates['Symbol']:
y = Dates[Dates['Symbol'] == s]
#print(y.loc[x,'Close'])
cclose = y.loc[x,'Close']
#print(cclose)
copen = y.loc[x,'Open']
#print(copen)
cvol = y.loc[x,'Volume']
if cclose > copen:
adv = adv + 1
advol = advol + cvol
elif copen > cclose:
dec = dec + 1
devol = devol + cvol
else:
net = net + 1
netvol = netvol + cvol
data = [x,adv,dec,advol,devol]
output.append(data)
final = pd.DataFrame(output, columns = {'Date','Advance','Decline','Adv_Volume','Dec_Volume'})
print(final)
Output:
Dec_Volume Adv_Volume Date Decline Advance
0 2017-02-06 27 23 88546029 70663663
1 2017-02-07 15 35 53775268 127004815
2 2017-02-08 27 23 76150502 96895043
3 2017-02-09 20 30 48815099 121956144
4 2017-02-10 19 31 47713187 156262469
5 2017-02-13 23 27 78460358 86575050
6 2017-02-14 15 35 65543372 100474945
7 2017-02-15 13 37 35055563 160091302
8 2017-02-16 35 15 114283658 73082870
9 2017-02-17 22 28 91383781 193246678
10 2017-02-20 34 16 100148171 54036281
11 2017-02-21 29 21 87434834 75182662
12 2017-02-22 13 37 77086733 148499613
13 2017-02-23 20 29 104469151 192787014
14 2017-02-27 13 37 41823692 140518994
15 2017-02-28 21 29 76949655 142799485
As you can see from output that the column names do not match with that data under them. Why is this happening and how do I fix it?
If I print the value of Output list after the series of loops are over then the data looks exactly the way I want it to be(as far as a novice like me can see). The problem is happening when I am converting the Output list into a DataFrame.
I think the solution is simply to pass your column names as a Python list (using []), which has a well-defined element order, rather than as a set ({}) of unordered elements:
final = pd.DataFrame(output, columns = ['Date','Advance','Decline','Adv_Volume','Dec_Volume'])

Scraping an html table with beautiful soup into pandas

I'm trying to scrape an html table using beautiful soup and import it into pandas -- http://www.baseball-reference.com/teams/NYM/2017.shtml -- the "Team Batting" table.
Finding the table is no problem:
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
Finding the rows of data isn't a problem either:
for i in table.findAll('tr')[2]: #increase to 3 to get next row in table...
print(i.get_text())
And I can even find the header names:
table_head = table.find('thead')
for i in table_head.findAll('th'):
print(i.get_text())
Now I'm having trouble putting everything together into a data frame. Here's what I have so far:
header = []
for th in table_head.findAll('th'):
key = th.get_text()
header.append(key)
row= []
for tr in table.findAll('tr')[2]:
value = tr.get_text()
row.append(value)
od = OrderedDict(zip(head, row))
df = pd.DataFrame(d1, index=[0])
This only works for one row at a time. My question is how can I do this for every row in the table at the same time?
I have tested that the below will work for your purposes. Basically you need to create a list, loop over the players, use that list to populate a DataFrame. It is advisable to not create the DataFrame row by row as that will probably be significantly slower.
import collections as co
import pandas as pd
from bs4 import BeautifulSoup
with open('team_batting.html','r') as fin:
soup = BeautifulSoup(fin.read(),'lxml')
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
table_head = table.find('thead')
header = []
for th in table_head.findAll('th'):
key = th.get_text()
header.append(key)
# loop over table to find number of rows with '' in first column
endrows = 0
for tr in table.findAll('tr'):
if tr.findAll('th')[0].get_text() in (''):
endrows += 1
rows = len(table.findAll('tr'))
rows -= endrows + 1 # there is a pernicious final row that begins with 'Rk'
list_of_dicts = []
for row in range(rows):
the_row = []
try:
table_row = table.findAll('tr')[row]
for tr in table_row:
value = tr.get_text()
the_row.append(value)
od = co.OrderedDict(zip(header,the_row))
list_of_dicts.append(od)
except AttributeError:
continue
df = pd.DataFrame(list_of_dicts)
This solution uses only pandas, but it cheats a little by knowing in advance that the team batting table is the tenth table. With that knowledge, the following uses pandas's read_html function and grabbing the tenth DataFrame from the list of returned DataFrame objects. The remaining after that is just some data cleaning:
import pandas as pd
url = 'http://www.baseball-reference.com/teams/NYM/2017.shtml'
# Take 10th dataframe
team_batting = pd.read_html(url)[9]
# Take columns whose names don't contain "Unnamed"
team_batting.drop([x for x in team_batting.columns if 'Unnamed' in x], axis=1, inplace=True)
# Remove the rows that are just a copy of the headers/columns
team_batting = team_batting.ix[team_batting.apply(lambda x: x != team_batting.columns,axis=1).all(axis=1),:]
# Take out the Totals rows
team_batting = team_batting.ix[~team_batting.Rk.isnull(),:]
# Get a glimpse of the data
print(team_batting.head(5))
# Rk Pos Name Age G PA AB R H 2B ... OBP SLG OPS OPS+ TB GDP HBP SH SF IBB
# 0 1 C Travis d'Arnaud 28 12 42 37 6 10 2 ... .357 .541 .898 144 20 1 1 0 0 1
# 1 2 1B Lucas Duda* 31 13 50 42 4 10 2 ... .360 .571 .931 153 24 1 0 0 0 2
# 2 3 2B Neil Walker# 31 14 62 54 5 12 3 ... .306 .278 .584 64 15 2 0 0 1 0
# 3 4 SS Asdrubal Cabrera# 31 15 67 63 10 17 2 ... .313 .397 .710 96 25 0 0 0 0 0
# 4 5 3B Jose Reyes# 34 15 59 53 3 5 2 ... .186 .132 .319 -9 7 0 0 0 0 0
I hope this helps.

Resources