Displaying Pandas Series with Asian text in index - python-3.x

If the index of a Pandas Series is in one of the Asian languages and has variable lengths, then the print-out would not be aligned correctly.
import pandas as pd
from IPython.display import display
df = pd.Series( range(2), index = [ 'ミートボールスパゲッティ', 'ご飯' ] )
display(df)
print(df)
df
Note that this only happens with Series, with DataFrame display can display the content nicely.
How can I fix the output here?

Convert the Series to a DataFrame for display. Currently, Series have no to_html() method. Therefore, they cannot be displayed in this format directly.
import pandas as pd
from IPython.display import display
df = pd.Series( range(2), index = [ 'ミートボールスパゲッティ', 'ご飯' ] )
df.to_frame()

Related

Pandas plotting graph with timestamp

pandas 0.23.4
python 3.5.3
I have some code that looks like below
import pandas as pd
from datetime import datetime
from matplotlib import pyplot
def dateparse():
return datetime.strptime("2019-05-28T00:06:20,927", '%Y-%m-%dT%H:%M:%S,%f')
series = pd.read_csv('sample.csv', delimiter=";", parse_dates=True,
date_parser=dateparse, header=None)
series.plot()
pyplot.show()
The CSV file looks like below
2019-05-28T00:06:20,167;2070
2019-05-28T00:06:20,426;147
2019-05-28T00:06:20,927;453
2019-05-28T00:06:22,688;2464
2019-05-28T00:06:27,260;216
As you can see 2019-05-28T00:06:20,167 is the timestamp with milliseconds and 2070 is the value that I want plotted.
When I run this the graph gets printed however on the X-Axis I see numbers which is a bit odd. I was expecting to see actual timestamps (like MS Excel). Can someone tell me what I am doing wrong?
You did not set datetime as index. Aslo, you don't need a date parser, just pass the columns you want to parse:
dfstr = '''2019-05-28T00:06:20,167;2070
2019-05-28T00:06:20,426;147
2019-05-28T00:06:20,927;453
2019-05-28T00:06:22,688;2464
2019-05-28T00:06:27,260;216'''
df = pd.read_csv(pd.compat.StringIO(dfstr), sep=';',
header=None, parse_dates=[0])
plt.plot(df[0], df[1])
plt.show()
Output:
Or:
df.set_index(0)[1].plot()
gives a little better plot:

Empty plot on Bokeh Tutorial Exercise

I'm following the bokeh tutorial and in the basic plotting section, I can't manage to show a plot. I only get the axis. What am I missing?
Here is the code:
df = pd.DataFrame.from_dict(AAPL)
weekapple = df.loc["2000-03-01":"2000-04-01"]
p = figure(x_axis_type="datetime", title="AAPL", plot_height=350, plot_width=800)
p.xgrid.grid_line_color=None
p.ygrid.grid_line_alpha=0.5
p.xaxis.axis_label = 'Time'
p.yaxis.axis_label = 'Value'
p.line(weekapple.date, weekapple.close)
show(p)
I get this:
My result
I'm trying to complete the exercise here (10th Code cell - Exercise with AAPL data) I was able to follow all previous code up to that point correctly.
Thanks in advance!
In case this is still relevant, this is how you should do you selection:
df = pd.DataFrame.from_dict(AAPL)
# Convert date column in df from strings to the proper datetime format
date_format="%Y-%m-%d"
df["date"] = pd.to_datetime(df['date'], format=date_format)
# Use the same conversion for selected dates
weekapple = df[(df.date>=dt.strptime("2000-03-01", date_format)) &
(df.date<=dt.strptime("2000-04-01", date_format))]
p = figure(x_axis_type="datetime", title="AAPL", plot_height=350, plot_width=800)
p.xgrid.grid_line_color=None
p.ygrid.grid_line_alpha=0.5
p.xaxis.axis_label = 'Time'
p.yaxis.axis_label = 'Value'
p.line(weekapple.date, weekapple.close)
show(p)
To make this work, before this code, I have (in my Jupyter notebook):
import numpy as np
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
import bokeh
import pandas as pd
from datetime import datetime as dt
bokeh.sampledata.download()
from bokeh.sampledata.stocks import AAPL
output_notebook()
As descried at, https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html, .loc is used in operations with the index (or boolean lists); date is not in the index in your dataframe (it is a regular column).
I hope this helps.
You dataframe sub-view is empty:
In [3]: import pandas as pd
...: from bokeh.sampledata.stocks import AAPL
...: df = pd.DataFrame.from_dict(AAPL)
...: weekapple = df.loc["2000-03-01":"2000-04-01"]
In [4]: weekapple
Out[4]:
Empty DataFrame
Columns: [date, open, high, low, close, volume, adj_close]
Index: []

How do I map df column values to hex color in one go?

I have a pandas dataframe with two columns. One of the columns values needs to be mapped to colors in hex. Another graphing process takes over from there.
This is what I have tried so far. Part of the toy code is taken from here.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mapper.to_rgba(x))
df
Which outputs:
How do I convert 'some_value' df column values to hex in one go?
Ideally using the sns.cubehelix_palette(light=1)
I am not opposed to using something other than matplotlib
Thanks in advance.
You may use matplotlib.colors.to_hex() to convert a color to hexadecimal representation.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
df
Efficiency
The above method it easy to use, but may not be very efficient. In the folling let's compare some alternatives.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
def create_df(n=10):
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(n, 2)),
columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
return df
The following is the solution from above. It applies the conversion to the dataframe row by row. This quite inefficient.
def apply1(df):
# map values to colors in hex via
# matplotlib to_hex by pandas apply
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
return df
That's why we might choose to calculate the values into a numpy array first and just assign this array as the newly created column.
def apply2(df):
# map values to colors in hex via
# matplotlib to_hex by assigning numpy array as column
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
a = mapper.to_rgba(df['some_value'])
df['some_value_color'] = np.apply_along_axis(mcolors.to_hex, 1, a)
return df
Finally we may use a look up table (LUT) which is created from the matplotlib colormap, and index the LUT by the normalized data. Because this solution needs to create the LUT first, it is rather ineffienct for dataframes with less entries than the LUT has colors, but will pay off for large dataframes.
def apply3(df):
# map values to colors in hex via
# creating a hex Look up table table and apply the normalized data to it
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
lut = plt.cm.viridis(np.linspace(0,1,256))
lut = np.apply_along_axis(mcolors.to_hex, 1, lut)
a = (norm(df['some_value'].values)*255).astype(np.int16)
df['some_value_color'] = lut[a]
return df
Compare the timings
Let's take a dataframe with 10000 rows.
df = create_df(10000)
Original solution (apply1)
%timeit apply1(df)
2.66 s per loop
Array solution (apply2)
%timeit apply2(df)
240 ms per loop
LUT solution (apply3)
%timeit apply1(df)
7.64 ms per loop
In this case the LUT solution gives almost a factor 400 of improvement.

Removing the .0's off data when Python reads data from a csv file

I have successfully gotten the volumes to add up correctly, but it is returning the volume as a decimal. All volumes in the CSV file are whole numbers. I would like to have them without the decimal part.
Code is below.
import pandas as pd
datagrid = pd.read_csv("Daily Receipts.csv")
daily_vols = datagrid.groupby("Txn")["Scan Volume"].sum()
print(daily_vols)
When you sum with pandas it converted the results to float.
Use astype(int) <--- Link to Docs
import pandas as pd
datagrid = pd.read_csv("Daily Receipts.csv")
daily_vols = datagrid.groupby("Txn")["Scan Volume"].sum().astype(int)
print(daily_vols)
this will do it :
import pandas as pd
datagrid = pd.read_csv("Daily Receipts.csv")
daily_vols = datagrid.groupby("Txn")["Scan Volume"].sum()
daily_vols = [ int(x) for x in daily_vols ]
print(daily_vols)

Parse Years in Python 3.4 Pandas and Bokeh from counter dictionary

I'm struggling with creating a Bokeh time series graph from the output of the counter function from collections.
import pandas as pd
from bokeh.plotting import figure, output_file, show
import collections
plotyears = []
counter = collections.Counter(plotyears)
output_file("years.html")
p = figure(width=800, height=250, x_axis_type="datetime")
for number in sorted(counter):
yearvalue = number, counter[number]
p.line(yearvalue, color='navy', alpha=0.5)
show(p)
The output of yearvalue when printed is:
(2013, 132)
(2014, 188)
(2015, 233)
How can I make bokeh make the years as x-axis and numbers as y-axis. I have tried to follow the Time series tutorial, but I can't use the pd.read_csv and parse_dates=['Date'] functionalities since I'm not reading a csv file.
The simple way is to convert your data into a pandas DataFrame (with pd.DataFrame) and after create a datetime column with your year column.
simple example :
import pandas as pd
from bokeh.plotting import figure, output_notebook, show
output_notebook()
years = [2012,2013,2014,2015]
val = [230,120,200,340]
# Convert your data into a panda DataFrame format
data=pd.DataFrame({'year':years, 'value':val})
# Create a new column (yearDate) equal to the year Column but with a datetime format
data['yearDate']=pd.to_datetime(data['year'],format='%Y')
# Create a line graph with datetime x axis and use datetime column(yearDate) for this axis
p = figure(width=800, height=250, x_axis_type="datetime")
p.line(x=data['yearDate'],y=data['value'])
show(p)

Resources