I am new to python, I have got list which has a lot of number, then I would like to use np.histogram & pandas to generate a histogram like csv file:
import numpy as np
import pandas as pd
counts, bin_edges = np.histogram(data, bins=5)
print(counts)
print(bin_edges)
And I get the following output:
[27 97 24 27 11]
[-19.12 -8.406 2.308 13.022 23.736 34.45 ]
Then I tried to write the data into CSV file
bin = 5
min = np.delete(bin_edges, bin)
max = np.delete(bin_edges, 0)
df = pd.DataFrame({'Min': min, 'Max': max, 'Count': counts})
df.to_csv('data.csv', index=False, sep='\t')
However, I have got the following file ...
Min Max Count
-19.12 -8.405999999999999 27
-8.405999999999999 2.3080000000000034 97
2.3080000000000034 13.02200000000001 24
13.02200000000001 23.736000000000008 27
23.736000000000008 34.45 11
Is there any way that I can restrict the decimal numbers?
Many thanks,
You can use the float_format parameter of the to_csv() function.
df.to_csv(
'data.csv',
index=False,
sep='\t',
float_format='%.3f')
In the example above the output is to 3 decimal places. See the pandas docs and the python docs for more info.
Related
I've got a dataframe with more than 30000 rows and almost 40 columns exported from a csv file.
The most part of it mixes str with int features.
-integers are int
-floats and powers of ten are str
It looks like this:
Id A B
1 2.5220019e+008 1742087
2 1.7766118e+008 2223964.5
3 3.3750285e+008 2705867.8
4 97782360 2.5220019e+008
I've tried the following code:
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point, LineString, shape
df = pd.read_csv('mycsvfile.csv').astype(float)
Which yields the this error message:
ValueError: could not convert string to float: '-1.#IND'
I guess that it has to do about the exponencial nomenclator of powers of ten (e+) that the python libraries isn't able to transform.
Is there a way to fix it?
From my conversation with QuangHoang I should apply the function:
pd.to_numeric(df['column'], errors='coerce')
Since almost the whole DataFrame are str objects, I ran the following code line:
df2 = df.apply(lambda x : pd.to_numeric(x, errors='coerce'))
I am using a pandas DataFrame with datetime indexing. I know from the
Xarray documentation, that datetime indexing can be done as ds['date.year'] with ds being the DataArray of xarray, date the date index and years of the dates. Xarray points to datetime components which again leads to DateTimeIndex, the latter being panda documentation. So I thought of doing the same with pandas, as I really like this feature.
However, it is not working for me. Here is what I did so far:
# Import required modules
import pandas as pd
import numpy as np
# Create DataFrame (name: df)
df=pd.DataFrame({'Date': ['2017-04-01','2017-04-01',
'2017-04-02','2017-04-02'],
'Time': ['06:00:00','18:00:00',
'06:00:00','18:00:00'],
'Active': [True,False,False,True],
'Value': np.random.rand(4)})
# Combine str() information of Date and Time and format to datetime
df['Date']=pd.to_datetime(df['Date'] + ' ' + df['Time'],format = '%Y-%m-%d %H:%M:%S')
# Make the combined data the index
df = df.set_index(df['Date'])
# Erase the rest, as it is not required anymore
df = df.drop(['Time','Date'], axis=1)
# Show me the first day
df['2017-04-01']
Ok, so this shows me only the first entries. So far, so good.
However
df['Date.year']
results in KeyError: 'Date.year'
I would expect an output like
array([2017,2017,2017,2017])
What am I doing wrong?
EDIT:
I have a workaround, which I am able to go on with, but I am still not satisfied, as this doesn't explain my question. I did not use a pandas DataFrame, but an xarray Dataset and now this works:
# Load modules
import pandas as pd
import numpy as np
import xarray as xr
# Prepare time array
Date = ['2017-04-01','2017-04-01', '2017-04-02','2017-04-02']
Time = ['06:00:00','18:00:00', '06:00:00','18:00:00']
time = [Date[i] + ' ' + Time[i] for i in range(len(Date))]
time = pd.to_datetime(time,format = '%Y-%m-%d %H:%M:%S')
# Create Dataset (name: ds)
ds=xr.Dataset({'time': time,
'Active': [True,False,False,True],
'Value': np.random.rand(4)})
ds['time.year']
which gives:
<xarray.DataArray 'year' (time: 4)>
array([2017, 2017, 2017, 2017])
Coordinates:
* time (time) datetime64[ns] 2017-04-01T06:00:00 ... 2017-04-02T18:00:00
Just in terms of what you're doing wrong, your are
a) trying to call an index as a series
b) chaning commands within a string df['Date'] is a single column df['Date.year'] is a column called 'Date.year'
if you're datetime is the index, then use the .year or dt.year if it's a series.
df.index.year
#or assuming your dtype is a proper datetime (your code indicates it is)
df.Date.dt.year
hope that helps bud.
I have a dataframe with 3 temperature values columns, as follows:
T1 at 1000 m
T2 at 2000 m
T3 at 3000m
And I have a list with different elevations ranging from 1000 to 3000.
For each elevation I want to create the interpolated temperature.
The main issue is that I can't make my code automatically select the correct columns. For example if my target elevation is 1500 I want to interpolate between 1000 and 2000. I am aiming for simple linear interpolation. I tried the method that is suggested in Pandas: Make a new column by linearly interpolating between existing columns
But I kept getting TypeError: 'zip' object is not subscriptable
Can you help me solve this problem?
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
data = np.random.randint(1, high=50, size=len(days))
Elevation= np.random.randint(1000, high=3000, size=len(days))
df = pd.DataFrame({'Time': days, 'T1000':data,'T2000':data,'T3000':data} )
df = df.set_index('Time')
print(df)
You can simply interpolate for each row of the the DataFrame:
df['Alti'] = Elevation
df['Val'] = df.apply(lambda x: np.interp(x.Alti, [1000, 2000, 3000], x['T1000':'T3000']),
axis=1)
Good morning,
I'm using python 3.6. I'm trying to name my index (see last line in code below) because I plan on joining to another DataFrame. The DataFrame should be multi-indexed. The index is the first two columns ('currency' and 'rtdate') and the data
rate
AUD 2010-01-01 0.897274
2010-02-01 0.896608
2010-03-01 0.895943
2010-04-01 0.895277
2010-05-01 0.894612
This is the code that I'm running:
import pandas as pd
import numpy as np
import datetime as dt
df=pd.read_csv('file.csv',index_col=0)
df.index = pd.to_datetime(df.index)
new_index = pd.date_range(df.index.min(),df.index.max(),freq='MS')
df=df.reindex(new_index)
df=df.interpolate().unstack()
rate = pd.DataFrame(df)
rate.columns = ['rate']
rate.set_index(['currency','rtdate'],drop=False)
Running this throw's an error message:
KeyError: 'currency'
What am I missing.
Thanks for the assistance
I think you need to set the names of the levels of MultiIndex by using rename_axis first and then reset_index for columns from MultiIndex:
So you'd end up with this:
rate = df.interpolate().unstack().set_axis(('currency','rtdate')).reset_index()
instead of this:
df=df.interpolate().unstack()
rate = pd.DataFrame(df)
rate.columns = ['rate']
rate.set_index(['currency','rtdate'],drop=False)
I want to read a pickle file in python 3.5. I am using the following code.
The following is my output, I want to load it as pandas dataframe.
when I try to convert into pd Dataframe, using df = pd.DataFrame(df), I am getting the below error.
ValueError: arrays must all be same length
link to data- https://drive.google.com/file/d/1lSFBPLbUCluWfPjzolUZKmD98yelTSXt/view?usp=sharing
I think you need dict comprehension with concat:
from pandas.io.json import json_normalize,
import pickle
fh = open("imdbnames40.pkl", 'rb')
d = pickle.load(fh)
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
print (df.head())
ethnicity score best
'Aina Rapoza 0 Asian 0.89 Asian
1 GreaterAfrican 0.05 Asian
2 GreaterEuropean 0.06 Asian
3 IndianSubContinent 0.11 GreaterEastAsian
4 GreaterEastAsian 0.89 GreaterEastAsian
Then if need column from first level of MultiIndex:
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()