the data comes in 3 columns after (orderbook = pd.DataFrame(orderbook_data):
timestamp bids asks
UNIX timestamp [bidprice, bidvolume] [askprice, askvolume]
list has 100 values of each. timestamp is the same
the problem is that I don't know how to access/index the values inside each row list [price, volume] of each column
I know that by running ---> bids = orderbook["bids"]
I get the list of 100 lists ---> [bidprice, bidvolume]
I'm looking to avoid doing a loop.... there has to be a way to just plot the data
I hope someone can undertand my problem. I just want to plot price on x and volume on y. The goal is to make it live
As you didn't present your input file, I prepared it on my own:
timestamp;bids
1579082401;[123.12, 300]
1579082461;[135.40, 220]
1579082736;[130.76, 20]
1579082801;[123.12, 180]
To read it I used:
orderbook = pd.read_csv('Input.csv', sep=';')
orderbook.timestamp = pd.to_datetime(orderbook.timestamp, unit='s')
Its content is:
timestamp bids
0 2020-01-15 10:00:01 [123.12, 300]
1 2020-01-15 10:01:13 [135.40, 220]
2 2020-01-15 10:05:36 [130.76, 20]
3 2020-01-15 10:06:41 [123.12, 180]
Now:
timestamp has been converted to native pandasonic type of datetime,
but bids is of object type (actually, a string).
and, as I suppose, this is the same when read from your input file.
And now the main task: The first step is to extract both numbers from bids,
convert them to float and int and save in respective columns:
orderbook = orderbook.join(orderbook.bids.str.extract(
r'\[(?P<bidprice>\d+\.\d+), (?P<bidvolume>\d+)]'))
orderbook.bidprice = orderbook.bidprice.astype(float)
orderbook.bidvolume = orderbook.bidvolume.astype(int)
Now orderbook contains:
timestamp bids bidprice bidvolume
0 2020-01-15 10:00:01 [123.12, 300] 123.12 300
1 2020-01-15 10:01:01 [135.40, 220] 135.40 220
2 2020-01-15 10:05:36 [130.76, 20] 130.76 20
3 2020-01-15 10:06:41 [123.12, 180] 123.12 180
and you can generate e.g. a scatter plot, calling:
orderbook.plot.scatter('bidprice', 'bidvolume');
or other plotting function.
Another possibility
Or maybe your orderbook_data is a dictionary? Something like:
orderbook_data = {
'timestamp': [1579082401, 1579082461, 1579082736, 1579082801],
'bids': [[123.12, 300], [135.40, 220], [130.76, 20], [123.12, 180]] }
In this case, when you create a DataFrame from it, the column types
are initially:
timestamp - int64,
bids - also object, but this time each cell contains a plain
pythonic list.
Then you can also convert timestamp column to datetime just like
above.
But to split bids (a column of lists) into 2 separate columns,
you should run:
orderbook[['bidprice', 'bidvolume']] = pd.DataFrame(orderbook.bids.tolist())
Then you have 2 new columns with respective components of the
source column and you can create your graphics jus like above.
Related
I've got the following dataset:
where:
customer id represents a unique customer
each customer has multiple invoices
each invoice is marked by a unique identifier (Invoice)
each invoice has multiple items (rows)
I want to determine the time difference between invoices for a customer. In other words, the time between one invoice and the next. Is this possible? and how should I do it with DiffDatetime?
Here is how I am setting up the entities:
es = ft.EntitySet(id="data")
es = es.add_dataframe(
dataframe=df,
dataframe_name="items",
index = "items",
make_index=True,
time_index="InvoiceDate",
)
es.normalize_dataframe(
base_dataframe_name="items",
new_dataframe_name="invoices",
index="Invoice",
copy_columns=["Customer ID"],
)
es.normalize_dataframe(
base_dataframe_name="invoices",
new_dataframe_name="customers",
index="Customer ID",
)
I tried:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="invoices",
agg_primitives=[],
trans_primitives=["diff_datetime"],
verbose=True,
)
And also changing the target dataframe to invoices or customers, but none of those work.
The df that I am trying to work on looks like this:
es["invoices"].head()
And what I want can be done with pandas like this:
es["invoices"].groupby("Customer ID")["first_items_time"].diff()
which returns:
489434 NaT
489435 0 days 00:01:00
489436 NaT
489437 NaT
489438 NaT
...
581582 0 days 00:01:00
581583 8 days 01:05:00
581584 0 days 00:02:00
581585 10 days 20:41:00
581586 14 days 02:27:00
Name: first_items_time, Length: 40505, dtype: timedelta64[ns]
Thank you for your question.
You can use the groupby_trans_primitives argument in the call to dfs.
Here is an example:
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="invoices",
agg_primitives=[],
groupby_trans_primitives=["diff_datetime"],
return_types="all",
verbose=True,
)
The return_types argument is required since DiffDatetime returns a Feature with Timedelta logical type. Without specifying return_types="all", DeepFeatureSynthesis will only return Features with numeric, categorical, and boolean data types.
I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755
Is it possible to have an xarray with multiple columns all having the same coordinates? In The following example I create an xarray and then I want to extract time series data at different locations. However, to do this I have to create a numpy array to store this data and its coordinates.
#Sample from the data in the netCDF file
ds['temp'] = xr.DataArray(data=np.random.rand(2,3,4), dims=['time','lat','lon'],
coords=dict(time=pd.date_range('1900-1-1',periods=2,freq='D'),
lat=[25.,26.,27.],lon=[-85.,-84.,-83.,-82.]))
display(ds)
#lat and lon locations to extract temp values
locations=np.array([[25.6, -84.7], [26, -83], [26.5, -84.1]])
#Extract time series at different locations
temp=np.empty([ds.shape[0], len(locations)])
lat_lon=np.empty([len(locations),2])
for n in range(locations.shape[0]):
lat_lon[n,0]=ds.sel(lat=locations[n,0],
lon=locations[n,1], method='nearest').coords['lat'].values
lat_lon[n,1]=ds.sel(lat=locations[n,0],
lon=locations[n,1], method='nearest').coords['lon'].values
temp[:,n]=ds.sel(lat=locations[n,0],
lon=locations[n,1], method='nearest')
print(temp)
print(lat_lon)
#Find maximum temp for all locations:
temp=temp.max(1)
The output of this code is:
array([[[0.67465371, 0.0710136 , 0.03263631, 0.41050204],
[0.26447469, 0.46503577, 0.5739435 , 0.33725726],
[0.20353832, 0.01441925, 0.26728572, 0.70531547]],
[[0.75418953, 0.20321738, 0.41129902, 0.96464691],
[0.53046103, 0.88559914, 0.20876142, 0.98030988],
[0.48009467, 0.7906767 , 0.09548439, 0.61088112]]])
Coordinates:
time (time) datetime64[ns] 1900-01-01 1900-01-02
lat (lat) float64 25.0 26.0 27.0
lon (lon) float64 -85.0 -84.0 -83.0 -82.0
temp (time, lat, lon) float64 0.09061 0.6634 ... 0.5696 0.4438
Attributes: (0)
[[0.26447469 0.5739435 0.01441925]
[0.53046103 0.20876142 0.7906767 ]]
[[ 26. -85.]
[ 26. -83.]
[ 27. -84.]]
More simply, is there a way to find the maximum temp across all locations for every timestamp without creating the intermediate temp array?
When you create the sample data, you specify 3 values of latitude and 4 values of longitude. That means 12 values in total, on a 2D grid (3D if we add time).
When you want to query values for 3 specific points, you have to query each point individually. As far as I know, there are two ways to do that:
Write a loop and store the result on an intermediate array (your solution)
Stack dimensions and query longitude and latitude simultaneously.
First, you have to express your locations as a list/array of tuples:
locations=np.array([[25.6, -84.7], [26, -83], [26.5, -84.1]])
coords=[(coord[0], coord[1]) for coord in locations]
print(coords)
[(25.6, -84.7), (26.0, -83.0), (26.5, -84.1)]
Then you interpolate your data for the specified locations, stack latitude and longitude to a new dimension coord, select your points.
(ds
.interp(lon=locations[:,1], lat=locations[:,0], method='linear') # interpolate on the grid
.stack(coord=['lat','lon']) # from 3x3 grid to list of 9 points
.sel(coord=coords)) # select your three points
.temp.max(dim='coord') # get largest temp value from the coord dimension
)
array([0.81316195, 0.56967184]) # your largest values at both timestamps
The downside is that xarray doesn't support interpolation for unlabeled multi-index, which is why first you need to interpolate (NOT simply find the nearest neighbor) the grid on your set of latitudes and longitudes.
I am downloading data in json format in Python 3.7 and trying to display it as an Excel spreadsheet. I tried using pandas, but I think that there is a problem in getting the data into a dataframe. I also tried a csv approach but that was not successful.
I am new to learning python, so maybe there is something obvious I am missing.
import json, requests,urllib,csv
import pandas as pd
url = urllib.request.urlopen('https://library.brown.edu/search/solr_pub/iip/?start=0&rows=100&indent=on&wt=json&q=*')
str_response = url.read().decode('utf-8')
jsondata=json.loads(str_response)
df = pd.DataFrame(jsondata)
I was hoping to get a number of rows for each item, e.g., zoor0353, with the columns for each of the keys associated with it (e.g., region, date_desc, etc. -there are quite a few). Instead, it seemed only to take the first section, returning:
responseHeader \
QTime 1
docs NaN
numFound NaN
params {'q': '*', 'indent': 'on', 'start': '0', 'rows...
start NaN
status 0
response
QTime NaN
docs [{'inscription_id': 'zoor0353', 'metadata': ['...
numFound 4356
params NaN
start 0
status NaN
I tried this with normalization method, but did no better. Ultimately, I would like to use a dataset made of an appended file of many calls to this api, and am also wondering if I will need to manipulate data, and how, to get it to work with pandas.
Since I'm not sure what exactly you need to get, I'll show you
how to use pandas.io.json.json_normalize
how to pick extract key/values
json_normalize
import requests
import pandas as pd
from pandas.io.json import json_normalize
url = 'https://library.brown.edu/search/solr_pub/iip/?start=0&rows=100&indent=on&wt=json&q=*'
r = requests.get(url).json()
df = json_normalize(r['response']['docs']).T
This is a sample output; each column is a separate item of r['response']['docs'] containing values, and rows are the keys for said values.
Data are all messed up, and also some of the keys you probably don't want to use. That's why I think it's better to extract specific keys and values.
df = pd.DataFrame()
query = r['response']['docs']
for i in range(len(query)):
records = [
query[i]['inscription_id'], query[i]['city'], query[i]['_version_'],
query[i]['type'][0], query[i]['language_display'][0], query[i]['physical_type'][0]
]
# recs.append(records)
data = pd.DataFrame.from_records([records], columns=['inscription_id', 'city', '_version_', 'type', 'language_display', 'physical_type'])
df = df.append(data)
# sample data
inscription_id city _version_ type language_display physical_type
0 unkn0103 Unknown 1613822149755142144 prayer Hebrew other_object
0 zoor0391 Zoora 1613822173978296320 funerary.epitaph Greek tombstone
0 zoor0369 Zoora 1613822168079007744 funerary.epitaph Greek tomb
0 zoor0378 Zoora 1613822170509606912 funerary.epitaph Greek tombstone
0 zoor0393 Zoora 1613822174648336384 funerary.epitaph Greek tombstone
edit
query=r[][] - what is it and how can I learn more about it?
r = requests.get(url).json()
by using .json() at the end of requests.get method, I converted the response object (data from the API) to a dict type. Hence, r is now a dictionary containing keys:
r.keys()
# dict_keys(['responseHeader', 'response'])
to access values of a specific key, I use a format: dictionary[key][key_inside_of_another_key]:
r['response'].keys()
# dict_keys(['numFound', 'start', 'docs'])
r['response']['docs'][0]
# sample output
{'inscription_id': 'zoor0353',
'metadata': ['zoor0353',
'Negev',
'Zoora',
'Zoora, Negev. cemetery in An Naq. \nNegev.\n Zoora. Found by local inhabitants in the northwest corner of the Bronze\n Age, Byzantine and Islamic cemetery in the An Naq neighborhood south of\n the Wadi al-Hasa, probably in secondary use in later graves. \n',
'funerary.epitaph',
...
...
...
# [0] here is an item inside of a key - in the case of your data,
# there were 100 such items: [0, 1, .. 99] inside of r['response']['docs']
# to access item #4, you'd use ['response']['docs'][4]
That's how you navigate through a dictionary. Now, to access specific key, say inscription_id or _version_:
r['response']['docs'][0]['inscription_id']
# 'zoor0353'
r['response']['docs'][99]['inscription_id']
# 'mger0001'
r['response']['docs'][33]['_version_']
# 1613822151126679552
Lastly, to iterate through all rows, or items of data, I used a for loop: i here is a substitude for a range of numbers representing each item of your data -- from r['response']['docs'][1] to r['response']['docs'][99].
This is by far the most difficult problem I have faced. I am trying to create plots indexed on ratetype. For example, a matrix of unique ratetype x avg customer number for that ratetype is what I want to create efficiently. The lambda expression for getting the rows where the value is equal to each individual ratetype then getting the average customer number for that type then creating a series based on these two lists that are equal in size and length and accurate, is way over my head for pandas.
The number of different ratetypes can be in the hundreds. Reading it into a list via lambda would logically be a better choice than hard coding each possibility, as the list is going to only increase in size and new variability.
""" a section of the data for example use. Working with column "Ratetype"
column "NumberofCustomers" to work towards getting something like
list1 = unique occurs of ratetypes
list2 = avg number of customers for each ratetype
rt =['fixed','variable',..]
avg_cust_numbers = [45.3,23.1,...]
**basically for each ratetype: get mean of all row data for custno column**
ratetype,numberofcustomers
fixed,1232
variable, 1100
vec, 199
ind, 1211
alg, 123
bfd, 788
csv, 129
ggg, 1100
aaa, 566
acc, 439
"""
df['ratetype','number_of_customers']
fixed = df.loc['ratetype']=='fixed']
avg_fixed_custno = fixed.mean()
rt_counts = df.ratetype.value_counts()
rt_uniques = df.ratetype.unique()
# rt_uniques would be same size vector as avg_cust_nos, has to be anyway
avg_cust_nos = [avg_fixed_custno, avg_variable_custno]
My goal is to create and plot these subplots using matplot.pyplot.
data = {'ratetypes': pd.Series(rt_counts, index=rt_uniques),
'Avg_cust_numbers': pd.Series(avg_cust_nos, index=rt_uniques),
}
df = pd.DataFrame(data)
df = df.sort_values(by=['ratetypes'], ascending=False)
fig, axes = plt.subplots(nrows=2, ncols=1)
for i, c in enumerate(df.columns):
df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c)
plt.savefig('custno_byrate.png', bbox_inches='tight')