From list of dict to pandas DataFrame - python-3.x

I'm retrieving real-time financial data.
Every 1 second, I pull the following list:
[{'symbol': 'ETHBTC', 'price': '0.03381600'}, {'symbol': 'LTCBTC', 'price': >'0.01848300'}...]
The goal is to put this list into an already-existing pandas DataFrame.
What I've done so far is converting this list of a dictionary to a pandas DataFrame. My problem is that symbols and prices are in two columns. I would like to have symbols as the DataFrame header and add a new row every 1 second containing price's values.
marketInformation = [{'symbol': 'ETHBTC', 'price': '0.03381600'}, {'symbol': 'LTCBTC', 'price': >'0.01848300'}...]
data = pd.DataFrame(marketInformation)
header = data['symbol'].values
newData = pd.DataFrame(columns=header)
while True:
realTimeData = ... // get a new marketInformation list of dict
newData.append(pd.DataFrame(realTimeData)['price'])
print(newData)
Unfortunately, the printed DataFrame is always empty. I would like to have a new row added every second with new prices for each symbol with the current time.
I printed the below part:
pd.DataFrame(realTimeData)['price']
and it gives me a pandas.core.series.Series object with a length equals to the number of symbol.
What's wrong?

After you create newData, just do:
newData.loc[len(newData), :] = [item['price'] for item in realTimeData]

You just need to set_index() and then transpose the df:
newData = pd.DataFrame(marketInformation).set_index('symbol').T
#In [245]: newData
#Out[245]:
#symbol ETHBTC LTC
#price 0.03381600 0.01848300
# then change the index to whatever makes sense to your data
newdata_time = pd.Timestamp.now()
newData.rename(index={'price':newdata_time})
#Out[246]:
#symbol ETHBTC LTC
#2019-04-03 17:08:51.389359 0.03381600 0.01848300

Related

Creating a pandas DataFrame from json object and appending a column to it

From an API, I get information about who has filled a particular form and when they have done it as a json object. I get the below data from 2 forms formid = ["61438732", "48247759dc"]. The json object is stored in the r_json_object variable.
r_json_online = {
'results': [
{'submittedAt': 1669963478503,
'values': [{'email': 'brownsvilleselect#gmail.com'}]},
{'submittedAt': 1669963259737,
'values': [{'email': 'brewsterdani33#gmail.com'}]},
{'submittedAt': 1669963165956,
'values': [{'email': 'thesource95#valpo.edu'}]}
]
}
I have used the json_normalize function to de-nest the json object and insert the values into a DataFrame called form_submissions. This is the code I have used
import pandas as pd
from pandas import json_normalize
submissions = []
formid = ["61438732", "48247759dc"]
for i in range(0,len(formid)):
submissions.extend(r_json_online["results"])
form_submissions = pd.DataFrame()
for j in submissions:
form_submissions = form_submissions.append(json_normalize(j["values"]))
form_submissions = form_submissions.append({'createdOn': j["submittedAt"]}, ignore_index=True)
form_submissions = form_submissions.append({'formid': formid[i]}, ignore_index=True)
form_submissions['createdOn'] = form_submissions['createdOn'].fillna(method = 'bfill')
form_submissions['formid'] = form_submissions['formid'].fillna(method = 'bfill')
form_submissions = form_submissions.dropna(subset= 'email')
Code explanation:
I have created an empty list called submissions
For each value in the formid list, I'm running the for loop.
In the for loop:
a. I have added data to the submissions list
b. Created an empty DataFrame, normalized the json object and appended the values to the DataFrame from each element in the submissions list
Expected Output:
I wanted the first 3 rows to have formid = '61438732'
The next 3 rows should have the formid = '48247759dc'
Actual Output:
The formid is the same for all the rows
The problem is you are using this line " form_submissions = pd.DataFrame()" in the loop which reset your dataframe each time
This can easily attained by converting into two dataframes and doing cartesian product/cross merge on between both.
formids = ["61438732", "48247759dc"]
form_submissions_df=json_normalize(r_json_online['results'], record_path=['values'], meta=['submittedAt'])
# converting form_ids list to dataframe
form_ids_df = pd.DataFrame (formids, columns = ['form_id'])
# cross merge for cartesian product result
form_submissions_df.merge(form_ids_df, how="cross")
Results

How to make a function with dynamic variables to add rows in pandas?

I'm trying to make a table from a list of data using pandas.
Originally I wanted to make a function where I can pass dynamic variables so I could continuously add new rows from data list.
It works up until a point where adding rows part begun. Column headers are adding, but the data - no. It either keeps value at only last col or adds nothing.
My scrath was:
for title in titles:
for x in data:
table = {
title: data[x]
}
df.DataFrame(table, columns=titles, index[0]
columns list:
titles = ['timestamp', 'source', 'tracepoint']
data list:
data = ['first', 'second', 'third',
'first', 'second', 'third',
'first', 'second', 'third']
How can I make something like this?
timestamp, source, tracepoint
first, second, third
first, second, third
first, second, third
If you just want to initialize pandas dataframe, you can use dataframe’s constructor.
And you can also append row by using a dict.
Pandas provides other useful functions,
such as concatenation between data frames, insert/delete columns. If you need, please check pandas’s doc.
import pandas as pd
# initialization by dataframe’s constructor
titles = ['timestamp', 'source', 'tracepoint']
data = [['first', 'second', 'third'],
['first', 'second', 'third'],
['first', 'second', 'third']]
df = pd.DataFrame(data, columns=titles)
print('---initialization---')
print(df)
# append row
new_row = {
'timestamp': '2020/11/01',
'source': 'xxx',
'tracepoint': 'yyy'
}
df = df.append(new_row, ignore_index=True)
print('---append result---')
print(df)
output
---initialization---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
---append result---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
3 2020/11/01 xxx yyy

Convert huge number of lists to pandas dataframe

User defined function=> my_fun(x): returns a list
XYZ = file with LOTS of lines
pandas_frame = pd.DataFrame() # Created empty data frame
for index in range(0,len(XYZ)):
pandas_frame = pandas_frame.append(pd.DataFrame(my_fun(XYZ[i])).transpose(), ignore_index=True)
This code is taking very long time to run like in days. How do I speed up?
I think need apply for each row funcion to new list by list comprehension and then use only once DataFrame constructor:
L = [my_fun(i) for i in range(len(XYZ))]
df = pd.DataFrame(L)

List iterations and regex, what is the better way to remove the text I don' t need?

We handle data from volunteers, that data is entered in to a form using ODK. When the data is downloaded the header (column names) row contains a lot of 'stuff' we don' t need. The pattern is as follows:
'Group1/most_common/G27'
I want to replace the column names (there can be up to 200) or create a copy of the DataFrame with column names that just contain the G-code (Gxxx). I think I got it.
What is the faster or better way to do this?
IS the output reliable in terms of sort order? As of now it appears that the results list is in the same order as the original list.
y = ['Group1/most common/G95', 'Group1/most common/G24', 'Group3/plastics/G132']
import re
r = []
for x in y:
m = re.findall(r'G\d+', x)
r.append(m)
# the comprehension below is to flatten it
# append.m gives me a list of lists (each list has one item)
results = [q for t in r for q in t]
print(results)
['G95', 'G24', 'G132']
The idea would be to iterate through the column names in the DataFrame (or a copy), delete what I don't need and replace (inplace=True).
Thanks for your input.
You can use str.extract:
df = pd.DataFrame(columns=['Group1/most common/G95',
'Group1/most common/G24',
'Group3/plastics/G132'])
print (df)
Empty DataFrame
Columns: [Group1/most common/G95, Group1/most common/G24, Group3/plastics/G132]
Index: []
df.columns = df.columns.str.extract('(G\d+)', expand=False)
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []
Another solution with rsplit and select last values with [-1]:
df.columns = df.columns.str.rsplit('/').str[-1]
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []

assigning a list of tokens as a row to dataframe

I am attempting to create a dataframe where the first column is a list of tokens and where additional columns of information can be added. However pandas will not allow a list of tokens to be added as one column.
So code looks as below
array1 = ['two', 'sample', 'statistical', 'inferences', 'includes']
array2 = ['references', 'please', 'see', 'next', 'page', 'the','material', 'of', 'these']
array3 = ['time', 'student', 'interest', 'and', 'lecturer', 'preference', 'other', 'topics']
## initialise list
list = []
list.append(array1)
list.append(array2)
list.append(array3)
## create dataFrame
numberOfRows = len(list)
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns = ('data', 'diversity'))
df.iloc[0] = list[0]
the error message reads
ValueError: cannot copy sequence with size 6 to array axis with dimension 2
Any insight into how I can better achieve creating a dataframe and updating columns would be appreciated.
Thanks
ok so the answer was fairly simple posting it for prosperity.
When adding lists as rows I needed to include the column name and position..
so the code looks like below.
df.data[0] = array1

Resources