assigning a list of tokens as a row to dataframe - python-3.x

I am attempting to create a dataframe where the first column is a list of tokens and where additional columns of information can be added. However pandas will not allow a list of tokens to be added as one column.
So code looks as below
array1 = ['two', 'sample', 'statistical', 'inferences', 'includes']
array2 = ['references', 'please', 'see', 'next', 'page', 'the','material', 'of', 'these']
array3 = ['time', 'student', 'interest', 'and', 'lecturer', 'preference', 'other', 'topics']
## initialise list
list = []
list.append(array1)
list.append(array2)
list.append(array3)
## create dataFrame
numberOfRows = len(list)
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns = ('data', 'diversity'))
df.iloc[0] = list[0]
the error message reads
ValueError: cannot copy sequence with size 6 to array axis with dimension 2
Any insight into how I can better achieve creating a dataframe and updating columns would be appreciated.
Thanks

ok so the answer was fairly simple posting it for prosperity.
When adding lists as rows I needed to include the column name and position..
so the code looks like below.
df.data[0] = array1

Related

Adding new column to a pandas dataframe and inserting data corresponding to the existing data

I have a pandas dataframe which has columns like this.
Columns: [jobId, trainingInput, createTime, startTime, endTime, state, trainingOutput, etag, errorMessage, labels].
The trainingInput is a dictionary where I am stripping part of the data and creating a new column 'serviceAccount' to the dataframe. I am getting random data in the new column.
Dataframe sample:
trainingInput = {'scaleTier': 'CUSTOM', 'masterType': 'n2-highmem-4', 'packageUris': ['gs:/marketprice_aitp-0.0.0.tar.gz'] 'pythonVersion': '3.7', 'serviceAccount': 'projects/kk1'}
the serviceAccount should be - 'projects/kk1' and should be inserted to the corresponding trainingInput.
**Expected output:**
trainingInput serviceAccount
{'scaleTier': 'CUSTOM', 'masterType': projects/kk1
'n2-highmem-4', 'packageUris':
['gs:/marketprice_aitp-0.0.0.tar.gz']
'pythonVersion': '3.7',
'serviceAccount': 'projects/kk1'}
But what I am getting is random data for serviceAccount based on the code below.
k_1 = df['trainingInput'].values
temp =[]
for i in k_1:
m= json.dumps(i)
k = json.loads(m)
temp.append(k['serviceAccount'])
try:
df5 =pd.DataFrame(temp)
# df.merge(pd.DataFrame(data=[df5.values] * len(df), columns=df5.index, #index=df.index), left_index=True, right_index=True)
df['serviceaccount'] =k['serviceAccount']
# df['region'] = k['region']
except KeyError:
df['serviceaccount'] = 'None'
I tried the merge commented here as well. But I get the error "ValueError: Buffer has wrong number of dimensions (expected 1, got 2)". Please let me know your thoughts. Thanks.

How to make a function with dynamic variables to add rows in pandas?

I'm trying to make a table from a list of data using pandas.
Originally I wanted to make a function where I can pass dynamic variables so I could continuously add new rows from data list.
It works up until a point where adding rows part begun. Column headers are adding, but the data - no. It either keeps value at only last col or adds nothing.
My scrath was:
for title in titles:
for x in data:
table = {
title: data[x]
}
df.DataFrame(table, columns=titles, index[0]
columns list:
titles = ['timestamp', 'source', 'tracepoint']
data list:
data = ['first', 'second', 'third',
'first', 'second', 'third',
'first', 'second', 'third']
How can I make something like this?
timestamp, source, tracepoint
first, second, third
first, second, third
first, second, third
If you just want to initialize pandas dataframe, you can use dataframe’s constructor.
And you can also append row by using a dict.
Pandas provides other useful functions,
such as concatenation between data frames, insert/delete columns. If you need, please check pandas’s doc.
import pandas as pd
# initialization by dataframe’s constructor
titles = ['timestamp', 'source', 'tracepoint']
data = [['first', 'second', 'third'],
['first', 'second', 'third'],
['first', 'second', 'third']]
df = pd.DataFrame(data, columns=titles)
print('---initialization---')
print(df)
# append row
new_row = {
'timestamp': '2020/11/01',
'source': 'xxx',
'tracepoint': 'yyy'
}
df = df.append(new_row, ignore_index=True)
print('---append result---')
print(df)
output
---initialization---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
---append result---
timestamp source tracepoint
0 first second third
1 first second third
2 first second third
3 2020/11/01 xxx yyy

Splitting Multiple values inside a Pandas Column into Separate Columns

I have a dataframe with column which contains two different column values and their name as follows:
How Do I transform it into separate columns?
So far, I tried Following:
use df[col].apply(pd.Series) - It didn't work since data in column is not in dictionary format.
Tried separating columns by a semi-colon (";") sign but It is not a good idea since the given dataframe might have n number of column based on response.
EDIT:
Data in plain text format:
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
How about:
df2 = (df["ClusterName"]
.str.replace("Date:", "")
.str.replace("Bucket:", "")
.str.split(";", expand=True))
df2.columns = ["Date", "Bucket"]
EDIT:
Without hardcoding the variable names, here's a quick hack. You can clean it up (and make less silly variable names):
df_temp = df.ClusterName.str.split(";", expand=True)
cols = []
for col in df_temp:
df_temptemp = df_temp[col].str.split(":", expand=True)
df_temp[col] = df_temptemp[1]
cols.append(df_temptemp.iloc[0, 0])
df_temp.columns = cols
So .. maybe like this ...
Setup the data frame
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
df = pd.DataFrame(data=d)
df
Parse over the dataframe breaking apart by colon and semi-colon
ls = []
for index, row in df.iterrows():
splits = row['ClusterName'].split(';')
print(splits[0].split(':')[1],splits[1].split(':')[1])
ls.append([splits[0].split(':')[1],splits[1].split(':')[1]])
df = pd.DataFrame(ls, columns =['Date', 'Bucket'])

From list of dict to pandas DataFrame

I'm retrieving real-time financial data.
Every 1 second, I pull the following list:
[{'symbol': 'ETHBTC', 'price': '0.03381600'}, {'symbol': 'LTCBTC', 'price': >'0.01848300'}...]
The goal is to put this list into an already-existing pandas DataFrame.
What I've done so far is converting this list of a dictionary to a pandas DataFrame. My problem is that symbols and prices are in two columns. I would like to have symbols as the DataFrame header and add a new row every 1 second containing price's values.
marketInformation = [{'symbol': 'ETHBTC', 'price': '0.03381600'}, {'symbol': 'LTCBTC', 'price': >'0.01848300'}...]
data = pd.DataFrame(marketInformation)
header = data['symbol'].values
newData = pd.DataFrame(columns=header)
while True:
realTimeData = ... // get a new marketInformation list of dict
newData.append(pd.DataFrame(realTimeData)['price'])
print(newData)
Unfortunately, the printed DataFrame is always empty. I would like to have a new row added every second with new prices for each symbol with the current time.
I printed the below part:
pd.DataFrame(realTimeData)['price']
and it gives me a pandas.core.series.Series object with a length equals to the number of symbol.
What's wrong?
After you create newData, just do:
newData.loc[len(newData), :] = [item['price'] for item in realTimeData]
You just need to set_index() and then transpose the df:
newData = pd.DataFrame(marketInformation).set_index('symbol').T
#In [245]: newData
#Out[245]:
#symbol ETHBTC LTC
#price 0.03381600 0.01848300
# then change the index to whatever makes sense to your data
newdata_time = pd.Timestamp.now()
newData.rename(index={'price':newdata_time})
#Out[246]:
#symbol ETHBTC LTC
#2019-04-03 17:08:51.389359 0.03381600 0.01848300

List iterations and regex, what is the better way to remove the text I don' t need?

We handle data from volunteers, that data is entered in to a form using ODK. When the data is downloaded the header (column names) row contains a lot of 'stuff' we don' t need. The pattern is as follows:
'Group1/most_common/G27'
I want to replace the column names (there can be up to 200) or create a copy of the DataFrame with column names that just contain the G-code (Gxxx). I think I got it.
What is the faster or better way to do this?
IS the output reliable in terms of sort order? As of now it appears that the results list is in the same order as the original list.
y = ['Group1/most common/G95', 'Group1/most common/G24', 'Group3/plastics/G132']
import re
r = []
for x in y:
m = re.findall(r'G\d+', x)
r.append(m)
# the comprehension below is to flatten it
# append.m gives me a list of lists (each list has one item)
results = [q for t in r for q in t]
print(results)
['G95', 'G24', 'G132']
The idea would be to iterate through the column names in the DataFrame (or a copy), delete what I don't need and replace (inplace=True).
Thanks for your input.
You can use str.extract:
df = pd.DataFrame(columns=['Group1/most common/G95',
'Group1/most common/G24',
'Group3/plastics/G132'])
print (df)
Empty DataFrame
Columns: [Group1/most common/G95, Group1/most common/G24, Group3/plastics/G132]
Index: []
df.columns = df.columns.str.extract('(G\d+)', expand=False)
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []
Another solution with rsplit and select last values with [-1]:
df.columns = df.columns.str.rsplit('/').str[-1]
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []

Resources