I have the following Python pandas dataframe:
There are more EventName's than shown on this date.
Each will have Race_Number = 'Race 1', 'Race 2', etc.
After a while the date increments.
.
I'm trying to create a dataframe that looks like this:
Each race has different numbers of runners.
Is there a way to do this in pandas ?
Thanks
I assumed output would be another DataFrame.
import pandas as pd
import numpy as np
from nltk import flatten
import copy
df = pd.DataFrame({'EventName': ['sydney', 'sydney', 'sydney', 'sydney', 'sydney', 'sydney'],
'Date': ['2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01'],
'Race_Number': ['Race1', 'Race1', 'Race1', 'Race2', 'Race2', 'Race3'],
'Number': [4, 7, 2, 9, 5, 10]
})
print(df)
dic={}
for rows in df.itertuples():
if rows.Race_Number in dic:
dic[rows.Race_Number] = flatten([dic[rows.Race_Number], rows.Number])
else:
dic[rows.Race_Number] = rows.Number
copy_dic = copy.deepcopy(dic)
seq = np.arange(0,len(dic.keys()))
for key, n_key in zip(copy_dic, seq):
dic[n_key] = dic.pop(key)
df = pd.DataFrame([dic])
print(df)
Related
import pandas as pd
import numpy as np
I do have 3 dataframes df1, df2 and df3.
df1=
data = {'Period': ['2024-04-O1', '2024-07-O1', '2024-10-O1', '2025-01-O1', '2025-04-O1', '2025-07-O1', '2025-10-O1', '2026-01-O1', '2026-04-O1', '2026-07-O1', '2026-10-O1', '2027-01-O1', '2027-04-O1', '2027-07-O1', '2027-10-O1', '2028-01-O1', '2028-04-O1', '2028-07-O1', '2028-10-O1'],
'Price': ['NaN','NaN','NaN','NaN', 'NaN','NaN','NaN','NaN', 'NaN','NaN','NaN','NaN',
'NaN','NaN','NaN','NaN', 'NaN','NaN','NaN'],
'years': [2024,2024,2024,2025,2025,2025,2025,2026,2026,2026,2026,2027,2027,2027,2027,2028,
2028,2028,2028],
'quarters':[2,3,4, 1,2,3,4, 1,2,3,4, 1,2,3,4, 1,2,3,4]
}
df1 = pd.DataFrame(data=data)
df2=
data = {'price': [473.26,244,204,185, 152, 157],
'year': [2023, 2024, 2025, 2026, 2027, 2028]
}
df3 = pd.DataFrame(data=data)
df3=
data = {'quarters': [1,2,3,4],
'weights': [1.22, 0.81, 0.83, 1.12]
}
df2 = pd.DataFrame(data=data)
My aim is to compute the price of df1. For each iteration through df1 check condition and carry calculations accordingly. For example for the 1st iteration, check if df1['year']=2024 and df1['quarters']=2. Then df1['price']=df2.loc[df2['year']=='2024', 'price'] * df3.loc[df3['quarters']==2, 'weights'].
===>>> df1['price'][0]=**473.26*0.81**.
df1['price'][1]=**473.26*0.83**.
...
...
...
and so on.
I could ha used this method but i want to write a code in a more efficient way. I would like to use the following code structure.
for i in range(len(df1)):
if (df1['year']==2024) & (df1['quarter']==2):
df1['Price']= df2.loc[df2['year']==2024, 'price'] * df3.loc[df3['quarters']==2, 'weights']
elif (df1['year']==2024) & (df1['quarter']==3):
df1['price']= df2.loc[df2['year']=='2024', 'price'] * df3.loc[df3['quarters']==3, 'weights']
elif (df1['year']==2024) & (df1['quarters']==4):
df1['Price']= df2.loc[df2['year']=='2024', 'price'] * df3.loc[df3['quarters']==4, 'weights']
...
...
...
Thanks!!!
I think if I understand correctly you can use pd.merge to bring these fields together first.
df1 = df1.merge(df2, how='left' , left_on='years', right_on='year')
df1 = df1.merge(df3, how='left' , left_on='quarters', right_on='quarters')
df1['Price'] = df1['price']*df1['weights']
import yaml
import pandas as pd
data = ['apple','car','smash','market','washing']
bata = ['natural','artificail','intelligence','outstanding','brain']
df = pd.DataFrame(zip(data,bata),columns=['Data','Bata'])
for columns in df:
for list in df[columns]:
text = yaml.dump_all(list)
print(text)
I used above code but I'm getting each letter printed. How to get good YAML format. Thank you.
You can use yaml.dump to get text in yaml format
>>> import yaml
>>> import pandas as pd
>>> data = ['apple','car','smash','market','washing']
>>> bata = ['natural','artificail','intelligence','outstanding','brain']
>>> df = pd.DataFrame(zip(data,bata),columns=['Data','Bata'])
>>> text = yaml.dump(
df.reset_index().to_dict(orient='records'),
sort_keys=False, width=72, indent=4,
default_flow_style=None)
>>> text
'- {index: 0, Data: apple, Bata: natural}\n- {index: 1, Data: car, Bata: artificail}\n- {index: 2, Data: smash, Bata: intelligence}\n- {index: 3, Data: market, Bata: outstanding}\n- {index: 4, Data: washing, Bata: brain}\n'
import yaml
import pandas as pd
data = ['apple','car','smash','market','washing']
bata = ['natural','artificail','intelligence','outstanding','brain']
df = pd.DataFrame(zip(data,bata),columns=['Data','Bata'])
text = yaml.dump(df.to_dict(orient='records'),default_flow_style=None)`
If you want save to file your df:
with open('test_df_to_yaml.yaml', 'w') as file:
documents = yaml.dump({'result': df.to_dict(orient='records')}, file, default_flow_style=False)
If you open after saving as DataFrame:
with open('test_df_to_yaml.yaml', mode="rt", encoding="utf-8") as test_df_to_yaml:
df_merged = pd.DataFrame(yaml.full_load(test_df_to_yaml)['result'])
here is my dataframe
import pandas as pd
data = {'from':['Frida', 'Frida', 'Frida', 'Pablo','Pablo'], 'to':['Vincent','Pablo','Andy','Vincent','Andy'],
'score':[2, 2, 1, 1, 1]}
df = pd.DataFrame(data)
df
I want to swap the values in columns 'from' and 'to' and add them on because these scores work both ways.. here is what I have tried.
df_copy = df.copy()
df_copy.rename(columns={"from":"to","to":"from"}, inplace=True)
df_final = df.append(df_copy)
which works but is there a shorter way to do the same?
One line could be :
df_final = df.append(df.rename(columns={"from":"to","to":"from"}))
On the right track. However, introduce deep=True to make a true copy, otherwise your df.copy will just update df and you will be up in a circle.
df_copy = df.copy(deep=True)
df_copy.rename(columns={"from":"to","to":"from"}, inplace=True)
df_final = df.append(df_copy)
I am working on a csv file that has multiple columns.
The file looks something like this...
A,B,C
1,'x;y;z','e;f;g'
2,'w;x;y','r;s;t'
3,'','p;q;r'
Each cell in the file has a string that is separated by ";".
I want to create a single list by reading each cell and splitting each cell based on the separator.
I have been able to do this but there are performance issues.
The csv file is huge and so I am looking for an optimized version.
The columns names are known upfront. My code is given below
My current solution is
Make a list reading all the rows from each column
Flatten the list
split the items in list if the item is string,append to a new list
remove duplicates from the list
import pandas as pd
from io import StringIO
from collections import Iterable
import operator
csv_path ='my_dir'
# load the data with pd.read_csv
dataDF = pd.read_csv(csv_path)
dataDF.fillna(" ")
result=[]
cols=['A','B','C']
for i in cols:
result.append(dataDF[i].tolist())
result=reduce(operator.concat, result)
print(result)
my_list=[]
for token in result:
if isinstance(token, str):
my_list.append(token.split(";"))
my_list=reduce(operator.concat, my_list)
my_list=list(set(my_list))
If you have many repeated values, this will probably go faster.
from itertools import chain
# load the data with pd.read_csv
dataDF = pd.DataFrame({'A': [1, 2, 3], 'B': ['x;y;z', 'w;x;y', ''], 'C': ['e;f;g', 'r;s;t', 'p;q;r']})
dataDF.fillna(" ", inplace=True)
results_set = set()
for i in dataDF.columns:
try:
results_set.update(chain(*dataDF[i].str.split(';').values))
except AttributeError:
pass
print(results_set)
Try this one:
from itertools import chain
# load the data with pd.read_csv
dataDF = pd.DataFrame({'A': [1, 2, 3], 'B': ['x;y;z', 'w;x;y', ''], 'C': ['e;f;g', 'r;s;t', 'p;q;r']})
dataDF.fillna(" ", inplace=True)
list_of_lists = []
for i in dataDF.columns:
try:
list_of_lists.extend(dataDF[i].str.split(';').values)
except AttributeError:
pass
print(set(chain(*list_of_lists)))
I'm trying to loop through a list(y) and output by appending a row for each item to a dataframe.
y=[datetime.datetime(2017, 3, 29), datetime.datetime(2017, 3, 30), datetime.datetime(2017, 3, 31)]
Desired Output:
Index Mean Last
2017-03-29 1.5 .76
2017-03-30 2.3 .4
2017-03-31 1.2 1
Here is the first and last part of the code I currently have:
import pandas as pd
import datetime
df5=pd.DataFrame(columns=['Mean','Last'],index=index)
for item0 in y:
.........
.........
df=df.rename(columns = {0:'Mean'})
df4=pd.concat([df, df3], axis=1)
print (df4)
df5.append(df4)
print (df5)
My code only puts one row into the dataframe like as opposed to a row for each item in y:
Index Mean Last
2017-03-29 1.5 .76
Try:
y = [datetime(2017, 3, 29), datetime(2017, 3, 30),datetime(2017, 3, 31)]
m = [1.5,2.3,1.2]
l = [0.76, .4, 1]
df = pd.DataFrame([],columns=['time','mean','last'])
for y0, m0, l0 in zip(y,m,l):
data = {'time':y0,'mean':m0,'last':l0}
df = df.append(data, ignore_index=True)
and if you want y to be the index:
df.index = df.time
There are a few ways to skin this, and it's hard to know which approach makes the most sense with the limited info given. But one way is to start with a dataframe that has only the index, iterate through the dataframe by row and populate the values from some other process. Here's an example of that approach:
import datetime
import numpy as np
import pandas as pd
y=[datetime.datetime(2017, 3, 29), datetime.datetime(2017, 3, 30), datetime.datetime(2017, 3, 31)]
main_df = pd.DataFrame(y, columns=['Index'])
#pop in the additional columns you want, but leave them blank
main_df['Mean'] = None
main_df['Last'] = None
#set the index
main_df.set_index(['Index'], inplace=True)
that gives us the following:
Mean Last
Index
2017-03-29 None None
2017-03-30 None None
2017-03-31 None None
Now let's loop and plug in some made up random values:
## loop through main_df and add values
for (index, row) in main_df.iterrows():
main_df.ix[index].Mean = np.random.rand()
main_df.ix[index].Last = np.random.rand()
this results in the following dataframe which has the None values filled:
Mean Last
Index
2017-03-29 0.174714 0.718738
2017-03-30 0.983188 0.648549
2017-03-31 0.07809 0.47031