How to iteratively add rows to an inital empty pandas Dataframe? - python-3.x

I have to iteratively add rows to a pandas DataFrame and find this quite hard to achieve. Also performance-wise I'm not sure if this is the best approach.
So from time to time, I get data from a server and this new dataset from the server will be a new row in my pandas DataFrame.
import pandas as pd
import datetime
df = pd.DataFrame([], columns=['Timestamp', 'Value'])
# as this df will grow over time, is this a costly copy (df = df.append) or does pandas does some optimization there, or is there a better way to achieve this?
# ignore_index, as I want the index to automatically increment
df = df.append({'Timestamp': datetime.datetime.now()}, ignore_index=True)
print(df)
After one day the DataFrame will be deleted, but during this time, probably 100k times a new row with data will be added.
The goal is still to achieve this in a very efficient way, runtime-wise (memory doesn't matter too much as enough RAM is present).

I tried this to compare the speed of 'append' compared to 'loc' :
import timeit
code = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df= df.append({'A' : 3, 'B' : 4}, ignore_index = True)
"""
code2 = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df.loc[df.index.max()+1, :] = [3, 4]
"""
elapsed_time1 = timeit.timeit(code, number = 1000)/1000
elapsed_time2 = timeit.timeit(code2, number = 1000)/1000
print('With "append" :',elapsed_time1)
print('With "loc" :' , elapsed_time2)
On my machine, I obtained these results :
With "append" : 0.001502693824000744
With "loc" : 0.0010836279180002747
Using "loc" seems to be faster.

Related

Passing data from a for loop to a dataframe

I have just started out learning few things in python, I am stuck in between.
import yfinance as yf
import pandas as pd
import yahoo_fin.stock_info as si
ticker = ['20MICRONS.NS', '21STCENMGM.NS', '3IINFOTECH.NS', '3MINDIA.NS', '3PLAND.NS']
for i in ticker:
try:
quote = si.get_quote_table(i)
df = pd.DataFrame.from_dict(quote.items())
df = df.append(quote.items(), ignore_index=True)
except (ValueError, IndexError,TypeError):
continue
print(df)
Just for example: The value of i has more than 4 entries, whenever I am exiting the loop this data has to be added or should be appended in the dataframe.
But for some reason the dataframe is not appending these values.
Thanks in advance
You defined df within the loop, which means that a new dataframe is initialised in df at each iteration. Define a new dataframe in df before the loop and append to the df in the loop. Please add the information that you provided in the comments to the question.

Iterating over columns from two dataframes to estimate correlation and p-value

I am trying to estimate Pearson's correlation coefficient and P-value from the corresponding columns of two dataframes. I managed to write this code so far but it is just providing me the results from the last columns. Need some help with this code. Also, want to save the outputs in a new dataframe.
import os
import pandas as pd
import scipy as sp
import scipy.stats as stats
df_1 = pd.DataFrame(pd.read_excel('15_Oct_Yield_A.xlsx'))
df_2= pd.DataFrame(pd.read_excel('Oct_Z_index.xlsx'))
for column in df_1.columns[1:]:
for column in df_2.columns[1:]:
x = (df_1[column])
y = (df_2[column])
correl = stats.pearsonr(x, y)
Your looping setup is incorrect on a couple measures... You are using the same variable name in both for-loops which is going to cause problems. Also, you are computing correl outside of your inner loop... etc.
What you want to do is loop over the columns with 1 loop, assuming that both data frames have the same column names. If they do not, you will need to take extra steps to find the common column names and then iterate over them.
Something like this should work:
import os
import pandas as pd
import scipy as sp
import scipy.stats as stats
df_1 = pd.DataFrame({ 'A': ['dog', 'pig', 'cat'],
'B': [0.25, 0.50, 0.75],
'C': [0.30, 0.40, 0.90]})
df_2 = pd.DataFrame({ 'A': ['bird', 'monkey', 'rat'],
'B': [0.20, 0.60, 0.90],
'C': [0.80, 0.50, 0.10]})
results = dict()
for column in df_1.columns[1:]:
correl = stats.pearsonr(df_1[column], df_2[column])
results[column] = correl
print(results)

Dask Dataframe groupby and aggregate for column

I had a pd.DataFrame that I converted to Dask.DataFrame for faster computations.
My requirement is that I have to find out the 'Total Views' of a channel.
In pandas it would be, df.groupby(['ChannelTitle'])['VideoViewCount'].sum() but in dask the columns dtypes is object and groupby is taking these as string and not int(see image 2)
To handle above issue, I added two columns separating figure(115) and multiplier(6 for M, 3 for K) of views hoping to do an operation like ddf['new_views_f'] * (10**ddf['new_views_m']), but now I cannot find mul for two columns in dask.
Either I am missing something or complicating the requirement.
It does sound like you are complicating the requirement. For column multiplication, the regular pandas syntax will work (df['c'] = df['a'] * df['b']). In your case, it's possible to use pd.eval to get the actual numeric value for views:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import random
df = pd.DataFrame(15*np.random.rand(15), columns=['views'])
df['views'] = df['views'].round(2).astype('str') + [random.choice(['K views', 'M views']) for _ in range(len(df))]
df['group'] = [random.choice([1,2,3]) for _ in range(len(df))]
ddf = dd.from_pandas(df, npartitions=2)
ddf['views_digits'] = ddf['views'].replace({'K views': '*1e3', 'M views': '*1e6'}, regex=True).map(pd.eval, meta=ddf['group'])
aggregate_df = ddf.groupby(['group']).agg({'views_digits': 'sum'}).compute()

Is there a way to list the rows and columns in a pandas DataFrame that are empty strings?

I have a 1650x40 dataframe that is a matrix of people who worked on projects each day. It looks like this:
import pandas as pd
df = pd.DataFrame([['bob','11/1/19','X','','',''], ['pete','11/1/19','X','','',''],
['wendy','11/1/19','','','X',''], ['sam','11/1/19','','','',''],
['cara','11/1/19','','','X','']],
columns=['person', 'date', 'project1','project2','project3','project4'])
I am trying to sanity check the data by:
listing any columns that do not have an X in them (in this case
'project2' and 'project4')
listing any rows that do not have an X in them (in this case
'sam')
Desired outcome:
Something like df.show_empty(columns) returns ['project2','project4'] and df.show_empty(rows) returns ['sam']
Obviously the this method would need some way to tell it that the first two columns are not expected to be empty and they should be ignored.
My desired outcome above would return lists of column headings (or row indexes) so that I could go back and check my data and application to find out why there's no entry in the relevant cell (I am guessing there's a good chance that more than one row or column are affected). This seems like it should be trivial but I'm really stuck with figuring this out.
Thanks for any help offered!
For me, it is easier to use apply to accomplish this task. The working code is shown below:
import pandas as pd
df = pd.DataFrame([['bob','11/1/19','X','','',''], ['pete','11/1/19','X','','',''],
['wendy','11/1/19','','','X',''], ['sam','11/1/19','','','',''],
['cara','11/1/19','','','X','']],
columns=['person', 'date', 'project1','project2','project3','project4'])
import numpy as np
df = df.replace('', np.NaN)
colmns = df.apply(lambda x: x.count()==0, axis=0)
df[colmns.index[colmns]]
df[df.apply(lambda x: x[2:].count()==0, axis=1)]
df = df.replace('', np.NaN) will replace the '' with NaN, so that we can use count() function.
colmns = df.apply(lambda x: x.count()==0, axis=0): this will find the columns that are all NaN.
df[df.apply(lambda x: x[2:].count()==0, axis=1)]: this will ignore the first two columns.

log transformation of whole dataframe using numpy

I have a dataframe in python which is made using the following code:
import pandas as pd
df = pd.read_csv('myfile.txt', sep="\t")
df1 = df.iloc[:, 3:]
now in df1 there are 24 columns. I would like to transform the values to log2 value and make a new dataframe in which there are 24 columns with log value of original dataframe. to do so I used numpy.log like the following line:
df2 = (numpy.log(df1))
this code does not return what I would like to get. do you know how to fix it?

Resources