more pythonic way to concatenate pandas data frames - python-3.x

So I've been having to write programs which do something to an existing pandas data frame and then at that data frame to the end of a big data frame in a for loop.
I've found a way to do this, by setting the first data frame to be the end data frame for the first iteration, and then concatenating data frames to this end data frame in later iterations but it doesn't seem to be the most efficient way to do this to me.
I've been using python for a while but have only recently started using pandas so I don't know if there is an easier way to do this. I've attached a simple sample code which hopefully demonstrates what I'm doing and was wondering whether it can be done more pythonically.
df = pandas.DataFrame([0,1,2,3])
for i in range(3):
if i == 0:
end_df = df
else:
end_df = pandas.concat([end_df,df],ignore_index=True)

If you want to have just one variable, you can simplify your code:
df = pd.DataFrame([0,1,2,3])
for i in range(3):
df = pd.concat([df, some_op_modifying_df(df)], ignore_index=True)
, where some_op_modifying_df is a function that generates a new version of df.
Having said that, this would be way easier to come up with some sensible solution if you provided more detail of your problem.

Related

fast date based replacement of rows in Pandas

I am on a quest of finding the fastest replacement method based on index in Pandas.
I want to fill np.nans to all rows based on index (DateTimeIndex).
I tested various types of selection, but obviously, the bottleneck is setting the rows equal to a value (np.nan in my case).
Naively, I want to do this:
df['2017-01-01':'2018-01-01'] = np.nan
I tried and tested a performance of various other methods, such as
df.loc['2017-01-01':'2018-01-01'] = np.nan
And also creating a mask with NumPy to speed it up
df['DateTime'] = df.index
st = pd.to_datetime('2017-01-01', format='%Y-%m-%d').to_datetime64()
en = pd.to_datetime('2018-01-01', format='%Y-%m-%d').to_datetime64()
ge_start = df['DateTime'] >= st
le_end = df['DateTime'] <= en
mask = (ge_start & le_end )
and then
df[mask] = np.nan
#or
df.where(~mask)
But with no big success. I have DataFrame (that I cannot share unfortunately) of size cca (200,1500000), so kind of big, and the operation takes order of seconds of CPU time, which is way too much imo.
Would appreciate any ideas!
edit: after going through
Modifying a subset of rows in a pandas dataframe and Why dataframe.values is very slow and unifying datatypes for the operation, the problem is solved with cca 20x speedup.

AWS Glue dynamic frames not getting updated

We are currrently facing an issue where we cannot insert more than 600K records in oracle db using AWS glue. We are getting connection reset error and DBA's are currently looking into it. As a temporary solution we thought of adding data in chunks by splitting a dataframe into multiple dataframe and looping this list of dataframe to add data. We are sure that splitting algorithm works fine and here is the code we use
def split_by_row_index(df, num_partitions=10):
# Let's assume you don't have a row_id column that has the row order
t = df.withColumn('_row_id', monotonically_increasing_id())
# Using ntile() because monotonically_increasing_id is discontinuous across partitions
t = t.withColumn('_partition', ntile(num_partitions).over(Window.orderBy(t._row_id)))
return [t.filter(t._partition == i + 1) for i in range(num_partitions)]
Here each DF have unique data but somehow when we convert this df in dynamic frame in loop it is we are getting common data in each dynamic frame. here is small snippet for this example
df_trns_details_list = split_by_row_index(df_trns_details, int(df_trns_details.count() / 100000))
trnsDetails1 = DynamicFrame.fromDF(df_trns_details_list[0], glueContext, "trnsDetails1")
trnsDetails2 = DynamicFrame.fromDF(df_trns_details_list[1], glueContext, "trnsDetails2")
print(df_trns_details_list[0].count())# counts are same
print(trnsDetails1.count())
print('-------------------------------')
print(df_trns_details_list[1].count()) # counts are same
print(trnsDetails2.count())
print('-------------------------------')
subDf1 = trnsDetails1.toDF().select(col("id"), col("details_id"))
subDf2 = trnsDetails2.toDF().select(col("id"), col("details_id"))
common = subDf1.intersect(subDf2)
# ------------------ common data exists----------------
print(common.count())
subDf3 = df_trns_details_list[0].select(col("id"), col("details_id"))
subDf4 = df_trns_details_list[1].select(col("id"), col("details_id"))
#------------------0 common data----------------
common1 = subDf3.intersect(subDf4)
print(common1.count())
here Id and details_id combination will be unique
We used this logic in multiple areas where it worked not sure why this is happening.
We are also quite new to Python and AWS Glue so any suggestion to improve it also welcomed. Thanks

which is the efficient way to iterate in python?

I have to iterate one by one over 1 million records, which are stored in a list. And its value is present in a Pandas dataframe. I have to first find its value in the dataframe then perform some arthritic operation on it. And again store it in another Pandas dataframe. But it takes too much time to complete. So I have stored the value in a tuple and the performance has improved a bit but not as expected. Is there any way to optimize this?
Below is sample code I have done.
c2=['Fruits','animals',...]
list1=[]
for j in c2:
data2=dataframe.loc[(dataframe['value'] == j)]
data3=data2.describe()
range1=data3.loc['max']-data3.loc['min']
The most efficient way is to use vectorized functions. Typing this in the blind:
c2 = ['Fruits', 'animals', ...]
tmp = dataframe[dataframe['value'].isin(c2)] \
.groupby('value') \
.agg(['min', 'max'])
df_range = tmp['max'] - tmp['min']

Make operations on several pandas data frames in for loop and return one concatenated data frame

I have many similar data frames which have to be modified and then concatenated in one data frame. I was wondering if there is a way to do everything with a for loop instead of importing and making operations on one data frame at the time?
This is how I was thinking
c = '/disc/data/'
files = [c+'frames_A1.csv',c+'frames_A2.csv',c+'frames_A3.csv',c+'frames_B1.csv',c+'frames_B2.csv',c+'frames_B3.csv',
c+'frames_A1_2.csv',c+'frames_A2_2.csv',c+'frames_A3_2.csv',c+'frames_B1_2.csv',c+'frames_B2_2.csv',c+'frames_B3_2.csv',
c+'frames_B_96.csv',c+'frames_C_96.csv',c+'frames_D_96.csv',c+'frames_E_96.csv',c+'frames_F_96.csv',c+'frames_G_96.csv']
data_tot = []
for i in files:
df = pd.read_csv(i, sep=';', encoding='unicode_escape')
df1 = df[['a','b','c','d']]
df2 = df1[df1['a'].str.contains(r'\btake\b')]
data_tot.append(df2)
I believe I should not append to a list but I cannot figure out how to do otherwise.
you could then do
total_df = pd.concat(data_tot, ignore_index = True).reset_index()

Difference between elements when reading from multiple files

I am trying to get the difference between each element after reading multiple csv files. Each csv file has 13 rows and 128 columns. I am trying to get the column-wise difference
I read the files using
data = [pd.read_csv(f, index_col=None, header=None) for f in _temp]
I get a list of all samples.
According to this I have to use .diff() to get the difference. Which goes something like this
data.diff()
This works but instead of getting the difference between each row in the same sample, I get the difference between each row of one sample to another sample.
Is there a way to separate this and let the difference happen within each sample?
Edit
Ok I am able to get the difference between the data elements by doing this
_local = pd.DataFrame(data)
_list = []
_a = _local.index
for _aa in _a:
_list.append(_local[0][_aa].diff())
flow = pd.DataFrame(_list, index=_a)
I am creating too many DataFrames, is there a better way to do this?
Here is a relatively efficient way to read you dataframes one at a time and calculate their differences which are stored in a list df_diff.
df_diff = []
df_old = pd.read_csv(_temp[0], index_col=None)
for f in _temp[1:]:
df = pd.read_csv(f, index_col=None)
df_diff.append(df_old - df)
df_old = df
Since your code work you should real post on https://codereview.stackexchange.com/
(PS. The leading "_" is not really pythonic. pls avoid. It makes your code harder to read. )
_local = pd.DataFrame(data)
_list = [ _local[0][_aa].diff() for _aa in _local.index ]
flow = pd.DataFrame(_list, index=_local.index )

Resources