Difference between elements when reading from multiple files - python-3.x

I am trying to get the difference between each element after reading multiple csv files. Each csv file has 13 rows and 128 columns. I am trying to get the column-wise difference
I read the files using
data = [pd.read_csv(f, index_col=None, header=None) for f in _temp]
I get a list of all samples.
According to this I have to use .diff() to get the difference. Which goes something like this
data.diff()
This works but instead of getting the difference between each row in the same sample, I get the difference between each row of one sample to another sample.
Is there a way to separate this and let the difference happen within each sample?
Edit
Ok I am able to get the difference between the data elements by doing this
_local = pd.DataFrame(data)
_list = []
_a = _local.index
for _aa in _a:
_list.append(_local[0][_aa].diff())
flow = pd.DataFrame(_list, index=_a)
I am creating too many DataFrames, is there a better way to do this?

Here is a relatively efficient way to read you dataframes one at a time and calculate their differences which are stored in a list df_diff.
df_diff = []
df_old = pd.read_csv(_temp[0], index_col=None)
for f in _temp[1:]:
df = pd.read_csv(f, index_col=None)
df_diff.append(df_old - df)
df_old = df

Since your code work you should real post on https://codereview.stackexchange.com/
(PS. The leading "_" is not really pythonic. pls avoid. It makes your code harder to read. )
_local = pd.DataFrame(data)
_list = [ _local[0][_aa].diff() for _aa in _local.index ]
flow = pd.DataFrame(_list, index=_local.index )

Related

Reading large volume data from Teradata using Dask cluster/Teradatasql and sqlalchemy

I need to read large volume data(app. 800M records) from teradata, my code is working fine for a million record. for larger sets its taking time to build metadata. Could someone please suggest how to make it faster. Below is the code snippet which I am using for my application.
def get_partitions(num_partitions):
list_range =[]
initial_start=0
for i in range(num_partitions):
amp_range = 3240//num_partitions
start = (i*amp_range+1)*initial_start
end = (i+1)*amp_range
list_range.append((start,end))
initial_start = 1
return list_range
#delayed
def load(query,start,end,connString):
df = pd.read_sql(query.format(start, end),connString)
engine.dispose()
return df
connString = "teradatasql://{user}:{password}#{hostname}/?logmech={logmech}&encryptdata=true"
results = from_delayed([load(query,start, end,connString) for start,end in get_partitions(num_partitions)])
The build time is probably taken in finding out the metadata of your table. This is done by fetching the whole of the first partition and analysing it.
You would be better off either specifying it explcitly, if you know the dtypes upfront, e.g., {col: dtype, ...} for all the columns, or generating it from a separate query that you limit to just as many rows as it takes to be sure you have the right types:
meta = dask.compute(load(query, 0,10 ,connString))
results = from_delayed(
[
load(query,start, end,connString) for start,end in
get_partitions(num_partitions)
],
mete=meta.loc[:0, :] # zero-length version of table
)

How to split pandas dataframe into multiple dataframes based on unique string value without aggregating

I have a df with multiple country codes in a column (US, CA, MX, AU...) and want to split this one df into multiple ones based on these country code values, but without aggregating it.
I've tried a for loop but was only able to get one df and it was aggregated with groupby().
I gave up trying to figure it out so I split them based on str.match and wrote one line for each country code. Is there a nice for loop that could achieve the same as below code? If it would write a csv file for each new df that would be fantastic.
us = df[df['country_code'].str.match("US")]
mx = df[df['country_code'].str.match("MX")]
ca = df[df['country_code'].str.match("CA")]
au = df[df['country_code'].str.match("AU")]
.
.
.
We can write a for loop which takes each code and uses query to get the correct part of the data. Then we write it to csv with to_csv also using f-string:
codes = ['US', 'MX', 'CA', 'AU']
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
temp.to_csv(f'df_{code}.csv')
note: f_string only work if Python >= 3.5
To keep the dataframes:
codes = ['US', 'MX', 'CA', 'AU']
dfs=[]
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
dfs.append(temp)
temp.to_csv(f'df_{code}.csv')
Then you can acces them with the index, for example: print(dfs[0]) or print(dfs[1]).

Make operations on several pandas data frames in for loop and return one concatenated data frame

I have many similar data frames which have to be modified and then concatenated in one data frame. I was wondering if there is a way to do everything with a for loop instead of importing and making operations on one data frame at the time?
This is how I was thinking
c = '/disc/data/'
files = [c+'frames_A1.csv',c+'frames_A2.csv',c+'frames_A3.csv',c+'frames_B1.csv',c+'frames_B2.csv',c+'frames_B3.csv',
c+'frames_A1_2.csv',c+'frames_A2_2.csv',c+'frames_A3_2.csv',c+'frames_B1_2.csv',c+'frames_B2_2.csv',c+'frames_B3_2.csv',
c+'frames_B_96.csv',c+'frames_C_96.csv',c+'frames_D_96.csv',c+'frames_E_96.csv',c+'frames_F_96.csv',c+'frames_G_96.csv']
data_tot = []
for i in files:
df = pd.read_csv(i, sep=';', encoding='unicode_escape')
df1 = df[['a','b','c','d']]
df2 = df1[df1['a'].str.contains(r'\btake\b')]
data_tot.append(df2)
I believe I should not append to a list but I cannot figure out how to do otherwise.
you could then do
total_df = pd.concat(data_tot, ignore_index = True).reset_index()

more pythonic way to concatenate pandas data frames

So I've been having to write programs which do something to an existing pandas data frame and then at that data frame to the end of a big data frame in a for loop.
I've found a way to do this, by setting the first data frame to be the end data frame for the first iteration, and then concatenating data frames to this end data frame in later iterations but it doesn't seem to be the most efficient way to do this to me.
I've been using python for a while but have only recently started using pandas so I don't know if there is an easier way to do this. I've attached a simple sample code which hopefully demonstrates what I'm doing and was wondering whether it can be done more pythonically.
df = pandas.DataFrame([0,1,2,3])
for i in range(3):
if i == 0:
end_df = df
else:
end_df = pandas.concat([end_df,df],ignore_index=True)
If you want to have just one variable, you can simplify your code:
df = pd.DataFrame([0,1,2,3])
for i in range(3):
df = pd.concat([df, some_op_modifying_df(df)], ignore_index=True)
, where some_op_modifying_df is a function that generates a new version of df.
Having said that, this would be way easier to come up with some sensible solution if you provided more detail of your problem.

Convert huge number of lists to pandas dataframe

User defined function=> my_fun(x): returns a list
XYZ = file with LOTS of lines
pandas_frame = pd.DataFrame() # Created empty data frame
for index in range(0,len(XYZ)):
pandas_frame = pandas_frame.append(pd.DataFrame(my_fun(XYZ[i])).transpose(), ignore_index=True)
This code is taking very long time to run like in days. How do I speed up?
I think need apply for each row funcion to new list by list comprehension and then use only once DataFrame constructor:
L = [my_fun(i) for i in range(len(XYZ))]
df = pd.DataFrame(L)

Resources