we have measurements for n points (say 22 points) over a period of time stored in a real time store. now we are looking for some understanding of trends for points mentioned above. In order to gain an objective we read measurements into a pandas DataFrame (python). Within this DataFrame points are now columns and rows are respective measurement time.
We would like to extend data frame with new columns for mean and std by inserting 'mean' and 'std' columns for each existing column, being a particular measurement. This means two new columns per 22 measurement points.
Now question is whether above is best achieved adding new mean and std columns while iterating existing columns or is there another more effective DataFrame built in operation or tricks?
Our understanding is that updating of DataFrame in a for loop would by far be worst practice.
Thanks for any comment or proposal.
From the comments, I guess this is what you are looking for -
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.normal(size = (1000,22))) # created an example dataframe
df.loc[:, 'means'] = df.mean(axis = 1)
df.loc[:, 'std'] = df.mean(axis = 1)
Related
I have a pandas DataFrame of 10,000 rows. Each of these rows contain a 1D np.array. I want to compute Euclidean distance (and a few other measurements) on each of these arrays (rows) with respect to every other array (row). I can do this in 3 ways, but all these approaches require high computing time. Is there an optimised way to achieve this? I'm sharing a brief code of what I did-
df is the DataFrame and vector is the column name. The df looks like this-
#Method-1:
for i in range(df.shape[0]):
for j in range(df.shape[0]):
get_distance(df['vector'][i], df['vector'][j])
#Method-2:
def func(n, df):
return df.apply(lambda x: get_distance(n,x))
df.vector.apply(lambda x: func(x,df.vector))
#Method-3:
features = list(product(df.vector.values, df.vector.values))
data = [get_distance(i[0], i[1]) for i in features]
Is there an optimised way to achieve this?
Thank you!
P.S. I'm not concerned about the output data type. The output can be a list (like in Methods 1 and 3), a DataFrame (like in Method 2), an array, a matrix, etc. I'm only trying to improve the effeciently.
In pandas dataframe df, one can extract a subset of rows and store it in another pandas data frame. For example, df1 = df[10:20]. Can we do something similar in spark dataframe?
Since we're at Spark, we're considering large datasets for which Pandas (and Python) are still catching up. I'm trying to stress out that the reason you may've considered PySpark as a better fit for your data processing problem is exactly the amount of data - to large for pandas to handle nicely.
With that said, you simply cannot think about the huge dataset as something to "rank" as no computer could handle it (either because lack of RAM or time).
In order to answer your question:
one can extract a subset of rows and store it in another pandas data frame.
think of filter or where that you use to filter out rows you don't want to include in a result dataset.
That could be as follows (using Scala API):
val cdf: DataFrame = ...
val result: DataFrame = cdf.where("here comes your filter expression")
Use result data frame however you wish. That's what you wanted to work with and is now available. That's a sort of "Spark way".
#chlebek since your answer works for me. I corrected a typo and post here as an answer.
b = cdf.withColumn("id", row_number().over(Window.orderBy("INTERVAL_END_DATETIME")))
b = b.where(b.id >= 10)
b = b.where(b.id <= 20)
You could try to use row_number, it will add increasing row number column. The data will be sorted by column used in .orderBy clause. Then you can just select needed rows.
import org.apache.spark.sql.expressions.Window
val new_df = df.withColumn("id",row_number.over(Window.orderBy('someColumnFromDf))).where('id <= 20 and 'id >= 10)
I am trying to go through an excel spreadsheet that contains lots of data and sort through it. The image following is just a short pic of what I have. I imported this excel sheet into a dataframe. What I need to do is split up the data by data point name into different data frames
The Datapoints go from 1066 to 1070 in increments of 1. I need to split these into different dataframes so theres an dataframe for each. Any help would be appreciated. I have already imported it into a dataframe which I called test_df_new. I just need to know how to go further.
Thank you
I want
Since your asked to sort by a column in comments you can use sort column like this
import pandas as pd
_data = {'Toolset': ['Toolset','Toolset','Toolset','Toolset'],
'Data Point Name': ['EN1068','EN1067','EN1068QR','EN1068QR'],
'Toolset Start Date':[0,0,0,0],
'ToolsetCount':[1674,1160,977,977],
'ToolsetCap':[0,0,0,0],
'Toolset Cap Start Date':[0,0,0,0],
'Cap Count':[0,0,51,42]
}
df = pd.DataFrame(data=_data)
df.sort_values(by=['Data Point Name'])
Note: Since they start with EN pandas will take care of the sorting by alphabet. I have attached an image is this what your are looking for?
and you can also use copy() function to copy df to new df like this:
new_df = df.copy()
At the moment I have 9 functions which do specific calculations to a data frame - average balance per month included, rolling P&L, period start balances, ratio calculation.
Each of those functions produce the following:
the first columns are the group by columns which the function accepts and the final column is the statistic calculation.
I.e.
Each of those functions produce a spark data frame that has the same group by variables(same first columns - 1 column if the group by variables is only 1, 2 columns if the group by variables are 2, etc.) and 1 column where the values are the specific calculation - examples of which I listed at the beginning.
Because each of those functions do different calculations, I need to produce a data frame for each one and then join them to produce a report
I join them on the group by variables because they are common in all of them(each individual statistic report).
But doing 7-8 and even more joins is very slow.
Is there a way to add those columns together without using join?
Thank you.
I can think of multiple approaches. But this looks like a good use case for a new pandas udf spark api.
You can define one group_by udf. The udf will receive the aggregated group as a pandas dataframe. You apply 9 aggregate functions on the group and return a pandas dataframe with additional 9 aggregated columns. Spark will combine each new returned pandas dataframe into a large spark dataframe.
e.g
# given you want to aggregate average and ratio
#pandas_udf("month long, avg double, ratio dbl", PandasUDFType.GROUPED_MAP)
def compute(pdf):
# pdf is a pandas.DataFrame
pdf['avg'] = compute_avg(pdf)
pdf['ratio'] = compute_ratio(pdf)
return pdf
df.groupby("month").apply(compute).show()
See Pandas-UDF#Grouped-Map
If you cluster is on a lower version you have 2 options:
Stick to dataframe api and write custom aggregate functions. See answer. They have a horrible api but usage would look like this.
df.groupBy(df.month).agg(
my_avg_func(df.amount).alias('avg'),
my_ratio_func(df.amount).alias('ratio'),
Fall back to good ol' rdd map reduce api
#pseudocode
def _create_tuple_key(record):
return (record.month, record)
def _compute_stats(acc, record):
acc['count'] += 1
acc['avg'] = _accumulate_avg(acc['count'], record)
acc['ratio'] = _accumulate_ratio(acc['count'], record)
return acc
df.toRdd.map(__create_tuple_key).reduceByKey(_compute_stats)
df = pd.DataFrame({"ID":['A','B','C','D','E','F'],
"IPaddress":['12.345.678.01','12.345.678.02','12.345.678.01','12.345.678.18','12.345.678.02','12.345.678.01'],
"score":[8,9,5,10,3,7]})
I'm using Python, and Pandas library. For those rows with duplicate IP addresses, I want to select only one row with highest score (score being from 0-10), and drop all duplicates.
I'm having a difficult time in turning this logic into a Python function.
Step 1: Using the groupby function of Pandas, split the df into groups of IPaddress.
df.groupby('IPaddress')
Result of this will create an groupby object. Once you check the type of this object, it will be the following: pandas.core.groupby.groupby.DataFrameGroupBy
Step 2: With the Pandas groupby object created from step1, using .idxmax() over the score, will return the Pandas series with maximum scores of each IPaddress
df.groupby('IPaddress').score.idxmax()
(Optional) Step 3: If you want to transform the above series to dataframe, you can do below:
df.loc[df.groupby('IPaddress').score.idxmax(),['IPaddress','score']]
Here, you are selecting all the rows with max scores, and showing the IPaddress, score columns.
Useful reference:
1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
https://pandas.pydata.org/pandas-docs/version/0.22/groupby.html
https://www.geeksforgeeks.org/python-pandas-dataframe-idxmax/