Dataframe optimize groupby & calculated fields - python-3.x

I have a dataframe with the following structure:
import pandas as pd
import numpy as np
names = ['PersonA', 'PersonB', 'PersonC', 'PersonD','PersonE','PersonF']
team = ['Team1','Team2']
dates = pd.date_range(start = '2020-05-28', end = '2021-11-22')
df = pd.DataFrame({'runtime': np.repeat(dates, len(names)*len(team))})
df['name'] = len(dates)*len(team)*names
df['team'] = len(dates)*len(names)*team
df['A'] = 40 + 20*np.random.random(len(df))
df['B'] = .1 * np.random.random(len(df))
df['C'] = 1 +.5 * np.random.random(len(df))
I would like to create a dataframe that displays calculated mean values for the runtime in periods such as the previous Week, Month, Yearl, and All-Time such that it looks like this:
name | team | A_w | B_w | C_w| A_m | B_m | C_m | A_y | B_y | C_y | A_at | B_at | C_at
I have successfully added a calculated column for the mean value using the lamda method described here:
How do I create a new column from the output of pandas groupby().sum()?
e.g.:
df = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_at=lambda gdf: gdf['A'].mean()))
My output gives me an additional column:
runtime name team A B C A_at
0 2020-05-28 PersonA Team1 55.608186 0.027767 1.311662 49.957820
1 2020-05-28 PersonB Team2 43.481041 0.038685 1.144240 50.057015
2 2020-05-28 PersonC Team1 47.277667 0.012190 1.047263 50.151846
3 2020-05-28 PersonD Team2 41.995354 0.040623 1.087151 50.412061
4 2020-05-28 PersonE Team1 49.824062 0.036805 1.416110 50.073381
... ... ... ... ... ... ... ...
6523 2021-11-22 PersonB Team2 46.799963 0.069523 1.322076 50.057015
6524 2021-11-22 PersonC Team1 48.851620 0.007291 1.473467 50.151846
6525 2021-11-22 PersonD Team2 49.711142 0.051443 1.044063 50.412061
6526 2021-11-22 PersonE Team1 57.074027 0.095908 1.464404 50.073381
6527 2021-11-22 PersonF Team2 41.372381 0.059240 1.132346 50.094965
[6528 rows x 7 columns]
But this is where it gets messy...
I don't need the runtime column, and I am unsure about how to clean this up so that it only lists the 'name' & 'team' columns, additionally... the way I have been producing my source dataframe(s) is by recreating the entire dataframe using a for loop for each time-period with:
for pt in runtimes[:d]:
<insert dataframe creation for d# of runtimes>
if d==7:
dfw = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_w=lambda gdf: gdf['A'].mean()))
if d==30:
dfm = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_m=lambda gdf: gdf['A'].mean()))
I am then attempting to concatenate the outputs like so:
dfs = pd.concat([dfw, dfm])
This works "OK" when d < 30, but when I'm looking at 90-100 days, it creates a dataframe with 50000+ rows and concats it with each other dataframe. Is there a way to perform this operation for x# of previous runtime values in-place?
Any tips on how to make this more efficient would be greatly appreciated.

An update...
I have been able to formulate a decent output by doing the following:
dfs = pd.DataFrame(columns=['name','team'])
for pt in runtimes[:d]:
if d == 7:
df = <insert dataframe creation for d# of runtimes>
dfw = df.groupby(['name', 'team'], as_index=True).apply(lambda gdf: gdf.assign(A_w=lambda gdf: gdf['A'].mean()))
...
dfw = dfw[['name', 'A_w','B_w','C_w','team']]
dfs = pd.merge(dfs, dfw, how='inner', on=['name', 'team'])
if d == 30:
df = <insert dataframe creation for d# of runtimes>
dfm = df.groupby(['name', 'team'], as_index=True).apply(lambda gdf: gdf.assign(A_m=lambda gdf: gdf['A'].mean()))
...
dfm = dfm[['name', 'A_m','B_m','C_m','team']]
dfs = pd.merge(dfs, dfm, how='inner', on=['name', 'team'])
This gives me the output that I am expecting.

Related

How to change float to date type in python? (ValueError: day is out of range for month)

I have the following column:
0 3012022.0
1 3012022.0
2 3012022.0
3 3012022.0
4 3012022.0
...
351 24032022.0
352 24032022.0
df.Data = df.Data.astype('str')
I converted the float to string and I'm trying to transform them in datetype:
df['data'] = pd.to_datetime(df['data'], format='%d%m%Y'+'.0').dt.strftime('%d-%m-%Y')
output:
ValueError: day is out of range for month
the code is:
os.chdir('/home/carol/upload')
for file in glob.glob("*.xlsx"):
xls = pd.ExcelFile('/home/carol/upload/%s'%(file))
if len(xls.sheet_names) > 1:
list_sheets= []
for i in xls.sheet_names:
df = pd.read_excel(xls, i)
list_sheets.append(df)
df = pd.concat(list_sheets)
else:
df = pd.read_excel(xls)
df = df[['Data','Frota','Placa','ValorFrete', 'ValorFaturado','CodFilial','NomeFilial']]
df = df.dropna()
df = df.apply(lambda x: x.astype(str).str.lower())
df.columns = df.columns.str.lower()
df.columns = df.columns.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
df= df.apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
df['data'] = pd.to_datetime(df['data'].astype(str).str.split('\.').str[0], format='%d%m%Y')
Convert to string, remove the decimal point and coerce to datettime. Code as follows
df['data'] = pd.to_datetime(df['data'].astype(str).str.split('\.').str[0], format='%d%m%Y')
data
0 2022-01-30
1 2022-01-30
2 2022-01-30
3 2022-01-30
4 2022-01-30
351 2022-03-24
352 2022-03-24

Groupby year-month and find top N smallest standard deviation values columns in Python

With sample data and code below, I'm trying to groupby year-month and find top K columns with smallest std values inside all the columns endswith _values:
import pandas as pd
import numpy as np
from statistics import stdev
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])
k = 3 # set k as 3
value_cols = df.columns[df.columns.str.endswith('_values')]
def find_topK_smallest_std(group):
std = stdev(group[value_cols])
cols = std.nsmallest(k).index
out_cols = [f'std_{i+1}' for i in range(k)]
rv = group.loc[:, cols]
rv.columns = out_cols
return rv
df.groupby(pd.Grouper(freq='M'), dropna=False).apply(find_topK_smallest_std)
But it raises a type error, how could I fix this issue? Sincere thanks at advance.
Out:
TypeError: can't convert type 'str' to numerator/denominator
Reference link:
Groupby year-month and find top N smallest values columns in Python
In your solution add DataFrame.apply for stdev per columns, if need per rows add axis=1:
def find_topK_smallest_std(group):
#procssing per columns
std = group[value_cols].apply(stdev)
cols = std.nsmallest(k).index
out_cols = [f'std_{i+1}' for i in range(k)]
rv = group.loc[:, cols]
rv.columns = out_cols
return rv
df = df.groupby(pd.Grouper(freq='M'), dropna=False).apply(find_topK_smallest_std)
print (df)
std_1 std_2 std_3
2013-02-26 7.333694 3.126731 1.389472
2013-02-27 7.529254 7.843101 6.621605
2013-02-28 6.165574 5.612724 0.866300
2013-03-01 5.693051 3.711608 4.521452
2013-03-02 7.322250 4.763135 5.178144
... ... ...
2013-05-22 8.795736 3.864723 6.316478
2013-05-23 7.959282 5.140268 1.839659
2013-05-24 5.412016 5.890717 9.081583
2013-05-25 1.088414 1.610210 9.016004
2013-05-26 4.930571 6.893207 2.338785
[90 rows x 3 columns]

Is there a pythonic way to merge a dataframe on the datetime with datapairs with an irregular datetimestamp

I have several dataseries, where every datapoint is saved with a timestamp with an accuracy of [ms]. I want to merge these series to be on one timeline, all timestamps should be sampled with an accuracy of [s] And in the end there should be one pd where the first column is the datetime with all different timestamps from the series .All the other columns are merged on that datetime.
My code is working, but fails on large data because of memory.
Data looks like this:
a_data; a_Timestamp; b_data; b_Timestamp; c_data ; c_Timestamp
1; 2019-07-24 12:00:00.123; 2 ; 2019-07-24 12:00:00.234; 3 ; 2019-07-24 12:00:00.345;
2; 2019-07-24 12:00:03.123; 3 ; 2019-07-24 12:00:02.234; 4 ; 2019-07-24 12:00:03.645;
My code is below:
import numpy as np
import pandas as pd
import datetime as dt
def prepareData(df):
dfm = None
df = df.dropna(axis='columns',how='all')
df = df.sort_index()
for col in df:
dt = None
if not "Timestamp" in col:
series = pd.DataFrame({'DateTime' : pd.to_datetime(df[col + '_Timestamp']).astype('datetime64[s]'),col : df[col]})
if mergedFrame is not None:
dfm = dfm.merge(series, on='DateTime', how ='outer').sort_values('DateTime')
else:
dfm = series
dfm = dfm.loc[~dfm.DateTime.duplicated(keep='first')]
dfm = dfm.sort_index()
dfm = dfm.fillna(method='ffill')
dfm = dfm.fillna(method='bfill')
dfm = dfm.fillna(0)
return dfm.reset_index()
df = pd.read_csv('file.csv', sep = ";", na_values="n/a" ,low_memory=False)
prepareData(df).to_csv( 'file_sampled.csv', sep = ';')
result should be
DateTime; a_data; b_data ; c_data
2019-07-24 12:00:00; 1;2;3
2019-07-24 12:00:02; 1;3;3
2019-07-24 12:00:03; 2;3;3
2019-07-24 12:00:04; 2;3;4
I got this result, but the memory it takes is too much for my pc. I guess there is a better way to do this.
First we select every data and every timestamp column and put them side by side:
x = pd.concat([pd.melt(df.iloc[:,::2], value_name='data'), pd.melt(df.iloc[:,1::2], value_name='DateTime').iloc[:,-1]], axis=1)
Convert date time strings do DateTime, round to full seconds and set as index:
x['DateTime'] = pd.to_datetime(x.DateTime).dt.round('s')
x = x.set_index('DateTime')
Finally we pivot the data:
x.pivot(columns='variable', values='data')
Result:
variable a_data b_data c_data
DateTime
2019-07-24 12:00:00 1.0 2.0 3.0
2019-07-24 12:00:02 NaN 3.0 NaN
2019-07-24 12:00:03 2.0 NaN NaN
2019-07-24 12:00:04 NaN NaN 4.0

How to merge two dataframes and return data from another column in new column only if there is match?

I have a two df that look like this:
df1:
id
1
2
df2:
id value
2 a
3 b
How do I merge these two dataframes and only return the data from value column in a new column if there is a match?
new_merged_df
id value new_value
1
2 a a
3 b
You can try this using #JJFord3 setup:
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
#Use isin to create new_value
df2['new_value'] = df2['value'].where(df2.index.isin(df1.index))
#Use reindex with union to rebuild dataframe with both indexes
df2.reindex(df1.index.union(df2.index))
Output:
value new_value
1 NaN NaN
2 a a
3 b NaN
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
new_merged_df_outer = df1.merge(df2,how='outer',left_index=True,right_index=True)
new_merged_df_inner = df1.merge(df2,how='inner',left_index=True,right_index=True)
new_merged_df_inner.rename(columns={'value':'new_value'})
new_merged_df = new_merged_df_outer.merge(new_merged_df_inner,how='left',left_index=True,right_index=True)
First, create an outer merge to keep all indexes.
Then create an inner merge to only get the overlap.
Then merge the inner merge back to the outer merge to get the desired column setup.
You can use full outer join
Lets model your data with case classes:
case class MyClass1(id: String)
case class MyClass2(id: String, value: String)
// this one for the result type
case class MyClass3(id: String, value: Option[String] = None, value2: Option[String] = None)
Creating some inputs:
val input1: Dataset[MyClass1] = ...
val input2: Dataset[MyClass2] = ...
Joining your data:
import scala.implicits._
val joined = input1.as("1").joinWith(input2.as("2"), $"1.id" === $"2.id", "full_outer")
joined map {
case (left, null) if left != null => MyClass3(left.id)
case (null, right) if right != null => MyClass3(right.id, Some(right.value))
case (left, right) => MyClass3(left.id, Some(right.value), Some(right.value))
}
DataFrame.merge has in parameter indicator which
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
This can be used to check if there is a match
import pandas as pd
df1 = pd.DataFrame(index=[1,2])
df2 = pd.DataFrame({'value' : ['a','b']},index=[2,3])
# creates a new column `_merge` with values `right_only`, `left_only` or `both`
merged = df1.merge(df2, how='outer', right_index=True, left_index=True, indicator=True)
merged['new_value'] = merged.loc[(merged['_merge'] == 'both'), 'value']
merged = merged.drop('_merge', axis=1)
Use merge and isin:
df = df1.merge(df2,on='id',how='outer')
id_value = df2.loc[df2['id'].isin(df1.id.tolist()),'id'].unique()
mask = df['id'].isin(id_value)
df.loc[mask,'new_value'] = df.loc[mask,'value']
# alternative df['new_value'] = np.where(mask, df['value'], np.nan)
print(df)
id value new_value
0 1 NaN NaN
1 2 a a
2 3 b NaN

Spark: Merge 2 dataframes by adding row index/number on both dataframes

Q: Is there is any way to merge two dataframes or copy a column of a dataframe to another in PySpark?
For example, I have two Dataframes:
DF1
C1 C2
23397414 20875.7353
5213970 20497.5582
41323308 20935.7956
123276113 18884.0477
76456078 18389.9269
the seconde dataframe
DF2
C3 C4
2008-02-04 262.00
2008-02-05 257.25
2008-02-06 262.75
2008-02-07 237.00
2008-02-08 231.00
Then i want to add C3 of DF2 to DF1 like this:
New DF
C1 C2 C3
23397414 20875.7353 2008-02-04
5213970 20497.5582 2008-02-05
41323308 20935.7956 2008-02-06
123276113 18884.0477 2008-02-07
76456078 18389.9269 2008-02-08
I hope this example was clear.
rownum + window function i.e solution 1 or zipWithIndex.map i.e solution 2 should help in this case.
Solution 1 : You can use window functions to get this kind of
Then I would suggest you to add rownumber as additional column name to Dataframe say df1.
DF1
C1 C2 columnindex
23397414 20875.7353 1
5213970 20497.5582 2
41323308 20935.7956 3
123276113 18884.0477 4
76456078 18389.9269 5
the second dataframe
DF2
C3 C4 columnindex
2008-02-04 262.00 1
2008-02-05 257.25 2
2008-02-06 262.75 3
2008-02-07 237.00 4
2008-02-08 231.00 5
Now .. do inner join of df1 and df2 that's all...
you will get below ouput
something like this
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df1 = .... // as showed above df1
df2 = .... // as shown above df2
df11 = df1.withColumn("columnindex", rowNumber().over(w))
df22 = df2.withColumn("columnindex", rowNumber().over(w))
newDF = df11.join(df22, df11.columnindex == df22.columnindex, 'inner').drop(df22.columnindex)
newDF.show()
New DF
C1 C2 C3
23397414 20875.7353 2008-02-04
5213970 20497.5582 2008-02-05
41323308 20935.7956 2008-02-06
123276113 18884.0477 2008-02-07
76456078 18389.9269 2008-02-08
Solution 2 : Another good way(probably this is best :)) in scala, which you can translate to pyspark :
/**
* Add Column Index to dataframe
*/
def addColumnIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add Column index
df.rdd.zipWithIndex.map{case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex)},
// Create schema
StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
)
// Add index now...
val df1WithIndex = addColumnIndex(df1)
val df2WithIndex = addColumnIndex(df2)
// Now time to join ...
val newone = df1WithIndex
.join(df2WithIndex , Seq("columnindex"))
.drop("columnindex")
I thought I would share the python (pyspark) translation for answer #2 above from #Ram Ghadiyaram:
from pyspark.sql.functions import col
def addColumnIndex(df):
# Create new column names
oldColumns = df.schema.names
newColumns = oldColumns + ["columnindex"]
# Add Column index
df_indexed = df.rdd.zipWithIndex().map(lambda (row, columnindex): \
row + (columnindex,)).toDF()
#Rename all the columns
new_df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx],
newColumns[idx]), xrange(len(oldColumns)), df_indexed)
return new_df
# Add index now...
df1WithIndex = addColumnIndex(df1)
df2WithIndex = addColumnIndex(df2)
#Now time to join ...
newone = df1WithIndex.join(df2WithIndex, col("columnindex"),
'inner').drop("columnindex")
for python3 version,
from pyspark.sql.types import StructType, StructField, LongType
def with_column_index(sdf):
new_schema = StructType(sdf.schema.fields + [StructField("ColumnIndex", LongType(), False),])
return sdf.rdd.zipWithIndex().map(lambda row: row[0] + (row[1],)).toDF(schema=new_schema)
df1_ci = with_column_index(df1)
df2_ci = with_column_index(df2)
join_on_index = df1_ci.join(df2_ci, df1_ci.ColumnIndex == df2_ci.ColumnIndex, 'inner').drop("ColumnIndex")
I referred to his(#Jed) answer
from pyspark.sql.functions import col
def addColumnIndex(df):
# Get old columns names and add a column "columnindex"
oldColumns = df.columns
newColumns = oldColumns + ["columnindex"]
# Add Column index
df_indexed = df.rdd.zipWithIndex().map(lambda (row, columnindex): \
row + (columnindex,)).toDF()
#Rename all the columns
oldColumns = df_indexed.columns
new_df = reduce(lambda data, idx:data.withColumnRenamed(oldColumns[idx],
newColumns[idx]), xrange(len(oldColumns)), df_indexed)
return new_df
# Add index now...
df1WithIndex = addColumnIndex(df1)
df2WithIndex = addColumnIndex(df2)
#Now time to join ...
newone = df1WithIndex.join(df2WithIndex, col("columnindex"),
'inner').drop("columnindex")
This answer solved it for me:
import pyspark.sql.functions as sparkf
# This will return a new DF with all the columns + id
res = df.withColumn('id', sparkf.monotonically_increasing_id())
Credit to Arkadi T
Here is an simple example that can help you even if you have already solve the issue.
//create First Dataframe
val df1 = spark.sparkContext.parallelize(Seq(1,2,1)).toDF("lavel1")
//create second Dataframe
val df2 = spark.sparkContext.parallelize(Seq((1.0, 12.1), (12.1, 1.3), (1.1, 0.3))). toDF("f1", "f2")
//Combine both dataframe
val combinedRow = df1.rdd.zip(df2.rdd). map({
//convert both dataframe to Seq and join them and return as a row
case (df1Data, df2Data) => Row.fromSeq(df1Data.toSeq ++ df2Data.toSeq)
})
// create new Schema from both the dataframe's schema
val combinedschema = StructType(df1.schema.fields ++ df2.schema.fields)
// Create a new dataframe from new row and new schema
val finalDF = spark.sqlContext.createDataFrame(combinedRow, combinedschema)
finalDF.show
Expanding on Jed's answer, in response to Ajinkya's comment:
To get the same old column names, you need to replace "old_cols" with a column list of the newly named indexed columns. See my modified version of the function below
def add_column_index(df):
new_cols = df.schema.names + ['ix']
ix_df = df.rdd.zipWithIndex().map(lambda (row, ix): row + (ix,)).toDF()
tmp_cols = ix_df.schema.names
return reduce(lambda data, idx: data.withColumnRenamed(tmp_cols[idx], new_cols[idx]), xrange(len(tmp_cols)), ix_df)
Not the better way performance wise.
df3=df1.crossJoin(df2).show(3)
To merge columns from two different dataframe you have first to create a column index and then join the two dataframes. Indeed, two dataframes are similar to two SQL tables. To make a connection you have to join them.
If you don't care about the final order of the rows you can generate the index column with monotonically_increasing_id().
Using the following code you can check that monotonically_increasing_id generates the same index column in both dataframes (at least up to a billion of rows), so you won't have any error in the merged dataframe.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sample_size = 1E9
sdf1 = spark.range(1, sample_size).select(F.col("id").alias("id1"))
sdf2 = spark.range(1, sample_size).select(F.col("id").alias("id2"))
sdf1 = sdf1.withColumn("idx", sf.monotonically_increasing_id())
sdf2 = sdf2.withColumn("idx", sf.monotonically_increasing_id())
sdf3 = sdf1.join(sdf2, 'idx', 'inner')
sdf3 = sdf3.withColumn("diff", F.col("id1")-F.col("id2")).select("diff")
sdf3.filter(F.col("diff") != 0 ).show()
You can use a combination of monotonically_increasing_id (guaranteed to always be increasing) and row_number (guaranteed to always give the same sequence). You cannot use row_number alone because it needs to be ordered by something. So here we order by monotonically_increasing_id. I am using Spark 2.3.1 and Python 2.7.13.
from pandas import DataFrame
from pyspark.sql.functions import (
monotonically_increasing_id,
row_number)
from pyspark.sql import Window
DF1 = spark.createDataFrame(DataFrame({
'C1': [23397414, 5213970, 41323308, 123276113, 76456078],
'C2': [20875.7353, 20497.5582, 20935.7956, 18884.0477, 18389.9269]}))
DF2 = spark.createDataFrame(DataFrame({
'C3':['2008-02-04', '2008-02-05', '2008-02-06', '2008-02-07', '2008-02-08']}))
DF1_idx = (
DF1
.withColumn('id', monotonically_increasing_id())
.withColumn('columnindex', row_number().over(Window.orderBy('id')))
.select('columnindex', 'C1', 'C2'))
DF2_idx = (
DF2
.withColumn('id', monotonically_increasing_id())
.withColumn('columnindex', row_number().over(Window.orderBy('id')))
.select('columnindex', 'C3'))
DF_complete = (
DF1_idx
.join(
other=DF2_idx,
on=['columnindex'],
how='inner')
.select('C1', 'C2', 'C3'))
DF_complete.show()
+---------+----------+----------+
| C1| C2| C3|
+---------+----------+----------+
| 23397414|20875.7353|2008-02-04|
| 5213970|20497.5582|2008-02-05|
| 41323308|20935.7956|2008-02-06|
|123276113|18884.0477|2008-02-07|
| 76456078|18389.9269|2008-02-08|
+---------+----------+----------+

Resources