Incremental append updated rows (based on some columns) in PySpark Palantir Foundry - apache-spark

I would like to create a historical dataset on which I would like to add all NEW records of a dataset.
For NEW records I mean new records or modified records: all those that are the same for all columns except the 'reference_date' one.
I insert here the piece of code that allows me to do it on all columns, but I can't figure out how to implement the exclusion condition of a column.
Inputs:
historical (previous):
ID
A
B
dt_run
1
abc
football
2022-02-14 21:00:00
2
dba
volley
2022-02-14 21:00:00
3
wxy
tennis
2022-02-14 21:00:00
input_df (new data):
ID
A
B
1
abc
football
2
dba
football
3
wxy
tennis
7
abc
tennis
DESIRED OUTPUT (new records in bold)
ID
A
B
dt_run
1
abc
football
2022-02-14 21:00:00
2
dba
volley
2022-02-15 21:00:00
3
wxy
tennis
2022-02-01 21:00:00
2
dba
football
2022-03-15 14:00:00
7
abc
tennis
2022-03-15 14:00:00
My code which doesn't work:
#incremental(snapshot_inputs=['input_df'])
#transform(historical = Output(....), input_df = Input(....))
def append(input_df, historical):
input_df = input_df.dataframe().withColumn('dt_run', F.to_timestamp(F.lit(datetime.now())))
historical = historical.write_dataframe(dataset_input_df.distinct()\
.subtract(historical.dataframe('previous', schema=input_df.schema)))
return historical

I've tested the following script and it works. In the following example, you don't need to drop/select columns. Using withColumn you create the missing column in input_df and also change the values in the existing column in historical. This way you can safely do subtract on the whole dataframe. Later, since you append the data rows, the old historical rows will stay intact with their old timestamps.
from transforms.api import transform, Input, Output, incremental
from pyspark.sql import functions as F
from datetime import datetime
#incremental(snapshot_inputs=['input_df'])
#transform(
historical=Output("...."),
input_df=Input("....")
)
def append(input_df, historical):
now = datetime.now()
df_inp = input_df.dataframe().withColumn('dt_run', F.to_timestamp(F.lit(now)))
df_hist = historical.dataframe('previous', df_inp.schema).withColumn('dt_run', F.to_timestamp(F.lit(now)))
historical.write_dataframe(df_inp.subtract(df_hist))

You can use code similar to what is found here.
Once you have combined the previous output with the new input, you just need to use PySpark to determine which is the newest row and only keep that row instead of line 19.
A possible implementation for this could be using F.row_number e.g.
import pyspark.sql.window as W
import pyspark.sql.functions as F
#incremental()
#transform(
input_df=Input('/examples/input_df'),
output_df=Output('/examples/output_df')
)
def incremental_group_by(input_df, output_df):
# Get new rows
new_input_df = input_df.dataframe().withColumn('dt_run', F.to_timestamp(F.lit(datetime.now())))
# Union with the old rows
out_schema = new_input_df.schema
both_df = new_input_df.union(
output_df.dataframe('previous', schema=out_schema)
)
partition_cols = ["A","B"]
# Get most recent row
totals_df = totals_df.withColumn("row_number",
F.row_number().over(W.Window.partitionBy(*partition_cols).orderBy(F.desc("dt_run")))
).where(F.col("row_number") == 1).drop("row_number")
# To fully replace the output, we always set the output mode to 'replace'.
# Checkpoint the totals dataframe before changing the output mode.
both_df.localCheckpoint(eager=True)
output_df.set_mode('replace')
output_df.write_dataframe(both_df.select(out_schema.fieldNames()))
Edit : The main difference between my answer and the one above is whether you want to have multiple rows in the ouput where 'A' and 'B' are the same. It depends on your usecase which one is better!

I have used the union() function along with dropDulicates()
from datetime import datetime
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pyspark.sql.functions as fx
def append(df_input, df_hist):
df_union = df_hist.unionByName(df_input,allowMissingColumns=True).dropDuplicates(['ID','A','B'])
historical = df_union.withColumn('dt_run', fx.coalesce('dt_run', fx.to_timestamp(fx.lit(datetime.now()))))
return historical
df_hist= spark.createDataFrame( [(1,'abc','football','2022-02-14 21:00:00'),(2,'dba','volley','2022-02-14 21:00:00'),(3,'wxy','tennis','2022-02-14 21:00:00')],schema= ['ID','A','B','dt_run'])
df_hist = df_hist.withColumn('dt_run',fx.col('dt_run').cast('timestamp'))
df_input= spark.createDataFrame([(1,'abc','football'),(2,'dba','football'),(3,'wxy','tennis'),(7,'abc','tennis')],schema= ['ID','A','B'])
df_historical = append(df_input,df_hist)
df_historical.show(truncate=False)

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

how to extract different tables in excel sheet using python

In one excel file, sheet 1 , there are 4 tables at different locations in the sheet .How to read those 4 tables . for example I have even added one picture snap from google for reference. without using indexes is there any other way to extract tables.
I assume your tables are formatted as "Excel Tables".
You can create an excel table by mark a range and then click:
Then there are a good guide from Samuel Oranyeli how to import the Excel Tables with Python. I have used his code and show with examples.
I have used the following data in excel, where each color represents a table.
Remarks about code:
The following part can be used to check which tables exist in the worksheet that we are working with:
# check what tables that exist in the worksheet
print({key : value for key, value in ws.tables.items()})
In our example this code will give:
{'Table2': 'A1:C18', 'Table3': 'D1:F18', 'Table4': 'G1:I18', 'Table5': 'J1:K18'}
Here you set the dataframe names. Be cautious if the number of dataframes missmatches the number of tables you will get an error.
# Extract all the tables to individually dataframes from the dictionary
Table2, Table3, Table4, Table5 = mapping.values()
# Print each dataframe
print(Table2.head(3)) # Print first 3 rows from df
print(Table2.head(3)) gives:
Index first_name last_name address
0 Aleshia Tomkiewicz 14 Taylor St
1 Evan Zigomalas 5 Binney St
2 France Andrade 8 Moor Place
Full code:
#import libraries
from openpyxl import load_workbook
import pandas as pd
# read file
wb = load_workbook("G:/Till/Tables.xlsx") # Set the filepath + filename
# select the sheet where tables are located
ws = wb["Tables"]
# check what tables that exist in the worksheet
print({key : value for key, value in ws.tables.items()})
mapping = {}
# loop through all the tables and add to a dictionary
for entry, data_boundary in ws.tables.items():
# parse the data within the ref boundary
data = ws[data_boundary]
### extract the data ###
# the inner list comprehension gets the values for each cell in the table
content = [[cell.value for cell in ent]
for ent in data]
header = content[0]
#the contents ... excluding the header
rest = content[1:]
#create dataframe with the column names
#and pair table name with dataframe
df = pd.DataFrame(rest, columns = header)
mapping[entry] = df
# print(mapping)
# Extract all the tables to individually dataframes from the dictionary
Table2, Table3, Table4, Table5 = mapping.values()
# Print each dataframe
print(Table2)
print(Table3)
print(Table4)
print(Table5)
Example data, example file:
first_name
last_name
address
city
county
postal
Aleshia
Tomkiewicz
14 Taylor St
St. Stephens Ward
Kent
CT2 7PP
Evan
Zigomalas
5 Binney St
Abbey Ward
Buckinghamshire
HP11 2AX
France
Andrade
8 Moor Place
East Southbourne and Tuckton W
Bournemouth
BH6 3BE
Ulysses
Mcwalters
505 Exeter Rd
Hawerby cum Beesby
Lincolnshire
DN36 5RP
Tyisha
Veness
5396 Forth Street
Greets Green and Lyng Ward
West Midlands
B70 9DT
Eric
Rampy
9472 Lind St
Desborough
Northamptonshire
NN14 2GH
Marg
Grasmick
7457 Cowl St #70
Bargate Ward
Southampton
SO14 3TY
Laquita
Hisaw
20 Gloucester Pl #96
Chirton Ward
Tyne & Wear
NE29 7AD
Lura
Manzella
929 Augustine St
Staple Hill Ward
South Gloucestershire
BS16 4LL
Yuette
Klapec
45 Bradfield St #166
Parwich
Derbyshire
DE6 1QN
Fernanda
Writer
620 Northampton St
Wilmington
Kent
DA2 7PP
Charlesetta
Erm
5 Hygeia St
Loundsley Green Ward
Derbyshire
S40 4LY
Corrinne
Jaret
2150 Morley St
Dee Ward
Dumfries and Galloway
DG8 7DE
Niesha
Bruch
24 Bolton St
Broxburn, Uphall and Winchburg
West Lothian
EH52 5TL
Rueben
Gastellum
4 Forrest St
Weston-Super-Mare
North Somerset
BS23 3HG
Michell
Throssell
89 Noon St
Carbrooke
Norfolk
IP25 6JQ
Edgar
Kanne
99 Guthrie St
New Milton
Hampshire
BH25 5DF
You may convert your excel sheet to csv file and then use csv module to grab rows.
import pandas as pd
read_file = pd.read_excel("Test.xlsx")
read_file.to_csv ("Test.csv",index = None,header=True)
enter code here
df = pd.DataFrame(pd.read_csv("Test.csv"))
print(df)
For better approch please provide us sample excel file
You need two things:
Access OpenXML data via python: https://github.com/python-openxml/python-xlsx
Find the tables in the file, via what is called a DefinedName: https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.spreadsheet.definedname?view=openxml-2.8.1
You may convert your excel sheet to csv file and then use csv module to grab rows.
//Code for excel to csv
import pandas as pd
read_file = pd.read_excel ("Test.xlsx")
read_file.to_csv ("Test.csv",index = None,header=True)
df = pd.DataFrame(pd.read_csv("Test.csv"))
print(df)
For better approch please provide us sample excel file

Add new rows to dataframe using existing rows from previous year

I'm creating a Pandas dataframe from an existing file and it ends up essentially like this.
import pandas as pd
import datetime
data = [[i, i+1] for i in range(14)]
index = pd.date_range(start=datetime.date(2019,1,1), end=datetime.date(2020,2,1), freq='MS')
columns = ['col1', 'col2']
df = pd.DataFrame(data, index, columns)
Notice that this doesn't go all the way up to the present -- often the file I'm pulling from is a month or two behind. What I then need to do is add on any missing months and fill them with the same value as the previous year.
So in this case I need to add another row that is
2020-03-01 2 3
It could be anywhere from 0-2 rows that need to be added to the end of the dataframe at a given point in time. What's the best way to do this?
Note: The data here is not real so please don't take advantage of the simple pattern of entries I gave above. It was just a quick way to fill two columns of a table as an example.
If I understand your problem, then the following should help you. This does assume that you always have data 12 months ago however. You can define a new DataFrame which includes the months up to the most recent date.
# First create the new index. Get the most recent date and add an offset.
start, end = df.index[-1] + pd.DateOffset(), pd.Timestamp.now()
index_new = pd.date_range(start, end, freq='MS')
Create your DataFrame
# Get the data from the previous year.
data = df.loc[index_new - pd.DateOffset(years=1)].values
df_new = pd.DataFrame(data, index = index_new, columns=df.columns)
which looks like
col1 col2
2020-03-01 2 3
then just use;
pd.concat([df, df_new], axis=0)
Which gives
col1 col2
2019-01-01 0 1
2019-02-01 1 2
2019-03-01 2 3
... ... ...
2020-02-01 13 14
2020-03-01 2 3
Note
This also works for cases where the number of months missing is greater than 1.
Edit
Slightly different variation
# Create series with missing months added.
# Get the corresponding data 12 months prior.
s = pd.date_range(df.index[0], pd.Timestamp.now(), freq='MS')
fill = df.loc[s[~s.isin(df.index)] - pd.DateOffset(years=1)]
# Reindex the original dataframe
df = df.reindex(s)
# Find the dates to fill and replace with lagged data
df.iloc[-1 * fill.shape[0]:] = fill.values

Merging sheets of excel using python

I am trying to take data of two sheets and comparing with each other if it matches i want to append column. Let me explain this by showing what i am doing and what i am trying to get in output using python.
This is my sheet1 from excel.xlsx:
it contains four column name,class,age and group.
This is my sheet2 from excel.xlsx:
it contains default, and name column with extra names in it.
So, Now i am trying to match name of sheet2 with sheet1, if the name containing in sheet1 matches with sheet2 then i want to add default value corresponding to that name from sheet2.
This i need in output:
As you can see only Ravi and Neha having default in sheet2 and that name matches with sheet1 name. Suhash and Aish dont have any default value so not anything coming there.
This code i tried:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
df1['DEFAULT'] = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)
and getting output excel like this:
Not getting default against Ravi.
Please help me with this to get this expected output using python.
Assuming you read each sheet into a dataframe (df = sheet1, df2 = sheet2)
it's quite easy and there are a few options (ranked in order of speed, from fastest to slowest):
# .merge
df = df.merge(df2, how='left', on='Name')
# pd.conact
df = pd.concat([df.set_index('Name'), df2.set_index('Name').Default], axis=1, sort='Name', join='inner')
# .join
df = df.set_index('Name').join(df2.set_index('Name'))
# .map
df.Default = df.Name.map(df2.set_index('Name')['Default'].to_dict())
All of them will have the following output:
Name Default Class Age Group
0 NaN NaN 4 2 tig
1 Ravi 2.0 5 5 rose
2 NaN NaN 3 3 lily
3 Suhas NaN 5 5 rose
4 NaN NaN 2 2 sun
5 Neha 3.0 5 5 rose
6 NaN NaN 5 2 sun
7 Aish NaN 5 5 rose
Then you overwrite the original sheet by using df.to_excel
EDIT
So the code you shared has 3 problems. One of which seems to be a language barrier... You only need 1 of the options I gave you. Secondly there's a missing ' when reading the first sheet into df. And lastly you're inconsistent when using the df names. you defined df1 and df2 but used just df in the code which doesn't work
So the correct code would be as follows:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1') #Here the ' was missing
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
## Now you chose one of the options, I used map here, but you can pick any one of them
df1.DEFAULT = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)

How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.
csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

Resources