Does PySpark run operation out-of-sequence due to optimization? - apache-spark

I'm confused about the result my code is giving me. Here is the code I wrote:
def update_cassandra(df : DataFrame, aggr: str):
aggr_map_dict = {
'Giornaliera' : 'day',
'Settimanale' : 'week',
'Bi-Settimanale' : 'bi_week',
'Mensile': 'month'
}
max_min_dates = df.agg(F.max(df['data']), F.min(df['data'])).collect()[0]
upper_date = max_min_dates[0]
lower_date = max_min_dates[1]
df = (df.select('data', 'punto_di_interesse', 'id_telco', 'presenze', 'presenze_uniche', 'presenze_00_06','presenze_06_08', 'presenze_08_10', 'presenze_10_12', 'presenze_12_14', 'presenze_14_16', 'presenze_16_18', 'presenze_18_20', 'presenze_20_22', 'presenze_22_24')
)
print('contenuto del csv')
display(df.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
telco_day_aggr = read_from_cassandra_dev(f'telco_{aggr_map_dict[aggr]}_aggr').where(F.col('data').between(lower_date,upper_date))
if telco_day_aggr.count() == 0:
telco_day_aggr = create_empty_df()
print('telco_day_aggr as is')
display(telco_day_aggr.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
union_df = df.union(telco_day_aggr)
print('unione del AS-IS e del csv')
display(union_df.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
output_df = (union_df.groupBy('data', 'punto_di_interesse', 'id_telco')
.agg(
F.sum('presenze').alias('presenze'),
F.sum('presenze_uniche').alias('presenze_uniche'),
F.sum('presenze_00_06').alias('presenze_00_06'),
F.sum('presenze_06_08').alias('presenze_06_08'),
F.sum('presenze_08_10').alias('presenze_08_10'),
F.sum('presenze_10_12').alias('presenze_10_12'),
F.sum('presenze_12_14').alias('presenze_12_14'),
F.sum('presenze_14_16').alias('presenze_14_16'),
F.sum('presenze_16_18').alias('presenze_16_18'),
F.sum('presenze_18_20').alias('presenze_18_20'),
F.sum('presenze_20_22').alias('presenze_20_22'),
F.sum('presenze_22_24').alias('presenze_22_24')
)
)
return output_df
aggregate_df = aggregate_table(df_daily, 'Giornaliera')
write_on_cassandra_dev(aggregate_df, 'telco_day_aggr')
What I expect to achieve is to create a sort of update for cassandra, becouse the cassandra drivers. So the operation in my head are like this:
read from blob storage the csv and store it in a dataframe (the df variable, input of the method)
with max and min dates of this csv file, query the table in cassandra and save it in another variable
concatenate the two dataframe
summing up with the groupby
write on cassandra the new dataframe overwriting the existing rows with the new ones
it seems to me that, some how, what is in the dataframe "df" is written before I can read "telco_day_aggr" and that the union and grupby part are ininfluent. In other words on my cassandra table there is present only the content of df.
I can provide additional information if needed.

Related

Any optimize way to iterate excel and provide data into pd.read_sql() as a string one by one

#here I have to apply the loop which can provide me the queries from excel for respective reports:
df1 = pd.read_sql(SQLqueryB2, con=con1)
df2 = pd.read_sql(ORCqueryC2, con=con2)
if (df1.equals(df2)):
print(Report2 +" : is Pass")
Can we achieve above by something doing like this (by iterating ndarray)
df = pd.read_excel(path) for col, item in df.iteritems():
OR do the only option left to read the excel from "openpyxl" library and iterate row, columns and then provide the values. Hope I am clear with the question, if any doubt please comment me.
You are trying to loop through an excel file, run the 2 queries, see if they match and output the result, correct?
import pandas as pd
from sqlalchemy import create_engine
# add user, pass, database name
con = create_engine(f"mysql+pymysql://{USER}:{PWD}#{HOST}/{DB}")
file = pd.read_excel('excel_file.xlsx')
file['Result'] = '' # placeholder
for i, row in file.iterrows():
df1 = pd.read_sql(row['SQLQuery'], con)
df2 = pd.read_sql(row['Oracle Queries'], con)
file.loc[i, 'Result'] = 'Pass' if df1.equals(df2) else 'Fail'
file.to_excel('results.xlsx', index=False)
This will save a file named results.xlsx that mirrors the original data but adds a column named Result that will be Pass or Fail.
Example results.xlsx:

How To use Pysark Window Function on new unprocessed data?

I have developed window functions on pyspark DataFrame to calculate Total Transaction Amount made by customer on monthly basis per transaction.
For Eg:
Input Table has data:
And the window function process the data and inserts it into table
Now, if i get new transactions today, I want to develop a code where it loads the last month transaction into spark dataframe and running window function on new rows and saving it into Processed Table. The current window function will process all the rows and then need to manually avoid already inserted records and insert only new records. This will use high resources and high memory, when the window function becomes for a year.
#Function to apply window function
def cumulative_total_CR(df, from_column, to_column, window_function):
intermediate_column = from_column + "_temp"
df = df.withColumn(from_column,df[from_column].cast("double"))
df = df.withColumn(intermediate_column,when(col("Flow") == 'C',df[from_column]).otherwise(0))
df = df.withColumn(to_column, F.sum(intermediate_column).over(window_function))
return df
def cumulative_total_DR(df, from_column, to_column, window_function):
intermediate_column = from_column + "_temp"
df = df.withColumn(from_column,df[from_column].cast("double"))
df = df.withColumn(intermediate_column,when(col("Flow") == 'D',df[from_column]).otherwise(0))
df = df.withColumn(to_column, F.sum(intermediate_column).over(window_function))
return df
#Window Function:
window = (Window.partitionBy("CUSNO").orderBy(F.col(TxnDateTime).cast('long')).rangeBetween(-30,0))
df = load.data.from.hive
#appending TxnDate and TxnTime into new column TxnDateTime with type casting as timestamp and format as 'yyyy-MM-dd HH:mm:ss.SSS'
df = cumulative_total_CR(df, "TXNAMT", "Total_Cr_Monthly_Amt", window_function_30_days)
df = cumulative_total_DR(df, "TXNAMT", "Total_Dr_Monthly_Amt", window_function_30_days)
df = saving.data.to.disk for new records

Return a dataframe to a dataframe

I am trying to take a list of "ID"s from my dataframe dfrep and pass the column with the ID into a function I created in order to pass the values into a query to return back to dfrep.
My function is returning a dataframe, but the results of the dataframe are including the header and when I print dfrep there are two lines. I also cannot write the dataframe to excel using xlwings because I get TypeError: must be a pywintypes time object (got DataFrame).
def overrides(id):
sql = f"select name from sales..rep where id in({id})"
mydf = pd.read_sql(sql, conn)
return mydf
overrides = np.vectorize(overrides)
dfrep['name'] = overrides(dfrep['ID'])
wsData.range('A1').options(pd.DataFrame,index=False).value = dfrep
My goal is to load the column(s) in my function's dataframe into my main dataframe dfrep and then write to excel via xlwings. Any help is appreciated.

How to convert specific rows in a column into a separate column using pyspark and enumerate each row with an increasing numerical index? [duplicate]

I'm trying to read in retrosheet event file into spark. The event file is structured as such.
id,TEX201403310
version,2
info,visteam,PHI
info,hometeam,TEX
info,site,ARL02
info,date,2014/03/31
info,number,0
info,starttime,1:07PM
info,daynight,day
info,usedh,true
info,umphome,joycj901
info,attendance,49031
start,reveb001,"Ben Revere",0,1,8
start,rollj001,"Jimmy Rollins",0,2,6
start,utlec001,"Chase Utley",0,3,4
start,howar001,"Ryan Howard",0,4,3
start,byrdm001,"Marlon Byrd",0,5,9
id,TEX201404010
version,2
info,visteam,PHI
info,hometeam,TEX
As you can see for each game the events loops back.
I've read the file into a RDD, and then via a second for loop added a key for each iteration, which appears to work. But I was hoping to get some feedback on if there was a cleaning way to do this using spark methods.
logFile = '2014TEX.EVA'
event_data = (sc
.textFile(logfile)
.collect())
idKey = 0
newevent_list = []
for line in event_dataFile:
if line.startswith('id'):
idKey += 1
newevent_list.append((idKey,line))
else:
newevent_list.append((idKey,line))
event_data = sc.parallelize(newevent_list)
PySpark since version 1.1 supports Hadoop Input Formats.You can use textinputformat.record.delimiter option to use a custom format delimiter as below
from operator import itemgetter
retrosheet = sc.newAPIHadoopFile(
'/path/to/retrosheet/file',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\nid,'}
)
(retrosheet
.filter(itemgetter(1))
.values()
.filter(lambda x: x)
.map(lambda v: (
v if v.startswith('id') else 'id,{0}'.format(v)).splitlines()))
Since Spark 2.4 you can also read data into DataFrame using text reader
spark.read.option("lineSep", '\nid,').text('/path/to/retrosheet/file')

Accessing a global lookup Apache Spark

I have a list of csv files each with a bunch of category names as header columns. Each row is a list of users with a boolean value (0, 1) whether they are part of that category or not. Each of the csv files does not have the same set of header categories.
I want to create a composite csv across all the files which has the following output:
Header is a union of all the headers
Each row is a unique user with a boolean value corresponding to the category column
The way I wanted to tackle this is to create a tuple of a user_id and a unique category_id for each cell with a '1'. Then reduce all these columns for each user to get the final output.
How do I create the tuple to begin with? Can I have a global lookup for all the categories?
Example Data:
File 1
user_id,cat1,cat2,cat3
21321,,,1,
21322,1,1,1,
21323,1,,,
File 2
user_id,cat4,cat5
21321,1,,,
21323,,1,,
Output
user_id,cat1,cat2,cat3,cat4,cat5
21321,,1,1,,,
21322,1,1,1,,,
21323,1,1,,,,
Probably the title of the question is misleading in the sense that conveys a certain implementation choice as there's no need for a global lookup in order to solve the problem at hand.
In big data, there's a basic principle guiding most solutions: divide and conquer. In this case, the input CSV files could be divided in tuples of (user,category).
Any number of CSV files containing an arbitrary number of categories can be transformed to this simple format. The resulting CSV results of the union of the previous step, extraction of the total nr of categories present and some data transformation to get it in the desired format.
In code this algorithm would look like this:
import org.apache.spark.SparkContext._
val file1 = """user_id,cat1,cat2,cat3|21321,,,1|21322,1,1,1|21323,1,,""".split("\\|")
val file2 = """user_id,cat4,cat5|21321,1,|21323,,1""".split("\\|")
val csv1 = sparkContext.parallelize(file1)
val csv2 = sparkContext.parallelize(file2)
import org.apache.spark.rdd.RDD
def toTuples(csv:RDD[String]):RDD[(String, String)] = {
val headerLine = csv.first
val header = headerLine.split(",")
val data = csv.filter(_ != headerLine).map(line => line.split(","))
data.flatMap{elem =>
val merged = elem.zip(header)
val id = elem.head
merged.tail.collect{case (v,cat) if v == "1" => (id, cat)}
}
}
val data1 = toTuples(csv1)
val data2 = toTuples(csv2)
val union = data1.union(data2)
val categories = union.map{case (id, cat) => cat}.distinct.collect.sorted //sorted category names
val categoriesByUser = union.groupByKey.mapValues(v=>v.toSet)
val numericCategoriesByUser = categoriesByUser.mapValues{catSet => categories.map(cat=> if (catSet(cat)) "1" else "")}
val asCsv = numericCategoriesByUser.collect.map{case (id, cats)=> id + "," + cats.mkString(",")}
Results in:
21321,,,1,1,
21322,1,1,1,,
21323,1,,,,1
(Generating the header is simple and left as an exercise for the reader)
You dont need to do this as a 2 step process if all you need is the resulting values.
A possible design:
1/ Parse your csv. You dont mention whether your data is on a distributed FS, so i'll assume it is not.
2/ Enter your (K,V) pairs into a mutable parallelized (to take advantage of Spark) map.
pseudo-code:
val directory = ..
mutable.ParHashMap map = new mutable.ParHashMap()
while (files[i] != null)
{
val file = directory.spark.textFile("/myfile...")
val cols = file.map(_.split(","))
map.put(col[0], col[i++])
}
and then you can access your (K/V) tuples by way of an iterator on the map.

Resources