Why the query result() difference when using Bigquery Python Client Library - python-3.x

I am trying to figure out the difference when executing query via below 2 ways:
job1 = client.query(query).result()
vs
job2= client.query(query)
job2.result()
Code is below:
from google.cloud import bigquery
bq_project= '<project_name>'
client = bigquery.Client(project=bq_project)
m_query = "SELECT * FROM <dataset.tbl>"
## NOTE: This query result has just 1 row.
x= client.query(m_query)
job1= x.result()
for row in job1:
val1 = row
job2 = client.query(myQuery).result()
for row in job2:
val2 = row
print(job1 == job2) # This is giving the return type as False
print(val1 == val2) # True
I understand that the final o/p from both the query execution will be same.
But I am not able to understand why job1 is not equal to job2.
Is the internal working different for both job1 and job2?
Note : I have gone through this Link but the question there is different.

You are comparing two objects (or iterables), stored in different spots in the memory. There is no point in comparing those. If you want to compare their contents, thats a doable thing:
A = [each for each in job1]
B = [each for each in job2]
print(A == B)

Related

Spark window function and taking first and last values per column per partition (aggregation over window)

Imagine I have a huge dataset which I partitionBy('id'). Assume that id is unique to a person, so there could be n number of rows per id and the goal is to reduce it to one.
Basically, aggregating to make id distinct.
w = Window().partitionBy(id).rowsBetween(-sys.maxsize, sys.maxsize)
test1 = {
key: F.first(key, True).over(w).alias(key)
for key in some_dict.keys()
if (some_dict[key] == 'test1')
}
test2 = {
key: F.last(key, True).over(w).alias(k)
for k in some_dict.keys()
if (some_dict[k] == 'test2')
}
Assume that I have some_dict with values either as test1 or test2 and based on the value, I either take the first or last as shown above.
How do I actually call aggregate and reduce this?
cols = {**test1, **test2}
cols = list(cols.value())
df.select(*cols).groupBy('id').agg(*cols) # Doesnt work
The above clearly doesn't work. Any ideas?
Goal here is : I have 5 unique IDs and 25 rows with each ID having 5 rows. I want to reduce it to 5 rows from 25.
Let assume you dataframe name df which contains duplicate use below method
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
window = Window.partitionBy(df['id']).orderBy(df['id'])
final = df.withColumn("row_id", row_number.over(window)).filter("row_id = 1")
final.show(10,False)
change the order by condition in case there is specific criteria so that particular record will be on top of partition

How to optimize the following code so as to run it faster?

First of all, I have a pandas dataframe with the following columns:
"YEAR","1DIGIT","2DIGITS","3DIGITS","SIZE","CODE","VALUE" with 1.8 million rows.
Here is my code to correct the data I have:
for year in list(data.YEAR.unique()):
data1 = data[data.YEAR == year]
for dig in list(data1.3DIGITS.unique()):
data2 = data1[data1.3DIGITS == dig]
for size in list(data2.SIZE.unique()):
data3 = data2[data2.SIZE == size]
data.loc[(data.YEAR == year)&(data.3DIGITS == dig)&(data.CODE == 9122),"VALUE") = data3[data3.CODE.isin(9001,9057)].VALUE.sum()
As you can see I want to sum values of codes 9001 and 9057 and assign it to value of code 9122. This works but really slow, it takes almost 1 and half an hour. Is there anything we can do to make it faster?
Try using groupby function of pandas.
This would look something like:
def add_col(df):
df.loc[(df.CODE == 9122),"VALUE") = df[df.CODE.isin(9001,9057)].VALUE.sum()
return df
data.groupby(['YEAR', '3DIGITS', 'SIZE']).apply(add_col)

Compare Spark Dataframes Where Not Equal With List of Comparison Columns

I'm currently trying to compare two data frames together to see how the fields don't match in pyspark. I have been able to write it manually, but I want to be able to pass a list of fields to ensure that the frames do not match on the fields. The data frames are identical.
The code I have thus far is:
key_cols = ['team_link_uuid', 'team_sat_hash']
temp_team_sat = orig.select(*key_cols)
temp_team_sat_incremental = delta.select(*key_cols)
hash_field = ['team_sat_hash']
test_update_list = temp_team_sat.join(temp_team_sat_incremental, (temp_team_sat.team_link_uuid == temp_team_sat_incremental.team_link_uuid) & (temp_team_sat.team_sat_hash != temp_team_sat_incremental.team_sat_hash))
But now I need to be able to take my list (hash_field) and be able to ensure that the one or many fields are not equal to each other.
assuming fields_to_compare_list is a list of the fields you want to compare,
from functools import reduce
comparaison_query = reduce(
lambda a,b : (a | b),
[ temp_team_sat[col] != temp_team_sat_incremental[col]
for col
in fields_to_compare_list
]
)
test_update_list = temp_team_sat.join(
temp_team_sat_incremental,
on = (temp_team_sat.team_link_uuid == temp_team_sat_incremental.team_link_uuid) \
& comparaison_query

Improving the speed of cross-referencing rows in the same DataFrame in pandas

I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min
I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())

what is the easiest way to check for the equality of two db rows using groovy sql

Is there a simple and concise way to check that two rows of a given Table contains the same data in all columns?
I haven't tested this, but it seems the most obvious solution:
// get an Sql instance
def db = [url:'jdbc:hsqldb:mem:testDB', user:'sa', password:'',
driver:'org.hsqldb.jdbcDriver']
def sql = Sql.newInstance(db.url, db.user, db.password, db.driver)
// Get 2 rows
GroovyRowResults row1 = sql.firstRow("select * from user where id = 4")
GroovyRowResults row2 = sql.firstRow("select * from user where email = 'me#example.org'")
// compare them
boolean identical = row1.equals(row2)
Not especally Groovy, but I'd make SQL do the lifting a la something like:
db.firstRow("SELECT COUNT(DISTINCT CONCAT(city,state,zip)) FROM Candidates WHERE id IN (1,2)")[0] == 0

Resources