Python conditional vlookup? - python-3.x

Im using pandas dataframes to work with 2 csv files
I need a vlookup, but I want to apply another vlookup if the result is a null string....any idea?
one dataframe file is called data
the other is called data2
this vlookup (WORKING code), will find data["ID A"] == data2['Person_ID'] and bring data2 ['Status_job'] from that row:
Code:
data['STATUS X'] = data['ID A'].map ( data2[['Person_ID', 'Status_job']].set_index('Person_ID') ['Status_job'].to_dict() )
BUT, I want another vlookup in case that ['Status_job'] return a null string. (same code but Program_ID instead Person_ID)
Working code2:
data['STATUS X'] = data['ID A'].map ( data2[['Program_ID', 'Status_job']].set_index('Program_ID') ['Status_job'].to_dict() )
How can I merge these 2 codes into 1 conditional? tried .loc and lambda x, but not sure how to make it work with no error, will appreciate any help.

Related

why am I getting column object not callable error in pyspark?

I am doing a simple parquet file reading and running a query to find the un-matched rows from left table. Please see the code snippet below.
argTestData = '<path to parquet file>'
tst_DF = spark.read.option('header', True).parquet(argTestData)
argrefData = '<path to parquet file>'
refDF = spark.read.option('header', True).parquet(argrefData)
cond = ["col1", "col2", "col3"]
fi = tst_DF.join(refDF, cond , "left_anti")
So far things are working. However, as a requirement, I need to get the elements list if the above gives count > 0, i.e. if the value of fi.count() > 0, then I need the elements name. So, I tried below code, but it is throwing error.
if fi.filter(col("col1").count() > 0).collect():
fi.show()
error
TypeError: 'Column' object is not callable
Note:
I have 3 columns as a joining condition which is in a list and assigned to a variable cond, and I need to get the un-matched records for those 3 columns, so the if condition has to accommodate them. OfCourse there are many other columns due to join.
Please suggest where am I making mistakes.
Thank you
If I understand correctly, that's simply :
fi.select(cond).collect()
The left_anti already get the records which do not match (exists in tst_DF but not in refDF).
You can add a distinct before the collect to remove duplicates.
Did you import the column function?
from pyspark.sql import functions as F
...
if fi.filter(F.col("col1").count() > 0).collect():
fi.show()

how to add column name to the dataframe storing result of correlation of two columns in pyspark?

I have read a csv file and need to find correlation between two columns.
I am using df.stat.corr('Age','Exp') and result is 0.7924058156930612.
But I want to have this result stored in another dataframe with header as "correlation".
correlation
0.7924058156930612
Following up on what #gupta_hemant commented.
You can create a new column as
df.withColumn("correlation", df.stat.corr("Age", "Exp").collect()[0].correlation)
(I am guessing the exact syntax here, but it should be something like this)
After reviewing the code, the syntax should be
import pyspark.sql.functions as F
df.withColumn("correlation", F.lit(df.stat.corr("Age", "Exp")))
Try this and let me know.
corrValue = df.stat.corr("Age", "Exp")
newDF = spark.createDataFrame(
[
(corrValue)
],
["corr"]
)

Get count of data from particular Excel cell using python

I am reading an excel file as below using pandas and writing the results to a dataframe .
I want to get the count of rows present in "Expected Result" column for each Testcase . I used the len function, it was throwing "TypeError: object of type 'numpy.int64' has no len()" error . Is there a way to capture the row count from excel for each test in python .
Here is my code
df = pd.read_excel("input_test_2.xlsx")
testcases = df['Test'].values
expected_result = df['Expected Result'].values
for i in range(0,len(df)):
testcase_nm = testcases[i]
_expected = expected_result[i]
print("Count of Expected Result:" , len(_expected))
This is the Output I am looking for :
Testcase-1 , Count of Expected Result: 1
Testcase-2 , Count of Expected Result: 3
Without seeing the dataframe data, it's tough to say if this will work as it's not clear how pandas handles merged excel data.
in general though:
df_counts = df.groupby('Test').count().reset_index() # won't give you the text, but a new dataframe

Getting NONE in the last row of dataframe when using pd.read_sql_query

I am trying to create a db using sqlite3. i created methods to read write delete and show table. however in order to view table in proper format on Command line, i decided to use pandas (pd.read_sql_query). However, when i do that i get None in the last row of the first column.
I tried writing the table to a csv and there was no none value there.
def show_table():
df = pd.read_sql_query("SELECT * FROM ticket_info", SQLITEDB.conn, index_col='resource_id')
print(df)
df.to_csv('hahaha.csv')
def fetch_from_db(query):
df = pd.read_sql_query('SELECT * FROM ticket_info WHERE {}'.format(query), SQLITEDB.conn, index_col='resource_id')
print(df)
here's the output as a picture.output image
Everything is correct but the last None value, where is it coming from? and how do i gt rid of it?
You are adding query as a variable. You might have a query that doesn't return any data from you table.

How to multiply two columns in a spark dataframe

Say I am having a dataframe named "orderitems" with below schema
DataFrame[order_item_id: int, order_item_order_id: int, order_item_product_id: int, order_item_quantity: int, order_item_subtotal: float, order_item_product_price: float]
So As a part of checking the data quality , I need to ensure all rows satisfies the formula : order_item_subtotal = (order_item_quantity*order_item_product_price).
For this I need to add a seperate column named "valid" which should have 'Y' as value for all those rows which satisfy the above formula and for all other rows it should have 'N' as value.
I have decided to use when() and otherwise() along with withColumn() method as below.
orderitems.withColumn("valid",when(orderitems.order_item_subtotal != (orderitems.order_item_product_price * orderitems.order_item_quantity),'N').otherwise("Y"))
But it returns me below Error:
TypeError: 'Column' object is not callable
I know this happened because I have tried to multiply two column objects. But I am not sure how to resolve this since I am still on a learnig proccess in spark.
I would like to know , how to fix this. I am using Spark 2.3.0 with Python
Try something like this:
from pyspark.sql.functions import col,when
orderitems.withColumn("valid",
when(col("order_item_subtotal") != (col("order_item_product_price") * col("order_item_quantity")),"N")
.otherwise("Y")).show()
This can be implemented through spark UDF functions which are very efficient in performing row operartions.
Before running this code make sure the comparison you are doing should have the same datatype.
def check(subtotal, item_quantity, item_product_price):
if subtotal == (item_quantity * item_product_price):
return "Y"
else:
return "N"
validate = udf(check)
orderitems = orderitems.withColumn("valid", validate("order_item_subtotal", "order_item_quantity", "order_item_product_price"))

Resources