Palantir Foundry spark.sql query - apache-spark

When I attempt to query my input table as a view, I get the error com.palantir.foundry.spark.api.errors.DatasetPathNotFoundException. My code is as follows:
def Median_Product_Revenue_Temp2(Merchant_Segments):
Merchant_Segments.createOrReplaceTempView('Merchant_Segments_View')
df = spark.sql('select * from Merchant_Segments_View limit 5')
return df
I need to dynamically query this table, since I am trying to calculate the median using percentile_approx across numerous fields, and I'm not sure how to do this without using spark.sql.
If I try to avoid using spark.sql to calculate median across numerous fields using something like the below code, it results in the error Missing Transform Attribute: A module object does not have an attribute percentile_approx. Please check the spelling and/or the datatype of the object.
import pyspark.sql.functions as F
exprs = {x: percentile_approx("x", 0.5) for x in df.columns if x is not exclustion_list}
df = df.groupBy(['BANK_NAME','BUS_SEGMENT']).agg(exprs)

try createGlobalTempView. It worked for me.
eg:
df.createGlobalTempView("people")
(Don't know the root cause why localTempView dose not work )

I managed to avoid using dynamic sql for calculating median across columns using the following code:
df_result = df.groupBy(group_list).agg(
*[ F.expr('percentile_approx(nullif('+col+',0), 0.5)').alias(col) for col in df.columns if col not in exclusion_list]
)
Embedding percentile_approx in an F.expr bypassed the issue I was encountering in the second half of my post.

Related

Pyspark - Looking to create a normalized version of a Double column

As the title states, I'd like to create a normalized version of an existing Double column.
As I'm quite new to pyspark, this was my attempt at solving this:
df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df2 = df2.withColumn('count_trans_norm', F.col('count_trans) / (F.max(F.col('count_trans'))))
When I do this, I get the following error:
"grouping expressions sequence is empty, and '`movie_id`' is not an aggregate function.
Any help would be much appreciated.
You need to specify an empty window if you want to get the maximum of count_trans in df2:
df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df3 = df2.selectExpr('*', 'count_trans / max(count_trans) over () as count_trans_norm')
Or if you prefer pyspark syntax:
from pyspark.sql import functions as F, Window
df3 = df2.withColumn('count_trans_norm', F.col('count_trans') / F.max(F.col('count_trans')).over(Window.orderBy()))

how to use lambda function to select larger values from two python dataframes whilst comparing them by date?

I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.

Getting NONE in the last row of dataframe when using pd.read_sql_query

I am trying to create a db using sqlite3. i created methods to read write delete and show table. however in order to view table in proper format on Command line, i decided to use pandas (pd.read_sql_query). However, when i do that i get None in the last row of the first column.
I tried writing the table to a csv and there was no none value there.
def show_table():
df = pd.read_sql_query("SELECT * FROM ticket_info", SQLITEDB.conn, index_col='resource_id')
print(df)
df.to_csv('hahaha.csv')
def fetch_from_db(query):
df = pd.read_sql_query('SELECT * FROM ticket_info WHERE {}'.format(query), SQLITEDB.conn, index_col='resource_id')
print(df)
here's the output as a picture.output image
Everything is correct but the last None value, where is it coming from? and how do i gt rid of it?
You are adding query as a variable. You might have a query that doesn't return any data from you table.

CreateDataFrame or SaveAsTable intuitively encode in pyspark 1.6

I am trying to save a table in spark1.6 using pyspark. All of the tables columns are saved as text, I'm wondering if I can change this:
product = sc.textFile('s3://path/product.txt')
product = m3product.map(lambda x: x.split("\t"))
product = sqlContext.createDataFrame(product, ['productid', 'marketID', 'productname', 'prod'])
product.saveAsTable("product", mode='overwrite')
Is there something in the last 2 commands that could automatically recognize productid and marketid as numerics? I have a lot of files and a lot of fields to upload so ideally it would be automatic
Is there something in the last 2 commands that could automatically recognize productid and marketid as numerics
If you pass int or float (depending on what you need) pyspark will convert the data type for you.
In your case, changing the lambda function in
product = m3product.map(lambda x: x.split("\t"))
product = sqlContext.createDataFrame(product, ['productid', 'marketID', 'productname', 'prod'])
to
from pyspark.sql.types import Row
def split_product_line(line):
fields = line.split('\t')
return Row(
productid=int(fields[0]),
marketID=int(fields[1]),
...
)
product = m3product.map(split_product_line).toDF()
You will find it much easier to control data types and possibly error/exception checks.
Try to prohibit lambda functions if possible :)

How to remove rows in DataFrame on column based on another DataFrame?

I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:
from pyspark.sql import Row
df1 = sqlContext.createDataFrame([
Row(name='Alice', age=2),
Row(name='Bob', age=1),
]).alias('df1')
df2 = sqlContext.createDataFrame([
Row(name='Bob'),
])
df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)
Since I want all rows from df1 which don't include name='Bob' I expect Row(age=2, name='Alice'). But I also retrieve Bob:
print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]
After various experiments to get down to this MCVE, I found out that the issue is with the age key. If I omit it:
df1_noage = sqlContext.createDataFrame([
Row(name='Alice'),
Row(name='Bob'),
]).alias('df1_noage')
df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]
Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:
df1_zage = sqlContext.createDataFrame([
Row(zage=2, name='Alice'),
Row(zage=1, name='Bob'),
]).alias('df1_zage')
df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]
I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name.
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT / EXCEPT is no the right choice for partial matches.
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F
ref = df1_with_df2.select("name").alias("ref")
(df1
.join(ref, ref.name == df1.name, "leftouter")
.filter(F.isnull("ref.name"))
.drop(F.col("ref.name")))

Resources