The "table" in sqlContext.table - apache-spark

I'm working my way through a book on Spark and I'm on a section dealing with the join method for dataframes. In this example, the "trips" table is being joined with the "stations" table:
trips = sqlContext.table("trips")
stations = sqlContext.table("stations")
joined = trips.join(stations, trips.start_terminal == stations.station_id)
joined.printSchema()
The data is supposed to come from two spreadsheets, trips.csv and stations.csv, but I don't know how Spark is supposed to figure that out. It seems to me that there should be a line indicating where "trips" and "stations" are supposed to come from.
If I try something like
trips = sqlContext.table('/home/l_preamble/Documents/trips.csv')
it doesn't like it "pyspark.sql.utils.ParseException: u"\nextraneous input '/' expecting {'SELECT', 'FROM', 'ADD'..."
So how can I point it in the direction of the data? Any help would be appreciated.

In order to join two dataframes in pyspark, you should try this:-
df1=sqlContext.registerTempTable("trips")
df2=sqlContext.registerTempTable("stations")
df2.join(df1,['column_name'],outer)

I think, maybe you need this
spark = SparkSession.builder.appName('MyApp').getOrCreate()
df_trips = spark.read.load(path='/home/l_preamble/Documents/trips.csv', format='csv', sep=',')
df_trips.createOrReplaceTempView('trips')
result = spark.sql("""select * from trips""")

Related

passing array into isin() function in Databricks

I have a requirement where I will have to filter records from a df if that is present in one array. so I have an array that is distinct values from another df's column like below.
dist_eventCodes = Event_code.select('Value').distinct().collect()
now I am passing this dist_eventCodes in a filter like below.
ADT_df_select = ADT_df.filter(ADT_df.eventTypeCode.isin(dist_eventCodes))
when I run this code I get the below error message
"AttributeError: 'DataFrame' object has no attribute '_get_object_id'"
can somebody please help me under what wrong am i doing?
Thanks in advance
If I understood correctly, you want to retain only those rows where eventTypeCode is within eventTypeCode from Event_code dataframe
Let me know if this is not the case
This can be achieved by a simple left-semi join in spark. This way you don't need to collect the dataframe, thus would be the right way in a distributed environment.
ADT_df.alias("df1").join(Event_code.select("value").distinct().alias("df2"), [F.col("df1.eventTypeCode")=F.col("df2.value")], 'leftsemi')
Or if there is a specific need to use isin, this would work (collect_set will take care of distinct):
dist_eventCodes = Event_code.select("value").groupBy(F.lit("dummy")).agg(F.collect_set("value").alias("value")).first().asDict()
ADT_df_select = ADT_df.filter(ADT_df["eventTypeCode"].isin(dist_eventCodes["value"]))
Input (ADT_df):
Event_code Dataframe:
Output:

Case sensitive join in Spark

I am dealing with a scenario in which I need to write a case sensitive join condition. For that, I found there is a spark config property spark.sql.caseSensitive that can be altered. However, there is no impact on the final result set if I set this property to True or False.
In both ways, I am not getting results for language=java from the below sample PySpark code. Can anyone please help with how to handle this scenario?
spark.conf.set("spark.sql.caseSensitive", False)
columns1 = ["language","users_count"]
data1 = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
columns2 = ["language","note"]
data2 = [("java", "JVM based"), ("Python", "Indentation is imp"), ("Scala", "Derived from Java")]
df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)
#df1.createOrReplaceTempView("df1")
#df2.createOrReplaceTempView("df2")
df = df1.join(df2, on="language", how="inner")
display(df)
My understanding of spark.sql.caseSensitive is that it affects SQL, not the data.
As for your join itself, if you do not want to lowercase or uppercase your data, which I can understand why, you can create a key column, which is the lowercase version of the value you want to join on. If you are having more complex situation, your key column could even become a md5() of one/more columns. Make sure everything stays lowercase/uppercase though to make the comparison works.

How do I calculate percentages over groups in spark?

I have data in the form:
FUND|BROKER|QTY
F1|B1|10
F1|B1|50
F1|B2|20
F1|B3|20
When I group it by FUND, and BROKER, I would like to calculate QTY as a percentage of the total at the group level. Like so,
FUND|BROKER|QTY %|QTY EXPLANATION
F1|B1|60%|(10+50)/(10+50+20+20)
F1|B2|20%|(20)/(10+50+20+20)
F1|B2|20%|(20)/(10+50+20+20)
Or when I group by just FUND, like so
FUND|BROKER|QTY %|QTY EXPLANATION
F1|B1|16.66|(10)/(10 + 50)
F1|B1|83.33|(50)/(10 + 50)
F1|B2|100|(20)/(20)
F1|B3|100|(20)/(20)
I would like to achieve this using spark-sql if possible or through dataframe functions.
I think I have to use Windowing functions, so I can get access to the total of the grouped dataset, but I've not had much luck using them the right way.
Dataset<Row> result = sparkSession.sql("SELECT fund_short_name, broker_short_name,first(quantity)/ sum(quantity) as new_col FROM margin_summary group by fund_short_name, broker_short_name" );
PySpark SQL solution.
This can be done using sum as a window function defining 2 windows - one with a grouping on broker, fund and the other only on fund.
from pyspark.sql import Window
from pyspark.sql.functions import sum
w1 = Window.partitionBy(df.fund,df.broker)
w2 = Window.partitionBy(df.fund)
res = df.withColumn('qty_pct',sum(df.qty).over(w1)/sum(df.qty).over(w2))
res.select(res.fund,res.broker,res.qty_pct).distinct().show()
Edit: Result 2 is simpler.
res2 = df.withColumn('qty_pct',df.qty/sum(df.qty).over(w1))
res2.show()
SQL solution would be
select distinct fund,broker,100*sum(qty) over(partition by fund,broker)/sum(qty) over(partition by fund)
from tbl
Yes. You are right when you say that you need to use windowed analytical functions.
Please find below the solutions to your queries.
Hope it helps!
spark.read.option("header","true").option("delimiter","|").csv("****").withColumn("fundTotal",sum("QTY").over(Window.partitionBy("FUND"))).withColumn("QTY%",sum("QTY").over(Window.partitionBy("BROKER"))).select('FUND,'BROKER,(($"QTY%"*100)/'fundTotal).as("QTY%")).distinct.show
And the second!
spark.read.option("header","true").option("delimiter","|").csv("/vihit/data.csv").withColumn("QTY%",sum("QTY").over(Window.partitionBy("BROKER"))).select('FUND,'BROKER,(('QTY*100)/$"QTY%").as("QTY%")).distinct.show

spark save taking lot of time

I've 2 dataframes and I want to find the records with all columns equal except 2 (surrogate_key,current)
And then I want to save those records with new surrogate_key value.
Following is my code :
val seq = csvDataFrame.columns.toSeq
var exceptDF = csvDataFrame.except(csvDataFrame.as('a).join(table.as('b),seq).drop("surrogate_key","current"))
exceptDF.show()
exceptDF = exceptDF.withColumn("surrogate_key", makeSurrogate(csvDataFrame("name"), lit("ecc")))
exceptDF = exceptDF.withColumn("current", lit("Y"))
exceptDF.show()
exceptDF.write.option("driver","org.postgresql.Driver").mode(SaveMode.Append).jdbc(postgreSQLProp.getProperty("url"), tableName, postgreSQLProp)
This code gives correct results, but get stuck while writing those results to postgre.
Not sure what's the issue. Also is there any better approach for this??
Regards,
Sorabh
By Default spark-sql creates 200 partitions, which means when you are trying to save the datafrmae it will be saved in 200 parquet files. you can reduce the number of partitions for Dataframe using below techniques.
At application level. Set the parameter "spark.sql.shuffle.partitions" as follows :
sqlContext.setConf("spark.sql.shuffle.partitions", "10")
Reduce the number of partition for a particular DataFrame as follows :
df.coalesce(10).write.save(...)
Using the var for dataframe are not suggested, You should always use val and create a new Dataframe after performing some transformation in dataframe.
Please remove all the var and replace with val.
Hope this helps!

How to use monotonically_increasing_id to join two pyspark dataframes having no common column?

I have two pyspark dataframes with same number of rows but they don't have any common column. So I am adding new column to both of them using monotonically_increasing_id() as
from pyspark.sql.functions import monotonically_increasing_id as mi
id=mi()
df1 = df1.withColumn("match_id", id)
cont_data = cont_data.withColumn("match_id", id)
cont_data = cont_data.join(df1,df1.match_id==cont_data.match_id, 'inner').drop(df1.match_id)
But after join the resulting data frame has less number of rows.
What am I missing here. Thanks
You just don't. This not an applicable use case for monotonically_increasing_id, which is by definition non-deterministic. Instead:
convert to RDD
zipWithIndex
convert back to DataFrame.
join
You can generate the id's with monotonically_increasing_id, save the file to disc, and then read it back in THEN do whatever joining process. Would only suggest this approach if you just need to generate the id's once. At that point they can be used for joining, but for the reasons mentioned above, this is hacky and not a good solution for anything that runs regularly.
If you want to get an incremental number on both dataframes and then join, you can generate a consecutive number with monotonically and windowing with the following code:
df1 = df1.withColumn("monotonically_increasing_id",monotonically_increasing_id())
window = Window.orderBy(scol('monotonically_increasing_id'))
df1 = df1.withColumn("match_id", row_number().over(window))
df1 = df1.drop("monotonically_increasing_id")
cont_data = cont_data.withColumn("monotonically_increasing_id",monotonically_increasing_id())
window = Window.orderBy(scol('monotonically_increasing_id'))
cont_data = cont_data.withColumn("match_id", row_number().over(window))
cont_data = cont_data.drop("monotonically_increasing_id")
cont_data = cont_data.join(df1,df1.match_id==cont_data.match_id, 'inner').drop(df1.match_id)
Warning It may move the data to a single partition! So maybe is better to separate the match_id to a different dataframe with the monotonically_increasing_id, generate the consecutive incremental number and then join with the data.

Resources