Is there a way to get column names using hiveContext? - apache-spark

I have an "iplRDD" which is a json, and I do below steps and query through hivecontext. I get the results but without columns headers. Is there is a way to get the columns names along with the values?
val teamRDD = hiveContext.jsonRDD(iplRDD)
teamRDD.registerTempTable("teams")
hiveContext.cacheTable("teams")
val result = hiveContext.sql("select * from teams where team_name = "KKR" )
result.collect.foreach(println)
Any thoughts please ?

teamRDD.schema.fieldNames should contain the header names.

You can get it by using:
result.schema().fields();

you can save your dataframe 'result' like this with header as csv file:
result.write().format("com.databricks.spark.csv").option("header", "true").save(outputPath);

Related

passing array into isin() function in Databricks

I have a requirement where I will have to filter records from a df if that is present in one array. so I have an array that is distinct values from another df's column like below.
dist_eventCodes = Event_code.select('Value').distinct().collect()
now I am passing this dist_eventCodes in a filter like below.
ADT_df_select = ADT_df.filter(ADT_df.eventTypeCode.isin(dist_eventCodes))
when I run this code I get the below error message
"AttributeError: 'DataFrame' object has no attribute '_get_object_id'"
can somebody please help me under what wrong am i doing?
Thanks in advance
If I understood correctly, you want to retain only those rows where eventTypeCode is within eventTypeCode from Event_code dataframe
Let me know if this is not the case
This can be achieved by a simple left-semi join in spark. This way you don't need to collect the dataframe, thus would be the right way in a distributed environment.
ADT_df.alias("df1").join(Event_code.select("value").distinct().alias("df2"), [F.col("df1.eventTypeCode")=F.col("df2.value")], 'leftsemi')
Or if there is a specific need to use isin, this would work (collect_set will take care of distinct):
dist_eventCodes = Event_code.select("value").groupBy(F.lit("dummy")).agg(F.collect_set("value").alias("value")).first().asDict()
ADT_df_select = ADT_df.filter(ADT_df["eventTypeCode"].isin(dist_eventCodes["value"]))
Input (ADT_df):
Event_code Dataframe:
Output:

how to add column name to the dataframe storing result of correlation of two columns in pyspark?

I have read a csv file and need to find correlation between two columns.
I am using df.stat.corr('Age','Exp') and result is 0.7924058156930612.
But I want to have this result stored in another dataframe with header as "correlation".
correlation
0.7924058156930612
Following up on what #gupta_hemant commented.
You can create a new column as
df.withColumn("correlation", df.stat.corr("Age", "Exp").collect()[0].correlation)
(I am guessing the exact syntax here, but it should be something like this)
After reviewing the code, the syntax should be
import pyspark.sql.functions as F
df.withColumn("correlation", F.lit(df.stat.corr("Age", "Exp")))
Try this and let me know.
corrValue = df.stat.corr("Age", "Exp")
newDF = spark.createDataFrame(
[
(corrValue)
],
["corr"]
)

How can I covert AJAX to dictionary in python?

I Have the following AJAX submission data entry and I want to organize it in report format so I have to covert it to table format
The table should include the field_name as header and the field_value as rows
Can anyone help me?
[{"field_name":"patientno","field_value":"1"},
{"field_name":"patient_unhcr_id","field_value":"1"},
{"field_name":"jps_file","field_value":"1"},
{"field_name":"patient_individual_id","field_value":"1"},
{"field_name":"name","field_value":"1"},
{"field_name":"name_in_arabic","field_value":"1"},
{"field_name":"age","field_value":"1"},
{"field_name":"age_category","field_value":"U5"},
{"field_name":"gender","field_value":"F"},
{"field_name":"coo","field_value":"Syria"},
{"field_name":"phone_number","field_value":"1"},
{"field_name":"governorate","field_value":"Mafraq"},
{"field_name":"bank_branch","field_value":"\u0641\u0631\u0639 \u0636\u0627\u062d\u064a\u0629 \u0627\u0644\u064a\u0627\u0633\u0645\u064a\u0646"},
{"field_name":"treatment_site","field_value":"Ramtha Governmental Hospital"},
{"field_name":"case_category","field_value":"CS"},
{"field_name":"description","field_value":"a"},
{"field_name":"eligibilities_","field_value":"Eligible Level 2"},{"field_name":"approved_amount_before_rounding","field_value":"21.5"},
{"field_name":"approved_amount","field_value":"20"},
{"field_name":"radio_buttons","field_value":"Yes"},
{"field_name":"recipient_name","field_value":"a"},
{"field_name":"recipient__dob","field_value":"02\/24\/2020"},
{"field_name":"gender_of_recpient_","field_value":"F"},
{"field_name":"recipient_unhcr_id_number","field_value":"1"},
{"field_name":"recipient_individual_id","field_value":"1"},
{"field_name":"relationship_to_patient","field_value":"Daughter-in-law"},
{"field_name":"recepient_phone_no","field_value":"1"},
{"field_name":"date_request_send_to_unhcr","field_value":"02\/17\/2020"},
{"field_name":"approval_date","field_value":"02\/24\/2020"},
{"field_name":"closure_date","field_value":"02\/11\/2020"},
{"field_name":"comment","field_value":"a"},
{"field_name":"attatchment","field_value":"http:\/\/192.168.1.52:9999\/wordpress\/wp-content\/uploads\/2020\/02\/IC-Weekly-Task-List-Template-8624.xlsx"}]
Here is my crude approach, hope it helps
import pandas as pd
ajax = [{"field_name":"patientno","field_value":"1"},{"field_name":"patient_unhcr_id","field_value":"1"},{"field_name":"jps_file","field_value":"1"},{"field_name":"patient_individual_id","field_value":"1"},{"field_name":"name","field_value":"1"},{"field_name":"name_in_arabic","field_value":"1"},{"field_name":"age","field_value":"1"},{"field_name":"age_category","field_value":"U5"},{"field_name":"gender","field_value":"F"},{"field_name":"coo","field_value":"Syria"},{"field_name":"phone_number","field_value":"1"},{"field_name":"governorate","field_value":"Mafraq"},{"field_name":"bank_branch","field_value":"\u0641\u0631\u0639 \u0636\u0627\u062d\u064a\u0629 \u0627\u0644\u064a\u0627\u0633\u0645\u064a\u0646"},{"field_name":"treatment_site","field_value":"Ramtha Governmental Hospital"},{"field_name":"case_category","field_value":"CS"},{"field_name":"description","field_value":"a"},{"field_name":"eligibilities_","field_value":"Eligible Level 2"},{"field_name":"approved_amount_before_rounding","field_value":"21.5"},{"field_name":"approved_amount","field_value":"20"},{"field_name":"radio_buttons","field_value":"Yes"},{"field_name":"recipient_name","field_value":"a"},{"field_name":"recipient__dob","field_value":"02/24/2020"},{"field_name":"gender_of_recpient_","field_value":"F"},{"field_name":"recipient_unhcr_id_number","field_value":"1"},{"field_name":"recipient_individual_id","field_value":"1"},{"field_name":"relationship_to_patient","field_value":"Daughter-in-law"},{"field_name":"recepient_phone_no","field_value":"1"},{"field_name":"date_request_send_to_unhcr","field_value":"02/17/2020"},{"field_name":"approval_date","field_value":"02/24/2020"},{"field_name":"closure_date","field_value":"02/11/2020"},{"field_name":"comment","field_value":"a"},{"field_name":"attatchment","field_value":"http://192.168.1.52:9999/wordpress/wp-content/uploads/2020/02/IC-Weekly-Task-List-Template-8624.xlsx"}]
df = pd.DataFrame(data=ajax).T
df.columns = df.iloc[0]
df = df.drop(df.index[0])
Basically you use ajax list as data to create a dataframe, transpose it, set the first row as headers and drop that row afterwards.

Getting NONE in the last row of dataframe when using pd.read_sql_query

I am trying to create a db using sqlite3. i created methods to read write delete and show table. however in order to view table in proper format on Command line, i decided to use pandas (pd.read_sql_query). However, when i do that i get None in the last row of the first column.
I tried writing the table to a csv and there was no none value there.
def show_table():
df = pd.read_sql_query("SELECT * FROM ticket_info", SQLITEDB.conn, index_col='resource_id')
print(df)
df.to_csv('hahaha.csv')
def fetch_from_db(query):
df = pd.read_sql_query('SELECT * FROM ticket_info WHERE {}'.format(query), SQLITEDB.conn, index_col='resource_id')
print(df)
here's the output as a picture.output image
Everything is correct but the last None value, where is it coming from? and how do i gt rid of it?
You are adding query as a variable. You might have a query that doesn't return any data from you table.

spark save taking lot of time

I've 2 dataframes and I want to find the records with all columns equal except 2 (surrogate_key,current)
And then I want to save those records with new surrogate_key value.
Following is my code :
val seq = csvDataFrame.columns.toSeq
var exceptDF = csvDataFrame.except(csvDataFrame.as('a).join(table.as('b),seq).drop("surrogate_key","current"))
exceptDF.show()
exceptDF = exceptDF.withColumn("surrogate_key", makeSurrogate(csvDataFrame("name"), lit("ecc")))
exceptDF = exceptDF.withColumn("current", lit("Y"))
exceptDF.show()
exceptDF.write.option("driver","org.postgresql.Driver").mode(SaveMode.Append).jdbc(postgreSQLProp.getProperty("url"), tableName, postgreSQLProp)
This code gives correct results, but get stuck while writing those results to postgre.
Not sure what's the issue. Also is there any better approach for this??
Regards,
Sorabh
By Default spark-sql creates 200 partitions, which means when you are trying to save the datafrmae it will be saved in 200 parquet files. you can reduce the number of partitions for Dataframe using below techniques.
At application level. Set the parameter "spark.sql.shuffle.partitions" as follows :
sqlContext.setConf("spark.sql.shuffle.partitions", "10")
Reduce the number of partition for a particular DataFrame as follows :
df.coalesce(10).write.save(...)
Using the var for dataframe are not suggested, You should always use val and create a new Dataframe after performing some transformation in dataframe.
Please remove all the var and replace with val.
Hope this helps!

Resources