Need help in extracting the object from nested JSON in pysaprk - python-3.x

My JSON column values are as below
[{"item":"54509485","id":"1234","rule":"9383","issue_type":[],"rule_message":"this is json data.","sample_attributes":["shicode","measurement"],"impacted":[["Child"],[]],"type_of_blocker":[]}]
I want to get only object "item", "rule", "sample_attributes" using pyspark code using dataframe

If you got pyspark.sql.dataframe.DataFrame, you can do it by:
data.select(data.column_name.item, data.column_name.rule, data.column_name.sample_attributes)
Where data is your dataframe, and column_name is name of your column.

Related

passing array into isin() function in Databricks

I have a requirement where I will have to filter records from a df if that is present in one array. so I have an array that is distinct values from another df's column like below.
dist_eventCodes = Event_code.select('Value').distinct().collect()
now I am passing this dist_eventCodes in a filter like below.
ADT_df_select = ADT_df.filter(ADT_df.eventTypeCode.isin(dist_eventCodes))
when I run this code I get the below error message
"AttributeError: 'DataFrame' object has no attribute '_get_object_id'"
can somebody please help me under what wrong am i doing?
Thanks in advance
If I understood correctly, you want to retain only those rows where eventTypeCode is within eventTypeCode from Event_code dataframe
Let me know if this is not the case
This can be achieved by a simple left-semi join in spark. This way you don't need to collect the dataframe, thus would be the right way in a distributed environment.
ADT_df.alias("df1").join(Event_code.select("value").distinct().alias("df2"), [F.col("df1.eventTypeCode")=F.col("df2.value")], 'leftsemi')
Or if there is a specific need to use isin, this would work (collect_set will take care of distinct):
dist_eventCodes = Event_code.select("value").groupBy(F.lit("dummy")).agg(F.collect_set("value").alias("value")).first().asDict()
ADT_df_select = ADT_df.filter(ADT_df["eventTypeCode"].isin(dist_eventCodes["value"]))
Input (ADT_df):
Event_code Dataframe:
Output:

Loading data from csv to pandas dataframe gives NaNs

I am using the following code to read a csv file into pandas but when I move it into a data frame, I am only getting Nans. I need to put it into dataframe to work on loading it into SQL Server.
I am using the following code to load the data into csv file:
for file in z.namelist():
df1=pd.read_csv(z.open(file),sep='\t',skiprows=[1,2])
print(df1)
This gives me the intended results:
But when I try to put the data into a dataframe, I am getting only NaNs. This is the code that I am using to load the data into data frame after the step above.
df1 = pd.DataFrame(df1,columns=['ResponseID','ResponseSet','IPAddress','StartDate','EndDate',
'RecipientLastName','RecipientFirstName','RecipientEmail','ExternalDataReference','Finished',
'Status','EmbeddedData','License Type','Organization','Reference ID','Q16','Q3#1_1_1_TEXT',
'Q3#1_1_2_TEXT','Q3#1_1_3_TEXT','Q3#1_2_1_TEXT','Q3#1_2_3_TEXT','Q3#1_3_1_TEXT','Q3#1_3_2_TEXT',
'Q3#1_3_3_TEXT','Q3#1_4_1_TEXT','Q3#1_4_2_TEXT','Q3#1_4_3_TEXT','Q3#1_5_1_TEXT','Q3#1_5_2_TEXT',
'Q3#1_5_3_TEXT','Q3#1_6_1_TEXT','Q3#1_6_2_TEXT','Q3#1_6_3_TEXT','Q4#1_5_1_TEXT','Q18','Q19#1_1_1_TEXT',
'Q19#1_2_1_TEXT','Q19#1_3_1_TEXT','Q19#1_4_1_TEXT','Q19#1_6_1_TEXT','Q14#1_4_1_TEXT','Q14#1_5_1_TEXT',
'Q14#1_8_1_TEXT','Q20','Q29','Q21','Q22','Q23','Q24','LocationLatitude','LocationLongitude','LocationAccuracy'])
print(df1)
I am getting only NaNs on for this.
What should I do to get the data from csv into my data frame and what is wrong with my code?
I was able to resolve this by using "," as a separator for my read_csv.
df1=pd.read_csv(z.open(file),sep=',',skiprows=[1,2])
I got rid of the NaNs by using the following:
df1 = df1.replace({np.nan: None})

HIVE Parquet error

I'm trying to insert the content of a dataframe to a partitioned parquet-formatted hive table using
df.write.mode(SaveMode.Append).insertInto(myTable)
with hive.exec.dynamic.partition = 'true' and hive.exec.dynamic.partition.mode = 'nonstrict'.
I keep getting an parquet.io.ParquetEncodingException saying that
empty fields are illegal, the field should be ommited completely
instead.
The schema includes arrays (array<<struct<<int, string>>>>), and the df do contain some empty entries for these fields.
However, when I insert the df content into a non-partitioned table, I do not get an error.
How to this fix this issue.
I have attached error

pyspark dataframe column name

what is limitation for pyspark dataframe column names. I have issue with following code ..
%livy.pyspark
df_context_spark.agg({'spatialElementLabel.value': 'count'})
It gives ...
u'Cannot resolve column name "spatialElementLabel.value" among (lightFixtureID.value, spatialElementLabel.value);'
The column name is evidently typed correctly. I got the dataframe by transformation from pandas dataframe. It there any issue with dot in the column name string?
Dots are used for nested fields inside a structure type. So if you had a column that was called "address" of type StructType, and inside that you had street1, street2, etc you would access it the individual fields like this:
df.select("address.street1", "address.street2", ..)
Because of that, if you want to used a dot in your field name you need to quote the field whenever you refer to it. For example:
from pyspark.sql.types import *
schema = StructType([StructField("my.field", StringType())])
rdd = sc.parallelize([('hello',), ('world',)])
df = sqlContext.createDataFrame(rdd, schema)
# Using backticks to quote the field name
df.select("`my.field`").show()

cloudant-spark connector creates duplicate column name with nested JSON schema

I'm using the following JSON Schema in my cloudant database:
{...
departureWeather:{
temp:30,
otherfields:xyz
},
arrivalWeather:{
temp:45,
otherfields: abc
}
...
}
I'm then loading the data into a dataframe using the cloudant-spark connector. If I try to select fields like so:
df.select("departureWeather.temp", "arrivalWeather.temp")
I end up with a dataframe that has 2 columns with the same name e.g. temp. It looks like Spark datasource framework is flattening the name using only the last part.
Is there an easy to deduplicate the column names?
You can use aliases:
df.select(
col("departureWeather.temp").alias("departure_temp"),
col("arrivalWeather.temp").alias("arrival_temp")
)

Resources