Unable to read and a 2GB nested json with dataframe in pyspark

Unable to read and a 2GB nested json with dataframe in pyspark - apache-spark

I am trying to read a JSON in pyspark in databricks. it is a nested JSON with multiple levels. it has one array which holds the entire data in it. I am able to process if the size of the JSOn is small like 5mb. but same code is not working for 2GB or bigger file. The structure of the json is as below. The whole data is coming into in_network column. it will be appreciated if can some one tell how to load this into data frame and then to a table. The structure of JSON
I tried with explode the column. it is able to print the schema of the object. when I am trying to save it to table or file or if I use display or show on the data frame, it is hanging and the job is getting failed.
below is the code I am using to read the json and explode
from pyspark.sql.functions import explode,col
df = spark.read.option('multiline',True).json("/FileStore/shared_uploads/konda/SampleDatafilefrombigfile.json")
df=df.withColumn('in_network',explode('in_network'))\
.withColumn('negotiation_arrangement',col('in_network.negotiation_arrangement'))\
.withColumn('name',col('in_network.name'))\
.withColumn('billing_code_type',col('in_network.billing_code_type'))\
.withColumn('billing_code_type_version',col('in_network.billing_code_type_version'))\
.withColumn('billing_code',col('in_network.billing_code'))\
.withColumn('description',col('in_network.description'))\
.withColumn('explode_negotiated_rates',explode('in_network.negotiated_rates'))\
.withColumn('exp_neg_prices',explode(col('explode_negotiated_rates.negotiated_prices')))\
.withColumn('exp_neg_pg',explode(col('explode_negotiated_rates.provider_groups')))\
.withColumn('exp_npi',explode(col('exp_neg_pg.npi')))\
.withColumn('billing_class',col('exp_neg_prices.billing_class'))\
.withColumn('expiration_date',col('exp_neg_prices.expiration_date'))\
.withColumn('negotiated_rate',col('exp_neg_prices.negotiated_rate'))\
.withColumn('negotiated_type',col('exp_neg_prices.negotiated_type'))\
.withColumn('service_code',explode(col('exp_neg_prices.service_code')))\
.withColumn('tin_type',col('exp_neg_pg.tin.type'))\
.withColumn('tin_val',col('exp_neg_pg.tin.value'))\
.drop('in_network')\
.drop('explode_negotiated_rates')\
.drop('exp_neg_prices')\
.drop("exp_neg_pg")
df.printSchema()

Related

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!

As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".

When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.

Azure Data Factory - Google BigQuery Copy Data activity not returning nested column names

I have a copy activity in Azure Data Factory with a Google BigQuery source.
I need to import the whole table (which contains nested fields - Records in BigQuery).
Nested fields get imported as follows (a string containing only data values):
"{\"v\":{\"f\":[{\"v\":\"1\"},{\"v\":\"1\"},{\"v\":\"1\"},{\"v\":null},{\"v\":\"1\"},{\"v\":null},{\"v\":null},{\"v\":\"1\"},{\"v\":null},{\"v\":null},{\"v\":null},{\"v\":null},{\"v\":\"0\"}]}}"
Expected output would be something like:
{"nestedColName" : [{"subNestedColName": 1}, {"subNestedColName": 1}, {"subNestedColName": 1}, {"subNestedColName": null}, ...] }
I think this is a connector issue from Data Factory's side but am not sure how to proceed.
Have considered using Databricks to import data from GBQ directly and then saving the DataFrame to sink.
Have also considered querying for a subset of columns and using UNNEST where required but would rather not do this as Parquet handles both Array and Map types.
Anyone encountered this before / what did you do?

Solution used:
Databricks (Spark) connector for Google BigQuery:
https://docs.databricks.com/data/data-sources/google/bigquery.html
This preserves schemas and nested field names.

Preferring the simpler setup of ADF BigQuery connector to Databricks's BigQuery support, I opted for a solution where I extract the data in JSON and 'massage' it into Parquet using Databricks:
Use a Copy activity to get data from BigQuery with all the data packed into a single JSON string field. Output format can be Parquet or JSON (I'm using Parquet). Use a BigQuery query like this:
select TO_JSON_STRING(t) as value from `<your BigQuery table>` as t
NOTE: The name of the field must be value. The df.write.text() text file writer writes the contents of value column into each row of the text file, which is a JSON string in this case.
Run a Databrick notebook activity with code like this:
# Read data and write it out as text file to get the JSON. (Compression is optional).
dfInput=spark.read.parquet(inputpath)
dfInput.write.mode("overwrite").option("compression","gzip").text(tmppath)
# Read back as JSON to extract the correct schema.
dfTemp=spark.read.json(tmppath)
dfTemp.write.mode("overwrite").parquet(outputpath)
Use the output as is, or use a Copy activity to copy it to where you like.

How to include SQLAlchemy Data types in Python dictionary

I've written an application using Python 3.6, pandas and sqlalchemy to automate the bulk loading of data into various back-end databases and the script works well.
As a brief summary, the script reads data from various excel and csv source files, one at a time, into a pandas dataframe and then uses the df.to_sql() method to write the data to a database. For maximum flexibility, I use a JSON file to provide all the configuration information including the names and types of source files, the database engine connection strings, the column titles for the source file and the column titles in the destination table.
When my script runs, it reads the JSON configuration, imports the specified source data into a dataframe, renames source columns to match the destination columns, drops any columns from the dataframe that are not required and then writes the dataframe contents to the database table using a call similar to:
df.to_sql(strTablename, con=engine, if_exists="append", index=False, chunksize=5000, schema="dbo")
The problem I have is that I would like to also specify the data types in the df.to_sql method for columns and provide them as inputs from the JSON configuration file however, this doesn't appear to be possible as all the strings in the JSON file need to be be enclosed in quotes and they don't then translate when read by my code. This is how the df.to_sql call should look:
df.to_sql(strTablename, con=engine, if_exists="append", dtype=dictDatatypes, index=False, chunksize=5000, schema="dbo")
The entries that form the dtype dictionary from my JSON file look like this:
"Data Types": {
"EmployeeNumber": "sqlalchemy.types.NVARCHAR(length=255)",
"Services": "sqlalchemy.types.INT()",
"UploadActivities": "sqlalchemy.types.INT()",
......
and there a many more, one for each column.
However, when the above is read in as a dictionary, which I pass to the df.to_sql method, it doesn't work as the alchemy datatypes shouldn't be enclosed in quotes but, I can't get around this in my JSON file. The dictionary values therefore aren't recognised by pandas. They look like this:
{'EmployeeNumber': 'sqlalchemy.types.INT()', ....}
And they really need to look like this:
{'EmployeeNumber': sqlalchemy.types.INT(), ....}
Does anyone have experience of this to suggest how I might be able to have the sqlalchemy datatypes in my configuration file?

You could use eval() to convert the string names to objects of that type:
import sqlalchemy as sa
dict_datatypes = {"EmployeeNumber": "sa.INT", "EmployeeName": "sa.String(50)"}
pprint(dict_datatypes)
"""console output:
{'EmployeeName': 'sa.String(50)', 'EmployeeNumber': 'sa.INT'}
"""
for key in dict_datatypes:
dict_datatypes[key] = eval(dict_datatypes[key])
pprint(dict_datatypes)
"""console output:
{'EmployeeName': String(length=50),
'EmployeeNumber': <class 'sqlalchemy.sql.sqltypes.INTEGER'>}
"""
Just be sure that you do not pass untrusted input values to functions like eval() and exec().

Is there a way to split a DF using column name comparison?

I am extremely new to Python. I've created a DataFrame using a csv file. My file is a complex nested json file having header values at the lowest granular level.
[Example] df.columns = [ID1, fullID2, total.count, total.value, seedValue.id, seedValue.value1, seedValue.value2, seedValue.largeFile.id, seedValue.largeFile.value1, seedValue.largeFile.value2......]
Requirement: I have to create multiple smaller csvs using each of the columns that are granular and ID1, fullID2.
My approach that I figured is: save the smaller slices by splitting on the header value.
Problem 1: Not able to split the value correctly or traverse to the first location for comparison.
[Example]
I'm using df.columns.str.split('.').tolist(). Suppose I get the below listed value, I want to compare seedValue of id with seedValue of value1 and pull out this entire part as a new df.
{['seedValue','id'],['seedValue'.'value1'], ['seedValue'.'value2']}
Problem 2: Adding ID1 and fullID2 to this new df.
Any help or direction to achieve this would be super helpful !
[Final output]
df.columns = [ID1, fullID2, total.count, total.value, seedValue.id, seedValue.value1, seedValue.value2, seedValue.largeFile.id, seedValue.largeFile.value1, seedValue.largeFile.value2......]
post-processing the file -
seedValue.columns = ID1,fullID2,id,value1,value2
total.columns = ID1,fullID2,count,value
seedValue.largeFile.columns = ID1,fullID2,id,value1,value2

While I do not possess your complex data to provide a more particular solution. I was able to reproduce a similar case with a .csv sample data, which will exemplify how to achieve what you aim with your data.
In order to save in each ID in a different file, we need to loop through the ID's. Also, assuming there might be more duplicate ID's, the script will save each group of ID's into a .csv file. Below is the script, already with sample data:
import pandas as pd
import csv
my_dict = { 'ids' : [11,11,33,55,55],
'info' : ["comment_1","comment_2", "comment_3", "comment_4", "comment_5"],
'other_column': ["something", "something", "something", "", "something"]}
#Creating a dataframe from the .csv file
df = pd.DataFrame(my_dict)
#sorting the value
df = df.sort_values('ids')
#g=df.groupby('ids')
df
#looping through each group of ids and saving them into a file
for id,g in df.groupby('ids'):
g.to_csv('id_{}.csv'.format(id),index=False)#, header=True, index_label=False)
And the output,
id_11.csv
id_33.csv
id_55.csv
For instance, within id_11.csv:
ids info other_column
11 comment_1 something
11 comment_2 something
Notice that, we use the field ids in the name of each file. Moreover, index=False which means that a new column with indexes for each line of data won't be created.
ADDITIONAL INFO: I have used the Notebook in AI Platform within GCP to execute and test the code.

Compared to the more widely known pd.read_csv, pandas offers more granular json support through pd.json_normalize, which allows you to specify how to unnest the data, which meta-data to use, etc.
Apart from this, reading nested fields from a csv into a two-dimensional dataframe might not be the ideal solution here, and having nested objects inside a dataframe can often be tricky to work with.
Try to read the file as a pure dictionary or list of dictionaries. You can then loop through the keys and create a custom logic to check how many more levels you want to go down, how to return the values and so on. Once you are on a lower level and prefer to have this inside of a dataframe, create a new temporary dataframe, then append these parts together inside the loop.

What is the proper way to validate datatype of csv data in spark?

We have a JSON file as input to the spark program(which describe schema definition and constraints which we want to check on each column) and I want to perform some data quality checks such as (Not NULL, UNIQUE) and datatype validations as well(Wants to check whether csv file contains the data according to json schema or not?).
JSON File:
{
"id":"1",
"name":"employee",
"source":"local",
"file_type":"text",
"sub_file_type":"csv",
"delimeter":",",
"path":"/user/all/dqdata/data/emp.txt",
"columns":[
{"column_name":"empid","datatype":"integer","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"empname","datatype":"string","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"salary","datatype":"double","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"doj","datatype":"date","constraints":["not null","unique"],"values_permitted":["1","2"]},
{"column_name":"location","string":"number","constraints":["not null","unique"],"values_permitted":["1","2"]}
]
}
Sample CSV input :
empId,empname,salar,dob,location
1,a,10000,11-03-2019,pune
2,b,10020,14-03-2019,pune
3,a,10010,15-03-2019,pune
a,1,10010,15-03-2019,pune
Keep in mind that,
1) intentionally I have put the invalid data for empId and name field(check last record).
2) The number of column in the json file is not fixed?
Question:
How can I ensure that an input data file contains all the records as per the given datatype(in JSON) file or not?
We have tried below things:
1) If we try to load the data from the CSV file using a data frame by applying external schema, then the spark program immediately throws some cast exception(NumberFormatException, etc) and it abnormally terminates the program. But I want to continue the execution flow and log the specific error as "Datatype mismatch error for column empID".
Above scenario works only when we call some RDD action on data frame which I felt a weird way to validate schema.
Please guide me, How we can achieve it in spark?

I don't think there is a free lunch here you have to write this process yourself but the process you can do is...
Read the csv file as a Dataset of Strings so that every row is good
Parse the Dataset using the map function to check for Null or datatype problems per column
Add an extra two columns, a boolean called like validRow and a String called like message or description
With the parser mentioned in '2.', do some sort of try/catch or a Try/Success/Failure for each value in each column and catch the exception and set the validRow and the description column accordingly
Do a filter and write one DataFrame/DataSet that is successful (validRow flag is set to True) to a success place, and write the error DataFrame/DataSet to an error place

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Unable to read and a 2GB nested json with dataframe in pyspark - apache-spark

Related

Databricks Delta Live Tables - Apply Changes from delta table

Azure Data Factory - Google BigQuery Copy Data activity not returning nested column names

How to include SQLAlchemy Data types in Python dictionary

Is there a way to split a DF using column name comparison?

What is the proper way to validate datatype of csv data in spark?

Categories

Resources