CSV upload to PySpark data frame counting linebreaks as new rows - apache-spark

I am trying to load a CSV into a Spark data frame using common instructions, however, the CSV is incorrectly loading. Below is the header and a problematic record. This is the Vim view of the file, showing ^M carriage return.
symbol,tweet_id,text,created_at,retweets,likes,geo,place,coordinates,location^M
AIG,1423790670557351945, "Next Monday, US stock market may edge up. The market is more likely to be mixed.
$MNST $AIG $WFC $GS $MET $BAC $JPM▒<80>▒",2021-08-06 23:38:43,1,0,,,,"Toronto, Ontario, Canada"^M
Here is the command I'm using to load the CSV into Spark data frame:
df = spark.read.load("symbols_tweets.csv",
format="csv", sep=",", header="true")
The issue is that spark.read.load ends up thinking $MNST is a new row since it appears on a new line. Is there any way I can have Spark pay attention to the Unix carriage return ^M instead so that it would load the rows as intended? As a workaround, I tried converting the CSV to Pandas data frame and then to Spark data frame, however that is resulting in more complex datatype issues - I would rather solve in a more direct manner.

Related

Unable to read and a 2GB nested json with dataframe in pyspark

I am trying to read a JSON in pyspark in databricks. it is a nested JSON with multiple levels. it has one array which holds the entire data in it. I am able to process if the size of the JSOn is small like 5mb. but same code is not working for 2GB or bigger file. The structure of the json is as below. The whole data is coming into in_network column. it will be appreciated if can some one tell how to load this into data frame and then to a table. The structure of JSON
I tried with explode the column. it is able to print the schema of the object. when I am trying to save it to table or file or if I use display or show on the data frame, it is hanging and the job is getting failed.
below is the code I am using to read the json and explode
from pyspark.sql.functions import explode,col
df = spark.read.option('multiline',True).json("/FileStore/shared_uploads/konda/SampleDatafilefrombigfile.json")
df=df.withColumn('in_network',explode('in_network'))\
.withColumn('negotiation_arrangement',col('in_network.negotiation_arrangement'))\
.withColumn('name',col('in_network.name'))\
.withColumn('billing_code_type',col('in_network.billing_code_type'))\
.withColumn('billing_code_type_version',col('in_network.billing_code_type_version'))\
.withColumn('billing_code',col('in_network.billing_code'))\
.withColumn('description',col('in_network.description'))\
.withColumn('explode_negotiated_rates',explode('in_network.negotiated_rates'))\
.withColumn('exp_neg_prices',explode(col('explode_negotiated_rates.negotiated_prices')))\
.withColumn('exp_neg_pg',explode(col('explode_negotiated_rates.provider_groups')))\
.withColumn('exp_npi',explode(col('exp_neg_pg.npi')))\
.withColumn('billing_class',col('exp_neg_prices.billing_class'))\
.withColumn('expiration_date',col('exp_neg_prices.expiration_date'))\
.withColumn('negotiated_rate',col('exp_neg_prices.negotiated_rate'))\
.withColumn('negotiated_type',col('exp_neg_prices.negotiated_type'))\
.withColumn('service_code',explode(col('exp_neg_prices.service_code')))\
.withColumn('tin_type',col('exp_neg_pg.tin.type'))\
.withColumn('tin_val',col('exp_neg_pg.tin.value'))\
.drop('in_network')\
.drop('explode_negotiated_rates')\
.drop('exp_neg_prices')\
.drop("exp_neg_pg")
df.printSchema()

read only non-merged files in pyspark

I have N deltas in N folders (ex. /user/deltas/1/delta1.csv, /user/deltas/2/delta2csv,.../user/deltas/n/deltaN.csv)
all deltas have same columns, only information in columns is different.
i have a code for reading my csv files from folder "deltas"
dfTable = spark.read.format("csv").option("recursiveFileLookup","true")\
.option("header", "true).load("/home/user/deltas/")
and i gonna use deltaTable.merge to merge and update information from deltas and write updated information in table (main_table.csv)
For example tommorow i will have new delta with another updated information, and i will run my code again to refresh data in my main_table.csv .
How to avoid deltas that have already been used by deltaTable.merge earlier to the file main_table.csv ?
is it possible maybe to change file type after delta's run for example to parquet and thats how to avoid re-using deltas again? because im reading csv files, not parquet, or something like log files etc..
I think a time path filter might work well for your use case. If you are running your code daily (either manually or with a job), then you could use the modifiedAfter parameter to only load files that were modified after 1 day ago (or however often you are rerunning this code).
from datetime import datetime, timedelta
timestamp_last_run = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%dT-%H:%M:%S")
dfTable = spark.read.format("csv").option("recursiveFileLookup","true")\
.option("header", "true).load("/home/user/deltas/", modifiedAfter=timestamp_last_run)
## ...perform merge operation and save data in main_table.csv

Load data in pyspark where csv file contains bullets

Hi I have to load data in hive, I am using pyspark for data processing.
I am getting csv file file from upstream which has bullets in it. when I do dataframe.show(), I don't see bullets, I rather see ???.
Can someone suggest how can I load the data, I am ok with replacing bullets with space or blank.

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I have the following tables and flow:
Landing_zone -> This is a folder in which JSON files are added that contain data of inserted or updated records.
Raw_table -> This is the data in the JSON files but in table format. This table is in delta format. No transformations are done, except from transforming the JSON structure into a tabular structure (I did an explode and then creating columns from the JSON keys).
Intermediate_table -> This is the raw_table, but with some extra columns (depending on other column values).
To go from my landing zone to the raw table I have the following Pyspark code:
cloudfile = {"cloudFiles.format":"JSON",
"cloudFiles.schemaLocation": sourceschemalocation,
"cloudFiles.inferColumnTypes": True}
#dlt.view('landing_view')
def inc_view():
df = (spark
.readStream
.format('cloudFiles')
.options(**cloudFilesOptions)
.load(filpath_to_landing)
<Some transformations to go from JSON to tabular (explode, ...)>
return df
dlt.create_target_table('raw_table',
table_properties = {'delta.enableChangeDataFeed': 'true'})
dlt.apply_changes(target='raw_table',
source='landing_view',
keys=['id'],
sequence_by='updated_at')
This code works as expected. I run it, add a changes.JSON file to the landing zone, rerun the pipeline and the upserts are correctly applied to the 'raw_table'
(However, each time a new parquet file with all the data is created in the delta folder, I would expect that only a parquet file with the inserted and updated rows was added? And that some information about the current version was kept in the delta logs? Not sure if this is relevant for my problem. I already changed the table_properties of the 'raw_table' to enableChangeDataFeed = true. The readStream for 'intermediate_table' then has option(readChangeFeed, 'true')).
Then I have the following code to go from my 'raw_table' to my 'intermediate_table':
#dlt.table(name='V_raw_table', table_properties={delta.enableChangeDataFeed': 'True'})
def raw_table():
df = (spark.readStream
.format('delta')
.option('readChangeFeed', 'true')
.table('LIVE.raw_table'))
df = df.withColumn('ExtraCol', <Transformation>)
return df
ezeg
dlt.create_target_table('intermediate_table')
dlt.apply_changes(target='intermediate_table',
source='V_raw_table',
keys=['id'],
sequence_by='updated_at')
Unfortunately, when I run this, I get the error:
'Detected a data update (for example part-00000-7127bd29-6820-406c-a5a1-e76fc7126150-c000.snappy.parquet) in the source table at version 2. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.'
I checked in the 'ignoreChanges', but don't think this is what I want. I would expect that the autoloader would be able to detect the changes in the delta table and pass them through the flow.
I am aware that readStream only works with append, but that is why I would expect that after the 'raw_table' is updated, a new parquet file would be added to the delta folder with only the inserts and updates. This added parquet file is then detected by autoloader and could be used to apply the changes to the 'intermediate_table'.
Am I doing this the wrong way? Or am I overlooking something? Thanks in advance!
As readStream only works with appends, any change in the the source file will create issues downstream. The assumption that an update on "raw_table" will only insert a new parquet file is incorrect. Based on the settings like "optimized writes" or even without it, apply_changes can add or remove files. You can find this information in your "raw_table/_delta_log/xxx.json" under "numTargetFilesAdded" and "numTargetFilesRemoved".
Basically, "Databricks recommends you use Auto Loader to ingest only immutable files".
When you changed the settings to include the option '.option('readChangeFeed', 'true')', you should start with a full refresh(there is dropdown near start). Doing this will resolve the error 'Detected data update xxx', and your code should work for the incremental update.

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark
I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:
I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.
Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?
You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)
In Spark 2.2 there was added new option - wholeFile. If you write this:
spark.read.option("wholeFile", "true").csv("file.csv")
it will read all file and handle multiline CSV.
There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison
wholeFile does not exist (anymore?) in the spark api documentation:
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
This solution will work:
spark.read.option("multiLine", "true").csv("file.csv")
From the api documentation:
multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Resources