Read multiple text file in single dataframe - python-3.x

I'm trying to read multiple text files into a single DataFrame in Pyspark and then apply the show() but getting the error in second file path.
BUYERS10_m1 = spark.read.text(Buyers_F1_path,Buyers_F2_path)
BUYERS10_m1.show()
Py4JJavaError: An error occurred while calling o245.showString.
: java.lang.IllegalArgumentException: For input string: "s3a://testing/Buyers/File2.TXT"
Does anyone have any idea why I'm getting this error and how to resolve it ?

Following should work.
spark.read.text("s3a://testing/Buyers/File{1,2}.TXT")

Related

Reading Parquet file with Pyspark returns java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths

I am trying to load parquet files in the following directories:
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-1
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-2
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-3
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-4
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-5
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-6
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-7
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-8
This is what I wrote in Pyspark
s3_bucket_location_of_data = "s3://dir1/model=m1/version=newest/versionnumber=3/scores/"
df = spark.read.parquet(s3_bucket_location_of_data)
but I received the following error:
Py4JJavaError: An error occurred while calling o109.parquet.
: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-1
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-2
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-3
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-4
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-5
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-6
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-7
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-8
After reading other StackOverflow posts like this, I tried the following:
base_path="s3://dir1/" # I have tried to set this to "s3://dir1/model=m1/version=newest/versionnumber=3/scores/" as well, but it didn't work
s3_bucket_location_of_data = "s3://dir1/model=m1/version=newest/versionnumber=3/scores/"
df = spark.read.option("basePath", base_path).parquet(s3_bucket_location_of_data)
but that returned a similar error message as above. I am new to Spark/Pyspark and I don't know what I could possibly be doing wrong here. Thank you in advance for your answers!
You don't need to specify the detailed path. Just load the files from the base_path.
df = spark.read.parquet("s3://dir1")
df.filter("model = 'm1' and version = 'newest' and versionnumber = 3")
The directory structure is already partitioned by 3 columns, model, version and versionnumber. So read the base and filter the partition, then you could read all the parquet files under the partition path.

AzureML TabularDatasetFactory.from_parquet_files() error handling column types

I'm reading in a folder of parquet files using azureml's TabularDatasetFactory method:
dataset = TabularDatasetFactory.from_parquet_files(path=[(datastore_instance, "path/to/files/*.parquet")])
but am running into the issue that one of the columns is typed 'List' in the parquet files, and it seems TabularDatasetFactory.from_parquet_files() can't handle that typing?
ExecutionError:
Error Code: ScriptExecution.StreamAccess.Validation
Validation Error Code: NotSupported
Validation Target: ParquetType
Failed Step: xxxxxx
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by ValidationException.
No conversion exists for column: '[REDACTED COLUMN NAME]', from Parquet SchemaType: 'List' to DataPrep ValueKind
So I'm wondering if there's a way to tell TabularDatasetFactory.from_parquet_files() specifically which columns to pull in, or a way to tell it to fall back on any unsupported column types to just use object/string. Or maybe there's a work around by first reading in the files as a FileDataset, then selecting which columns in the files to use?
I do see the set_column_types parameter, but I don't know the columns until I read it into a dataset since I'm using datasets to explore what data is available in the folder paths in the first place

Error:'str' object has no attribute 'write' when converting Parquet to CSV

I have the following parquet files listed in my lake and I would like to convert the parquet files to CSV.
I have attempted to carry out the conversion using the suggestions on SO, but I keep on getting the Attribute Error:
AttributeError: 'str' object has no attribute 'write'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<command-507817377983169> in <module>
----> 1 df.write.format("csv").save("/mnt/lake/RAW/export/")
AttributeError: 'str' object has no attribute 'write'
I have created a dataframe to the location where the parquet files reside as 'df' which gives the following output:
Out[71]: '/mnt/lake/CUR/CURATED/F1Area/F1Domain/myfinal'
When I attempt to write / convert the parquets to CSV using either of the following I get the above error:
df.write.format("csv").save("/mnt/lake/RAW/export/")
df.write.csv(path)
I'm entering the following to read: df = spark.read.parquet("/mnt/lake/CUR/CURATED/F1Area/F1Domain/myfinal/"), but I'm getting the following error message:
A transaction log for Databricks Delta was found at /mnt/lake/CUR/CURATED/F1Area/F1Domain/myfinal/_delta_log, but you are trying to read from /mnt/lake/CUR/CURATED/F1Area/F1Domain/myfinal/ using format("parquet"). You must use 'format("delta")' when reading and writing to a delta table. To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
The file you have stored is in delta format. So, read it as the following command
df= spark.read.format("delta").load(path_to_data)
Once loaded, try to display first to make sure that it is loaded properly using display(df).
If the output is as expected, then you can write it as CSV to your desired location.
Type of df variable is a string and its value is /mnt/lake/CUR/CURATED/F1Area/F1Domain/myfinal.
You need to read the file first and make sure df variable is a pyspark dataframe before calling df.write

How to use variables in pyspark functions like months_between

I am newbie in pyspark. Facing difficulty using var in pyspark. It is treating that var as column name and throwing exception.
var_date_to='2020-06-01' \
months_between(col("date_to"),var_date_to)
Exception Thrown:
pyspark.sql.utils.AnalysisException: "cannot resolve '2020-06-01' given input columns: [......
I tried formatting the input string but getting same exception.
months_between(col("date_to"),'{0}'.format(var_date_to))
Please help
You have to convert it to column type first
months_between(col("date_to"),lit(var))
And it will work

Error while overwriting Cassandra table from PySpark

I am attempting to OVERWRITE data in cassandra with a PySpark dataframe. I get this error: keyword can't be an expression
I am able to append the data by
df.write.format("org.apache.spark.sql.cassandra").options(keyspace="ks",table="testtable").mode("append").save()
However, overwriting is throwing error
df.write.format("org.apache.spark.sql.cassandra").options(keyspace="ks",table="testtable", confirm.truncate="true").mode("overwrite").save()
Error: keyword can't be an expression
I found the solution.
df.write.format("org.apache.spark.sql.cassandra")
.mode("overwrite").option("confirm.truncate","true")
.options(keyspace="ks",table="testtable")
.save()

Resources