Apache Spark Dataframes Not Being Created with Databricks - apache-spark

When reading in data from SQL from a notebook a Spark Dataframe is created, but when I read in the same data from a different notebook I don't see a DataFrame.
So when I run the following in one notebook I get a dataframe as follows:
jdbcUrl = f"jdbc:sqlserver://{DBServer}.database.windows.net:1433;database={DBDatabase};user={DBUser};password={DBPword}"
my_sales = spark.read.jdbc(jdbcUrl, 'AZ_FH_ELLIPSE.AZ_FND_MSF620')
I get the following DataFrame output
However, but when I run the same code on a different notebook I only get how long it took to run the code but no dataframe.
Any thoughts?
I should mention that the DataFrame isn't appearing on the Community Edition of Databricks. However, I don't think that should be the reason why I'm not seeing a Dataframe or Schema appearing..

Related

PySpark pandas converting Excel to Delta Table Failed

I am using PySpark.pandas read_excel function to import data and saving the result in metastore using to_table. It works fine if format='parquet'. However, the job hangs if format='delta'. The cluster idles after creating the parquets and does not proceed to write _delta_log (at least that's what it seems).
Have you any clue what might be happening?
I'm using Databricks 11.3, Spark 3.3.
I have also tried importing Excel using regular pandas, convert the pandas DF to spark DF using spark.createDataFrame, and then write.saveAsTable without success if format is delta.

Read multiple text files into a spark dataframe

I am trying to read multiple text files into a single spark data frame, I have used the following code for as single file:
df =spark.read.text('C:/User/Alex/Directory/Subdirectory/Filename.txt.pgp.decr')
df.count()
and I get the correct result, then I try and read in all of the files in that directory as follows:
df = spark.read.text('C:/User/Alex/Directory/Subdirectory/*')
df.count()
and the notebook just hangs and produces no result. I have also tried reading the data into a rdd using the sparkContext with textFile and wholeTextFiles, but also didn't come right, please can you help?

DropDuplicates in PySpark gives stackoverflowerror

I have a PySpark program which reads a json files of size around 350-400 MB and created a dataframe out of it.
In my next step, I create a Spark SQL query using createOrReplaceTempView and select few columns as required
Once this is done, I filter my dataframe with some conditions. It was working fine until this point of time.
Now, I needed to remove some duplicate values using a column. So, I introduced,
dropDuplicates in next step and it suddenly started giving me StackoverflowError
Below is the sample code:-
def create_some_df(initial_df):
initial_df.createOrReplaceTempView('data')
original_df = spark.sql('select c1,c2,c3,c4 from data')
## Filter out some events
original_df = original_df.filter(filter1condition)
original_df = original_df.filter(filter2condition)
original_df = original_df.dropDuplicates(['c1'])
return original_df
It worked fine until I added dropDuplicates method.
I am using 3 node AWS EMR cluster c5.2xlarge
I am running PySpark using spark-submit command in YARN client mode
What I have tried
I tried adding persist and cache before calling filter, but it didn't help
EDIT - Some more details
I realise that the error appears when I invoke my write function after multiple transformation i.e first action.
If I have dropDuplicates in my transformation before I write, it fails with error.
If I do not have dropDuplicates in my transformation, write works fine.

Dropping temporary columns in Spark

Im creating a new column in a data frame and use it in subsequent transformations. Latter when I try to drop the new column it breaks the execution. When I look into the execution plan Spark optimize execution plan by removing the whole flow as because Im dropping the column in latter stage. How to drop temporary column without affecting execution plan? - Im using pyspark.
df = df.withColumn('COLUMN_1', "some transformation returns value").withColumn('COLUMN_2',"some transformation returns value")
df = df.withColumn('RESULT',when("some condition", col('COLUMN_1')).otherwise(col('COLUMN_2'))).drop('COLUMN_1','COLUMN_2')
I have tried in spark-shell(using scala) and it's working as expected
I'm using Spark 2.4.4 version and scala 2.11.12.
I have tried the same in Pyspark and refer the attachment. Let me know if this answer helps for you.
With Pyspark

Spark temp tables not found

I'm trying to run a pySpark job with custom inputs, for testing purposes.
The job has three sets of input, each read from a table in a different metastore database.
The data is read in spark with: hiveContext.table('myDb.myTable')
The test inputs are three files. In an attempt to not change any of the original code, I read all three inputs into DataFrames, and attempt to register a temp table with myDF.registerTempTable('myDb.myTable').
The problem is that spark fails with org.apache.spark.sql.catalyst.analysis.NoSuchTableException.
I've also tried:
hiveContext.sql('create database if not exists myDb')
hiveContext.sql('use myDb')
myDF.registerTempTable('myTable')
But that fails as well.
Any idea why the table cannot be found?
Using Spark 1.6

Resources