size of dataframe/rdd in spark 3.2 - apache-spark

I was using
spark_session._jsparkSession.sessionState().executePlan(
df._jdf.queryExecution().logical()).optimizedPlan().stats().sizeInBytes()
and pyspark with version<3.2 in order to get the size of my DF (in bytes), but in 3.2 it seems the signature of executePlan has changed and i get the following error
py4j.Py4JException: Method executePlan([class org.apache.spark.sql.catalyst.plans.logical.Filter]) does not exist
Is there anyway to make this work? I tried adding
spark_session._jsparkSession.CommandExecutionMode
to the function call, but it produced the following error:
{AttributeError}'JavaMember' object has no attribute '_get_object_id'

I am not sure about the approach that you are trying with
Hers is the alternative approach to your problem.
df.rdd.mapPartitionsWithIndex(lambda x,it: [(x,sum(1 for _ in it))]).collect()

Related

Spark wrongly casting integers as `struct<int:int,long:bigint>`

In a spark job, I am using
.withColumn("year", year(to_timestamp(lit(col("timestamp")))))
This code used to work. But now I get the error :
"cannot resolve 'CAST(`timestamp` AS TIMESTAMP)' due to data type mismatch: cannot cast struct<int:int,long:bigint> to timestamp;"
I looks like spark is reading my timestamp column as a struct<int:int,long:bigint> instead of a int
How can I prevent that ?
Context the initial data is in jsonline. I read it using AWS GLUE glueContext.create_dynamic_frame.from_catalog. In the GLUE catalog the timestamp column is typed int.
Finally I solved it this way :
GF_resolved = ResolveChoice.apply(
frame=GF_raw,
specs=[("timestamp", "cast:int")],
transformation_ctx="resolve timestamp type",
)
ResolveChoice is method avaible on AWS Glue DynamicFrame
The short answer is that you cannot prevent it if creating a dynamic frame from catalog because, as the name suggests, the schema is dynamic. See this SO for more information.
Alternative approach that is a little more compact is...
gf_resolved = gf_raw.resolveChoice(specs = [('timestamp','cast:int')])
Official documentation for the resolve choice class can be found here.
AWS Resolve Choice

Invalid date:Error while import CSV to Cassandra using pySpark

I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.

Spark (PySpark) File Already Exists Exception

I am trying to save a data frame as a text file, however, I am getting a File Already Exists exception. I tried adding the mode to the code but to no avail. Furthermore, the file does not actually exists. Would anyone have an idea how I can solve this problem? I am using PySpark
This is the code:
distFile = sc.textFile("/Users/jeremy/Downloads/sample2.nq")
mapper = distFile.map(lambda q: __q2v(q))
reducer = mapper.reduceByKey(lambda a, b: a + os.linesep + b)
data_frame = reducer.toDF(["context", "triples"])
data_frame.coalesce(1).write.partitionBy("context").text("/Users/jeremy/Desktop/so")
May I add that the exception is being raised after some time and that some data is actually stored in temporary files (which are obviously deleted).
Thanks!
Edit: Exception can be found here: https://gist.github.com/jerdeb/c30f65dc632fb997af289dac4d40c743
you can used overwrite or append for replacing the file or adding the data into same file.
data_frame.coalesce(1).write.mode('overwrite').partitionBy("context").text("/Users/jeremy/Desktop/so")
or
data_frame.coalesce(1).write.mode('append').partitionBy("context").text("/Users/jeremy/Desktop/so")
I had the same problem and was able get around it with this:
outputDir = "/FileStore/tables/my_result/"
dbutils.fs.rm(outputDir , True)
Just change the outputDir variable to whatever directory you are writing to.
You should check your executors and look at the logs of the ones that are failing.
In my case, I had a coalesce(1) on a large DF. 4 of my executors failed - 3 of them had the same error of org.apache.hadoop.fs.FileAlreadyExistsException: File already exists.
However, 1 of them had a different exception: org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 262144 bytes of memory, got 148328
I was able to fix it by increasing the executor memory so that the coalesce did not cause an out of memory error.

spark connecting to Phoenix NoSuchMethod Exception

I am trying to connect to Phoenix through Spark/Scala to read and write data as a DataFrame. I am following the example on GitHub however when I try the very first example Load as a DataFrame using the Data Source API I get the below exception.
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
There are couple of things that are driving me crazy from those examples:
1)The import statement import org.apache.phoenix.spark._ gives me below exception in my code:
cannot resolve symbol phoenix
I have included below jars in my sbt
"org.apache.phoenix" % "phoenix-spark" % "4.4.0.2.4.3.0-227" % Provided,
"org.apache.phoenix" % "phoenix-core" % "4.4.0.2.4.3.0-227" % Provided,
2) I get the deprecated warning for symbol load.
I googled about that warnign but didn't got any reference and I was not able to find any example of the suggested method. I am not able to find any other good resource which guides on how to connect to Phoenix. Thanks for your time.
please use .read instead of load as shown below
val df = sparkSession.sqlContext.read
.format("org.apache.phoenix.spark")
.option("zkUrl", "localhost:2181")
.option("table", "TABLE1").load()
Its late to answer but here's what i did to solve a similar problem(Different method not found and deprecation warning):
1.) About the NoSuchMethodError: I took all the jars from hbase installation lib folder and add it to your project .Also add pheonix spark jars .Make sure to use compatible versions of spark and pheonix spark.Spark 2.0+ is compatible with pheonix-spark-4.10+
maven-central-link.This resolved the NoSuchMethodError
2.) About the load - The load method has long since been deprecated .Use sqlContext.phoenixTableAsDataFrame.For reference see this Load as a DataFrame directly using a Configuration object

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.
From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)
What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Resources