Read CSV file on Spark - apache-spark

I am started working with Spark and found out one problem.
I tried reading CSV file using the below code:
df = spark.read.csv("/home/oybek/Serverspace/Serverspace/Athletes.csv")
df.show(5)
Error:
Py4JJavaError: An error occurred while calling o38.csv.
: java.lang.OutOfMemoryError: Java heap space
I am working in Linux Ubuntu, VirtualBox:~/Serverspace.

You can try changing the driver memory by creating a spark session variable like below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "4g") \
.appName('read-csv') \
.getOrCreate()

Related

Set up jupyter on EMR to read from cassandra using cql?

When I try to set the spark context in jupyter with
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages datastax:spark-cassandra-connector:2.4.0-s_2.11 --conf spark.cassandra.connection.host=x.x.x.x pyspark-shell'
or
spark = SparkSession.builder \
.appName('SparkCassandraApp') \
.config('spark.cassandra.connection.host', 'x.x.x.x') \
.config('spark.cassandra.connection.port', 'xxxx') \
.config('spark.cassandra.output.consistency.level','ONE') \
.master('local[2]') \
.getOrCreate()
I still cannot make a connection to the cassandra cluster with the code
dataFrame = spark.read.format("org.apache.spark.sql.cassandra").option("keyspace", "keyspace").option("table", "table").load()
dataFrame = dataFrame.limit(100)
dataFrame.show()
Comes up with error:
An error was encountered:
An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.
Please find packages at http://spark.apache.org/third-party-projects.html
A similar question was asked here modify jupyter kernel to add cassandra connection in spark
but i do not see a valid answer.

Cannot run spark-nlp due to Exception: Java gateway process exited before sending its port number

I have a working Pyspark installation running through Jupyter on a Ubuntu VM.
Only one Java version (openjdk version "1.8.0_265"), and I can I can run a local Spark (v2.4.4) session like this without problems:
import pyspark
from pyspark.sql import SparkSession
memory_gb = 24
conf = (
pyspark.SparkConf()
.setMaster('local[*]')
.set('spark.driver.memory', '{}g'.format(memory_gb))
)
spark = SparkSession \
.builder \
.appName("My Name") \
.config(conf=conf) \
.getOrCreate()
Now I want to use spark-nlp. I've installed spark-nlp using pip install spark-nlp in the same virtual environment my Pyspark is in.
However, when I try to use it, I get the error Exception: Java gateway process exited before sending its port number.
I've tried to follow the instructions in the documentation here, but to no success.
So doing
spark = SparkSession \
.builder \
.appName("RevDNS Stats") \
.config(conf=conf) \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.5")\
.getOrCreate()
only results in the error mentioned above.
How do I fix this?

Spark Cassandra Connector Error: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef

Spark version:3.00
scala:2.12
Cassandra::3.11.4
spark-cassandra-connector_2.12-3.0.0-alpha2.jar
I am not using DSE. Below is my test code to write the dataframe into my Cassandra database.
spark = SparkSession \
.builder \
.config("spark.jars","spark-streaming-kafka-0-10_2.12-3.0.0.jar,spark-sql-kafka-0-10_2.12-3.0.0.jar,kafka-clients-2.5.0.jar,commons-pool2-2.8.0.jar,spark-token-provider-kafka-0-10_2.12-3.0.0.jar,**spark-cassandra-connector_2.12-3.0.0-alpha2.jar**") \
.config("spark.cassandra.connection.host", "127.0.0.1")\
.config('spark.cassandra.output.consistency.level', 'ONE')\
.appName("StructuredNetworkWordCount") \
.getOrCreate()
streamingInputDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "192.168.56.1:9092") \
.option("subscribe", "def") \
.load()
##Dataset operations
def write_to_cassandra(streaming_df,E):
streaming_df\
.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="a", keyspace="abc") \
.save()
q1 =sites_flat.writeStream \
.outputMode('update') \
.foreachBatch(write_to_cassandra) \
.start()
q1.awaitTermination()
I am able to do some operations to dataframe and print it to the console but I am not able to save or even read it from my Cassandra database. The error i am getting is:
File "C:\opt\spark-3.0.0-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o70.load.
: java.lang.NoClassDefFoundError: com/datastax/spark/connector/TableRef
at org.apache.spark.sql.cassandra.DefaultSource$.TableRefAndOptions(DefaultSource.scala:142)
at org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:56)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
I have tried with other cassandra connector version(2.5) but getting the same error
Please help!!!
The problem is that you're using spark.jars options that includes only provided jars into the classpath. But the TableRef case class is in the spark-cassandra-connector-driver package that is dependency for spark-cassandra-connector. To fix this problem, it's better to start the pyspark or spark-submit with --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 (same for kafka support) - in this case Spark will fetch all necessary dependencies & put them into classpath.
P.S. With alpha2 release you may get problems with fetching some dependencies, like, ffi, groovy, etc. - this is a known bug (mostly in Spark): SPARKC-599, that is already fixed, and we'll hopefully get beta drop very soon.
Update (14.03.2021): It's better to use assembly version of SCC that includes all necessary dependencies.
P.P.S. for writing to Cassandra from Spark Structured Streaming, don't use foreachbatch, just use as normal data sink:
val query = streamingCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "webhdfs://192.168.0.10:5598/checkpoint")
.option("keyspace", "test")
.option("table", "sttest_tweets")
.start()
I ran into the same problem,try it :
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.3</version>
</dependency>
version compatibility is presumed to be the cause

unable load data into hive using pyspark

unable to write the data into hive using pyspark through jupyter notebook .
giving me below error
Py4JJavaError: An error occurred while calling o99.saveAsTable.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Note these steps already tried:
copied the hdfs-site.xml , core-site.xml to /conf of hive
removed metasotore_db and created again using below cmd
$HIVE_HOME/bin/schematool –initschema –dbtype derby
did you use spark-submit for running your script?
Also you should add -> ".enableHiveSupport()" like that:
spark = SparkSession.builder \
.appName("yourapp") \
.enableHiveSupport() \
.getOrCreate()

Loading data from GCS using Spark Local

I am trying to read data from GCS buckets on my local machine, for testing purposes. I would like to sample some of the data in the cloud
I have downloaded the GCS Hadoop Connector JAR.
And setup the sparkConf as follow:
conf = SparkConf() \
.setMaster("local[8]") \
.setAppName("Test") \
.set("spark.jars", "path/gcs-connector-hadoop2-latest.jar") \
.set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "path/to/keyfile")
sc = SparkContext(conf=conf)
spark = SparkSession.builder \
.config(conf=sc.getConf()) \
.getOrCreate()
spark.read.json("gs://gcs-bucket")
I have also tried to set the conf like so:
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.json.keyfile", "path/to/keyfile")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", "true")
I am using PySpark install via PIP and running the code using the unit test module from IntelliJ
py4j.protocol.Py4JJavaError: An error occurred while calling o128.json.
: java.io.IOException: No FileSystem for scheme: gs
What should I do?
Thanks!
To solve this issue, you need to add configuration for fs.gs.impl property in addition to properties that you already configured:
sc._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")

Resources