Open parquet from GCS using local Pyspark - apache-spark

i have a folder on Google Cloud Storage with several parquet files. I installed in my VM pyspark and now i want to read the parquet files. Here's my code:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.config("spark.driver.maxResultSize", "40g") \
.config('spark.sql.shuffle.partitions', '2001') \
.config("spark.jars", "~/spark/spark-2.4.4-bin-hadoop2.7/jars/gcs-connector-hadoop2-latest.jar")\
.getOrCreate()
sc = spark.sparkContext
# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
# to read parquet file
filename = "gs://path/to/parquet"
df = sqlContext.read.parquet(filename)
print(df.head())
When i run it, it gives me the following error:
WARN FileStreamSink: Error while looking for metadata directory.
To install pyspark i followed this tutorial: https://towardsdatascience.com/how-to-get-started-with-pyspark-1adc142456ec

Have you tried to read from GCS like this and then passing the data that you read? I do not think that you can read directly with pyspark.
I've been reading around about the error and in some instances it is raised when the file is not reachable or the path is incorrect. I think that might be it.

Related

Connecting to Casssandra on remote client using Spark

I have two PCs, one of them is Ubuntu system that has Cassandra, and the other one is Windows PC.
I have made same installations of Java, Spark, Python and Scala versions on both PCs. My goal is read data with Jupyter Notebook using Spark from Cassandra that on other PC.
On the PC that has Cassandra, I was able to read data with connecting to Cassandra using Spark. But when I try to connect that Cassandra from remote client using Spark, I could not connect to Cassandra and get an error.
Representation of the system
Commands that run on Ubuntu PC which has Cassandra.
~/spark/bin ./pyspark --master spark://10.0.0.10:7077 --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.driver.extraJavaOptions=-Xss512m --conf spark.executer.extraJavaOptions=-Xss512m
from spark.sql.functions import col
host = {"spark.cassandra.connection.host":'10.0.0.10,10.0.0.11,10.0.0.12',"table":"table_one","keyspace":"log_keyspace"}
data_frame = sqlContext.read.format("org.apache.spark.sql.cassandra").options(**hosts).load()
a = data_frame.filter(col("col_1")<100000).select("col_1","col_2","col_3","col_4","col_5").toPandas()
As a result of the above codes running, the data received from Cassandra can be displayed.
Commands trying to get data by connecting to Cassandra from another PC.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ' --master spark://10.0.0.10:7077 --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 --conf spark.driver.extraJavaOptions=-Xss512m --conf spark.executer.extraJavaOptions=-Xss512m spark.cassandra.connection.host=10.0.0.10 pyspark '
import findspark
findspark.init()
findspark.find()
from pyspark import SparkContext SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('example')
sc = pyspark.SparkContext(conf = conf)
spark = SparkSession(sc)
hosts ={"spark.cassandra.connection.host":'10.0.0.10',"table":"table_one","keyspace":"log_keyspace"}
sqlContext = SQLContext(sc)
data_frame = sqlContext.read.format("org.apache.spark.sql.cassandra").options(**hosts).load()
As a result of the above codes running, " :java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at http://spark.apache.org/third-party-projects.html " error occurs.
What can I do for fixing this error?

Read CSV file on Spark

I am started working with Spark and found out one problem.
I tried reading CSV file using the below code:
df = spark.read.csv("/home/oybek/Serverspace/Serverspace/Athletes.csv")
df.show(5)
Error:
Py4JJavaError: An error occurred while calling o38.csv.
: java.lang.OutOfMemoryError: Java heap space
I am working in Linux Ubuntu, VirtualBox:~/Serverspace.
You can try changing the driver memory by creating a spark session variable like below:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local[*]') \
.config("spark.driver.memory", "4g") \
.appName('read-csv') \
.getOrCreate()

py4JJavaError: An error occurred while calling o253.load. : java.lang.ClassNotFoundException: Failed to find data source: bigquery

Trying to read data from bigquery to jupyter notebook with pyspark libraries. All of the apache spark and java hvae been downloaded to my C:Drive. Read and watched tutorial videos but none of them which seem to work. looking for guidance
Code:
import pyspark
import findspark
from pyspark import SparkContext,SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, col, year, month, aggregate, date_add,
timestamp_seconds, rank, split
from pyspark.sql.types import StructField, StructType, StringType, BooleanType, DoubleType,
StringType, IntegerType, FloatType
#import com.google.cloud.spark.bigquery
#this creates spark UI - check current spark session
spark =SparkSession.builder.master('local[*]').appName('conversions').enableHiveSupport().getOrCreate()
df = spark.read.format('bigquery').load('table')
df.show()
error:
Py4JJavaError: An error occurred while calling o253.load.
: java.lang.ClassNotFoundException:
Failed to find data source: bigquery. Please find packages at
http://spark.apache.org/third-party-projects.html
Please change the SparkSession creation to
spark =SparkSession.builder \
.master('local[*]') \
.appName('conversions') \
.enableHiveSupport() \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.2') \
.getOrCreate()
Also, please make sure you are using a python notebook rather than a pyspark notebook - otherwise Jupyter will create the SparkSession for you and no additional packages can be added.
See more documentation in the connector's repo.

Loading data from GCS using Spark Local

I am trying to read data from GCS buckets on my local machine, for testing purposes. I would like to sample some of the data in the cloud
I have downloaded the GCS Hadoop Connector JAR.
And setup the sparkConf as follow:
conf = SparkConf() \
.setMaster("local[8]") \
.setAppName("Test") \
.set("spark.jars", "path/gcs-connector-hadoop2-latest.jar") \
.set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "path/to/keyfile")
sc = SparkContext(conf=conf)
spark = SparkSession.builder \
.config(conf=sc.getConf()) \
.getOrCreate()
spark.read.json("gs://gcs-bucket")
I have also tried to set the conf like so:
sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.json.keyfile", "path/to/keyfile")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable", "true")
I am using PySpark install via PIP and running the code using the unit test module from IntelliJ
py4j.protocol.Py4JJavaError: An error occurred while calling o128.json.
: java.io.IOException: No FileSystem for scheme: gs
What should I do?
Thanks!
To solve this issue, you need to add configuration for fs.gs.impl property in addition to properties that you already configured:
sc._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")

How to run python spark script with specific jars

I have to run a python script on EMR instance using pyspark to query dynamoDB. I am able to do that by querying dynamodb on pyspark which is executed by including jars with following command.
`pyspark --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar`
I ran following python3 script to query data using pyspark python module.
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
start_time = time.time()
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://nn1:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
df_load = sparkSession.sql("SELECT * FROM example")
df_load.show()
print(time.time() - start_time)
Which caused following runtime exception for missing jars.
java.lang.ClassNotFoundException Class org.apache.hadoop.hive.dynamodb.DynamoDBSerDe not found
How do I convert the pyspark --jars.. to a pythonic equivalent.
As of now I tried copying the jars from the location /usr/share/... to $SPARK_HOME/libs/jars and adding that path to spark-defaults.conf external class path that had no effect.
Use spark-submit command to execute your python script. Example :
spark-submit --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar script.py

Resources