How to read multiple gzipped files from S3 into dataframe - apache-spark

My JSON data is gzipped and those files are stored on S3.
I want to read those data
I tried some streaming options as below
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, TimestampType};
import org.apache.spark.sql.SparkSession
import sys.process._
val tSchema = new StructType().add("log_type", StringType)
val tDF = spark.readStream.option("compression","gzip").schema(tSchema).load("s3a://S3_dir/")
tDF.writeStream.outputMode("Append").format("console").start()
Got exceptions
s3a://S3_dir/file_name is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-17, 20, 3, 0]

Related

TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

I use pyspark streaming to read kafka data, but it went wrong:
import os
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
from pyspark import SparkContext
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell'
sc = SparkContext(appName="test")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
kafkaStream.map(lambda x: x.split(" ")).pprint()
ssc.start()
ssc.awaitTermination()
________________________________________________________________________________________________
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.3.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
________________________________________________________________________________________________
Traceback (most recent call last):
File "/home/docs/dp_model/dp_algo_platform/dp_algo_core/test/test.py", line 29, in <module>
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2})
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream
File "/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper
TypeError: 'JavaPackage' object is not callable
My spark version: 2.4.3, kafka version: 2.1.0, and I replace os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' with os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 pyspark-shell', it cannot work either. How can I do it?
I think you should move around your imports such that the environment is loaded with the variable before you import and initialize the Spark variables
You also definitely need to be using the same version of packages as your Spark version
import os
sparkVersion = '2.4.3' # update this accordingly
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:{} pyspark-shell'.format(sparkVersion)
# import Spark core
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
# import extra packages
from pyspark.streaming.kafka import KafkaUtils
# begin application
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
Note: Kafka 0.8 support is deprecated as of Spark 2.3.0

pyspark 'SparkSession' object has no attribute '_jssc'

I'm using:
Hadoop 2.6.0-cdh5.14.2
SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101
And I'm getting this error when starting the directStream from KafkaUtils:
File "/home/ale/amazon_fuse_ds/bin/hdp_amazon_fuse_aggreagation.py", line 91, in setupContexts
kafka_stream = KafkaUtils.createDirectStream( self.spark_streaming_context, [ self.kafka_topicin ], kafka_configuration )
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 145, in createDirectStream
AttributeError: 'SparkSession' object has no attribute '_jssc'
and I see that SparkSession has _jsc method but _jssc.
The object you pass is a SparkSession, why you should pass StreamingContext.
from pyspark.streaming import StreamingContext
ssc = StreaminContext(self.spark_streaming_context.sparkContext, batchDuration)
KafkaUtils.createDirectStream(ssc, ...)

Pyspark Py4JJavaError

enter image description here
Getting this error on - > sc = pyspark.SparkContext(appName="Pi") line
import findspark
findspark.init()
import pyspark
import random
pyspark.SparkContext(appName="Pi")

Spark 2.0.2 PySpark failing to Import collect_list

I have a DataFrame of the form:
+--------------+------------+----+
| s|variant_hash|call|
+--------------+------------+----+
|C1046::HG02024| 83779208| 0|
|C1046::HG02025| 83779208| 1|
|C1046::HG02026| 83779208| 0|
|C1047::HG00731| 83779208| 0|
|C1047::HG00732| 83779208| 1
...
I was hoping to leverage collect_list() to transform it into:
+--------------------+-------------------------------------+
| s| feature_vector|
+--------------------+-------------------------------------+
| C1046::HG02024|[(83779208, 0), (68471259, 2)...]|
+--------------------+-------------------------------------+
Where the feature vector column is a list of tuples of the form (variant_hash, call). I was planning on leveraging groupBy and agg(collect_list()) to accomplish this result, but am receiving the following error:
Traceback (most recent call last):
File "/tmp/ba6a891c-529b-4c75-a76f-8ab20f4377ba/ml_on_vds.py", line 43, in <module>
vector_df = svc_df.groupBy('s').agg(func.collect_list(('variant_hash', 'call')))
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 39, in _
File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace:
py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
The below code shows my imports. I didn't think it was necessary to import HiveContext and enableHiveSupport in 2.0.2, but I had hoped doing so would resolve the issue. Sadly, no luck. Does anyone have any recommendations to resolve this import issue?
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext, HiveContext
from pyspark.sql.functions import udf, hash, collect_list
from pyspark.sql.types import *
from hail import *
# Initialize the SparkSession
spark = (SparkSession.builder.appName("PopulationGenomics")
.config("spark.sql.files.openCostInBytes", "1099511627776")
.config("spark.sql.files.maxPartitionBytes", "1099511627776")
.config("spark.hadoop.io.compression.codecs", "org.apache.hadoop.io.compress.DefaultCodec,is.hail.io.compress.BGzipCodec,org.apache.hadoop.io.compress.GzipCodec")
.enableHiveSupport()
.getOrCreate())
I am attempting run this code on a gcloud dataproc cluster.
so it throws error at this line -
vector_df = svc_df.groupBy('s').agg(func.collect_list(('variant_hash', 'call')))
you are calling collect_list as func.collect_list
but you are importing functions as -
from pyspark.sql.functions import udf, hash, collect_list
may be you meant to import functions as 'func' like
from pyspark.sql import functions as func,

jupyter notebook NameError: name 'sc' is not defined

I used the jupyter notebook, pyspark, then, my first command was:
rdd = sc.parallelize([2, 3, 4])
Then, it showed that
NameError Traceback (most recent call last)
<ipython-input-1-c540c4a1d203> in <module>()
----> 1 rdd = sc.parallelize([2, 3, 4])
NameError: name 'sc' is not defined.
How to fix this error 'sc' is not defined.
Have you initialized the SparkContext?
You could try this:
#Initializing PySpark
from pyspark import SparkContext, SparkConf
# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
Try this
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark import SparkContext, SparkConf
# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)
myrdd = sc.parallelize([('roze', 60), ('Mary', 80), ('stella', 34)])

Resources