pySpark writerStream not showing output to console in Jupyter Lab

pySpark writerStream not showing output to console in Jupyter Lab - apache-spark

I am trying to display some streaming data (twitter feeds) to screen.
This is being done so I can follow better what is going on in Spark (debugging to a certain extent), but I am not getting any output.
Writing to csv file works ok for the same query but to console nothing is coming out.
I am using Jupyter Lab.
The query is;
tweets_query = tweets\
.selectExpr("cast(value as string)")\
.select( f.from_json(f.col("value").cast("string"), schema).alias("tweets"))\
.select( "tweets.id", "tweets.text", "tweets.createdOnDate", "tweets.lang", "tweets.loc")
The part to write to the screen;
query = tweets_query \
.writeStream \
.format("console") \
.outputMode("append") \
.option("truncate","false") \
.start()
What am I missing?

you are missing the await. add the following line after you start the query.
sparkSession.streams.awaitAnyTermination()

Related

Can not read the data from HDFS in pySpark

I am a beginner in coding. Currently trying to read a file (which was imported to HDFS using sqoop) with the help of pyspark. The spark job is not progressing and my jupyter pyspark kernel is like stuck. I am not sure whether I used the correct way to import the file to hdfs and whether the code used to read the file with spark is correct or not.
The sqoop import code I used is as follows
sqoop import --connect jdbc:mysql://upgraddetest.cyaielc9bmnf.us-east-1.rds.amazonaws.com/testdatabase --table SRC_ATM_TRANS --username student --password STUDENT123 --target-dir /user/root/Spar_Nord -m 1
The pyspark code I used is
df = spark.read.csv("/user/root/Spar_Nord/part-m-00000", header = False, inferSchema = True)
Also please advice how we can know the file type that we imported with sqoop? I just assumed .csv and wrote the pyspark code.
Appreciate a quick help.

When pulling data into HDFS via sqoop, the default delimiter is the tab character. Sqoop creates a generic delimited text file based on the parameters passed into the sqoop command. To make the file output with a comma delimiter to match a generic csv format, you should add:
--fields-terminated-by <char>
So your sqoop command would look like:
sqoop import --connect jdbc:mysql://upgraddetest.cyaielc9bmnf.us-east-1.rds.amazonaws.com/testdatabase --table SRC_ATM_TRANS --username student --password STUDENT123 --fields-terminated-by ',' --target-dir /user/root/Spar_Nord -m 1

What configuration setting should I be changing to handle this error relating to buffer length when decompressing snappy?

I'm running a simple test in EMR on a json file that has been compressed into snappy.
I'm getting this error:
Java.lang.InternalError: Could not decompress data. Buffer length is too small.
I'm running:
df = oSpark.session.read.options(mode='FAILFAST', \
primitivesAsString='true', \
multiLine='true', \
compression='snappy', \
encoding='UTF-8') \
.json(file)
df.printSchema()
print(df.head(1))
df.show(truncate=False)
I've tried playing around with:
spark.buffer.size, spark.kryoserializer.buffer.max, io.file.buffer.size but I'm not getting any improvement

Dataflow job does not any produce output

I have an issue where the dataflow job actually runs fine, but it does not produce any output until the job is manually drained.
With the following code I was assuming that it would produce windowed output, effectively triggering after each window.
lines = (
p
| "read" >> source
| "decode" >> beam.Map(decode_message)
| "Parse" >> beam.Map(parse_json)
| beam.WindowInto(
beam.window.FixedWindows(5*60),
trigger=beam.trigger.Repeatedly(beam.trigger.AfterProcessingTime(5*60)),
accumulation_mode=beam.trigger.AccumulationMode.DISCARDING)
| "write" >> sink
)
What I want is that if it has received events in a window, it should produce output after the window in any case. The source is a Cloud PubSub, with approximately 100 events/minute.
This is the parameters i use to start the job:
python main.py \
--region $(REGION) \
--service_account_email $(SERVICE_ACCOUNT_EMAIL_TEST) \
--staging_location gs://$(BUCKET_NAME_TEST)/beam_stage/ \
--project $(TEST_PROJECT_ID) \
--inputTopic $(TOPIC_TEST) \
--outputLocation gs://$(BUCKET_NAME_TEST)/beam_output/ \
--streaming \
--runner DataflowRunner \
--temp_location gs://$(BUCKET_NAME_TEST)/beam_temp/ \
--experiments=allow_non_updatable_job \
--disk_size_gb=200 \
--machine_type=n1-standard-2 \
--job_name $(DATAFLOW_JOB_NAME)
Any ideas on how to fix this? I'm using apache-beam 2.22 SDK, python 3.7

Excuse me if you are referring to 2.22, because "apache-beam 1.22" seems to be old? Especially when you are using Python 3.7, you might want to try newer SDK versions such as 2.22.0.
What I want is that if it has received events in a window, it should produce output after the window in any case. The source is a Cloud PubSub, with approximately 100 events/minute.
If you just need one pane fired per window and fixed windows every 5 mins, you can simply go with
beam.WindowInto(beam.window.FixedWindows(5*60))
If you want to customize triggers, you can take a look at this document streaming-102.
Here is a streaming example with visualization of windowed outputs.
from apache_beam.runners.interactive import interactive_beam as ib
ib.options.capture_duration = timedelta(seconds=30)
ib.evict_captured_data()
pstreaming = beam.Pipeline(InteractiveRunner(), options=options)
words = (pstreaming
| 'Read' >> beam.io.ReadFromPubSub(topic=topic)
| 'Window' >> beam.WindowInto(beam.window.FixedWindows(5)))
ib.show(words, visualize_data=True, include_window_info=True)
If you run these code in a notebook environment such as jupyterlab, you get to debug streaming pipelines with outputs like this. Note the windows are visualized, for a period of 30 seconds, we get 6 windows as the fixed window is set to 5 seconds. You can bin data by windows to see what data came in which window.
You can setup your own notebook runtime following instructions;
Or you can use hosted solutions provided by Google Dataflow Notebooks.

Spark : Japanese letters are garbled in Paquet files created in HDFS

I have a Spark job which reads some CSV file on S3 ,process and save the result as parquet files.These CSV contains Japanese text.
When I run this job on local, reading the S3 CSV file and write to parquet files into local folder, the japanese letters looks fine.
But when I ran this on my spark cluster, reading the same S3 CSV file and write parquet to HDFS , all the Japanese letters are garbled.
run on spark-cluster (data is garbled)
spark-submit --master spark://spark-master-stg:7077 \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap= -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=hdfs://nameservice1/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar
run locally (data looks fine)
spark-submit --master local \
--conf spark.sql.session.timeZone=UTC \
--conf spark.driver.extraJavaOptions="-Ddatabase=dev_mall -Dtable=table_base_TEST -DtimestampColumn=time_stamp -DpartitionColumns= -Dyear=-1 -Dmonth=-1 -DcolRenameMap= -DpartitionByYearMonth=true -DaddSpdbCols=false -DconvertTimeDateCols=true -Ds3AccessKey=xxxxx -Ds3SecretKey=yyyy -Ds3BasePath=s3a://bucket/export/e2e-test -Ds3Endpoint=http://s3.url -DhdfsBasePath=/tmp/encoding-test -DaddSpdbCols=false" \
--name Teradata_export_test_ash \
--class com.mycompany.data.spark.job.TeradataNormalTableJob \
--deploy-mode client \
https://artifactory.maven-it.com/spdb-mvn-release/com.mycompany.data/teradata-spark_2.11/0.1/teradata-spark_2.11-0.1-assembly.jar
As can be seen above, both spark-submit jobs points to the same S3 file, only different is when running on Spark cluster, the result is written to HDFS.
Reading CSV:
def readTeradataCSV(schema: StructType, path: String) : DataFrame = {
dataFrameReader.option("delimiter", "\u0001")
.option("header", "false")
.option("inferSchema", "false")
.option("multiLine","true")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.schema(schema)
.csv(path)
}
This is how I write to parquet:
finalDf.write
.format("parquet")
.mode(SaveMode.Append)
.option("path", hdfsTablePath)
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
.partitionBy(parCols: _*)
.save()
This is how data on HDFS looks like:
Any tips on how to fix this ?
Does the input CSV file has to be in UTF-8 encoding ?
** Update **
Found out its not related to Parquet, rather CSV loading. Asked a seperate question here :
Spark CSV reader : garbled Japanese text and handling multilines

Parquet format has no option for encoding or charset cf. https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
Hence your code has no effect:
finalDf.write
.format("parquet")
.option("encoding", "UTF-8")
.option("charset", "UTF-8")
(...)
These options apply only for CSV, you should set them (or rather ONE of them since they are synonyms) when reading the source file.
Assuming you are using the Spark dataframe API to read the CSV; otherwise you are on your own.

Found nothing in _spark_metadata

I am trying to read CSV files from a specific folder and write same contents to other CSV file in a different location on the local pc for learning purpose. I can read the file and show the contents on the console. However, if I want to write it to another CSV file at the specified output directory I get a folder named "_spark_metadata" which contain nothing inside.
I paste the whole code here step by step:
creating Spark Session:
spark = SparkSession \
.builder \
.appName('csv01') \
.master('local[*]') \
.getOrCreate();
spark.conf.set("spark.sql.streaming.checkpointLocation", <String path to checkpoint location directory> )
userSchema = StructType().add("name", "string").add("age", "integer")
Read from CSV file
df = spark \
.readStream \
.schema(userSchema) \
.option("sep",",") \
.csv(<String path to local input directory containing CSV file>)
Write to CSV file
df.writeStream \
.format("csv") \
.option("path", <String path to local output directory containing CSV file>) \
.start()
In "String path to local output directory containing CSV file" I only get a folder _spark_metadata which contains no CSV file.
Any help on this is highly appreciated

You don't use readStream to read from static data. You use that to read from a directory where files are added into that folder.
You only need spark.read.csv

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pySpark writerStream not showing output to console in Jupyter Lab - apache-spark

you are missing the await. add the following line after you start the query. sparkSession.streams.awaitAnyTermination()

Related

Can not read the data from HDFS in pySpark

What configuration setting should I be changing to handle this error relating to buffer length when decompressing snappy?

Dataflow job does not any produce output

Spark : Japanese letters are garbled in Paquet files created in HDFS

Found nothing in _spark_metadata

Categories

Resources