Pyspark jobs on dataproc using documents from Firestore - apache-spark

I need to run some simple Pyspark jobs on Big Data that are stored in Google's Firestore.
The dataset contains 42 million documents regarding Instagram posts. I want to do some simple aggregations like summing the number of likes per country (location).
However i am new to Big Data processing and i have no idea on how to import the data to the Dataproc cluster to do the processing.
Should i export all the data into a GCS bucket and then load them to the VMs on the cluster?
Or i should connect the VMs to firebase when i need to do the processing?
Also, since Spark distributes the data into RDDs, is it possible to split (parallelize) the data directly from firebase so that each worker gets a chunk without having to load all 42m documents and then do the split?
The data should be around 15GB so i must consider which is the cheaper option as well.

Related

Spark Streaming join with GreenPlum/Postgres Data. Approach

What I have?
I have Spark Streaming Application (on Kafka Streams) on Hadoop Cluster that aggregates each 5 minutes users' clicks and some actions done on a web site and
converts them into metrics.
Also I have a table in GreenPlum (on its own cluster) with users data that may get updated. This table is filled using Logical Log Streaming Replication via Kafka. Table size is 100 mln users.
What I want?
I want to join Spark Streams with static data from GreenPlum every 1 or 5 minutes and then aggregate data already using e.g. user age from static table.
Notes
Definitely, I don't need to read all records from users table. There are rather stable core segment + number of new users registering each minute.
Currently I use PySpark 2.1.0
My solutions
Copy data from GreenPlum cluster to Hadoop cluster and save it as
orc/parquet files. Each 5 minute add new files for new users. Once a
day reload all files.
Create new DB on Hadoop and Setup Log replication via Kafka as it is
done for GreenPlum. Read data from DB and use built in Spark
Streaming joins.
Read data from GreenPlum on Spark in cache. Join stream data with
cache.
For each 5 minute save/append new user data in a file, ignore old
user data. Store extra column e.g. last_action to truncate this
file if a user wasn't active on web site during last 2 weeks. Thus,
join this file with stream.
Questions
What of these solutions are more suitable for MVP? for Production?
Are there any better solutions/best practices for such sorts of
problem. Some literature)
Spark streaming reading data from a cache like Apache geode make this better. used this approach in real-time fraud use case. In a nut shell I have features generated on Greenplum Database using historical data. The feature data and some decision making lookup data is pushed in to geode. Features are periodically refreshed (10 min interval) and then refreshed in geode. Spark scoring streaming job constantly scoring the transactions as the come in w/o reading from Greenplum. Also spark streaming job puts the score in geode, which is synced to Greenplum using different thread. I had spark streaming running on cloud foundry using k8. This is a very high level but should give you an idea.
You might want to check out the GPDB Spark Connector --
http://greenplum-spark-connector.readthedocs.io/en/latest/
https://greenplum-spark.docs.pivotal.io/130/index.html
You can load data directly from the segments into Spark.
Currently, if you want to write back to GPDB, you need to use a standard JDBC to the master.

what should be the Hadoop cofigurations to be used for 100 gb of csv files for analysis in Spark

I have around 100 GB of data in CSV format on which I intend to do some transformation like aggregation, data splitting and after that do some clustering using ML package of Apache Spark.
I have tried it by uploading data on MYSQ trying to automate the process on python but it's taking too much time to build any solution.
What is the configuration I need to setup and how I should start with the spark?
I am new in spark. I am planning to use cloud services.
I'm going to recommend you learn to use spark locally with a small subset of the data; you can run it standalone with a few tens moving to hundreds of MB. Its limited, but you can learn the tooling without paying. Your first spark dataframe query could be sampling the source data and saving it into a more efficient query format.
CSV isn't a great format for big data; Spark likes Parquet and for 2.3+ ORC). Embrace them for better perf.
Play with "notebooks"; Apache Zeppelin is one you can install and run locally.
Like I say, learn to play with small amounts. Spark is very interactive & working with small datasets is an easy way to learn fast.
There are many ways to do that but it depends on your case. As far as I know, HDFS with default configuration(without any specific tuning) works fine. Majority of Hadoop tuning guides are focused on YARN side. So, let me make a plan like below:
Generally speaking, you can put your (raw) data in HDFS and load them in Apache Spark and save them in Parquet/ORC like below:
from pyspark.sql.types import StructType,StructField,StringType
myschema = StructType([StructField("FirstName",StringType(),True),StructField("LastName",StringType(),True)])
mydf = spark.read.format("com.databricks.spark.csv").option("header","true").schema(myschema).option("delimiter",",").load("hdfs://hadoopmaster:9000/user/hduser/mydata.csv")
mydf.count()
mydf.repartition(6).write.format("parquet").save("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf = spark.read.parquet("hdfs://hadoopmaster:9000/user/hduser/DataInParquet")
newdf.count()
Finally, compare mydf.count() with newdf.count(). That will run faster than raw format. In addition, your data size will decrease from 100GB to ~24GB.
If you are new to hadoop, spark and interested to setup hadoop environment in cloud. I would suggest you to go with Elastic Map Reduce(EMR) powered by AWS. You can create On demand spark cluster with the user defined configuration to process a wide range of data sets.
https://aws.amazon.com/emr/
https://aws.amazon.com/emr/details/spark/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-launch.html
Or
You can setup a hadoop cluster on top of EC2 instance or in any cloud platform with the required number of nodes with sufficient RAM and CPU. Storage optimized instances is preferred over here to analyze a large data set.
We do not need to bother about storage cost, For storage optimized instances, AWS offers free ephemeral storage data disk with size 1 - 2TB depends on instance size.
Note: Data in the ephemeral storage will be lost when the VM is rebooted. We can persist the processed data in S3 at the cheapest cost.
When it comes to cluster configuration, the list of things to be checked.
Spark on YARN is preferred
Set minimum and maximum core and memory in yarn node manager container settings for your spark executors.
Enable dynamic memory allocation in spark
Set container size to the maximum and spark memory fraction to maximum to avoid shuffling multiple times and frequent spilling and cached data eviction.
Use kryo serialization to get high performance.
Enable compression for map outputs before shuffling.
Enable spark web UI to track your application tasks and its stages.
Apache Spark Config Reference: https://spark.apache.org/docs/2.1.0/configuration.html

Spark Streaming: in-memory aggregation - correct usage

I have a Spark 2.2 Structured streaming flow from an on-premise system into a containerized cloud spark cluster where kafka recieves the data, and SSS maintains a number of queries that flush to disk every ten seconds. A query console-sink is not accessible to external sessions outside the streaming context (hence the CSV flush); the monitoring dashboard runs spark sql from another context to get metrics.
Right now I am only aggregating the data that has come in since streaming was last started. Now I need to aggregate data since forever with the incoming streaming data to provide (near) realtime views. This will mean running a bunch of GROUP BY's on billions of records - maintaining several million aggregate rows in-memory.
My question is regarding how Spark streaming queries can scale like this: how efficient is memory usage (I'll probably use 32 worker contaiers) and is this the correct way to manage a (near-) realtime view of incoming data using kafka and SSS?

Kafka to Spark batch processing

I'm looking for an optimal data architecture.
I'm dealing with TS data that is flushed from a Redis database to OpenTSDB database each week.
OpenTSDB stores its data on HBase which is launched on a Hadoop cluster.
Then, the time series data available on OpenTSDB has to be batch processed (at 1-6 months interval).
Knowing that OpenTSDB data is stored in Binary large object format on HBase, I can't currently tackle HBase HTTP API.
Since Spark cannot directly access OpenTSDB API (While Kafka seems to be Okay with HTTP Api)... I'm facing architecture issues which can be expressed as follows, would it be more convenient to:
Use Apache Kafka to extract batch data (TByte) and use it as a pipeline to ingest and analyze data into Spark Dataframes ?
Flush redis data directly in HBase and hence use Spark directly from it ?
That said, I want to be sure than Spark can handle terabytes batch analytics and kafka can handle that amount before loading it as Spark RDD.
Any suggestion on help will be welcome. Thanks

Spark for realtime OLAP queries

I have IoT data streaming in via Kafka, and would like to use Spark SQL to analyze it. I was planning on persisting data to S3 using Sector, but there will be a delay of a few minutes while the batch of data gets collected before being written to S3.
How can I make Spark query both the streaming data, and the historical data on S3? Do I run two queries - one with Spark Streaming, and one Spark SQL, and try to combine the results?
Or do I need to use an OLTP database for this type of functionality? I wanted to independently scale compute and storage, which is why I went with Spark + S3.

Resources