I have log files in different servers(5 servers connect through LAN) and I need to process and get the result
Each node has 4TB log files and I'm using HDFS to load all log files into Spark
Every time, when request come, the Spark load all files (5 * 4TB) then query with Spark SQL
What if I load all the log files into caseesndra and then query (it can be preloaded ) ? which is the fast way ..?
HDFS and Cassandra has each their own advantages.
If you need to process all log files entirely, HDFS is a better choice because it is a file system and has been designed to store a massive amount of data and process them by batch.
Now, if you only need to process a portion of the log files, a datastore like Cassandra is a better choice because you can filter data by primary key and have faster access and skip scanning through all the files.
Cassandra is designed for OLTP workload whereas HDFS and the kind are designed for OLAP workload
Related
Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?
Spark is distributed data processing engine used for computing huge volumes of data. Let's say I have huge volume of data stored in mysql which I want to perform processing on. Spark reads the data from mysql and perform in-memory (or disk) computation on the cluster nodes itself. I am still not able to understand why is distributed file storage needed to run spark in a clustered mode?
is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode?
Pretty Much
if yes, why?
Because the spark workers take input from a shared table, distribute the computation amongst themselves, then are choreographed by the spark driver to write their data back to another shared table.
If you are trying to work exclusively with mysql you might be able to use the local filesystem ("file://) as the cluster FS. However, if any RDD or stage in a spark query does try to use a shared filesystem as a way of committing work, the output isn't going to propagate from the workers (which will have written to their local filesystem) and the spark driver (which can only read its local filesystem)
What I have?
I have Spark Streaming Application (on Kafka Streams) on Hadoop Cluster that aggregates each 5 minutes users' clicks and some actions done on a web site and
converts them into metrics.
Also I have a table in GreenPlum (on its own cluster) with users data that may get updated. This table is filled using Logical Log Streaming Replication via Kafka. Table size is 100 mln users.
What I want?
I want to join Spark Streams with static data from GreenPlum every 1 or 5 minutes and then aggregate data already using e.g. user age from static table.
Notes
Definitely, I don't need to read all records from users table. There are rather stable core segment + number of new users registering each minute.
Currently I use PySpark 2.1.0
My solutions
Copy data from GreenPlum cluster to Hadoop cluster and save it as
orc/parquet files. Each 5 minute add new files for new users. Once a
day reload all files.
Create new DB on Hadoop and Setup Log replication via Kafka as it is
done for GreenPlum. Read data from DB and use built in Spark
Streaming joins.
Read data from GreenPlum on Spark in cache. Join stream data with
cache.
For each 5 minute save/append new user data in a file, ignore old
user data. Store extra column e.g. last_action to truncate this
file if a user wasn't active on web site during last 2 weeks. Thus,
join this file with stream.
Questions
What of these solutions are more suitable for MVP? for Production?
Are there any better solutions/best practices for such sorts of
problem. Some literature)
Spark streaming reading data from a cache like Apache geode make this better. used this approach in real-time fraud use case. In a nut shell I have features generated on Greenplum Database using historical data. The feature data and some decision making lookup data is pushed in to geode. Features are periodically refreshed (10 min interval) and then refreshed in geode. Spark scoring streaming job constantly scoring the transactions as the come in w/o reading from Greenplum. Also spark streaming job puts the score in geode, which is synced to Greenplum using different thread. I had spark streaming running on cloud foundry using k8. This is a very high level but should give you an idea.
You might want to check out the GPDB Spark Connector --
http://greenplum-spark-connector.readthedocs.io/en/latest/
https://greenplum-spark.docs.pivotal.io/130/index.html
You can load data directly from the segments into Spark.
Currently, if you want to write back to GPDB, you need to use a standard JDBC to the master.
I have millions of streaming parquet files being written . I want to support running ad hoc interactive queries for debugging and analytics purpose ( added bonus if i can run streaming queries for some real time monitoring of key metrics as well).
What is a scalable solution for supporting this.
The two ways I have observed is running spark sql interactively over millions of parquet files (not too familiar with spark ecosystem but does this mean running a spark job for every sql user submits or do i need to run some streaming job and submit queries somehow) and second being using a presto sql engine on top of parquet (not exactly sure how presto ingests new incoming parquet files).
Any recommendations or pros and cons of either approach . Any better solutions considering i have > ~10Tb data produced every day .
Let me address your use cases :
Support running ad hoc interactive queries for debugging and analytics purpose
I would recommend building a presto cluster if you care about minimizing the latency of your queries and are willing to invest in many machines with a large amount of memory.
Reason: Presto would run fully in-memory without touching disk (in most cases)
A Spark Cluster can also do the job, however, it won't be as fast as Presto. The advantage of Spark over presto is its fault tolerance capabilities and its ability to fail over to disk in case of out of memory conditions which may be important for you given that you have too much data.
Run streaming queries for some real-time monitoring of key metrics as well
As long as you have basic queries, you can build dashboards on top of Presto which could run these queries every x minutes.
Having a considerable amount of processing may be a good reason to look at Spark streaming if real-time monitoring is important.
If it isn't then you could build an ETL (using Spark) for calculating your metrics, storing the data as a new hive table and then expose for querying via Presto/SparkSQL again.
How presto ingests new incoming parquet files?
I'm now aware of your architecture, but in any case, you need to provide Presto with a Hive connection (Hive Metastore to be precise).
Hive provides Presto with few schemas attached to the directories where you ingest your data. Presto dynamically sees the new data by default. Spark is not different by the way.
Presto has nothing to do with data ingestion. It only starts its job once the data is there.
While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards,
Bhupesh
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.
I am new to Apache Spark, I would like to know is it possible to store data using Apache Spark. Or is it only a processing tool?
Thanks for spending your time,
Satya
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
In real life use-case you usually have database, or data repository frome where you access data from spark.
Spark can access data that's in:
SQL Databases (Anything that can be connected using JDBC driver)
Local files
Cloud storage (eg. Amazon S3)
NoSQL databases.
Hadoop File System (HDFS)
and many more...
Detailed description can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
Apache Spark is primarily processing engine. It works with underlying file systems such as HDFS, s3 and other supported file systems. It has capabilities to read the data from relational databases as well. But primarily it is in memory distributed processing tool.
As you can read in Wikipedia, Apache Spark is defined as:
is an open source cluster computing framework
When we refer about computing, it's related to a processing tool, in essence it allows to work as a pipeline scheme (or somehow ETL), you read the dataset, you process the data, and then you store the data processed, or models that describe the data.
If your main objective is to distribute your data, there are some good alternatives like HDFS (Hadoop File System), and others.