library to process .rrd(round robin data) using spark - apache-spark

I have huge time series data which is in .rrd(round robin database) format stored in S3. I am planning to use apache spark for running analysis on this to get different performance matrix.
Currently I am downloading the .rrd file from s3 and processing it using rrd4j library. I am going to do processing for longer terms like year or more. it involves processing of hundreds of thousands of .rrd files. I want spark nodes to get the file directly from s3 and run the analysis.
how can I make spark to use the rrd4j to read the .rrd files? is there any library which helps me do that?
is there any support in spark for processing this kind of data?

The spark part is rather easy, use either wholeTextFiles or binaryFiles on sparkContext (see docs). According to the documentation, rrd4j usually wants a path to construct an rrd, but with the RrdByteArrayBackend, you could load the data in there - but that might be a problem, because most of the API is protected. You'll have to figure out a way to load an Array[Byte] into rrd4j.

Related

Dilemma about Spark partitions

I am working on a project where I have to read S3 files (each about 3MB zipped) using boto3. I have a small pyspark script that runs every hour to process the file and generate 2 types of output data which is written back to S3. The pyspark script uses 'xmltodict' python library to read some static data into a dictionary object needed for file processing. I have a small Amazon EMR cluster v5.28 running with 1 Master and 1 Core. This might be excessive but is not my main concern right now.
Questions:
1. How do I know 'IF' i should partition the data? I have read articles on how many partitions to create, etc but couldn't find anything on IF and WHEN. What is the criteria that drives partitioning - number of rows, columns, data type, actions taken in the script, etc in the source data file? I read the source file into an RDD and convert it to a DF and perform various operations by adding columns, grouping data, counting data, etc. How does spark handle partitioning behind the scenes?
2. Currently, I manually execute the pyspark script as follows:
spark-submit --master spark://x.x.x.x:7077 --deploy-mode client test.py
on the master node as I have decided to stick with Standalone CM. The 'xmltodict' is installed on this node, but is not installed on the Core node. It doesn't seem like it needs to be installed or even python3 configured on Core node since I am not seeing any errors. Is that correct and can somebody shed some light on this confusion? I tried to install the python libraries via shell file as a bootstrap
when I created the cluster, but it failed and quite frankly after trying it a few times, I gave up.
3. Based on partitioning I think I am slightly confused on whether or not to use coalesce() or collect(). Again, the question is when to use and when not to?
Sorry too many questions. Now, that I have the pyspark script written, I am trying to work the efficiencies.
Thanks
Partitioning is the mechanism with which data is divided into optimum size chunks and based on that multiple tasks are run, each processing one piece of data. As you see this is the core of parallelism and without this there is no significant use of Spark (or any bigdata processing framework). Most of the file formats are splittable and some are splittable when compressed like Avro, parquet, orc etc. Some file formats are not splittable when compressed like - zip, gzip etc. Based on the size of the file being processed and their ability to be split, Spark automatically creates multiple partitions and processes data in parallel. In your case the data being zip, one file will be one partition and no more than 1 CPU can work on it at once. If this zip is small then its ok, but if it is big then its processing will be slow.

Write Spark dataframe to database (Exasol) using jdbc slow

I am reading from AWS(s3) and writing in to database (exasol) taking too much time even setting batchsize is not effecting performance.
I am writing 6.18m rows (around 3.5 gb) taking 17min
running in cluster mode 20 node cluster
how I can make it fast
Dataset ds = session.read().parquet(s3Path)
ds.write().format("jdbc").option("user", username).option("password", password).option("driver", Conf.DRIVER).option("url", dbURL).option("dbtable", exasolTableName).option("batchsize", 50000).mode(SaveMode.Append).save();
Ok, it's an interesting question.
I did not check the implementation details of recently released Spark connector. But you may go with some previously existing methods.
Save Spark job results as CSV files in Hadoop. Run standard parallel IMPORT from all created files via WebHDFS http calls.
Official UDF script is capable of importing directly from Parquet, as far as I know.
You may implement your own Java UDF script to read Parquet in way you want. For example, this is how it works for ORC files.
Generally speaking, the best way to achieve some real performance is to bypass Spark altogether.

how to use flink and spark together,and spark just for transformation?

Let`s say there is a collection "goods" in mongodb like this:
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
In the past,I use spark to flatten it and save to hive:
goodsDF.select($"name",explode($"attribute"))
But,now we need to handle incremental data,
for example,there are a new good in the third line in the next day
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
{name:"C",attr:["location":"uk"],"eventTime":"2018-02-01"}
some of our team think flink is better on streaming,because flink has event driver application,streaming pipeline and batch,but spark is just micro batch.
so we change to use flink,but there are a lot of code has been written by spark,for example,the "explode" above,so my question is:
Is it possible to use flink to fetch source and save to the sink,but in the middle,use spark to transform the dataset?
If it is not possible,how about save it to a temporary sink,let`s say,some json files,and then spark read the files and transform and save to hive.But I am afraid this makes no sense,because for spark,It is also incremental data.Use flink then use spark is the same as use spark Structured Streaming directly.
No. Apache Spark code can not be used in Flink without making changes in code. As these two are different processing frameworks and APIs provided by two and it's syntax are different from each other. Choice of framework should really be driven by the use case and not by generic statements like Flink is better than Spark. A framework may work great for your use case and it may perform poorly in other use case. By the way, Spark is not just micro batch. It has batch, streaming, graph, ML and other things. Since the complete use case is not mentioned in question, it would be hard to suggest which one is better for this scenario. But if your use case can afford sub-second latency then I would not waste my time in moving to another framework.
Also, if the things are dynamic and it is anticipated that processing framework may change in future it would be better to use something like apache beam which provides abstraction over most of the processing engines. Using apache beam processing APIs will give you flexibility to change underlying processing engine any time. Here is the link to read more about beam - https://beam.apache.org/.

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

Cassandra Loading Options

I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,
For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).

Resources