PySpark rdd of pandas data frames - apache-spark

I'm extracting information of different source files. Each source file corresponds to a given snapshot time of some measurement data. I have a preprocessing function that takes one of these files and outputs a pandas data frame. So I did a spark sc.wholeTextFiles call, which gave me a list of all input files, and then I called map on it, which provided me with an rdd where each element is a pandas data frame. What would now be the best approach to "reshape" this structure such that I have only one resulting data frame consisting of the concatenated smaller data frames?

You can create spark dataframe. Assuming that these files situated in one location and are delimted you can use spark to create a new dataframe having data from all files.
spark.read.option("header", "true").csv("../location/*")
After that you can use lot of transformations available in spark. They are a lot similar to pandas, and works on bigdata and even faster than RDD.

Related

Extract and analyze data from JSON - Hadoop vs Spark

I'm trying to learn the whole open source big data stack, and I've started with HDFS, Hadoop MapReduce and Spark. I'm more or less limited with MapReduce and Spark (SQL?) for "ETL", HDFS for storage, and no other limitation for other things.
I have a situation like this:
My Data Sources
Data Source 1 (DS1): Lots of data - totaling to around 1TB. I have IDs (let's call them ID1) inside each row - used as a key. Format: 1000s of JSON files.
Data Source 2 (DS2): Additional "metadata" for data source 1. I have IDs (let's call them ID2) inside each row - used as a key. Format: Single TXT file
Data Source 3 (DS3): Mapping between Data Source 1 and 2. Only pairs of ID1, ID2 in CSV files.
My workspace
I currently have a VM with enough data space, about 128GB of RAM and 16 CPUs to handle my problem (the whole project is a research for, not a production-use-thing). I have CentOS 7 and Cloudera 6.x installed. Currently, I'm using HDFS, MapReduce and Spark.
The task
I need only some attributes (ID and a few strings) from Data Source 1. My guess is that it comes to less than 10% in data size.
I need to connect ID1s from DS3 (pairs: ID1, ID2) to IDs in DS1 and ID2s from DS3 (pairs: ID1, ID2) to IDs in DS2.
I need to add attributes from DS2 (using "mapping" from the previous bullet) to my extracted attributes from DS1
I need to make some "queries", like:
Find the most used words by years
Find the most common words, used by a certain author
Find the most common words, used by a certain author, on a yearly basi
etc.
I need to visualize data (i.e. wordclouds, histograms, etc.) at the end.
My questions:
Which tool to use to extract data from JSON files the most efficient way? MapReduce or Spark (SQL?)?
I have arrays inside JSON. I know the explode function in Spark can transpose my data. But what is the best way to go here? Is it the best way to
extract IDs from DS1 and put exploded data next to them, and write them to new files? Or is it better to combine everything? How to achieve this - Hadoop, Spark?
My current idea was to create something like this:
Extract attributes needed (except arrays) from DS1 with Spark and write them to CSV files.
Extract attributes needed (exploded arrays only + IDs) from DS1 with Spark and write them to CSV files - each exploded attribute to own file(s).
This means I have extracted all the data I need, and I can easily connect them with only one ID. I then wanted to make queries for specific questions and run MapReduce jobs.
The question: Is this a good idea? If not, what can I do better? Should I insert data into a database? If yes, which one?
Thanks in advance!
Thanks for asking!! Being a BigData developer for last 1.5 years and having experience with both MR and Spark, I think I may guide you to the correct direction.
The final goals which you want to achieve can be obtained using both MapReduce and Spark. For visualization purpose you can use Apache Zeppelin, which can run on top of your final data.
Spark jobs are memory expensive jobs, i.e, the whole computation for spark jobs run on memory, i.e, RAM. Only the final result is written to the HDFS. On the other hand, MapReduce uses less amount of memory and used HDFS for writing intermittent stage results, thus making more I/O operations and more time consuming.
You can use Spark's Dataframe feature. You can directly load data to Dataframe from a structured data (it can be plaintext file also) which will help you to get the required data in a tabular format. You can write the Dataframe to a plaintext file, or you can store to a hive table from where you can visualize data. On the other hand, using MapReduce you will have to first store in Hive table, then write hive operations to manipulate data, and store final data to another hive table. Writing native MapReduce jobs can be very hectic so I would suggest to refrain from choosing that option.
At the end, I would suggest to use Spark as processing engine (128GB and 16 cores is enough for spark) to get your final result as soon as possible.

How to convert multiple parquet files into TFrecord files using SPARK?

I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy(). I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps:
Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files.
Read those parquet files to convert them into TFrecord files with the tensorflow-connector plugin.
It is the second step that I'm unable to do efficiently. My idea was to read in the individual parquet files on the executors and immediately write them into TFrecord files. But this needs access to the SQLContext which can only be done in the Driver (discussed here) so not in parallel. I would like to do something like this:
# List all parquet files to be converted
import glob, os
files = glob.glob('/path/*.parquet'))
sc = SparkSession.builder.getOrCreate()
sc.parallelize(files, 2).foreach(lambda parquetFile: convert_parquet_to_tfrecord(parquetFile))
Could I construct the function convert_parquet_to_tfrecord that would be able to do this on the executors?
I've also tried just using the wildcard when reading all the parquet files:
SQLContext(sc).read.parquet('/path/*.parquet')
This indeed reads all parquet files, but unfortunately not into individual partitions. It appears that the original structure gets lost, so it doesn't help me if I want the exact contents of the individual parquet files converted into TFrecord files.
Any other suggestions?
Try spark-tfrecord.
Spark-TFRecord is a tool similar to spark-tensorflow-connector but it does partitionBy. The following example shows how to partition a dataset.
import org.apache.spark.sql.SaveMode
// create a dataframe
val df = Seq((8, "bat"),(8, "abc"), (1, "xyz"), (2, "aaa")).toDF("number", "word")
val tf_output_dir = "/tmp/tfrecord-test"
// dump the tfrecords to files.
df.repartition(3, col("number")).write.mode(SaveMode.Overwrite).partitionBy("number").format("tfrecord").option("recordType", "Example").save(tf_output_dir)
More information can be found at
Github repo:
https://github.com/linkedin/spark-tfrecord
If I understood your question correctly, you want to write the partitions locally on the workers' disk.
If that is the case then I would recommend looking at spark-tensorflow-connector's instructions on how to do so.
This is the code that you are looking for (as stated in the documentation linked above):
myDataFrame.write.format("tfrecords").option("writeLocality", "local").save("/path")
On a side note, if you are worried about efficiency why are you using pyspark? It would be better to use scala instead.

How can I accumulate Dataframes in Spark Streaming?

I know Spark Streaming produces batches of RDDs, but I'd like to accumulate one big Dataframe that updates with each batch (by appending new dataframe to the end).
Is there a way to access all historical Stream data like this?
I've seen mapWithState() but I haven't seen it accumulate Dataframes specifically.
While Dataframes are implemented as batches of RDDs under the hood, a Dataframe is presented to the application as an non-discrete infinite stream of rows. There are no "batches of dataframes" as there are "batches of RDDs".
It's not clear what historical data you would like.

Read from Python to Spark in a distributed way [duplicate]

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.
Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.
Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?
the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:
has to be accessible on each worker, for example using distributed file system
file format has to support splitting (the simplest examples is plain old csv)
In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.
Regarding HDF5 files you have two options:
read data in chunks on a driver, build Spark DataFrame from each chunk, and union results. This is inefficient but easy to implement
distribute HDF5 file / files and read data directly on workers. This generally speaking harder to implement and requires a smart data distribution strategy

Spark - Saving data to Parquet file in case of dynamic schema

I have a JavaPairRDD of the following typing:
Tuple2<String, Iterable<Tuple2<String, Iterable<Tuple2<String, String>>>>>
that denotes the following object:
(Table_name, Iterable(Tuple_ID, Iterable(Column_name, Column_value)))
This means each record in the RDD will create one Parquet file.
The idea is, as you may have guessed, to save each object as a new Parquet table called Table_name. In this table, there is one column called ID that stores the value Tuple_ID, and each column Column_name stores the value Column_value.
The challenge I'm facing is that the table's columns (the schema) are collected on the fly on runtime, AND, as it is not possible to create nested RDDs in Spark, I can't create an RDD within the previous RDD (for each record) and save it finally to a Parquet file --after converting it to a DataFrame of course.
And I can't just convert the previous RDD to a DataFrame, for the obvious reason (need to iterate to get column/value).
As a temporarily workaround, I flattened the RDD into a list of the same typing as the RDD using collect(), but this is not the proper way as the data could be larger than the available disk space on the driver machine, causing an out of memory.
Any advice on how to achieve this? please let me know if the question is not clear enough.
Take a look at answer for this [question][1]
[1]: Writing RDD partitions to individual parquet files in its own directory. I used this answer to create separate (one or more) parquet file for each partition. This technique I believe you can use the same to create separate file each with different schema if you like.

Resources