How to transfer binary file into rdd in spark? - apache-spark

I am trying to load seg-Y type files into spark, and transfer them into rdd for mapreduce operation.
But I failed to transfer them into rdd. Does anyone who can offer help?

You could use binaryRecords() pySpark call to convert binary file's content into an RDD
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords
binaryRecords(path, recordLength)
Load data from a
flat binary file, assuming each record is a set of numbers with the
specified numerical format (see ByteBuffer), and the number of bytes
per record is constant.
Parameters: path – Directory to the input data files recordLength –
The length at which to split the records
Then you could map() that RDD into a structure by using, for example, struct.unpack()
https://docs.python.org/2/library/struct.html
We use this approach to ingest propitiatory fixed-width records binary files. There is a bit of Python code that generates Format string (1st argument to struct.unpack), but if your files layout is static, it's fairly simple to do manually one time.
Similarly is possible to do using pure Scala:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext#binaryRecords(path:String,recordLength:Int,conf:org.apache.hadoop.conf.Configuration):org.apache.spark.rdd.RDD[Array[Byte]]

You've not really given much detail, but you can start with using the SparkContextbinaryFiles() API
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext

Related

Continous appending of data on existing tabular data file (CSV, parquet) using PySpark

For a project I need to append frequently but on a non-periodic way about one thousand or more data files (tabular data) on one existing CSV or parquet file with same schema in Hadoop/HDFS (master=yarn). At the end, I need to be able to do some filtering on the result file - to extract subset of data.
One dummy file may look like this (very simple example):
id,uuid,price
1,16c533c3-c191-470c-97d9-e1e01ccc3080,46159
2,6bb0917b-2414-4b24-85ca-ae2c2713c9a0,50222
3,7b1fa3f9-2db2-4d93-a09d-ca6609cfc834,74591
4,e3a3f874-380f-4c89-8b3e-635296a70d76,91026
5,616dd6e8-5d05-4b07-b8f2-7197b579a058,73425
6,23e77a21-702d-4c87-a69c-b7ace0626616,34874
7,339e9a7f-efb1-4183-ac32-d365e89537bb,63317
8,fee09e5f-6e16-4d4f-abd1-ecedb1b6829c,6642
9,2e344444-35ee-47d9-a06a-5a8bc01d9eab,55931
10,d5cba8d6-f0e1-49c8-88e9-2cd62cde9737,51792
Number of rows may vary between 10 and about 100000
On user request, all input files copied in a source folder should be ingested by an ETL pipeline and appended at the end of one single CSV/parquet file or any other appropriate file format (no DB). Data from a single input file may be spread over one, two or more partitions.
Because the input data files may all have different number of rows, I am concerned about getting partitions with different sizes in the resulting CSV/parquet file. Sometimes all the data may be append in one new file. Sometimes the data is so big that several files are appended.
And because input files may be appended a lot of time from different users and different sources, I am also concerned that the result CSV/parquet may contains too much part-files for the namenode to handle them.
I have done some small test appending data on existing CSV / parquet files and noticed that for each appending, a new file was generated - for example:
df.write.mode('append').csv('/user/applepy/pyspark_partition/uuid.csv')
will append the new data as a new file in the file 'uuid.csv' (which is actually a directory generated by pyspark containing all pieces of appended data).
Doing some load tests based on real conditions, I quickly realized that I was generating A LOT of files (several 10-thousands). At some point I got so much files that PySpark was unable to simple count the number of rows (NameNode memory overflow).
So I wonder how to solve this problem. What would be the best practice here? Read the whole file, append the data chunk, same the data in a new file doesn't seems to be very efficient here.
NameNode memory overflow
Then increase the heapsize of the namenode
quickly realized that I was generating A LOT of files
HDFS write operations almost never append to single files. They append "into a directory", and create new files, yes.
From Spark, you can use coalesce and repartition to create larger writer batches.
As you'd mentioned, you wanted parquet, so write that then. That'll cause you to have even smaller file sizes in HDFS.
or any other appropriate file format (no DB)
HDFS is not really the appropriate tool for this. Clickhouse, Druid, and Pinot are the current real time ingest / ETL tools being used, especially when data is streamed in "non periodically" from Kafka

Overused the capacity memory when trying to process the CSV file when using Pyspark and Python

I dont know which part of the code I should share since what I do is basically as below(I will share a simple code algorithm instead for reference):
Task: I need to search for file A and then match the values in file A with column values in File B(It has more than 100 csv files, with each contained more than 1millions rows in CSV), then after matched, combined the results into a single CSV.
Extract column values for File A and then make it into list of values.
Load File B in pyspark and then use .isin to match with File A list of values.
Concatenate the results into single csv file.
"""
first = pd.read_excel("fileA.xlsx")
list_values = first[first["columnA"].apply(isinstance,args=(int,))]["columnA"].values.tolist()
combine = []
for file in glob.glob("directory/"): #here will loop at least 100 times.
second = spark.read.csv("fileB")
second = second["columnB"].isin(list_values) # More than hundreds thousands rows will be expected to match.
combine.append(second)
total = pd.concat(combine)
Error after 30hours of running time:
UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
Is there a way to better perform such task? currently, to complete the process it takes more than 30hours to just run the code but it ended with failure with above error. Something like parallel programming or which I could speed up the process or to clear the above error? ?
Also, when I test it with running only 2 CSV files, it took less than a minute to complete but when I try to loop the whole folder with 100 files, it takes more than 30hours.
There are several things that I think you can try to optimize given that your configuration and resource unchanged:
Repartition when you read your CSV. Didn't study the source code on how spark read the csv, but based on my experience / case in SO, when you use spark to read the csv, all the data will be in single partition, which might cause you the Java OOM error and also it's not fully utilize your resource. Try to check the partitioning of the data and make sure that there is no data skewness before you do any transformation and action.
Rethink on how to do the filtering based on another dataframe column value. From your code, your current approach is to use a python list to collect and store the reference, and then use .isin() to search if the main dataframe column contain value which is in this reference list. If the length of your reference list is very large, the searching operation of EACH ROW to go through the whole reference list is definitely a high cost. Instead, you can try to use the leftsemi .join() operation to achieve the same goal. Even if the dataset is small and you want to prevent the data shuffling, you can use the broadcast to copy your reference dataframe to every single node.
If you can achieve in Spark SQL, don't do it by pandas. In your last step, you're trying to concat all the data after the filtering. In fact, you can achieve the same goal with .unionAll() or .unionByName(). Even you do the pd.concat() in the spark session, all the pandas operation will be done in the driver node but not distributed. Therefore, it might cause Java OOM error and degrade the performance too.

Spark output JSON vs Parquet file size discrepancy

new Spark user here. i wasn't able to find any information about filesize comparison between JSON and parquet output of the same dataFrame via Spark.
testing with a very small data set for now, doing a df.toJSON().collect() and then writing to disk creates a 15kb file. but doing a df.write.parquet creates 105 files at around 1.1kb each. why is the total file size so much larger with parquet in this case than with JSON?
thanks in advance
what you're doing with df.toJSON.collect is you get a single JSON from all your data (15kb in your case) and you save that to disk - this is not something scalable for situations you'd want to use Spark in any way.
For saving parquet you are using spark built-in function and it seems that for some reason you have 105 partitions (probably the result of the manipulation you did) so you get 105 files. Each of these files has the overhead of the file structure and probably stores 0,1 or 2 records. if you want to save a single file you should coalesce(1) before you save (again this just for the toy example you have) so you'd get 1 file. Note that it still might be larger due to the file format overhead (i.e. the overhead might still be larger than the compression benefit)
Conan, it is very hard to answer your question precisely without knowing the nature of the data (you don't even tell amount of row in your DataFrame). But let me speculate.
First. Text files containing JSON usually take more space on disk then parquet. At least when one store millions-billions rows. The reason for that is parquet is highly optimized column based storage format which uses a binary encoding to store your data
Second. I would guess that you have a very small dataframe with 105 partitions (and probably 105 rows). When you store something that small the disk footprint should not bother you but if it does you need to be aware that each parquet file has a relatively sizeable header describing the data you store.

Converting JSON to Parquet in Amazon EMR

I need to achieve the following, and am having difficulty coming up with an approach to accomplish it due to my inexperience with Spark:
Read data from .json.gz files stored in S3.
Each file includes a partial day of Google Analytics data with the schema as specified in https://support.google.com/analytics/answer/3437719?hl=en.
File names are in the pattern ga_sessions_20170101_Part000000000000_TX.json.gz where 20170101 is a YYYYMMDD date specification and 000000000000 is an incremental counter when there are multiple files for a single day (which is usually the case).
An entire day of data is therefore composed of multiple files with incremental "part numbers".
There are generally 3 to 5 files per day.
All fields in the JSON files are stored with qoute (") delimiters, regardless of the data type specified in the aforementioned schema documentation. The data frame which results from reading the files (via sqlContext.read.json) therefore has every field typed as string, even though some are actually integer, boolean, or other data types.
Convert the all-string data frame to a properly typed data frame according to the schema specification.
My goal is to have the data frame typed properly so that when it is saved in Parquet format the data types are correct.
Not all fields in the schema specification are present in every input file, or even every day's worth of input files (the schema may have changed over time). The conversion will therefore need to be dynamic, converting the data types of only the fields actually present in the data frame.
Write the data in the properly typed data frame back to S3 in Parquet format.
The data should be partitioned by day, with each partition stored in a separate folder named "partition_date=YYYYMMDD" where "YYYYMMDD" is the actual date associated with the data (from the original input file names).
I don't think the number of files per day matters. The goal is simply to have partitioned Parquet format data that I can point Spectrum at.
I have been able to read and write the data successfully, but have been unsuccessful with several aspects of the overall task:
I don't know how to approach the problem to ensure that I'm effectively utilizing the AWS EMR cluster to its full potential for parallel/distributed processing, either in reading, converting, or writing the data. I would like to size up the cluster as needed to accomplish the task within whatever time frame I choose (within reason).
I don't know how to best accomplish the data type conversion. Not knowing which fields will or will not be present in any particular batch of input files requires dynamic code to retype the data frame. I also want to make sure this task is distributed effectively and isn't done inefficiently (I'm concerned about creating a new data frame as each field is retyped).
I don't understand how to manage partitioning of the data appropriately.
Any help working through an overall approach would be greatly appreciated!
If your input JSONs have fixed schema you can specify DF schema manually, stating fields as optional. Refer to the official guide.
If you have all values inside "", you can read them as strings and later cast to required type.
I don't know how to approach the problem to ensure that I'm effectively...
Use Dataframe API to read input, most likely defaults will be good for this task. If you hit performance issue, attach Spark Job Timeline.
I don't know how to best accomplish the data type conversion...
Use cast column.cast(DataType) method.
For example, you have 2 JSONs:
{"foo":"firstVal"}{"foo":"val","bar" : "1"}
And you want to read 'foo' as String and bar as integer you can write something like this:
val schema = StructType(
StructField("foo", StringType, true) ::
StructField("bar", StringType, true) :: Nil
)
val df = session.read
.format("json").option("path", s"${yourPath}")
.schema(schema)
.load()
val withCast = df.select('foo, 'bar cast IntegerType)
withCast.show()
withCast.write.format("parquet").save(s"${outputPath}")

Spark transformations and ordering

I am working on parsing different types of files (text,xml,csv etc.) into a specific text file format using spark java API. This output file maintains the order of file header, start tag, data header, data and end tag. All of these element are extracted from input file at some point.
I tried to achieve this in below 2 ways:
Read file to RDD using sparks textFile and perform parsing by using map or mapPartions which returns new RDD.
Read file using sparks textFile , reduce to 1 partition using coalesce and perform parsing by using mapPartions which returns new RDD.
While I am not concerned about sequencing of actual data, with first approach I am not able to keep the required order of File Header, Start Tag, Data Header and End Tag.
The latter works for me, but I know it is not efficient way and may cause problem in case of BIG files.
Is there any efficient way to achieve this?
You are correct in you assumptions. The second choice simply cancels the distributional aspect of your application, so it's not scalable. For the order issue, as the concept is asynchronous, we cannot keep track of order when the data reside in different nodes. What you could do is some preprocessing that would cancel the need for order. Meaning, merge lines up to the point where the line order does not matter and only then distribute your file. Unless you can make assumptions about the file structure, such as number of lines that belong together, I would go with the above.

Resources