Converting JSON to Parquet in Amazon EMR - apache-spark

I need to achieve the following, and am having difficulty coming up with an approach to accomplish it due to my inexperience with Spark:
Read data from .json.gz files stored in S3.
Each file includes a partial day of Google Analytics data with the schema as specified in https://support.google.com/analytics/answer/3437719?hl=en.
File names are in the pattern ga_sessions_20170101_Part000000000000_TX.json.gz where 20170101 is a YYYYMMDD date specification and 000000000000 is an incremental counter when there are multiple files for a single day (which is usually the case).
An entire day of data is therefore composed of multiple files with incremental "part numbers".
There are generally 3 to 5 files per day.
All fields in the JSON files are stored with qoute (") delimiters, regardless of the data type specified in the aforementioned schema documentation. The data frame which results from reading the files (via sqlContext.read.json) therefore has every field typed as string, even though some are actually integer, boolean, or other data types.
Convert the all-string data frame to a properly typed data frame according to the schema specification.
My goal is to have the data frame typed properly so that when it is saved in Parquet format the data types are correct.
Not all fields in the schema specification are present in every input file, or even every day's worth of input files (the schema may have changed over time). The conversion will therefore need to be dynamic, converting the data types of only the fields actually present in the data frame.
Write the data in the properly typed data frame back to S3 in Parquet format.
The data should be partitioned by day, with each partition stored in a separate folder named "partition_date=YYYYMMDD" where "YYYYMMDD" is the actual date associated with the data (from the original input file names).
I don't think the number of files per day matters. The goal is simply to have partitioned Parquet format data that I can point Spectrum at.
I have been able to read and write the data successfully, but have been unsuccessful with several aspects of the overall task:
I don't know how to approach the problem to ensure that I'm effectively utilizing the AWS EMR cluster to its full potential for parallel/distributed processing, either in reading, converting, or writing the data. I would like to size up the cluster as needed to accomplish the task within whatever time frame I choose (within reason).
I don't know how to best accomplish the data type conversion. Not knowing which fields will or will not be present in any particular batch of input files requires dynamic code to retype the data frame. I also want to make sure this task is distributed effectively and isn't done inefficiently (I'm concerned about creating a new data frame as each field is retyped).
I don't understand how to manage partitioning of the data appropriately.
Any help working through an overall approach would be greatly appreciated!

If your input JSONs have fixed schema you can specify DF schema manually, stating fields as optional. Refer to the official guide.
If you have all values inside "", you can read them as strings and later cast to required type.
I don't know how to approach the problem to ensure that I'm effectively...
Use Dataframe API to read input, most likely defaults will be good for this task. If you hit performance issue, attach Spark Job Timeline.
I don't know how to best accomplish the data type conversion...
Use cast column.cast(DataType) method.
For example, you have 2 JSONs:
{"foo":"firstVal"}{"foo":"val","bar" : "1"}
And you want to read 'foo' as String and bar as integer you can write something like this:
val schema = StructType(
StructField("foo", StringType, true) ::
StructField("bar", StringType, true) :: Nil
)
val df = session.read
.format("json").option("path", s"${yourPath}")
.schema(schema)
.load()
val withCast = df.select('foo, 'bar cast IntegerType)
withCast.show()
withCast.write.format("parquet").save(s"${outputPath}")

Related

How can I extract information from parquet files with Spark/PySpark?

I have to read in N parquet files, sort all the data by a particular column, and then write out the sorted data in N parquet files. While I'm processing this data, I also have to produce an index that will later be used to optimize the access to the data in these files. The index will also be written as a parquet file.
For the sake of example, let's say that the data represents grocery store transactions and we want to create an index by product to transaction so that we can quickly know which transactions have cottage cheese, for example, without having to scan all N parquet files.
I'm pretty sure I know how to do the first part, but I'm struggling with how to extract and tally the data for the index while reading in the N parquet files.
For the moment, I'm using PySpark locally on my box, but this solution will eventually run on AWS, probably in AWS Glue.
Any suggestions on how to create the index would be greatly appreciated.
This is already built into spark SQL. In SQL use "distribute by" or pyspark: paritionBy before writing and it will group the data as you wish on your behalf. Even if you don't use a partitioning strategy Parquet has predicate pushdown that does lower level filtering. (Actually if you are using AWS, you likely don't want to use partitioning and should stick with large files that use predicate pushdown. Specifically because s3 scanning of directories is slow and should be avoided.)
Basically, great idea, but this is already in place.

Best practice for inferring schema from a CSV file in a raw ingestion layer of a data lake?

Is there a best practice for inferring schema in a raw ingestion layer of a data lake (not schema validation, just infer data types and column names)?
I am using Azure and want to design a way to validate the schema downstream from the ingestion layer, so therefore want a way to infer it from a CSV in order to do the validation.
So far I have tried to read a csv with integers using Azure Data Factory and write to AVRO because of the schema in the header and it stored all as strings. I also tried to scan the files (CSV and AVRO) with Purview but still returned all strings.
File Format: NAICS Company Number, NAICS Company Name, Column for each state (w a value of 1 or 0)
I think the obvious answer may be to use Spark (Databricks) but I want to make sure I go with a simple / necessary rationale for needing to suggest this.
Edit: We need to do this dynamically as we will be running it daily and for a pipeline that ingests many csvs (not just one file).
I am not sure if I understand correctly but you could get something like this. Which will resul in struct and that could be used to validate your file.
val df = spark.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("/FileStore/tables/retail-data/by-day/2010_12_01.csv")
val scheme = df.schema
results in:
scheme: org.apache.spark.sql.types.StructType = StructType(StructField(InvoiceNo,StringType,true), StructField(StockCode,StringType,true), StructField(Description,StringType,true), StructField(Quantity,IntegerType,true), StructField(InvoiceDate,StringType,true), StructField(UnitPrice,DoubleType,true), StructField(CustomerID,DoubleType,true), StructField(Country,StringType,true))

Is there a way to denormalize semi structured data from a hadoop InputSource to a csv file via Apache Spark?

Let's say I have an ordering dataset that looks like the following:
{
"orderId": "349b494a-cd3b-41b9-96be-3e37372da84b",
"cart": {
"319ec5de-8c77-45bd-bf46-f87623c898e8": 1,
}
},
{
"orderId": "f8e2de64-47ef-4446-bbd0-e55463fbeaa3",
"cart": {
"319ec5de-8c77-45bd-bf46-f87623c898e8": 1,
"86bb7e54-3f21-42b6-adbb-0ef29507d4b4": 2,
"a5605396-ae19-4a81-b79f-dec69d8fc2af": 1
}
},
I don't know all the items in the catalog (set of items that could be in the cart) up front
The value in the cart "map" is the purchase quantity.
How do I transform this dataset into a .csv file that looks like this:
"orderId","cart.319ec5de-8c77-45bd-bf46-f87623c898e8","cart.86bb7e54-3f21-42b6-adbb-0ef29507d4b4","cart.a5605396-ae19-4a81-b79f-dec69d8fc2af"
"349b494a-cd3b-41b9-96be-3e37372da84b",1,,
"f8e2de64-47ef-4446-bbd0-e55463fbeaa3",1,2,1
In reality:
This is a proprietary file format. I have written the parser and have wrapped it with a Hadoop InputFormat. Before this, I tried transforming the proprietary file format to json and exporting as a csv file in hadoop gave me a "CSV data source does not support struct" exception. Ideally I do not use JSON as an intermediate state for space reasons
This is not ordering data but I thought this example would make more sense. The only thing that is different is the quantity field is a string, not a number
I've written a java program (not Spark) that takes two passes across the data set. The first pass figures out the "cart" headers and builds a static string to column index mapper. The second pass writes the values to a csv file using the proper column offset. Unfortunately this is all single threaded, and I wanted to leverage Spark's distributed parallel processing. I also wanted to use Spark SQL on the result.
Can someone point me to some documentation how to denormalize semi-structured data in Spark?
Thanks
EDIT: The JSON example was only to show the structure of the data. My data is in a proprietary format and I would prefer not to store an intermediate JSON version as the dataset is multi-terabyte (expensive).
I was thinking I could:
frame the data using a Hadoop InputFormat into a Text structure <- this works
denormalize the Text structure <- I can't figure out how to do this
export .csv file <- this works when I use fixed schema
I'm struggling to figure out how to denormalize the dataset as I don't know all the columns I want to add before I process the data.

Question about using parquet for time-series data

I'm exploring ways to store a high volume of data from sensors (time series data), in a way that's scalable and cost-effective.
Currently, I'm writing a CSV file for each sensor, partitioned by date, so my filesystem hierarchy looks like this:
client_id/sensor_id/year/month/day.csv
My goal is to be able to perform SQL queries on this data, (typically fetching time ranges for a specific client/sensor, performing aggregations, etc) I've tried loading it to Postgres and timescaledb, but the volume is just too large and the queries are unreasonably slow.
I am now experimenting with using Spark and Parquet files to perform these queries, but I have some questions I haven't been able to answer from my research on this topic, namely:
I am converting this data to parquet files, so I now have something like this:
client_id/sensor_id/year/month/day.parquet
But my concern is that when Spark loads the top folder containing the many Parquet files, the metadata for the rowgroup information is not as optimized as if I used one single parquet file containing all the data, partitioned by client/sensor/year/month/day. Is this true? Or is it the same to have many parquet files or a single partitioned Parquet file? I know that internally the parquet file is stored in a folder hierarchy like the one I am using, but I'm not clear on how that affects the metadata for the file.
The reason I am not able to do this is that I am continuously receiving new data, and from my understanding, I cannot append to a parquet file due to the nature that the footer metadata works. Is this correct? Right now, I simply convert the previous day's data to parquet and create a new file for each sensor of each client.
Thank you.
You can use Structured Streaming with kafka(as you are already using it) for real time processing of your data and store data in parquet format. And, yes you can append data to parquet files. Use SaveMode.Append for that such as
df.write.mode('append').parquet(path)
You can even partition your data on hourly basis.
client/sensor/year/month/day/hour which will further provide you performance improvement while querying.
You can create hour partition based on system time or timestamp column based on type of query you want to run on your data.
You can use watermaking for handling late records if you choose to partition based on timestamp column.
Hope this helps!
I could share my experience and technology stack that being used at AppsFlyer.
We have a lot of data, about 70 billion events per day.
Our time-series data for near-real-time analytics are stored in Druid and Clickhouse. Clickhouse is used to hold real-time data for the last two days; Druid (0.9) wasn't able to manage it. Druid holds the rest of our data, which populated daily via Hadoop.
Druid is a right candidate in case you don't need a row data but pre-aggregated one, on a daily or hourly basis.
I would suggest you let a chance to the Clickhouse, it lacks documentation and examples but works robust and fast.
Also, you might take a look at Apache Hudi.

Spark: How collect large amount of data without out of memory

I have the following issue:
I do a sql query over a set of parquet files on HDFS and then I do a collect in order to get the result.
The problem is that when there are many rows I get an out of memory error.
This query requires shuffling so I can not do the query on each file.
One solution could be to iterate over the values of a column and save the result on disk:
df = sql('original query goes here')
// data = collect(df) <- out of memory
createOrReplaceTempView(df, 't')
for each c in cities
x = collect(sql("select * from t where city = c")
append x to file
As far as I know it will result in the program taking too much time because the query will be executed for each city.
What is the best way of doing this?
In the case if its running out of memory, which means that the output data is really very huge, so,
you can write down the results into some file itself just like parquet file.
If you want to further perform some operation, on this collected data, you can read data from this file.
For large datasets we should not use collect(), instead you may use take(100) or take(some_integer) in order to check that some values are correct.
As #cricket_007 said, I would not collect() your data from Spark to append it to a file in R.
Additionally, it doesn't make sense to iterate over a list of SparkR::distinct() cities and then select everything from those tables just to append them to some output dataset. The only time you would want to do that is if you are trying to do another operation within each group based upon some sort of conditional logic or apply an operation to each group using a function that is NOT available in SparkR.
I think you are trying to get a data frame (either Spark or R) with observations grouped in a way so that when you look at them, everything is pretty. To do that, add a GROUP BY city clause to your first SQL query. From there, just write the data back out to HDFS or some other output directory. From what I understand about your question, maybe doing something like this will help:
sdf <- SparkR::sql('SELECT SOME GREAT QUERY FROM TABLE GROUP BY city')
SparkR::write.parquet(sdf, path="path/to/desired/output/location", mode="append")
This will give you all your data in one file, and it should be grouped by city, which is what I think you are trying to get with your second query in your question.
You can confirm the output is what you want via:
newsdf<- SparkR::read.parquet(x="path/to/first/output/location/")
View(head(sdf, num=200))
Good luck, hopefully this helps.
Since your data is huge it is no longer possible to collect() anymore. So you can use a strategy to sample data and learn from the sampled data.
import numpy as np
arr = np.array(sdf.select("col_name").sample(False, 0.5, seed=42).collect())
Here you are sampling 50% of the data and just a single column.

Resources