How to read huge number of .gz S3 files into RDD? - apache-spark

aws s3api list-objects-v2 --bucket cw-milenko-tests | grep 'tick_c'
output shows
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-50-22.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-52-59.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-55-08.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-57-30.json.gz",
"Key": "Json_gzips/tick_calculated_3_2020-05-27T11-59-59.json.gz",
"Key": "Json_gzips/tick_calculated_4_2020-05-27T09-14-28.json.gz",
"Key": "Json_gzips/tick_calculated_4_2020-05-27T11-35-38.json.gz",
With wc -l
aws s3api list-objects-v2 --bucket cw-milenko-tests | grep 'tick_c' | wc -l
457
I can read one file into data frame.
val path ="tick_calculated_2_2020-05-27T00-01-21.json"
scala> val tick1DF = spark.read.json(path)
tick1DF: org.apache.spark.sql.DataFrame = [aml_barcode_canc: string, aml_barcode_payoff: string ... 70 more fields]
I was surprised to see negative votes.
What I want to know is how to load 457 files into RDD? I saw this SO question.
Is it possible at all? What are the limitations?
This is what I tried so far.
val rdd1 = sc.textFile("s3://cw-milenko-tests/Json_gzips/tick_calculated*.gz")
If I go for s3a
val rdd1 = sc.textFile("s3a://cw-milenko-tests/Json_gzips/tick_calculated*.gz")
rdd1: org.apache.spark.rdd.RDD[String] = s3a://cw-milenko-tests/Json_gzips/tick_calculated*.gz MapPartitionsRDD[3] at textFile at <console>:27
Doesn't work either.
Try to inspect my RDD.
scala> rdd1.take(1)
java.io.IOException: No FileSystem for scheme: s3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
FileSytem was not recognized.
My GOAL:
s3://json.gz -> rdd -> parquet

Try this-
/**
* /Json_gzips
* |- spark-test-data1.json.gz
* --------------------
* {"id":1,"name":"abc1"}
* {"id":2,"name":"abc2"}
* {"id":3,"name":"abc3"}
*/
/**/Json_gzips
*|- spark-test-data2.json.gz
* --------------------
* {"id":1,"name":"abc1"}
* {"id":2,"name":"abc2"}
* {"id":3,"name":"abc3"}
*/
val path = getClass.getResource("/Json_gzips").getPath
// path till the root directory which contains the all .gz files
spark.read.json(path).show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* +---+----+
*/
You can convert this df to rdd if required

from pyspark.sql import SparkSession
//Create Spark Session
spark = SparkSession
.builder
.appName("Python Spark SQL basic example")
.getOrCreate()
//To read all files inside from S3 in under Json_gzips key
df = spark.read.json("s3a://cw-milenko-tests/Json_gzips/tick_calculated*.gz")
df.show()
rdd = df.rdd // to convert it to rdd
use s3a instead s3
why s3a over s3?
Also add dependency for hadoop-aws 2.7.3 and AWS SDK
Add AWS S3 supporting JARs

Related

How to properly separate columns

I'm having trouble with Spark SQL. I tried to import a CSV file into spark DB. My columns are separated by semicolons. I have tried to separate the columns by using sep to do so, but to my dismay, the columns are not separated properly.
Is this how Spark SQL works or is there a difference the conventional Spark SQL and the one in DataBricks. I am new to SparkSQL, a whole new environment from the original SQL language, sorry pardon me for my knowledge for SparkSQL.
USE CarSalesP1935727;
CREATE TABLE IF NOT EXISTS Products
USING CSV
OPTIONS (path "/FileStore/tables/Products.csv", header "true", inferSchema
"true", sep ";");
SELECT * FROM Products LIMIT 10
Not sure about the problem, working well -
Please note that the env is not databricks
val path = getClass.getResource("/csv/test2.txt").getPath
println(path)
/**
* file data
* -----------
* id;sequence;sequence
* 1;657985;657985
* 2;689654;685485
*/
spark.sql(
s"""
|CREATE TABLE IF NOT EXISTS Products
|USING CSV
|OPTIONS (path "$path", header "true", inferSchema
|"true", sep ";")
""".stripMargin)
spark.sql("select * from Products").show(false)
/**
* +---+---------+---------+
* |id |sequence1|sequence2|
* +---+---------+---------+
* |1 |657985 |657985 |
* |2 |689654 |685485 |
* +---+---------+---------+
*/

Complex File parsing in spark 2.4

Spark with scala 2.4
My source data looks like as given below.
Salesperson_21: Customer_575,Customer_2703,Customer_2682,Customer_2615
Salesperson_11: Customer_454,Customer_158,Customer_1859,Customer_2605
Salesperson_10: Customer_1760,Customer_613,Customer_3008,Customer_1265
Salesperson_4: Customer_1545,Customer_1312,Customer_861,Customer_2178
Code used to flat the file.
val SalespersontextDF = spark.read.text("D:/prints/sales.txt")
val stringCol = SalespersontextDF.columns.map(c => s"'$c', cast(`$c` as string)").mkString(", ")
val processedDF = SalespersontextDF.selectExpr(s"stack(${df1.columns.length}, $stringCol) as (Salesperson, Customer)")
Unfortunately it is not populating salesperson in correct field , instead of salesperson number it is populating hardcoded value as "value". and salesperson number shift to another field.
Appreciate your help very much.
the below approach might solve your problem,
import org.apache.spark.sql.functions._
val SalespersontextDF = spark.read.text("/home/sathya/Desktop/stackoverflo/data/sales.txt")
val stringCol = SalespersontextDF.columns.map(c => s"'$c', cast(`$c` as string)").mkString(", ")
val processedDF = SalespersontextDF.selectExpr(s"stack(${SalespersontextDF.columns.length}, $stringCol) as (Salesperson, Customer)")
processedDF.show(false)
/*
+-----------+----------------------------------------------------------------------+
|Salesperson|Customer |
+-----------+----------------------------------------------------------------------+
|value |Salesperson_21: Customer_575,Customer_2703,Customer_2682,Customer_2615|
|value |Salesperson_11: Customer_454,Customer_158,Customer_1859,Customer_2605 |
|value |Salesperson_10: Customer_1760,Customer_613,Customer_3008,Customer_1265|
|value |Salesperson_4: Customer_1545,Customer_1312,Customer_861,Customer_2178 |
+-----------+----------------------------------------------------------------------+
*/
processedDF.withColumn("Salesperson", split($"Customer", ":").getItem(0)).withColumn("Customer", split($"Customer", ":").getItem(1)).show(false)
/*
+--------------+-------------------------------------------------------+
|Salesperson |Customer |
+--------------+-------------------------------------------------------+
|Salesperson_21| Customer_575,Customer_2703,Customer_2682,Customer_2615|
|Salesperson_11| Customer_454,Customer_158,Customer_1859,Customer_2605 |
|Salesperson_10| Customer_1760,Customer_613,Customer_3008,Customer_1265|
|Salesperson_4 | Customer_1545,Customer_1312,Customer_861,Customer_2178|
+--------------+-------------------------------------------------------+
*/
try this-
spark.read
.schema("Salesperson STRING, Customer STRING")
.option("sep", ":")
.csv("D:/prints/sales.txt")

From the following code how to convert a JavaRDD<Integer> to DataFrame or DataSet

public static void main(String[] args) {
SparkSession sessn = SparkSession.builder().appName("RDD2DF").master("local").getOrCreate();
List<Integer> lst = Arrays.asList(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
Dataset<Integer> DF = sessn.createDataset(lst, Encoders.INT());
System.out.println(DF.javaRDD().getNumPartitions());
JavaRDD<Integer> mappartRdd = DF.repartition(3).javaRDD().mapPartitions(it-> Arrays.asList(JavaConversions.asScalaIterator(it).length()).iterator());
}
From the above code, i am unable to convert the JavaRdd (mappartRdd) to DataFrame in Java Spark.
I am using the below to convert JavaRdd to DataFrame/DataSet.
sessn.createDataFrame(mappartRdd, beanClass);
I tried multiple options and different overloaded functions for createDataFrame. I am facing issues to convert it to DF. what is the beanclass I need to provide for the code to work?
Unlike scala, there is no function like toDF() to convert the RDD to DataFrame in Java. can someone assist to convert it as per my requirement.
Note: I am able to create a Dataset directly by modifying the above code as below.
Dataset<Integer> mappartDS = DF.repartition(3).mapPartitions(it-> Arrays.asList(JavaConversions.asScalaIterator(it).length()).iterator(), Encoders.INT());
But I want to know why my JavaRdd is not getting converted to DF/DS if i use createDataFrame. Any help will be greatly appreciated.
This seems to be follow up of this SO Question
I think, you are in learning stage of spark. I would suggest to understand the apis for java provided - https://spark.apache.org/docs/latest/api/java/index.html
Regarding your question, if you check the createDataFrame api, it is as follows-
def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
...
}
As you see, it takes JavaRDD[Row] and related StructType schema as args. Hence to create DataFrame which is equal to Dataset<Row> use below snippet-
JavaRDD<Integer> mappartRdd = DF.repartition(3).javaRDD().mapPartitions(it-> Arrays.asList(JavaConversions.asScalaIterator(it).length()).iterator());
StructType schema = new StructType()
.add(new StructField("value", DataTypes.IntegerType, true, Metadata.empty()));
Dataset<Row> df = spark.createDataFrame(mappartRdd.map(RowFactory::create), schema);
df.show(false);
df.printSchema();
/**
* +-----+
* |value|
* +-----+
* |6 |
* |8 |
* |6 |
* +-----+
*
* root
* |-- value: integer (nullable = true)
*/

Get source files for directory of parquet tables in spark

I have some code where I read in many parquet tables via a directory and wildcard, like this:
df = sqlContext.read.load("some_dir/*")
Is there some way I can get the source file for each row in the resulting DataFrame, df?
Let's create some dummy data and save it in parquet format.
spark.range(1,1000).write.save("./foo/bar")
spark.range(1,2000).write.save("./foo/bar2")
spark.range(1,3000).write.save("./foo/bar3")
Now we can read the data as desired :
import org.apache.spark.sql.functions.input_file_name
spark.read.load("./foo/*")
.select(input_file_name(), $"id")
.show(3,false)
// +---------------------------------------------------------------------------------------+---+
// |INPUT_FILE_NAME() |id |
// +---------------------------------------------------------------------------------------+---+
// |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|500|
// |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|501|
// |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|502|
// +---------------------------------------------------------------------------------------+---+
Since Spark 1.6, you can combine parquet data source and input_file_name function as shown above.
This seem to be buggy before spark 2.x with pyspark, but this is how it's done :
from pyspark.sql.functions import input_file_name
spark.read.load("./foo/*") \
.select(input_file_name(), "id") \
.show(3,truncate=False)
# +---------------------------------------------------------------------------------------+---+
# |INPUT_FILE_NAME() |id |
# +---------------------------------------------------------------------------------------+---+
# |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|500|
# |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|501|
# |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|502|
# +---------------------------------------------------------------------------------------+---+

Spark DataFrame losing string data in yarn-client mode

By some reason if I'm adding new column, appending string to existing data/column or creating new DataFrame from code, it misinterpreting string data, so show() doesn't work properly, filters (such as withColumn, where, when, etc.) doesn't work ether.
Here is example code:
object MissingValue {
def hex(str: String): String = str.getBytes("UTF-8").map(f => Integer.toHexString((f&0xFF)).toUpperCase).mkString("-")
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MissingValue")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val list = List((101,"ABC"),(102,"BCD"),(103,"CDE"))
val rdd = sc.parallelize(list).map(f => Row(f._1,f._2))
val schema = StructType(StructField("COL1",IntegerType,true)::StructField("COL2",StringType,true)::Nil)
val df = sqlContext.createDataFrame(rdd,schema)
df.show()
val str = df.first().getString(1)
println(s"${str} == ${hex(str)}")
sc.stop()
}
}
If I run it in local mode then everything works as expected:
+----+----+
|COL1|COL2|
+----+----+
| 101| ABC|
| 102| BCD|
| 103| CDE|
+----+----+
ABC == 41-42-43
But when I run the same code in yarn-client mode it produces:
+----+----+
|COL1|COL2|
+----+----+
| 101| ^E^#^#|
| 102| ^E^#^#|
| 103| ^E^#^#|
+----+----+
^E^#^# == 5-0-0
This problem exists only for string values, so first column (Integer) is fine.
Also if I'm creating rdd from the dataframe then everything is fine i.e. df.rdd.take(1).apply(0).getString(1)
I'm using Spark 1.5.0 from CDH 5.5.2
EDIT:
It seems that this happens when the difference between driver memory and executor memory is too high --driver-memory xxG --executor-memory yyG i.e. when I decreasing executor memory or increasing driver memory then the problem disappears.
This is a bug related to executor memory and Oops size:
https://issues.apache.org/jira/browse/SPARK-9725
https://issues.apache.org/jira/browse/SPARK-10914
https://issues.apache.org/jira/browse/SPARK-17706
It is fixed in Spark version 1.5.2

Resources