How to save text file to orc in spark - apache-spark

I am new to spark, I am trying to save my text file to orc using spark-shell is their any way to do that?
vall data =sc.textFile("/yyy/yyy/yyy")
data.saveAsOrcFile("/yyy/yyy/yyy")

You can convert the RDD to DataFrame and then save it.
data.toDF().write.format("orc").save("/path/to/save/file")
To read it back, use sqlContext
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val data = sqlContext.read.format("orc").load("/path/to/file/*")

Related

how to move large table from PSQL to parquet on gcloud via Apache Spark?

I have a large table(around 300gb) and a ram of about (50Gb), and 8 cpus.
I want to move my psql table into google cloud storage using spark and jdbc connection. very similar to:How to convert an 500GB SQL table into Apache Parquet?.
I know my connections work, because I was able to move a small table. But with large table I get memory issues. How can I optimize it?
import pyspark
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql import DataFrameReader
conf = pyspark.SparkConf().setAll([("spark.driver.extraClassPath", "/usr/local/bin/postgresql-42.2.5.jar:/usr/local/jar/gcs-connector-hadoop2-latest.jar")
,("spark.executor.instances", "8")
,("spark.executor.cores", "4")
,("spark.executor.memory", "1g")
,("spark.driver.memory", "6g")
,("spark.memory.offHeap.enabled","true")
,("spark.memory.offHeap.size","40g")])
sc = pyspark.SparkContext(conf=conf)
sc.getConf().getAll()
sc._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","/home/user/analytics/gcloud_key_name.json")
sqlContext = SQLContext(sc)
url = 'postgresql://address:port/db_name'
properties = {
'user': 'user',
'password': 'password'}
df_users = sqlContext.read.jdbc(
url='jdbc:%s' % url, table='users', properties=properties
)
gcloud_path= "gs://BUCKET/users"
df_users.write.mode('overwrite').parquet(gcloud_path)
Bonus question:
can I do partition now, or first I should save it as parquet then read it and repartition it?
Bonus question2:
If the answer to Bonus question 1 is yes, can I do sort it now, or first I should save it as parquet then read it and repartition it?

Write dataframe to csv with datatype map<string,bigint> in Spark

I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying
"Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type."
Code which I have used:
val conf=new SparkConf().setAppName("student-example").setMaster("local")
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
def sumaggr=udf((aggr: Map[String, collection.mutable.WrappedArray[Long]]) => if (aggr.keySet.contains("aggr")) aggr("aggr").sum else 0)
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
datadf.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
I tried converting datadf.toString() but still I am facing same issue.
How can write that result to CSV.
spark version 2.1.1
Spark CSV source supports only atomic types. You cannot store any columns that are non-atomic
I think best is to create a JSON for the column that has map<string,bigint> as a datatype and save it in csv as below.
import spark.implicits._
import org.apache.spark.sql.functions._
datadf.withColumn("column_name_with_map_type", to_json(struct($"column_name_with_map_type"))).write.csv("outputpath")
Hope this helps!
You are trying to save the output of
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
which I guess is a mistake as the udf function and all the aggregation done would go in vain if you do so
So I think you want to save the output of
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
So you need to save it in a new dataframe variable and use that variable to save.
val finalDF = datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0)
finalDF.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
And you should be fine.

how to use a whole hive database in spark and read sql queries from external files?

I am using hortonworks sandbox in Azure with spark 1.6.
I have a Hive database populated with TPC-DS sample data. I want to read some SQL queries from external files and run them on the hive dataset in spark.
I follow this topic Using hive database in spark which is just using a table in my dataset and also it writes SQL query in spark again, but I need to define whole, dataset as my source to query on that, I think i should use dataframes but i am not sure and do not know how!
also I want to import the SQL query from external .sql file and do not write down the query again!
would you please guide me how can I do this?
thank you very much,
bests!
Spark Can read data directly from Hive table. You can create, drop Hive table using Spark and even you can do all Hive hql related operations through the Spark. For this you need to use Spark HiveContext
From the Spark documentation:
Spark HiveContext, provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. To use a HiveContext, you do not need to have an existing Hive setup.
For more information you can visit Spark Documentation
To Avoid writing sql in code, you can use property file where you can put all your Hive query and then you can use the key in you code.
Please see below the implementation of Spark HiveContext and use of property file in Spark Scala.
package com.spark.hive.poc
import org.apache.spark._
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql._
import org.apache.spark._
import org.apache.spark.sql.DataFrame;
import org.apache.spark.rdd.RDD;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.sql.hive.HiveContext;
//Import Row.
import org.apache.spark.sql.Row;
//Import Spark SQL data types
import org.apache.spark.sql.types.{ StructType, StructField, StringType };
object ReadPropertyFiles extends Serializable {
val conf = new SparkConf().setAppName("read local file");
conf.set("spark.executor.memory", "100M");
conf.setMaster("local");
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
def main(args: Array[String]): Unit = {
var hadoopConf = new org.apache.hadoop.conf.Configuration();
var fileSystem = FileSystem.get(hadoopConf);
var Path = new Path(args(0));
val inputStream = fileSystem.open(Path);
var Properties = new java.util.Properties;
Properties.load(inputStream);
//Create an RDD
val people = sc.textFile("/user/User1/spark_hive_poc/input/");
//The schema is encoded in a string
val schemaString = "name address";
//Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));
//Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim));
//Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema);
peopleDataFrame.printSchema();
peopleDataFrame.registerTempTable("tbl_temp")
val data = sqlContext.sql(Properties.getProperty("temp_table"));
//Drop Hive table
sqlContext.sql(Properties.getProperty("drop_hive_table"));
//Create Hive table
sqlContext.sql(Properties.getProperty("create_hive_tavle"));
//Insert data into Hive table
sqlContext.sql(Properties.getProperty("insert_into_hive_table"));
//Select Data into Hive table
sqlContext.sql(Properties.getProperty("select_from_hive")).show();
sc.stop
}
}
Entry in Properties File :
temp_table=select * from tbl_temp
drop_hive_table=DROP TABLE IF EXISTS default.test_hive_tbl
create_hive_tavle=CREATE TABLE IF NOT EXISTS default.test_hive_tbl(name string, city string) STORED AS ORC
insert_into_hive_table=insert overwrite table default.test_hive_tbl select * from tbl_temp
select_from_hive=select * from default.test_hive_tbl
Spark submit Command to run this job:
[User1#hadoopdev ~]$ spark-submit --num-executors 1 \
--executor-memory 100M --total-executor-cores 2 --master local \
--class com.spark.hive.poc.ReadPropertyFiles Hive-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
/user/User1/spark_hive_poc/properties/sql.properties
Note: Property File location should be HDFS location.

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Update : Starting from Spark 2.2.x
there is finally a proper way to do it using Dataset.
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
import spark.implicits._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|id, date, timedump
|1, "2014/01/01 23:00:01",1499959917383
|2, "2014/11/31 12:40:32",1198138008843
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
Old spark versions
Actually you can, though it's using library internals and not widely advertised. Just create and use your own CsvParser instance.
Example that works for me on spark 1.6.0 and spark-csv_2.10-1.4.0 below
import com.databricks.spark.csv.CsvParser
val csvData = """
|userid,organizationid,userfirstname,usermiddlename,userlastname,usertitle
|1,1,user1,m1,l1,mr
|2,2,user2,m2,l2,mr
|3,3,user3,m3,l3,mr
|""".stripMargin
val rdd = sc.parallelize(csvData.lines.toList)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val csvDataFrame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
You can parse your string into a csv using, e.g. scala-csv:
val myCSVdata : Array[List[String]] =
myCSVString.split('\n').flatMap(CSVParser.parseLine(_))
Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...
You can then make this an RDD of records:
val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)
Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
I omit this step.
You can then convert to a DataFrame:
import spark.implicits._
myCSVDataframe = myCSVRDD.toDF()
The accepted answer wasn't working for me in spark 2.2.0 but lead me to what I needed with csvData.lines.toList
val fileUrl = getClass.getResource(s"/file_in_resources.csv")
val stream = fileUrl.getContent.asInstanceOf[InputStream]
val streamString = Source.fromInputStream(stream).mkString
val csvList = streamString.lines.toList
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvList.toDS())
.as[SomeCaseClass]

Merge parquet file on standalone spark

Is there a simple way how to save DataFrame into a single parquet file or merge the directory containing metadata and parts of this parquet file produced by sqlContext.saveAsParquetFile() into a single file stored on NFS without using HDFS and hadoop?
To save only one file, rather than many, you can call coalesce(1) / repartition(1) on the RDD/Dataframe before the data is saved.
If you already have a directory with small files, you could create a Compacter process which would read in the exiting files and save them to one new file. E.g.
val rows = parquetFile(...).coalesce(1)
rows.saveAsParquetFile(...)
You can store to a local file system using saveAsParquetFile. e.g.
rows.saveAsParquetFile("/tmp/onefile/")
I was able to use this method to compress parquet files using snappy format with Spark 1.6.1. I used overwrite so that I could repeat the process if needed. Here is the code.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
object CompressApp {
val serverPort = "hdfs://myserver:8020/"
val inputUri = serverPort + "input"
val outputUri = serverPort + "output"
val config = new SparkConf()
.setAppName("compress-app")
.setMaster("local[*]")
val sc = SparkContext.getOrCreate(config)
val sqlContext = SQLContext.getOrCreate(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
import sqlContext.implicits._
def main(args: Array[String]) {
println("Compressing Parquet...")
val df = sqlContext.read.parquet(inputUri).coalesce(1)
df.write.mode(SaveMode.Overwrite).parquet(outputUri)
println("Done.")
}
}
coalesce(N) has saved me so far. If your table is partitioned, then use repartition("partition key") as well.

Resources