Loading file from HDFS in spark - apache-spark

I'm trying to run this spark program from HDFS because when I run it locally I don't have enough memory on my pc to handle it. Can someone inform me on how to load the csv file from my HDFS as opposed to doing it locally? Here is my code:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.StructType;
public class VideoGamesSale {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("Video Games Spark")
.config("spark.master", "local")
.getOrCreate();

You can use the below code to create a dataset/dataframe from a csv file.
Dataset<Row> csvDS = spark.read().csv("/path/of/csv/file.csv");
If you want to read multiple files from directories you can use the below
Seq<String> paths = scala.collection.JavaConversions.asScalaBuffer(Arrays.asList("path1","path2"));
Dataset<Row> csvsDS = spark.read().csv(paths);

Related

Not able import spark SQL in Maven

I am trying to import Spark SQL. I am not able to import. I am not sure about the mistake what I am making. I am just a starting learner.
package MySource
import java.sql.{DriverManager, ResultSet}
import org.apache.spark.sql.SparkSession
import java.util.Properties
object MyCalc {
def main(args: Array[String]): Unit = {
println("This is my first Spark")
//val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val spark = SparkSession
.builder()
.appName("SparkSQL")
//.master("YARN")
.master("local[*]")
//.enableHiveSupport()
//.config("spark.sql.warehouse.dir","file:///c:/temp")
.getOrCreate()
import spark.sqlContext.implicits._
}
}
Error:(3, 8) object SparkSession is not a member of package org.apache.spark.sql
import org.apache.spark.sql.SparkSession
Error:(15, 17) not found: value SparkSession
val spark = SparkSession

How To save DataFrame Into Cassandra table using Spark Java API

I want to save data frame into cassandra table using sparkJava API
I want to add the part of saving in the following code
I want to save people dataframe into cassandra table and make queries on that cassandra table
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import com.datastax.spark.connector.cql.CassandraConnector;
import com.datastax.spark.connector.japi.CassandraRow;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
public class SimpleApp {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Simple Application");
conf.setMaster("local");
conf.set("spark.cassandra.connection.host", "localhost");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
DataFrame people = sqlContext.read().json("/root/people.json");
people.printSchema();
people.registerTempTable("people");
**//I want to save this TempTable or people dataframe into cassandra table and make teenagers SQL query on that cassandra table**
DataFrame teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19");
teenagers.show();
}
}

How to save Dataset<Row> in mySQL in spark?

I am using spark standalone cluster in my scenario. I want to read read a JSON file from Azure data lake and using SparkSQL and do some query over it and save the result into a mysql database. I don't know how to do it. A small help will be a great.
package com.biz.Read_from_ADL;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class App {
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example").getOrCreate();
Dataset<Row> df = spark.read().json("adl://pare.azuredatalakestore.net/EXCHANGE_DATA/BITFINEX/ETHBTC/MIDPOINT/BITFINEX_ETHBTC_MIDPOINT_2017-06-25.json");
//df.show();
df.createOrReplaceTempView("trade");
Dataset<Row> sqlDF = spark.sql("SELECT * FROM trade");
sqlDF.show();
}
}
You need to first define the connection properties and jdbc url.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", "USER_NAME")
connectionProperties.put("password", "PASSWORD")
val jdbc_url = ... // <- use mysql url
import org.apache.spark.sql.SaveMode
spark.sql("select * from diamonds limit 10").withColumnRenamed("table", "table_number")
.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbc_url, "diamonds_mysql", connectionProperties)
Refer here for more detail.

Using Py4J to invoke a method that takes a JavaSparkContext and return a JavaRDD<Integer>

I am looking for some help or example code that illustrates pyspark calling user written Java code outside of spark itself that takes a spark context from Python and then returns an RDD built in Java.
For completeness, I'm using Py4J 0.81, Java 8, Python 2.7, and spark 1.3.1
Here is what I am using for the Python half:
import pyspark
sc = pyspark.SparkContext(master='local[4]',
appName='HelloWorld')
print "version", sc._jsc.version()
from py4j.java_gateway import JavaGateway
gateway = JavaGateway()
print gateway.entry_point.getRDDFromSC(sc._jsc)
The Java portion is:
import java.util.Map;
import java.util.List;
import java.util.ArrayList;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import py4j.GatewayServer;
public class HelloWorld
{
public JavaRDD<Integer> getRDDFromSC(JavaSparkContext jsc)
{
JavaRDD<Integer> result = null;
if (jsc == null)
{
System.out.println("XXX Bad mojo XXX");
return result;
}
int n = 10;
List<Integer> l = new ArrayList<Integer>(n);
for (int i = 0; i < n; i++)
{
l.add(i);
}
result = jsc.parallelize(l);
return result;
}
public static void main(String[] args)
{
HelloWorld app = new HelloWorld();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Running produces on the Python side:
$ spark-1.3.1-bin-hadoop1/bin/spark-submit main.py
version 1.3.1
sc._jsc <class 'py4j.java_gateway.JavaObject'>
org.apache.spark.api.java.JavaSparkContext#50418105
None
The Java side reports:
$ spark-1.3.1-bin-hadoop1/bin/spark-submit --class "HelloWorld" --master local[4] target/hello-world-1.0.jar
XXX Bad mojo XXX
The problem appears to be that I am not correctly passing the JavaSparkContext from Python to Java. The same failure of the JavaRDD being null occurs when I use from python sc._scj.sc().
What is the correct way to invoke user defined Java code that uses spark from Python?
So I've got an example of this in a branch that I'm working on for Sparkling Pandas The branch lives at https://github.com/holdenk/sparklingpandas/tree/add-kurtosis-support and the PR is at https://github.com/sparklingpandas/sparklingpandas/pull/90 .
As it stands it looks like you have two different gateway servers which seems like it might cause some problems, instead you can just use the existing gateway server and do something like:
sc._jvm.what.ever.your.class.package.is.HelloWorld.getRDDFromSC(sc._jsc)
assuming you make that a static method as well.

Apache Spark java.lang.ClassNotFoundException

Spark standalone cluster looks it's running without a problem :
http://i.stack.imgur.com/gF1fN.png
I followed this tutorial.
I have built a fat jar for running this JavaApp on the cluster. Before maven package:
find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java
content of SimpleApp.java is :
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
public class SimpleApp {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setMaster("spark://10.35.23.13:7077")
.setAppName("My app")
.set("spark.executor.memory", "1g");
JavaSparkContext sc = new JavaSparkContext (conf);
String logFile = "/home/ubuntu/spark-0.9.1/test_data";
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
System.out.println("Lines with a: " + numAs);
}
}
This program only works when master is set as setMaster("local"). Otherwise I get this error
$java -cp path_to_file/simple-project-1.0-allinone.jar SimpleApp
http://i.stack.imgur.com/doRSn.png
There's the anonymous class (that extends Function) in SimpleApp.java file. This class is compiled to SimpleApp$1, which should be broadcast to each worker in the Spark cluster.
The simplest way for it is to add the jar explicitly to the Spark context. Add something like sparkContext.addJar("path_to_file/simple-project-1.0-allinone.jar") after JavaSparkContext creating and rebuild your jar file. Then the main Spark program (called the driver program) will automatically deliver your application code to the cluster.

Resources