How to access a file in windows using Spark and winutils? - apache-spark

I am running spark on windows using winutils.
In spark shell in trying to load a csv file, but it says Path does not exist, i.e. I have a file at location E:/data.csv.
I am executing:
scala> val df = spark.read.option("header","true").csv("E:\\data.csv")
Error:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/E:/data.csv;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
I cant figure out why is it appending a "/E:", whereas it should have been only E:
How should I access the file?

In my case I am able to read the file as below
val input = spark.sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
.option("delimiter", ";").option("quoteAll","true").option("inferSchema","false").load("C:/Work/test.csv").toDF()

Related

How to read a local csv file in a data frame using Spark in a virtual environment ? (cdhserver : labuser) [duplicate]

I'm following the great spark tutorial
so i'm trying at 46m:00s to load the README.md but fail to what i'm doing is this:
$ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash
bash-4.1# cd /usr/local/spark-1.1.0-bin-hadoop2.4
bash-4.1# ls README.md
README.md
bash-4.1# ./bin/spark-shell
scala> val f = sc.textFile("README.md")
14/12/04 12:11:14 INFO storage.MemoryStore: ensureFreeSpace(164073) called with curMem=0, maxMem=278302556
14/12/04 12:11:14 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 160.2 KB, free 265.3 MB)
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
how can I load that README.md?
Try explicitly specify sc.textFile("file:///path to the file/"). The error occurs when Hadoop environment is set.
SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits, which in turn uses org.apache.hadoop.fs.getDefaultUri if schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf. If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".
gonbe's answer is excellent. But still I want to mention that file:/// = ~/../../, not $SPARK_HOME. Hope this could save some time for newbs like me.
If the file is located in your Spark master node (e.g., in case of using AWS EMR), then launch the spark-shell in local mode first.
$ spark-shell --master=local
scala> val df = spark.read.json("file:///usr/lib/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Alternatively, you can first copy the file to HDFS from the local file system and then launch Spark in its default mode (e.g., YARN in case of using AWS EMR) to read the file directly.
$ hdfs dfs -mkdir -p /hdfs/spark/examples
$ hadoop fs -put /usr/lib/spark/examples/src/main/resources/people.json /hdfs/spark/examples
$ hadoop fs -ls /hdfs/spark/examples
Found 1 items
-rw-r--r-- 1 hadoop hadoop 73 2017-05-01 00:49 /hdfs/spark/examples/people.json
$ spark-shell
scala> val df = spark.read.json("/hdfs/spark/examples/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster.
Some network filesystems, like NFS, AFS, and MapR’s NFS layer, are exposed to the user as a regular filesystem.
If your data is already in one of these systems, then you can use it as an input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path
rdd = sc.textFile("file:///path/to/file")
If your file isn’t already on all nodes in the cluster, you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers
Take care to put file:// in front and the use of "/" or "\" according to OS.
Attention:
Make sure that you run spark in local mode when you load data from local(sc.textFile("file:///path to the file/")) or you will get error like this Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist.
Becasuse executors which run on different workers will not find this file in it's local path.
You need just to specify the path of the file as "file:///directory/file"
example:
val textFile = sc.textFile("file:///usr/local/spark/README.md")
I have a file called NewsArticle.txt on my Desktop.
In Spark, I typed:
val textFile= sc.textFile(“file:///C:/Users/582767/Desktop/NewsArticle.txt”)
I needed to change all the \ to / character for the filepath.
To test if it worked, I typed:
textFile.foreach(println)
I'm running Windows 7 and I don't have Hadoop installed.
This has happened to me with Spark 2.3 with Hadoop also installed under the common "hadoop" user home directory.Since both Spark and Hadoop was installed under the same common directory, Spark by default considers the scheme as hdfs, and starts looking for the input files under hdfs as specified by fs.defaultFS in Hadoop's core-site.xml. Under such cases, we need to explicitly specify the scheme as file:///<absoloute path to file>.
This has been discussed into spark mailing list, and please refer this mail.
You should use hadoop fs -put <localsrc> ... <dst> copy the file into hdfs:
${HADOOP_COMMON_HOME}/bin/hadoop fs -put /path/to/README.md README.md
I tried the following and it worked from my local file system.. Basically spark can read from local, HDFS and AWS S3 path
listrdd=sc.textFile("file:////home/cloudera/Downloads/master-data/retail_db/products")
This is the solution for this error that i was getting on Spark cluster that is hosted in Azure on a windows cluster:
Load the raw HVAC.csv file, parse it using the function
data = sc.textFile("wasb:///HdiSamples/SensorSampleData/hvac/HVAC.csv")
We use (wasb:///) to allow Hadoop to access azure blog storage file and the three slashes is a relative reference to the running node container folder.
For example: If the path for your file in File Explorer in Spark cluster dashboard is:
sflcc1\sflccspark1\HdiSamples\SensorSampleData\hvac
So to describe the path is as follows: sflcc1: is the name of the storage account. sflccspark: is the cluster node name.
So we refer to the current cluster node name with the relative three slashes.
Hope this helps.
If your trying to read file form HDFS. trying setting path in SparkConf
val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader")
conf.set("fs.defaultFS", "hdfs://hostname:9000")
You do not have to use sc.textFile(...) to convert local files into dataframes. One of options is, to read a local file line by line and then transform it into Spark Dataset. Here is an example for Windows machine in Java:
StructType schemata = DataTypes.createStructType(
new StructField[]{
createStructField("COL1", StringType, false),
createStructField("COL2", StringType, false),
...
}
);
String separator = ";";
String filePath = "C:\\work\\myProj\\myFile.csv";
SparkContext sparkContext = new SparkContext(new SparkConf().setAppName("MyApp").setMaster("local"));
JavaSparkContext jsc = new JavaSparkContext (sparkContext );
SQLContext sqlContext = SQLContext.getOrCreate(sparkContext );
List<String[]> result = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = br.readLine()) != null) {
String[] vals = line.split(separator);
result.add(vals);
}
} catch (Exception ex) {
System.out.println(ex.getMessage());
throw new RuntimeException(ex);
}
JavaRDD<String[]> jRdd = jsc.parallelize(result);
JavaRDD<Row> jRowRdd = jRdd .map(RowFactory::create);
Dataset<Row> data = sqlContext.createDataFrame(jRowRdd, schemata);
Now you can use dataframe data in your code.
Reading local file in Apache-Spark. This worked for me:
var a = sc.textFile("/home/omkar/Documents/text_input").flatMap(line => line.split(" ")).map(word => (word, 1));
try
val f = sc.textFile("./README.md")

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below.
please help me resolve this and understand what I am doing wrong.
# run using command
# time spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.1 connect_s3_using_keys.py
from pyspark import SparkContext, SparkConf
import ConfigParser
import pyspark
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Deepak_1ST_job")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
hadoop_conf = sc._jsc.hadoopConfiguration()
config = ConfigParser.ConfigParser()
config.read("/home/deepak/Desktop/secure/awsCred.cnf")
accessKeyId = config.get("aws_keys", "access_key")
secretAccessKey = config.get("aws_keys", "secret_key")
hadoop_conf.set(
"fs.s3n.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs3a.access.key", accessKeyId)
hadoop_conf.set("s3a.secret.key", secretAccessKey)
sqlContext = pyspark.SQLContext(sc)
df = sqlContext.read.json("s3a://bucket_name/logs/20191117log.json")
df.show()
EDIT 1:
As I am new to pyspark I am unaware of these dependencies, also the error is not easily understandable.
getting error as
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 98, in deco
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.json.
: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:816)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:792)
at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:747)
at org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.
I had the same issue with spark 3.0.0 / hadoop 3.2.
What worked for me was to replace the hadoop-aws-3.2.1.jar in spark-3.0.0-bin-hadoop3.2/jars with hadoop-aws-3.2.0.jar found here: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.2.0
Check your spark guava jar version. If you download spark from Amazon like me from the link (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz) in their documentation. You can see the include guava version is guava-14.0.1.jar and their container is using guava-21.0.jar
I have reported the issue to them and they will repack their spark to include the correct version. If you interested in the bug self, here is the link. https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#ClassNotFoundException:_org.apache.hadoop.fs.s3a.S3AFileSystem

By default in which file system does spark look for reading file? [duplicate]

I'm following the great spark tutorial
so i'm trying at 46m:00s to load the README.md but fail to what i'm doing is this:
$ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash
bash-4.1# cd /usr/local/spark-1.1.0-bin-hadoop2.4
bash-4.1# ls README.md
README.md
bash-4.1# ./bin/spark-shell
scala> val f = sc.textFile("README.md")
14/12/04 12:11:14 INFO storage.MemoryStore: ensureFreeSpace(164073) called with curMem=0, maxMem=278302556
14/12/04 12:11:14 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 160.2 KB, free 265.3 MB)
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
how can I load that README.md?
Try explicitly specify sc.textFile("file:///path to the file/"). The error occurs when Hadoop environment is set.
SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits, which in turn uses org.apache.hadoop.fs.getDefaultUri if schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf. If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".
gonbe's answer is excellent. But still I want to mention that file:/// = ~/../../, not $SPARK_HOME. Hope this could save some time for newbs like me.
If the file is located in your Spark master node (e.g., in case of using AWS EMR), then launch the spark-shell in local mode first.
$ spark-shell --master=local
scala> val df = spark.read.json("file:///usr/lib/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Alternatively, you can first copy the file to HDFS from the local file system and then launch Spark in its default mode (e.g., YARN in case of using AWS EMR) to read the file directly.
$ hdfs dfs -mkdir -p /hdfs/spark/examples
$ hadoop fs -put /usr/lib/spark/examples/src/main/resources/people.json /hdfs/spark/examples
$ hadoop fs -ls /hdfs/spark/examples
Found 1 items
-rw-r--r-- 1 hadoop hadoop 73 2017-05-01 00:49 /hdfs/spark/examples/people.json
$ spark-shell
scala> val df = spark.read.json("/hdfs/spark/examples/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster.
Some network filesystems, like NFS, AFS, and MapR’s NFS layer, are exposed to the user as a regular filesystem.
If your data is already in one of these systems, then you can use it as an input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path
rdd = sc.textFile("file:///path/to/file")
If your file isn’t already on all nodes in the cluster, you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers
Take care to put file:// in front and the use of "/" or "\" according to OS.
Attention:
Make sure that you run spark in local mode when you load data from local(sc.textFile("file:///path to the file/")) or you will get error like this Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist.
Becasuse executors which run on different workers will not find this file in it's local path.
You need just to specify the path of the file as "file:///directory/file"
example:
val textFile = sc.textFile("file:///usr/local/spark/README.md")
I have a file called NewsArticle.txt on my Desktop.
In Spark, I typed:
val textFile= sc.textFile(“file:///C:/Users/582767/Desktop/NewsArticle.txt”)
I needed to change all the \ to / character for the filepath.
To test if it worked, I typed:
textFile.foreach(println)
I'm running Windows 7 and I don't have Hadoop installed.
This has happened to me with Spark 2.3 with Hadoop also installed under the common "hadoop" user home directory.Since both Spark and Hadoop was installed under the same common directory, Spark by default considers the scheme as hdfs, and starts looking for the input files under hdfs as specified by fs.defaultFS in Hadoop's core-site.xml. Under such cases, we need to explicitly specify the scheme as file:///<absoloute path to file>.
This has been discussed into spark mailing list, and please refer this mail.
You should use hadoop fs -put <localsrc> ... <dst> copy the file into hdfs:
${HADOOP_COMMON_HOME}/bin/hadoop fs -put /path/to/README.md README.md
I tried the following and it worked from my local file system.. Basically spark can read from local, HDFS and AWS S3 path
listrdd=sc.textFile("file:////home/cloudera/Downloads/master-data/retail_db/products")
This is the solution for this error that i was getting on Spark cluster that is hosted in Azure on a windows cluster:
Load the raw HVAC.csv file, parse it using the function
data = sc.textFile("wasb:///HdiSamples/SensorSampleData/hvac/HVAC.csv")
We use (wasb:///) to allow Hadoop to access azure blog storage file and the three slashes is a relative reference to the running node container folder.
For example: If the path for your file in File Explorer in Spark cluster dashboard is:
sflcc1\sflccspark1\HdiSamples\SensorSampleData\hvac
So to describe the path is as follows: sflcc1: is the name of the storage account. sflccspark: is the cluster node name.
So we refer to the current cluster node name with the relative three slashes.
Hope this helps.
If your trying to read file form HDFS. trying setting path in SparkConf
val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader")
conf.set("fs.defaultFS", "hdfs://hostname:9000")
You do not have to use sc.textFile(...) to convert local files into dataframes. One of options is, to read a local file line by line and then transform it into Spark Dataset. Here is an example for Windows machine in Java:
StructType schemata = DataTypes.createStructType(
new StructField[]{
createStructField("COL1", StringType, false),
createStructField("COL2", StringType, false),
...
}
);
String separator = ";";
String filePath = "C:\\work\\myProj\\myFile.csv";
SparkContext sparkContext = new SparkContext(new SparkConf().setAppName("MyApp").setMaster("local"));
JavaSparkContext jsc = new JavaSparkContext (sparkContext );
SQLContext sqlContext = SQLContext.getOrCreate(sparkContext );
List<String[]> result = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = br.readLine()) != null) {
String[] vals = line.split(separator);
result.add(vals);
}
} catch (Exception ex) {
System.out.println(ex.getMessage());
throw new RuntimeException(ex);
}
JavaRDD<String[]> jRdd = jsc.parallelize(result);
JavaRDD<Row> jRowRdd = jRdd .map(RowFactory::create);
Dataset<Row> data = sqlContext.createDataFrame(jRowRdd, schemata);
Now you can use dataframe data in your code.
Reading local file in Apache-Spark. This worked for me:
var a = sc.textFile("/home/omkar/Documents/text_input").flatMap(line => line.split(" ")).map(word => (word, 1));
try
val f = sc.textFile("./README.md")

Cannot get a hello world running in Apache Spark on Ubuntu

I am trying to setup Apache Spark on Ubuntu and output hello world.
I added these two lines to .bashrc
export SPARK_HOME=/home/james/spark
export PATH=$PATH:$SPARK_HOME/bin
(Where spark is a symbolic link to the current spark version I have).
I run
spark-shell
From the command line and the shell starts up with hundreds of errors, for example:
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test
connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app)
Caused by: ERROR XJ041: Failed to create database 'metastore_db', see the next exception for details.
Caused by: ERROR XBM0A: The database directory '/home/james/metastore_db' exists. However, it does not contain the expected 'service.properties' file. Perhaps Derby was brought down in the middle of creating this database. You may want to delete this directory and try creating the database again.
Caused by: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
Caused by: org.apache.derby.iapi.error.StandardException: Failed to create database 'metastore_db', see the next exception for details.
My Java version:
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
I try to open a simple text file with scala:
scala> val textFile = sc.TextFile("test/README.md")
<console>:17: error: not found: value sc
I read I need to create a new SparkContext. So I try this:
scala> sc = SparkContext(appName = "foo")
<console>:19: error: not found: value sc
val $ires1 = sc
^
<console>:17: error: not found: value sc
sc = SparkContext(appName = "foo")
And this:
scala> val sc = new SparkContext(conf)
<console>:17: error: not found: type SparkContext
val sc = new SparkContext(conf)
^
<console>:17: error: not found: value conf
val sc = new SparkContext(conf)
I have no idea what is going on. All I need is the correct set up then there should be no problems.
Thank you in advance.
Try to delete metastore_db folder and derby.log file. I had the same problem and deleting those things fixed it. I'd assume there was some corrupted files in db folder due to Spark not closing properly.

SaveAsTextFile() results in Mkdirs failed - Apache Spark

Good, I currently have a cluster in spark with 3 working nodes. I also have a nfs server mounted on /var/nfs with 777 permission for testing. I'm trying to run the following code to count the words in a text:
root#master:/home/usuario# MASTER="spark://10.0.0.1:7077" spark-shell
val inputFile = sc.textFile("/var/nfs/texto.txt")
val counts = inputFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.toDebugString
counts.cache()
counts.count()
counts.saveAsTextFile("/home/usuario/output");
But spark gives me the following error:
Caused by: java.io.IOException: Mkdirs failed to create
file:/var/nfs/output-4/_temporary/0/_temporary/attempt_20170614094558_0007_m_000000_20
(exists=false, cwd=file:/opt/spark/work/app-20170614093824-0005/2)
I have searched for many websites but I can not find the solution for my case. All help is grateful.
When you start a spark-shell with MASTER as valid application-master url - and not local[*], spark treats all paths as HDFS; and performs IO operations only in underlying HDFS; not in local.
YOu have mounted the locations in local file-system; and those paths are not existed in HDFS.
That's why, the error says: exists=false
Same issue with me. Check ownership of your directory again.
sudo chown -R owner-user:owner-group directory

Resources