Unable to read file in spark program from local directory - apache-spark

I am unable to read the local csv file in spark program. I am using PyCharm IDE. Although I am able to use the position argument to read the file but not with file location. Can someone please help?
// code
# Processing logic here...
flightTimeCsvDF = spark.read \
.format("csv") \
.option("header", "true") \
.load("data/flight*.csv")
# .load(sys.argv[1])
\\error
Exception in thread "globPath-ForkJoinPool-1-worker-1" java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1218)
at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1423)
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
at org.apache.hadoop.fs.Globber.listStatus(Globber.java:128)
enter image description here

Please use the absolute path. From the image attached, I believe using the following will help solve the issue.
.load("C:\\Users\\psultania\\Anaconda3\\envs\\04-SparkSchemaDemo\\data\\flight*.csv")
If you are using different directories for input CSVs, please change the directory definition accordingly.

Yes it works using absolute path

Related

unable to read configfile using Configparser in Databricks

I want to read some values as a parameter using configparser in Databricks
i can import configparser module in databricks but unable to read the parameters from configfile its coming up error as KEY ERROR
please check the below screenshot
config file is
The problem is that your file is located on DBFS (the /FileStore/...) and this is file system isn't understood by configparser that works with "local" file system. To get this working, you need to append the /dbfs prefix to file path: /dbfs/FileStore/....
P.S. it may not work on community edition with DBR 7.x. In this case, just copy this config file before reading using the dbutils.fs.cp, like this :
dbutils.fs.cp("/FileStore/...", "file:///tmp/config.ini")
config.read("/tmp/config.ini")

spark-submit overriding default application.conf not working

I am building a jar which has application.conf under src/main/resources folder. However, I am trying to overwrite that while doing spark-submit. However it's not working.
following is my command
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--files application.conf \
$sandbox_jar flagFile/test.FLG \
--conf "spark.executor.extraClassPath=-Dconfig.file=application.conf"
application.conf - is located in same directory my jar file is.
-Dconfig.file=path/to/config-file mayn't work due to internal cache on ConfigFactory. The documentation suggest to run ConfigFactory.invalidateCaches().
The other way is following, which merges the supplied properties with existing properties available.
ConfigFactory.invalidateCaches()
val c = ConfigFactory.parseFile(new File(path-to-file + "/" + "application.conf"))
val config : Config = c.withFallback(ConfigFactory.load()).resolve
I think the best way to override the properties would be to supply them using -D. Typesafe gives highest priority to system properties, so -D will override reference.conf and application.conf.
Considering application.conf is properties file. There is other option, which can solve the same purpose of using properties file.
Not sure but packaging properties file with jar might not provide flexibility? Here keeping properties file separate from jar packaging, this will provide flexibility as, whenever if any property changes just replace new properties file instead of building and deploying whole jar.
This can be achieved as, keep your properties in properties file are prefix your property key with "spark."
spark.inputpath /input/path
spark.outputpath /output/path
Spark Submit command would be like,
$spark_submit $spark_params $hbase_params \
--class com.abc.xyz.MYClass \
--properties-file application.conf \
$sandbox_jar flagFile/test.FLG
Getting properties in code like,
sc.getConf.get("spark.inputpath") // /input/path
sc.getConf.get("spark.outputpath") // /output/path
Not nessesary it will solve your problem though. But here just try to put another approach to work.

How to get the SparkSession to find added python files

After running pip install BigDL==0.8.0, running from bigdl.util.common import * from python completed without issue.
However, with either of the following SparkSessions:
spark = (SparkSession.builder.master('yarn')
.appName('test')
.config("spark.jars", "/BigDL/spark/dl/target/bigdl-0.8.0-jar-with-dependencies-and-spark.jar")
.config('spark.submit.pyFiles', '/BigDL/pyspark/bigdl/util.zip')
.getOrCreate()
)
or
spark = (SparkSession.builder.master('local')
.appName('test')
.config("spark.jars", "/BigDL/spark/dl/target/bigdl-0.8.0-jar-with-dependencies-and-spark.jar")
.config('spark.submit.pyFiles', '/BigDL/pyspark/bigdl/util.zip')
.getOrCreate()
)
I get the following error.
ImportError: ('No module named bigdl.util.common', <function subimport at 0x7fd442a36aa0>, ('bigdl.util.common',))
In addition of the 'spark.submit.pyFiles' config above, after the SparkSession successfully starts, I have tried spark.sparkContext.addPyFile("util.zip") where "util.zip" contains all of the python files in https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl/util .
I have also zipped all of the contents in this folder https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl (branch-0.8) and pointed to that file in the .config('spark.submit.pyFiles', '/path/to/bigdl.zip'), but this also does not work.
How do I get the SparkSession to see these files?
Figured it out. The only thing that worked was spark.sparkContext.addPyFile("bigdl.zip") after the SparkSesssion has started. Where "bigdl.zip" contained all of the files in https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl (branch-0.8).
Not sure why .config('spark.submit.pyFiles', 'bigdl.zip') would not work.

pySpark local mode - loading text file with file:/// vs relative path

I am just getting started with spark and I am trying out examples in local mode...
I noticed that in some examples when creating the RDD the relative path to the file is used and in others the path starts with "file:///". The second option did not work for me at all - "Input path does not exist"
Can anyone explain what the difference is between using the file path and putting 'file:///' in front of it ?
I am using Spark 2.2 on Mac running in local mode
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("test")
sc = SparkContext(conf = conf)
#This will work providing the relative path
lines = sc.textFile("code/test.csv")
#This will not work
lines = sc.textFile("file:///code/test.csv")
sc.textFile("code/test.csv") means test.csv in /<hive.metastore.warehouse.dir>/code/test.csv on HDFS.
sc.textFile("hdfs:///<hive.metastore.warehouse.dir>/code/test.csv") is equal to above.
sc.textFile("file:///code/test.csv") means test.csv in /code/test.csv on local file system.

MLlib not saving the model data in Spark 2.1

We have a machine learning model that looks roughly like this:
sc = SparkContext(appName = "MLModel")
sqlCtx = SQLContext(sc)
df = sqlCtx.createDataFrame(data_res_promo)
#where data_res promo comes from a pandas dataframe
indexer = StringIndexer(inputCol="Fecha_Code", outputCol="Fecha_Index")
train_indexer = indexer.fit(df)
train_indexer.save('ALSIndexer') #This saves the indexer architecture
In my machine, when I run it as a local, it generates a folder ALSIndexer/ that has the parquet and all the information on the model.
When I run it in our Azure cluster of Spark, it does not generate the folder in the main node (nor in the slaves). However, if we try to rewrite it, it says:
cannot overwrite folder
Which means is somewhere, but we can't find it.
Would you have any pointers?
Spark will by default save files to the distributed filesystem (probably HDFS). The files will therefore not be visible on the nodes themselves but, as they are present, you get the "cannot overwrite folder" error message.
You can easily access the files through the HDFS to copy them to the main node. This can be done in the command line by one of these commands:
1.hadoop fs -get <HDFS file path> <Local system directory path>
2.hadoop fs -copyToLocal <HDFS file path> <Local system directory path>
It can also be done by importing the org.apache.hadoop.fs.FileSystem and utilize the commands available there.

Resources