MLlib not saving the model data in Spark 2.1 - apache-spark

We have a machine learning model that looks roughly like this:
sc = SparkContext(appName = "MLModel")
sqlCtx = SQLContext(sc)
df = sqlCtx.createDataFrame(data_res_promo)
#where data_res promo comes from a pandas dataframe
indexer = StringIndexer(inputCol="Fecha_Code", outputCol="Fecha_Index")
train_indexer = indexer.fit(df)
train_indexer.save('ALSIndexer') #This saves the indexer architecture
In my machine, when I run it as a local, it generates a folder ALSIndexer/ that has the parquet and all the information on the model.
When I run it in our Azure cluster of Spark, it does not generate the folder in the main node (nor in the slaves). However, if we try to rewrite it, it says:
cannot overwrite folder
Which means is somewhere, but we can't find it.
Would you have any pointers?

Spark will by default save files to the distributed filesystem (probably HDFS). The files will therefore not be visible on the nodes themselves but, as they are present, you get the "cannot overwrite folder" error message.
You can easily access the files through the HDFS to copy them to the main node. This can be done in the command line by one of these commands:
1.hadoop fs -get <HDFS file path> <Local system directory path>
2.hadoop fs -copyToLocal <HDFS file path> <Local system directory path>
It can also be done by importing the org.apache.hadoop.fs.FileSystem and utilize the commands available there.

Related

unable to read configfile using Configparser in Databricks

I want to read some values as a parameter using configparser in Databricks
i can import configparser module in databricks but unable to read the parameters from configfile its coming up error as KEY ERROR
please check the below screenshot
config file is
The problem is that your file is located on DBFS (the /FileStore/...) and this is file system isn't understood by configparser that works with "local" file system. To get this working, you need to append the /dbfs prefix to file path: /dbfs/FileStore/....
P.S. it may not work on community edition with DBR 7.x. In this case, just copy this config file before reading using the dbutils.fs.cp, like this :
dbutils.fs.cp("/FileStore/...", "file:///tmp/config.ini")
config.read("/tmp/config.ini")

How to handle reading non-existing file in spark

Am trying to read some files from HDFS using spark sc.wholeTextFiles, I pass a list of the required files, yet the job keeps throwing
py4j.protocol.Py4JJavaError: An error occurred while calling o98.showString.
: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist:
if one of the files didn't exist.
how can I bypass the not found files and only read found ones ?
To know if a file exists (and delete it, in my case) I do the following:
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
if (fs.exists(new Path(fullPath))) {
println("Output directory already exists. Deleting it...")
fs.delete(new Path(fullPath), true)
}
Use jvm File system from spark context to check file
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/file.csv"))
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("test.csv"))
True
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("fail_test.csv"))
False

Cassandra COPY FROM file pattern gives error

My cassandra version: 2.0.17.
I am following this https://www.datastax.com/dev/blog/new-features-in-cqlsh-copy post to copy all of my csv files placed in a folder to a Cassandra table. But it shows me error saying No such file or directory.
When I try to copy individual files using below command it works very well:
COPY table FROM '/home/folder1/a.csv' WITH DELIMITER=',' AND HEADER=FALSE;
There are multiple csv files in /home/folder1 location. So I tried to copy all the csv files in a single go using below command:
COPY table FROM '/home/folder1/*.csv' WITH DELIMITER=',' AND HEADER=FALSE;
When I run the above command it gives me below error:
Can't open '/home/folder1/*.csv' for reading: [Errno 2] No such file or directory: '/home/folder1/*.csv'
Please help to solve this issue.
The blog post says
We will review these new features in this post; they will be available in the following cassandra releases: 2.1.13, 2.2.5, 3.0.3 and 3.2.
So the 2.0.17 doesn't have this functionality. If you want to load all .csv files from directory, just use:
for i in /home/folder1/*.csv ; do
echo "COPY table FROM '$i' WITH DELIMITER=',' AND HEADER=FALSE;"|cqlsh -f -
done

pySpark local mode - loading text file with file:/// vs relative path

I am just getting started with spark and I am trying out examples in local mode...
I noticed that in some examples when creating the RDD the relative path to the file is used and in others the path starts with "file:///". The second option did not work for me at all - "Input path does not exist"
Can anyone explain what the difference is between using the file path and putting 'file:///' in front of it ?
I am using Spark 2.2 on Mac running in local mode
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("test")
sc = SparkContext(conf = conf)
#This will work providing the relative path
lines = sc.textFile("code/test.csv")
#This will not work
lines = sc.textFile("file:///code/test.csv")
sc.textFile("code/test.csv") means test.csv in /<hive.metastore.warehouse.dir>/code/test.csv on HDFS.
sc.textFile("hdfs:///<hive.metastore.warehouse.dir>/code/test.csv") is equal to above.
sc.textFile("file:///code/test.csv") means test.csv in /code/test.csv on local file system.

Pyspark - FileInputDStream: Error finding new files

Hi I'm new to Python Spark and I'm trying out this example from Spark github in order to Counts words in new text files created in the given directory :
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: hdfs_wordcount.py <directory>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingHDFSWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.textFileStream("hdfs:///home/my-logs/")
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
And this is what I get :
a warning saying : WARN FileInputDStream: Error finding new files
a warning message saying : WARN FileInputDStream: Error finding new files.
and I got empty results even i'm adding files in this dir :/
Any suggested solution for this ?
thanks.
The issue is spark streaming will not read old files from directory..since all logs files exist before your streaming job started
so what you need to do once you started your streaming job then put/copy input files in hdfs directory either manually or by an script.
I think you are referring to this example. Are you able to run it without modifying as I see you are setting directory to "hdfs:///" in program? You can run the example like below.
For example Spark is at /opt/spark-2.0.2-bin-hadoop2.7. You can run hdfs_wordcount.py available in example directory like below. We are using /tmp as directory to pass as argument to program.
user1#user1:/opt/spark-2.0.2-bin-hadoop2.7$ bin/spark-submit examples/src/main/python/streaming/hdfs_wordcount.py /tmp
Now while this program is running, open another terminal and copy some file to /tmp folder
user1#user1:~$ cp test.txt /tmp
You will see the word count in first terminal.
Solved!
The issue is the build, i use to build like that using maven depending on their readme file from github :
build/mvn -DskipTests clean package
I've build that way depending on their documentation :
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Someone know what those params are ?

Resources