Installation of graphframes package in an offline Spark cluster - apache-spark

I have an offline pyspark cluster (no internet access) where I need to install graphframes library.
I have manually downloaded the jar from here added in $SPARK_HOME/jars/ and then when I try to use it I get the following error:
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access term typesafe in package com,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access term scalalogging in value com.typesafe,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.typesafe.
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access type LazyLogging in value com.slf4j,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.slf4j.
Which is the correct way to offline install it with all the dependencies?

I manage to install the graphframes libarary. First of all I found the graphframes dependencies witch where:
scala-logging-api_xx-xx.jar
scala-logging-slf4j_xx-xx.jar
where xx is the proper versions for scala and the jar version. Then I installed them in the proper path. Because I work in an Cloudera machine the proper path is:
/opt/cloudera/parcels/SPARK2/lib/spark2/jars/
If you can not place them in this directory in your cluster (because you have no root rights and your admin is super lazy) you can simply add in your spark-submit/ spark-shell
spark-submit ..... --driver-class-path /path-for-jar/ \
--jars /../graphframes-0.5.0-spark2.1-s_2.11.jar,/../scala-logging-slf4j_2.10-2.1.2.jar,/../scala-logging-api_2.10-2.1.2.jar
This works for Scala. In order to use graphframes for python you need to
download graphframes jar and then through shell
#Extract JAR content
jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar
#Enter the folder
cd graphframes
#Zip the contents
zip graphframes.zip -r *
And then add the zipped file in your python path in spark-env.sh or your bash_profile
with
export PYTHONPATH=$PYTHONPATH:/..proper path/graphframes.zip:.
Then opening the shell/submitting (again with the same arguments as with scala) importing graphframes works normaly
This link was extremely useful for this solution

Related

How to change the kedro configuration environment in jupyter notebook?

I want to run a kedro pipeline in the base env using jupyter notebook. I do this the following way:
%reload_kedro --env=base
session.run(pipeline_name='dpfm1')
Doing this, the %reload_kedro command raises the following error:
RuntimeError: Could not find the project configuration file 'pyproject.toml' in --env=base. If you have created
your project with Kedro version <0.17.0, make sure to update your project template. See
https://github.com/kedro-org/kedro/blob/main/RELEASE.md#migration-guide-from-kedro-016-to-kedro-0170 for how to
migrate your Kedro project.
However, I have installed kedro version 0.18.2:
>>>!kedro --version
kedro, version 0.18.2
What's the matter here?
#ilja This is mentioned in the RELEASE.md if you have an old Kedro project, i.e. 0.16.x, there is no pypropject.toml file.
You may have Kedro 0.18.2 installed, but if it is an old project, there are some migration steps that you need to take, which are included in the RELEASE.md
If it is a new project, it's likely that you are not providing the right path argument, kedro need to find the pyproject.toml for certain metadata and determine where is the project root.
p.s. %reload_kedro path --env --extra_params is only supported since 0.18.3, previously it does not support any argument other than path, so you may to upgrade your Kedro version.

Why does "spark-shell --jars" with GraphFrames jar give "error: missing or invalid dependency detected while loading class file 'Logging.class'"?

I have run a command spark-shell --jars /home/krishnamahi/graphframes-0.4.0-spark2.1-s_2.11.jar and it threw me an error
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access term typesafe in package com,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access term scalalogging in value com.typesafe,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.typesafe.
error: missing or invalid dependency detected while loading class file 'Logging.class'.
Could not access type LazyLogging in value com.slf4j,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'Logging.class' was compiled against an incompatible version of com.slf4j.
I am using Spark Version 2.1.1, Scala Version 2.11.8, JDK Version 1.8.0_131, CentOS7 64-bit, Hadoop 2.8.0. Can anyone please tell me what additional command should I give for perfect run of program? Thanks in advance.
If you want to play with GraphFrames use --packages command-line option of spark-shell instead.
--packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version.
For graphframes-0.4.0-spark2.1-s_2.11.jar that'd be as follows:
$SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.4.0-spark2.1-s_2.11
which I copied verbatim from How to section of GraphFrames project.
That way you don't have to search for all the (transitive) dependencies of GraphFrames library as Spark will do it for you automatically.
I have installed raw Hadoop, with all components Hive, Pig, Spark of latest versions. It then worked for me. I used Cent OS 7. Order for installing Hadoop with components is
Anaconda3/Python3 (Since Spark 2.x doesn't support Python 2)
Hadoop
Pig
Hive
Hbase
Spark
All the components should be in a single go, in same terminal. After Spark installation, restart the system.

unresolved dependency: com.eed3si9n#sbt-assembly;0.13.0: not found

Did lots of search, saw many people having the similar issue and tried various suggested solution. None worked.
Can someone help me?
resolvers += Resolver.url("bintray-sbt-plugins", url("http://dl.bintray.com/sbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
The file is inside the project folder.
Instead of 0.13.0 version, I used 0.14.0 version.
I fixed this by adding POM file which I downloaded from
https://dl.bintray.com/sbt/sbt-plugin-releases/com.eed3si9n/sbt-assembly/scala_2.10/sbt_0.13/0.14.4/ivys/
to my local ivy folder under below location .ivy/local ( if not present, create the local folder).
once it was there I ran the build and it downloaded the jar.
You need to add [root_dir]/project/plugins.sbt file with the following content:
// packager
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")
Event better - don't use sbt-assembly at all! Flat-jars cause conflicts during merging which need to be resolved with assemblyMergeStrategy.
Use the binary distribution format plugin that sbt offers which enables you to distribute in binary script, dmg, msi and tar.gz.
Check out sbt-native-packager

Unresolved dependency when assembly Spark 1.2.0

I'm trying to build Spark 1.2.0 on ubuntu but i'm getting dependency issues.
I basically download the files extract the folder and run sbt/sbt/assembly
sbt = 0.13.6
scala = 2.10.4
sbt.ResolveException: unresolved dependency: org.apache.spark#spark-
network-common_2.10;1.2.0: configuration not public in
org.apache.spark#spark-network-common_2.10;1.2.0: 'test'. It was
required from org.apache.spark#spark-network-shuffle_2.10;1.2.0 test
This sbt issue seems to explain it: this would be a consequence of trying to get a test->test dependency if the same version has been resolved out of a public Maven repository.
A workaround would be using git SHA versioning or SNAPSHOT for non final builds of that test dependency, but we won't know more unless we get an idea of how you got to a 'bad' ivy cache state.
TL;DR : try clearing your cache of spark artefacts before building.
Edit: This is fixed in sbt 0.13.10-RC1 https://github.com/sbt/sbt/pull/2345 Please update

Adding dependencies from a single file, without composer.json

I am struggling around a wrong usage of composer, for sure.
I set up this repository: https://github.com/alle/assets-merger
I forked the project and was just trying to make it a kohana-module, including all the dependencies.
As for it would need the YUI comporess JAR, I was trying to make just that JARfile as a dependency, and I ended to declare it in the composer.json file (please, look at this).
Once I need to add my new package to a project I add it in the require section as follows:
...
"alle/assets-merger": "dev-master",
...
But the (latest) composer update command says:
Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.
Problem 1
- Installation request for alle/assets-merger dev-develop -> satisfiable by alle/assets-merger[dev-develop].
- alle/assets-merger dev-develop requires yui/yuicompressor 2.4.8 -> no matching package found.
Potential causes:
- A typo in the package name
- The package is not available in a stable-enough version according to your minimum-stability setting see <https://groups.google.com/d/topic/composer-dev/_g3ASeIFlrc/discussion> for more details.
And my story ends here.
How should I configure my composer.json in the https://github.com/alle/assets-merger repository, in order to include it as a fully satisfied kohana-module in other projects?
Some things I notice in your composer.json.
There is a version of that CSS minify available on Packagist which says it is just a copy of the original Goole-Code hosted files, but with Composer: natxet/cssmin. It is version 3.0.2, but I think that shouldn't make a difference.
mrclay/minify is included twice in the packages with the same version. It also is available on Packagist. You will probably already use that (version 2.2.0 is registered, and because you didn't turn of Packagist access, it will be generally available for install unless a version requirement or conflict prevents it).
You are trying to download a JAR file (which is a java executable without and PHP), but try to get PHP classmaps out of it. That will fail for sure.
You did miss the big note in the Composer documentation saying that Composer cannot resolve repositories mentioned in sub packages, only in the root package. That means that whatever you mention in your alle/asset-merger package will not be used if you use that package anywhere else. You'd have to duplicate these repositories in every package in addition to adding the package name itself as "required".
What this means is that you probably avoided missing mrclay/minify because it is available on Packagist, you might as well have added the cssmin by accident, but you definitly did not add YUICompressor.
But you shouldn't add this in the first place, because it is no PHP software. You can however add post-install commands to your projects. All your Composer integration does is download the JAR file. You can do that with a post-install or post-update command. See the documentation here.

Resources