Configure standalone spark for azure storage access

Configure standalone spark for azure storage access - azure

I have a need to be able to run spark on my local machine to access azure wasb and adl urls, but I can't get it to work. I have a stripped down example here:
maven pom.xml (Brand-new pom, only the dependencies have been set):
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure-datalake</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<version>6.0.0</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-data-lake-store-sdk</artifactId>
<version>2.2.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-storage</artifactId>
<version>7.0.0</version>
</dependency>
Java code (Doesn't need to be java - could be scala):
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.SparkSession;
public class App {
public static void main(String[] args) {
SparkConf config = new SparkConf();
config.setMaster("local");
config.setAppName("app");
SparkSession spark = new SparkSession(new SparkContext(config));
spark.read().parquet("wasb://container#host/path");
spark.read().parquet("adl://host/path");
}
}
No matter what I try I get:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: wasb
Same for adl. Every doc I can find on this either just says to add the azure-storage dependency, which I have done, or says to use HDInsight.
Any thoughts?

I figured this out and decided to post a working project since that is always what I look for. It is hosted here:
azure-spark-local-sample
The crux of it though is as #Shankar Koirala suggested:
For WASB, set the property to allow the url scheme to be recognized:
config.set("spark.hadoop.fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
Then set the property which authorizes access to the account. You will need one of these for each account you need to access. These are generated through the Azure Portal under the Access Keys section of the Storage Account blade.
config.set("fs.azure.account.key.[storage-account-name].blob.core.windows.net", "[access-key]");
Now for adl, assign the fs scheme as with WASB:
config.set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem");
// I don't know why this would be needed, but I saw it
// on an otherwise very helpful page . . .
config.set("spark.fs.AbstractFileSystem.adl.impl", "org.apache.hadoop.fs.adl.Adl");
. . . and finally, set the client access keys in these properties, again for each different account you need to access:
config.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential");
/* Client ID is generally the application ID from the azure portal app registrations*/
config.set("fs.adl.oauth2.client.id", "[client-id]");
/*The client secret is the key generated through the portal*/
config.set("fs.adl.oauth2.credential", "[client-secret]");
/*This is the OAUTH 2.0 TOKEN ENDPOINT under the ENDPOINTS section of the app registrations under Azure Active Directory*/
config.set("fs.adl.oauth2.refresh.url", "[oauth-2.0-token-endpoint]");
I hope this is helpful, and I wish I could give credit to Shankar for the answer, but I also wanted to get the exact details out there.

I am not sure about the adl haven't tested but for the wasb you need to define the file system to be used in the underlying Hadoop configurations.
Since you are using spark 2.3 you can use spark session to create a entry point as
val spark = SparkSession.builder().appName("read from azure storage").master("local[*]").getOrCreate()
Now define the file system
spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")
Now read the parquet file as
val baseDir = "wasb[s]://BlobStorageContainer#yourUser.blob.core.windows.net/"
val dfParquet = spark.read.parquet(baseDir + "pathToParquetFile")
Hope this helps!

Related

Quarkus Stork Service Discovery using DNS

Does anyone have a working example of Stork DNS Service Discovery working in Quarkus.... I see lots of examples using consul but not using DNS and I cannot get the properties or Inject beans when using DNS.
#ApplicationScoped
#RegisterRestClient(baseUri = "stork://auth-service/")
#RegisterProvider(OidcClientRequestFilter.class)
public interface KeycloakService {
I placed the following in my properties file but I get a compiler warning saying it is being ignored.
quarkus.stork.auth-service.service-discovery.type=dns
and my POM
<dependency>
<groupId>io.smallrye.stork</groupId>
<artifactId>stork-service-discovery-dns</artifactId>
<version>1.2.0</version>
</dependency>
The error I am getting is
Caused by: java.lang.IllegalArgumentException: The value of URL was invalid stork://auth-service

getting 401 error while trying to validate a azure token

Hi I'm using the Which azure-spring-boot-sample-active-directory example to use to validate access token in a Spring Boot application coming from a Vue.js application? 03-resource-server code to validate the token.
But I'm getting an 401 response all the time while using Postman and no Body in response.
what might be the issue? I'm stuck on this for last few days Please do help
Configuration:
#EnableWebSecurity
#Order(SecurityProperties.BASIC_AUTH_ORDER)
public class WebSecurityConfiguration extends WebSecurityConfigurerAdapter {
#Override
protected void configure(HttpSecurity http) throws Exception {
// #formatter:off
http.authorizeRequests()
.anyRequest().authenticated()
.and()
.oauth2ResourceServer()
.jwt()
.and();
}
}
Controller :
#RestController
#RequestMapping("/api")
public class HomeController {
#GetMapping("/asd")
#ResponseBody
public String home() {
return "Hello, this is resource server 1.";
}
}
application.yml
spring:
security:
oauth2:
resourceserver:
jwt:
jwk-set-uri: https://login.microsoftonline.com/{tenant-id}/discovery/keys
issuer-uri: https://login.microsoftonline.com/{tenant-id}/v2.0
pom.xml
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-oauth2-resource-server</artifactId>
</dependency>

Your Java code looks pretty correct. I would start with adding extra logging to your application properties file to see if that tells you anything, eg:
logging:
level:
org:
springframework:
security: DEBUG
Next I would use a JWT viewer to see if there is a nonce field in the JWT header of the access token. If so then it will fail validation - see this recent answer of mine for more info on exposing an API scope in Azure AD.
Finally, you could try another library temporarily and it may give you a better explanation of the cause. See this jose4j code for an example of library based verification. You could paste that into a small Java console app and run it with an Azure access token.

To connect to cassandra with jdbc with ssl validation disabled

I am using simba cassandra jdbc driver to connect to cassandra. My jdbc url is: jdbc:cassandra://127.0.0.1:9042?ssl=true. How to disable the ssl validation, like in postgress we can do sslfactory=org.postgresql.ssl.NonValidatingFactory . I am looking for similar thing for cassandra. Any pointers would help
<dependency>
<groupId>com.simba.cassandra.jdbc42.Driver</groupId>
<artifactId>jdbc42</artifactId>
<version>1.0</version>
<scope>system</scope>
<systemPath>${project.basedir}/jar/CassandraJDBC42.jar</systemPath>
</dependency>

No schemata were loaded : Please check your connection settings, and whether your database (and your database version!) is really supported by jOOQ

I use jooq-codegen-maven version <3.10.5>
It works with Postgresql 10.4 but doesn't work with PostgreSQL 9.4.6 gives warning:
[WARNING] No schemata were loaded : Please check your connection settings, and whether your database (and your database version!) is really supported by jOOQ. Also, check the case-sensitivity in your configured <inputSchema/> elements : {=[schema_name]}
Is there a compatibility table for jooq(code generator) and db versions?
My plugin configuration is:
<plugin>
<groupId>org.jooq</groupId>
<artifactId>jooq-codegen-maven</artifactId>
<version>3.10.5</version>
<executions>
<execution>
<goals>
<goal>generate</goal>
</goals>
</execution>
</executions>
<dependencies>
<dependency>
<groupId>postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.1-901.jdbc3</version>
</dependency>
</dependencies>
<configuration>
<!-- JDBC connection parameters -->
<jdbc>
<driver>org.postgresql.Driver</driver>
<url>jdbc:postgresql://X.X.X.106:5432/postgres</url>
<user>xxxx</user>
<password>xxxx</password>
</jdbc>
<!-- Generator parameters -->
<generator>
<name>org.jooq.util.DefaultGenerator</name>
<database>
<name>org.jooq.util.postgres.PostgresDatabase</name>
<includes>.*</includes>
<inputSchema>somedata</inputSchema>
<excludes></excludes>
</database>
<target>
<packageName>com.xxxx.xxxx.jooq.generated</packageName>
<directory>target/generated-sources/jooq</directory>
</target>
</generator>
</configuration>
</plugin>

I got the same error when I copy pasted the pom.xml plugin configuration from jooq website. Later I changed inputSchema in the pom.xml to the name of the database I created (It was 'public' ) before and it generated the code
I changed
<inputSchema>public</inputSchema>
to
<inputSchema>library</inputSchema>
where 'library' is the name of the database I created

In my case jdbc url's db part(jdbc:postgresql://X.X.X.106:5432/postgres) was wrong for my new database. So changing it with the right one solved my issue.

J2K read in Tomcat doesn't work

I am facing the problem with loading J2K image files (jp2, jp2000) in my java application. What is strange is the fact, that the application runs correctly (the file is successfully read from disk) when it runs as standalone java application (or in tests).
After deployment on Tomcat application server the ImageIO.read(..) method returns null every time.
Any help is appriciated.
Shimon

Update: After reviewing comment from #haraldK - the solution is well described on page https://github.com/haraldk/TwelveMonkeys (section Deploying the plugins in a web app).
You need to define listener in your web.xml:
<web-app ...>
...
<listener>
<display-name>ImageIO service provider loader/unloader</display-name>
<listener-class>com.twelvemonkeys.servlet.image.IIOProviderContextListener</listener-class>
</listener>
...
</web-app>
You also need to add this Maven dependency to your project:
<dependency>
<groupId>com.twelvemonkeys.servlet</groupId>
<artifactId>servlet</artifactId>
<version>3.0.2</version>
</dependency>
Other, less prefered solution is (this was my first solution mentioned here):
After doing some googling I have found this page - https://blog.idrsolutions.com/2013/03/getting-jai-jpeg2000-to-run-on-glassfish-server-without-a-npe/ which discribes problem of resolution of J2K imageio service provider, when using application server such as glassfish or tomcat.
According to this article, the solution is simple. Just use the reader directly:
public BufferedImage getJPEG2000Image(byte[] data){
ImageInputStream iis = null;
BufferedImage image=null;
try {
iis = ImageIO.createImageInputStream(new ByteArrayInputStream(data));
com.sun.media.imageioimpl.plugins.jpeg2000.J2KImageReaderSpi j2kImageReaderSpi = new com.sun.media.imageioimpl.plugins.jpeg2000.J2KImageReaderSpi();
com.sun.media.imageioimpl.plugins.jpeg2000.J2KImageReader j2kReader = new com.sun.media.imageioimpl.plugins.jpeg2000.J2KImageReader(j2kImageReaderSpi);
j2kReader.setInput(iis, true);
image = j2kReader.read(0, new com.sun.media.imageio.plugins.jpeg2000.J2KImageReadParam());
}
catch (Exception e){
e.printStackTrace();
}
return image;
}
This Maven dependency is needed as well:
<dependency>
<groupId>com.sun.media</groupId>
<artifactId>jai_imageio</artifactId>
<version>1.1</version>
</dependency>

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Configure standalone spark for azure storage access - azure

Related

Quarkus Stork Service Discovery using DNS

getting 401 error while trying to validate a azure token

To connect to cassandra with jdbc with ssl validation disabled

No schemata were loaded : Please check your connection settings, and whether your database (and your database version!) is really supported by jOOQ

J2K read in Tomcat doesn't work

Categories

Resources