How to read remote HDFS parquets from my local PySpark? - apache-spark

I am trying to load data from a remote HDFS file system to my local PySpark session on my local Mac machine:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
path = "/xx/yy/order_info_20220413/partn_date=20220511/part-00085-dd.gz.parquet"
host = "host"
port = 1234
orders = spark.read.parquet(
f"hdfs://{host}:{port}{path}"
)
Here is the error:
Py4JJavaError: An error occurred while calling o55.parquet.
: org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data length
at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1936)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1238)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1134)
I tried to understand what RPC response exceeds maximum data length is. I DID NOT find anything similar to the code block in core-site.xml as what https://stackoverflow.com/a/60701948/6693221 shows:
<property>
<name>fs.default.name</name>
<value>hdfs://host:port</value>
</property>
However, when I typed telnet host port in my Mac OS terminal, I was CONNECTED. What is the solution?

You sould configure your file system before creating the spark session, you can do that in the core-site.xml file or directly in your session config, then to read the parquet, you need just to provide the path, where you already configured your session to use the remote hdfs cluster as a FS:
from pyspark.sql import SparkSession
path = "/xx/yy/order_info_20220413/partn_date=20220511/part-00085-dd.gz.parquet"
host = "host"
port = 1234
spark = (
SparkSession.builder
.config("spark.hadoop.fs.default.name", f"hdfs://{host}:{port}")
.config("spark.hadoop.fs.defaultFS", f"hdfs://{host}:{port}")
.getOrCreate()
)
orders = spark.read.parquet(path)

Related

How to read a local csv file in a data frame using Spark in a virtual environment ? (cdhserver : labuser) [duplicate]

I'm following the great spark tutorial
so i'm trying at 46m:00s to load the README.md but fail to what i'm doing is this:
$ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash
bash-4.1# cd /usr/local/spark-1.1.0-bin-hadoop2.4
bash-4.1# ls README.md
README.md
bash-4.1# ./bin/spark-shell
scala> val f = sc.textFile("README.md")
14/12/04 12:11:14 INFO storage.MemoryStore: ensureFreeSpace(164073) called with curMem=0, maxMem=278302556
14/12/04 12:11:14 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 160.2 KB, free 265.3 MB)
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
how can I load that README.md?
Try explicitly specify sc.textFile("file:///path to the file/"). The error occurs when Hadoop environment is set.
SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits, which in turn uses org.apache.hadoop.fs.getDefaultUri if schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf. If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".
gonbe's answer is excellent. But still I want to mention that file:/// = ~/../../, not $SPARK_HOME. Hope this could save some time for newbs like me.
If the file is located in your Spark master node (e.g., in case of using AWS EMR), then launch the spark-shell in local mode first.
$ spark-shell --master=local
scala> val df = spark.read.json("file:///usr/lib/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Alternatively, you can first copy the file to HDFS from the local file system and then launch Spark in its default mode (e.g., YARN in case of using AWS EMR) to read the file directly.
$ hdfs dfs -mkdir -p /hdfs/spark/examples
$ hadoop fs -put /usr/lib/spark/examples/src/main/resources/people.json /hdfs/spark/examples
$ hadoop fs -ls /hdfs/spark/examples
Found 1 items
-rw-r--r-- 1 hadoop hadoop 73 2017-05-01 00:49 /hdfs/spark/examples/people.json
$ spark-shell
scala> val df = spark.read.json("/hdfs/spark/examples/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster.
Some network filesystems, like NFS, AFS, and MapR’s NFS layer, are exposed to the user as a regular filesystem.
If your data is already in one of these systems, then you can use it as an input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path
rdd = sc.textFile("file:///path/to/file")
If your file isn’t already on all nodes in the cluster, you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers
Take care to put file:// in front and the use of "/" or "\" according to OS.
Attention:
Make sure that you run spark in local mode when you load data from local(sc.textFile("file:///path to the file/")) or you will get error like this Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist.
Becasuse executors which run on different workers will not find this file in it's local path.
You need just to specify the path of the file as "file:///directory/file"
example:
val textFile = sc.textFile("file:///usr/local/spark/README.md")
I have a file called NewsArticle.txt on my Desktop.
In Spark, I typed:
val textFile= sc.textFile(“file:///C:/Users/582767/Desktop/NewsArticle.txt”)
I needed to change all the \ to / character for the filepath.
To test if it worked, I typed:
textFile.foreach(println)
I'm running Windows 7 and I don't have Hadoop installed.
This has happened to me with Spark 2.3 with Hadoop also installed under the common "hadoop" user home directory.Since both Spark and Hadoop was installed under the same common directory, Spark by default considers the scheme as hdfs, and starts looking for the input files under hdfs as specified by fs.defaultFS in Hadoop's core-site.xml. Under such cases, we need to explicitly specify the scheme as file:///<absoloute path to file>.
This has been discussed into spark mailing list, and please refer this mail.
You should use hadoop fs -put <localsrc> ... <dst> copy the file into hdfs:
${HADOOP_COMMON_HOME}/bin/hadoop fs -put /path/to/README.md README.md
I tried the following and it worked from my local file system.. Basically spark can read from local, HDFS and AWS S3 path
listrdd=sc.textFile("file:////home/cloudera/Downloads/master-data/retail_db/products")
This is the solution for this error that i was getting on Spark cluster that is hosted in Azure on a windows cluster:
Load the raw HVAC.csv file, parse it using the function
data = sc.textFile("wasb:///HdiSamples/SensorSampleData/hvac/HVAC.csv")
We use (wasb:///) to allow Hadoop to access azure blog storage file and the three slashes is a relative reference to the running node container folder.
For example: If the path for your file in File Explorer in Spark cluster dashboard is:
sflcc1\sflccspark1\HdiSamples\SensorSampleData\hvac
So to describe the path is as follows: sflcc1: is the name of the storage account. sflccspark: is the cluster node name.
So we refer to the current cluster node name with the relative three slashes.
Hope this helps.
If your trying to read file form HDFS. trying setting path in SparkConf
val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader")
conf.set("fs.defaultFS", "hdfs://hostname:9000")
You do not have to use sc.textFile(...) to convert local files into dataframes. One of options is, to read a local file line by line and then transform it into Spark Dataset. Here is an example for Windows machine in Java:
StructType schemata = DataTypes.createStructType(
new StructField[]{
createStructField("COL1", StringType, false),
createStructField("COL2", StringType, false),
...
}
);
String separator = ";";
String filePath = "C:\\work\\myProj\\myFile.csv";
SparkContext sparkContext = new SparkContext(new SparkConf().setAppName("MyApp").setMaster("local"));
JavaSparkContext jsc = new JavaSparkContext (sparkContext );
SQLContext sqlContext = SQLContext.getOrCreate(sparkContext );
List<String[]> result = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = br.readLine()) != null) {
String[] vals = line.split(separator);
result.add(vals);
}
} catch (Exception ex) {
System.out.println(ex.getMessage());
throw new RuntimeException(ex);
}
JavaRDD<String[]> jRdd = jsc.parallelize(result);
JavaRDD<Row> jRowRdd = jRdd .map(RowFactory::create);
Dataset<Row> data = sqlContext.createDataFrame(jRowRdd, schemata);
Now you can use dataframe data in your code.
Reading local file in Apache-Spark. This worked for me:
var a = sc.textFile("/home/omkar/Documents/text_input").flatMap(line => line.split(" ")).map(word => (word, 1));
try
val f = sc.textFile("./README.md")

Using spark on AWS EMR; The AWS Access Key Id you provided does not exist in our records. but boto3 calls work just fine

I am trying to read a file in SPARK on EMR, that I have provided TEMPORARY credentials for in a different system (Illumina ICA).
When trying to read the file using spark.read.csv, using the S3 URI, it gives me the error:
Py4JJavaError: An error occurred while calling o65.csv.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
But when I try the same credentials using a BOTO3 call, it works just fine, so the credentials (in the environment) are just fine.
Here's my test code (from a notebook)
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('s3://stratus-gds-use1/241dd164-decb-48f6-eba1-08d881d902b2/dummy.vcf.gz', sep='\t')
#... Py4JJavaError: An error occurred while calling o65.csv. ##: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
access_key_id=os.environ['AWS_ACCESS_KEY_ID']
secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY']
region=os.environ['AWS_DEFAULT_REGION']
session_token=os.environ['AWS_SESSION_TOKEN']
bucket_name='stratus-gds-use1'
key_prefix='241dd164-decb-48f6-eba1-08d881d902b2/dummy.vcf.gz'
import boto3
s3_session = boto3.session.Session(aws_access_key_id=access_key_id,
aws_secret_access_key=secret_access_key,
aws_session_token=session_token,
region_name = region)
s3_client = s3_session.client('s3')
%ls -l dummy.vcf.gz
#-=> ls: cannot access dummy.vcf.gz: No such file or directory
r = s3_client.download_file(Filename='dummy.vcf.gz',
Bucket=bucket_name,
Key=key_prefix)
%ls -l dummy.vcf.gz
#-=> -rw-rw-r-- 1 hadoop hadoop 2535 Apr 6 18:45 dummy.vcf.gz
Any ideas why spark on AWS EMR cannot access the file with the provided S3 URI?
I have tested other S3 URIs like that and they work fine, so the java classes work fine.
I finally figured out the solutions. I needed to provide the temporary AWS credentials in the Spark configuration and provide the special class
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as well as the session_token..
So this is the procedure to ensure spark can read an S3 bucket with temporary credentials.
import pyspark
from pyspark.sql import SparkSession
conf = (
pyspark.SparkConf()
.set('spark.hadoop.fs.s3a.aws.credentials.provider','org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
.set('spark.hadoop.fs.s3a.access.key', access_key_id)
.set('spark.hadoop.fs.s3a.secret.key', secret_access_key)
.set('spark.hadoop.fs.s3a.session.token', session_token)
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
df = spark.read.csv(f's3a://{BUCKET}/{KEY_PREFIX}', sep='\t')

403 Error while accessing s3a using Spark

ISSUE:
Able to successfully download the file using AWS CLI as well as boto 3.
However, while using the S3A connector of Hadoop/Spark , receiving the below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o24.parquet.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: BCFFD14CB2939D68, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: MfT8J6ZPlJccgHBXX+tX1fpX47V7dWCP3Dq+W9+IBUfUhsD4Nx+DcyqsbgbKsPn8NZzjc2U
Configuration:
Running this on my Local machine
Spark Version 2.4.4
Hadoop Version 2.7
Jars added:
hadoop-aws-2.7.3.jar
aws-java-sdk-1.7.4.jar
Hadoop Config:
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.secret.key", secret_key)
hadoop_conf.set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.session.token", session_key)
hadoop_conf.set("fs.s3a.endpoint", "s3-us-west-2.amazonaws.com") # yes, I am using central eu server.
hadoop_conf.set("com.amazonaws.services.s3.enableV4", "true")
Code to Read the file:
from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext.getOrCreate()
hadoop_conf=sc._jsc.hadoopConfiguration()
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet(path)
print(df.head())
Set AWS credentials provider to profile credentials:
hadoopConf.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
I faced the same issue as above. It turns out that I needed to add in the session token into my configuration. See the documentation.
configuration.set("fs.s3a.session.token", System.getenv("AWS_SESSION_TOKEN"))

How to access a file in windows using Spark and winutils?

I am running spark on windows using winutils.
In spark shell in trying to load a csv file, but it says Path does not exist, i.e. I have a file at location E:/data.csv.
I am executing:
scala> val df = spark.read.option("header","true").csv("E:\\data.csv")
Error:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/E:/data.csv;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
I cant figure out why is it appending a "/E:", whereas it should have been only E:
How should I access the file?
In my case I am able to read the file as below
val input = spark.sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
.option("delimiter", ";").option("quoteAll","true").option("inferSchema","false").load("C:/Work/test.csv").toDF()

By default in which file system does spark look for reading file? [duplicate]

I'm following the great spark tutorial
so i'm trying at 46m:00s to load the README.md but fail to what i'm doing is this:
$ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash
bash-4.1# cd /usr/local/spark-1.1.0-bin-hadoop2.4
bash-4.1# ls README.md
README.md
bash-4.1# ./bin/spark-shell
scala> val f = sc.textFile("README.md")
14/12/04 12:11:14 INFO storage.MemoryStore: ensureFreeSpace(164073) called with curMem=0, maxMem=278302556
14/12/04 12:11:14 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 160.2 KB, free 265.3 MB)
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
how can I load that README.md?
Try explicitly specify sc.textFile("file:///path to the file/"). The error occurs when Hadoop environment is set.
SparkContext.textFile internally calls org.apache.hadoop.mapred.FileInputFormat.getSplits, which in turn uses org.apache.hadoop.fs.getDefaultUri if schema is absent. This method reads "fs.defaultFS" parameter of Hadoop conf. If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as "hdfs://..."; otherwise "file://".
gonbe's answer is excellent. But still I want to mention that file:/// = ~/../../, not $SPARK_HOME. Hope this could save some time for newbs like me.
If the file is located in your Spark master node (e.g., in case of using AWS EMR), then launch the spark-shell in local mode first.
$ spark-shell --master=local
scala> val df = spark.read.json("file:///usr/lib/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
Alternatively, you can first copy the file to HDFS from the local file system and then launch Spark in its default mode (e.g., YARN in case of using AWS EMR) to read the file directly.
$ hdfs dfs -mkdir -p /hdfs/spark/examples
$ hadoop fs -put /usr/lib/spark/examples/src/main/resources/people.json /hdfs/spark/examples
$ hadoop fs -ls /hdfs/spark/examples
Found 1 items
-rw-r--r-- 1 hadoop hadoop 73 2017-05-01 00:49 /hdfs/spark/examples/people.json
$ spark-shell
scala> val df = spark.read.json("/hdfs/spark/examples/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster.
Some network filesystems, like NFS, AFS, and MapR’s NFS layer, are exposed to the user as a regular filesystem.
If your data is already in one of these systems, then you can use it as an input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path
rdd = sc.textFile("file:///path/to/file")
If your file isn’t already on all nodes in the cluster, you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers
Take care to put file:// in front and the use of "/" or "\" according to OS.
Attention:
Make sure that you run spark in local mode when you load data from local(sc.textFile("file:///path to the file/")) or you will get error like this Caused by: java.io.FileNotFoundException: File file:/data/sparkjob/config2.properties does not exist.
Becasuse executors which run on different workers will not find this file in it's local path.
You need just to specify the path of the file as "file:///directory/file"
example:
val textFile = sc.textFile("file:///usr/local/spark/README.md")
I have a file called NewsArticle.txt on my Desktop.
In Spark, I typed:
val textFile= sc.textFile(“file:///C:/Users/582767/Desktop/NewsArticle.txt”)
I needed to change all the \ to / character for the filepath.
To test if it worked, I typed:
textFile.foreach(println)
I'm running Windows 7 and I don't have Hadoop installed.
This has happened to me with Spark 2.3 with Hadoop also installed under the common "hadoop" user home directory.Since both Spark and Hadoop was installed under the same common directory, Spark by default considers the scheme as hdfs, and starts looking for the input files under hdfs as specified by fs.defaultFS in Hadoop's core-site.xml. Under such cases, we need to explicitly specify the scheme as file:///<absoloute path to file>.
This has been discussed into spark mailing list, and please refer this mail.
You should use hadoop fs -put <localsrc> ... <dst> copy the file into hdfs:
${HADOOP_COMMON_HOME}/bin/hadoop fs -put /path/to/README.md README.md
I tried the following and it worked from my local file system.. Basically spark can read from local, HDFS and AWS S3 path
listrdd=sc.textFile("file:////home/cloudera/Downloads/master-data/retail_db/products")
This is the solution for this error that i was getting on Spark cluster that is hosted in Azure on a windows cluster:
Load the raw HVAC.csv file, parse it using the function
data = sc.textFile("wasb:///HdiSamples/SensorSampleData/hvac/HVAC.csv")
We use (wasb:///) to allow Hadoop to access azure blog storage file and the three slashes is a relative reference to the running node container folder.
For example: If the path for your file in File Explorer in Spark cluster dashboard is:
sflcc1\sflccspark1\HdiSamples\SensorSampleData\hvac
So to describe the path is as follows: sflcc1: is the name of the storage account. sflccspark: is the cluster node name.
So we refer to the current cluster node name with the relative three slashes.
Hope this helps.
If your trying to read file form HDFS. trying setting path in SparkConf
val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader")
conf.set("fs.defaultFS", "hdfs://hostname:9000")
You do not have to use sc.textFile(...) to convert local files into dataframes. One of options is, to read a local file line by line and then transform it into Spark Dataset. Here is an example for Windows machine in Java:
StructType schemata = DataTypes.createStructType(
new StructField[]{
createStructField("COL1", StringType, false),
createStructField("COL2", StringType, false),
...
}
);
String separator = ";";
String filePath = "C:\\work\\myProj\\myFile.csv";
SparkContext sparkContext = new SparkContext(new SparkConf().setAppName("MyApp").setMaster("local"));
JavaSparkContext jsc = new JavaSparkContext (sparkContext );
SQLContext sqlContext = SQLContext.getOrCreate(sparkContext );
List<String[]> result = new ArrayList<>();
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
String line;
while ((line = br.readLine()) != null) {
String[] vals = line.split(separator);
result.add(vals);
}
} catch (Exception ex) {
System.out.println(ex.getMessage());
throw new RuntimeException(ex);
}
JavaRDD<String[]> jRdd = jsc.parallelize(result);
JavaRDD<Row> jRowRdd = jRdd .map(RowFactory::create);
Dataset<Row> data = sqlContext.createDataFrame(jRowRdd, schemata);
Now you can use dataframe data in your code.
Reading local file in Apache-Spark. This worked for me:
var a = sc.textFile("/home/omkar/Documents/text_input").flatMap(line => line.split(" ")).map(word => (word, 1));
try
val f = sc.textFile("./README.md")

Resources