When I use pyspark to read data (DAT file - 4 Gb) from my computer everything is fine but when I use pyspark to read data from local computer (other computer in my company connect by LAN) there was an error occurs as below:
'' Py4JJavaError: An error occurred while calling o304.csv.
: java.io.IOException: No FileSystem for scheme: null ''
Error picture
If I use pandas.read_csv to read file from local computer, everything is fine (only problem with pyspark). Please help to support in this case. Thank!
My code to read data in my computer (no problem occur):
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
path='V04R-V04R-SQLData.dat'
df = spark.read.option("delimiter", "\t").csv(path)
My code to read data in local computer (problem occur):
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
path='//8LWK8X1/Data/Subfolder1/V04R-V04R-SQLData.dat'
df = spark.read.option("delimiter", "\t").csv(path)
Note:
8LWK8X1 is a local computer name
read with pandas and convert that to a Pyspark Dataframe - Easy solution :)
Loading into Pandas DF
gam_charge_item_df = pd.read_scv(path)
Creating a PySpark dataFrame
spark_df = spark.createDataFrame(df)
Related
I am using PySpark.pandas read_excel function to import data and saving the result in metastore using to_table. It works fine if format='parquet'. However, the job hangs if format='delta'. The cluster idles after creating the parquets and does not proceed to write _delta_log (at least that's what it seems).
Have you any clue what might be happening?
I'm using Databricks 11.3, Spark 3.3.
I have also tried importing Excel using regular pandas, convert the pandas DF to spark DF using spark.createDataFrame, and then write.saveAsTable without success if format is delta.
I am trying to read multiple text files into a single spark data frame, I have used the following code for as single file:
df =spark.read.text('C:/User/Alex/Directory/Subdirectory/Filename.txt.pgp.decr')
df.count()
and I get the correct result, then I try and read in all of the files in that directory as follows:
df = spark.read.text('C:/User/Alex/Directory/Subdirectory/*')
df.count()
and the notebook just hangs and produces no result. I have also tried reading the data into a rdd using the sparkContext with textFile and wholeTextFiles, but also didn't come right, please can you help?
I'm using pySpark 2.3, trying to read a csv file that looks like that:
0,0.000476517230863068,0.0008178378961061477
1,0.0008506156837329876,0.0008467260987257776
But it doesn't work:
from pyspark import sql, SparkConf, SparkContext
print (sc.applicationId)
>> <property at 0x7f47583a5548>
data_rdd = spark.textFile(name=tsv_data_path).filter(x.split(",")[0] != 1)
And I get an error:
AttributeError: 'SparkSession' object has no attribute 'textFile'
Any idea how I should read it in pySpark 2.3?
First, textFile exists on the SparkContext (called sc in the repl), not on the SparkSession object (called spark in the repl).
Second, for CSV data, I would recommend using the CSV DataFrame loading code, like this:
df = spark.read.format("csv").load("file:///path/to/file.csv")
You mentioned in comments needing the data as an RDD. You are going to have significantly better performance if you can keep all of your operations on DataFrames instead of RDDs. However, if you need to fall back to RDDs for some reason you can do it like the following:
rdd = df.rdd.map(lambda row: row.asDict())
Doing this approach is better than trying to load it with textFile and parsing the CSV data yourself. If you use the DataFrame CSV loading then it will properly handle all the CSV edge cases for you like quoted fields. Also if only needed some of the columns, you could filter on the DataFrame before converting it to a RDD to avoid needing to bring all that extra data over into the python interpreter.
I have existing Hive data stored in Avro format. For whatever reason reading these data by executing SELECT is very slow. I didn't figure out yet why. The data is partitioned and my WHERE clause always follows the partition columns. So I decided to read the data directly by navigating to the partition path and using Spark SQLContext. This works much faster. However, the problem I have is reading the DOUBLE values. Avro stores them in a binary format.
When I execute the following query in Hive:
select myDoubleValue from myTable;
I'm getting the correct expected values
841.79
4435.13
.....
but the following Spark code:
val path="PathToMyPartition"
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.avro(path)
df.select("myDoubleValue").rdd.map(x => x.getAs[Double](0))
gives me this exception
java.lang.ClassCastException : [B cannot be cast to java.lang.Double
What would be the right way either to provide a schema or convert the value that is stored in a binary format into a double format?
I found a partial solution how to convert the Avro schema to a Spark SQL StructType. There is com.databricks.spark.avro.SchemaConverters developed by Databricks that has a bug in converting Avro logical data types in their toSqlType(avroSchema: Schema) method which was incorrectly converting the logicalType
{"name":"MyDecimalField","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":18}],"doc":"","default":null}
into
StructField("MyDecimalField",BinaryType,true)
I fixed this bug in my local version of the code and now it is converting into
StructField("MyDecimalField",DecimalType(38,18),true)
Now, the following code reads the Avro file and creates a Dataframe:
val avroSchema = new Schema.Parser().parse(QueryProvider.getQueryString(pathSchema))
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.schema(MyAvroSchemaConverter.toSqlType(avroSchema).dataType.asInstanceOf[StructType]).avro(path)
However, when I'm selecting the filed that I expect to be decimal by
df.select("MyDecimalField")
I'm getting the following exception:
scala.MatchError: [B#3e6e0d8f (of class [B)
This is where I stuck at this time and would appreciate if anyone can suggest what to do next or any other work around.
I have an R script similar to the example one, where you load some
data from hdfs and then store it somehow, in this case via Parquet
file.
library(SparkR)
# Initialize SparkContext and SQLContext
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Create a DataFrame from a JSON file
peopleDF <- jsonFile(sqlContext, file.path("/people.json"))
# Register this DataFrame as a table.
registerTempTable(peopleDF, "people")
# SQL statements can be run by using the sql methods provided by sqlContext
teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")
# Store the teenagers in a table
saveAsParquetFile(teenagers, file.path("/teenagers"))
# Stop the SparkContext now
sparkR.stop()
How exactly do I retrieve the data from the cluster into another spark
application? I'm currently considering connecting to the hdfs master
and retrieving the files according to this
example,
except for replacing the sbt-thrift with
scrooge.
Is there a more idiomatic way to retrieve the data without a direct
connection to the hadoop cluster? I considered copying the data out of
the hdfs, but parquet can only read from hadoop from what I've
understood.
Start a SparkContext with master local and use SparkSQL to retrieve the data.