How to read a csv file from unix server in pyspark - apache-spark

I need to create spark dataframe from csv file which is located in my UNIX server.
I tried like below,
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("demo").getOrCreate()
df = spark.read.format('csv').option('header','True'). \
load("ftp://USER:PASSWORD#UNIX_IP/home/user/sample.csv")
df.show(10)
But its throwing the error as,
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Illegal character in user info at index 32
Could anyone help me to resolve this. How we need to refer the ftp location in pyspark? Do we need to include any other library for this?

You need to use the addFile method like this:
import org.apache.spark.SparkFiles
sc.addFile("ftp://user:pwd#host:port/home/user/sample.csv")
spark.read.csv(SparkFiles.get("sample.csv")).show()
To test it, you could use a public ftp like:
sc.addFile("ftp://anonymous:anonymous#ftp.gnu.org/README")
spark.read.csv(SparkFiles.get("README")).show(2)
+--------------------+--------------------+
| _c0| _c1|
+--------------------+--------------------+
| This is ftp.gnu.org| the FTP server o...|
|NOTICE (Updated O...| null|
+--------------------+--------------------+
In python:
from pyspark import SparkFiles
sc.addFile('ftp://user:pwd#host:port/home/user/sample.csv')
spark.read.csv(SparkFiles.get('sample.csv')).show()

Related

Py4JJavaError: An error occurred while calling o389.csv

I'm new to pyspark. I'm running pyspark using databricks. My data is stored in Azure Data Lake Service.I'm trying to read csv file from ADLS to pyspark data frame. So I wrote following code
import pyspark
from pyspark import SparkContext
from pyspark import SparkFiles
df = sqlContext.read.csv(SparkFiles.get("dbfs:mycsv path in ADSL/Data.csv"),
header=True, inferSchema= True)
But I'm getting error message
Py4JJavaError: An error occurred while calling o389.csv.
Can you suggest me to rectify this error?
The SparkFiles class is intended for accessing the files shipped as part of the Spark job. If you just need access to the CSV file available on ADLS, then you just need to use spark.read.csv, like:
df = spark.read.csv("dbfs:mycsv path in ADSL/Data.csv",
header=True, inferSchema=True)
it's better not to use sqlContext, it's kept for compatibility reasons.

Pulling log file directory name into the Pyspark dataframe

I have a bit of a strange one. I have loads of logs that I need to trawl. I have done that successfully in Spark & I am happy with it.
However, I need to add one more field to the dataframe, which is the data center.
The only place that the datacenter name can be derived is from the directory path.
For example:
/feedname/date/datacenter/another/logfile.txt
What would be the way to extract the log file path and inject it into the dataframe? From there, I can do some string splits & extract the bit I need.
My current code:
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.withColumn("Datacenter", input_file_name())\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.printSchema()
mpe_data.createOrReplaceTempView("mpe")
You can get the file path using the _input_file_name_ in Spark 2.0+
from pyspark.sql.functions import input_file_name
df.withColumn("Datacenter", input_file_name())
Adding your piece of code as example, once you have read your file use the withcolumn to get the file_name.
mpe_data = my_spark.read\
.option("header","false")\
.option("delimiter", "\t")\
.csv('hdfs://nameservice/data/feed/mpe/dt=20191013/*/*/*', final_structure)
mpe_data.withColumn("Datacenter", input_file_name())
mpe_data.printSchema()

Is it possible to read Excel file from Apache Zeppellin to PySpark or to a Pandas Dataframe?

I have got a file in HDFS (/user/username/Project/data/file.xlsx) that I want to read into a DataFrame. (I do not care if it is a PySpark DataFrame or Pandas, but Pandas is preferred.)
I am using a Zeppelin Notebook to do my code.
Is it possible to get data from this file?
I have already tried the following commands, but none of them worked:
df = pd.read_excel("/user/username/Project/data/file.xlsx")
df = pd.read_excel("hdfs:///user/username/Project/data/file.xlsx")
df = pd.read_excel("hdfs://user/username/Project/data/file.xlsx")
I don't think you can read files stored in hdfs directly with pandas.
You probably have to either :
load the file into spark then use toPandas()
df = spark.read.format("excel").load("hdfs:xxx").toPandas()
use some alternative to enable pandas to read directly, as described here
It seems export and import commands in Python Interpreter in Apache Zeppellin can be only realised through "pd.read_csv" and "to_csv" modules.

Spark and Hive table schema out of sync after external overwrite

I'm am having issues with the schema for Hive tables being out of sync between Spark and Hive on a Mapr cluster with Spark 2.1.0 and Hive 2.1.1.
I need to try to resolve this problem specifically for managed tables, but the issue can be reproduced with unmanaged/external tables.
Overview of Steps
Use saveAsTable to save a dataframe to a given table.
Use mode("overwrite").parquet("path/to/table") to overwrite the data for the previously saved table. I am actually modifying the data through a process external to Spark and Hive, but this reproduces the same issue.
Use spark.catalog.refreshTable(...) to refresh metadata
Query the table with spark.table(...).show(). Any columns that were the same between the original dataframe and the overwriting one will show the new data correctly, but any columns that were only in the new table will not be displayed.
Example
db_name = "test_39d3ec9"
table_name = "overwrite_existing"
table_location = "<spark.sql.warehouse.dir>/{}.db/{}".format(db_name, table_name)
qualified_table = "{}.{}".format(db_name, table_name)
spark.sql("CREATE DATABASE IF NOT EXISTS {}".format(db_name))
Save as a managed table
existing_df = spark.createDataFrame([(1, 2)])
existing_df.write.mode("overwrite").saveAsTable(table_name)
Note that saving as an unmanaged table with the following will produce the same issue:
existing_df.write.mode("overwrite") \
.option("path", table_location) \
.saveAsTable(qualified_table)
View the contents of the table
spark.table(table_name).show()
+---+---+
| _1| _2|
+---+---+
| 1| 2|
+---+---+
Overwrite the parquet files directly
new_df = spark.createDataFrame([(3, 4, 5, 6)], ["_4", "_3", "_2", "_1"])
new_df.write.mode("overwrite").parquet(table_location)
View the contents with the parquet reader, the contents show correctly
spark.read.parquet(table_location).show()
+---+---+---+---+
| _4| _3| _2| _1|
+---+---+---+---+
| 3| 4| 5| 6|
+---+---+---+---+
Refresh spark's metadata for the table and read in again as a table. The data will be updated for the columns that were the same, but the additional columns do not display.
spark.catalog.refreshTable(qualified_table)
spark.table(qualified_table).show()
+---+---+
| _1| _2|
+---+---+
| 6| 5|
+---+---+
I have also tried updating the schema in hive before calling spark.catalog.refreshTable with the below command in the hive shell:
ALTER TABLE test_39d3ec9.overwrite_existing REPLACE COLUMNS (`_1` bigint, `_2` bigint, `_3` bigint, `_4` bigint);
After running the ALTER command I then run describe and it shows correctly in hive
DESCRIBE test_39d3ec9.overwrite_existing
OK
_1 bigint
_2 bigint
_3 bigint
_4 bigint
Before running the alter command it only shows the original columns as expected
DESCRIBE test_39d3ec9.overwrite_existing
OK
_1 bigint
_2 bigint
I then ran spark.catalog.refreshTable but it didn't effect spark's view of the data.
Additional Notes
From the spark side, I did most of my testing with PySpark, but also tested in a spark-shell (scala) and a sparksql shell. While in the spark shell I also tried using a HiveContext but didn't work.
import org.apache.spark.sql.hive.HiveContext
import spark.sqlContext.implicits._
val hiveObj = new HiveContext(sc)
hiveObj.refreshTable("test_39d3ec9.overwrite_existing")
After performing the ALTER command in the hive shell, I verified in Hue that the schema also changed there.
I also tried running the ALTER command with spark.sql("ALTER ...") but the version of Spark we are on (2.1.0) does not allow it, and looks like it won't be available until Spark 2.2.0 based on this issue: https://issues.apache.org/jira/browse/SPARK-19261
I have also read through the spark docs again, specifically this section: https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#hive-metastore-parquet-table-conversion
Based on those docs, spark.catalog.refreshTable should work. The configuration for spark.sql.hive.convertMetastoreParquet is typically false, but I switched it to true for testing and it didn't seem to effect anything.
Any help would be appreciated, thank you!
I faced a similar issue while using spark 2.2.0 in CDH 5.11.x package.
After spark.write.mode("overwrite").saveAsTable() when I issue spark.read.table().show no data will be displayed.
On checking I found it was a known issue with CDH spark 2.2.0 version. Workaround for that was to run the below command after the saveAsTable command was executed.
spark.sql("ALTER TABLE qualified_table set SERDEPROPERTIES ('path'='hdfs://{hdfs_host_name}/{table_path}')")
spark.catalog.refreshTable("qualified_table")
eg: If your table LOCATION
is like hdfs://hdfsHA/user/warehouse/example.db/qualified_table
then assign 'path'='hdfs://hdfsHA/user/warehouse/example.db/qualified_table'
This worked for me. Give it a try. I assume by now your issue would have been resolved. If not you can try this method.
workaround source: https://www.cloudera.com/documentation/spark2/2-2-x/topics/spark2_known_issues.html

Pyspark reads csv - NameError: name 'spark' is not defined

I am trying to run the following code in databricks in order to call a spark session and use it to open a csv file:
spark
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
And I get the following error:
NameError:name 'spark' is not defined
Any idea what might be wrong?
I have also tried to run:
from pyspark.sql import SparkSession
But got the following in response:
ImportError: cannot import name SparkSession
If it helps, I am trying to follow the following example (you will understand better if you watch it from from 17:30 on):
https://www.youtube.com/watch?v=K14plpZgy_c&list=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX
I got it worked by using the following imports:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
I got the idea by looking into the pyspark code as I found read csv was working in the interactive shell.
Please note the example code your are using is for Spark version 2.x
"spark" and "SparkSession" are not available on Spark 1.x. The error messages you are getting point to a possible version issue (Spark 1.x).
Check the Spark version you are using.

Resources