Defining Schema for a large dataframe in spark - apache-spark

I am trying to load a large dataset from a txt file (1000 columns, > 1M Rows) into the spark environment and my dataset has no headers, as a concequence I have run into this error:
TypeError: Can not infer schema for type:
The challenge: looking at the examples given in the documentation, both examples show how to infer the schema using reflection and programmatically demonstrate the idea with few (two) columns that can be easily typed. No special column names are needed since the data represents a matrix.
How would I go about inferring from a larger set of columns, hopefully w/o typing out. Or can the data be loaded in an alternative way that does not require these definitions.
PS: Spark newbie and using pyspark
EDITED (Added information)
dataset = "./data.txt"
conf = (SparkConf()
.setAppName("myApp")
.setMaster("host")
.set("spark.cores.max", "15")
.set("spark.rdd.compress", "true")
.set("spark.broadcast.compress", "true"))
sc = SparkContext(conf=conf)
spark = SparkSession \
.builder \
.appName("myApp") \
.config(conf=SparkConf()) \
.getOrCreate()
data = sc.textFile(dataset)
df = spark.createDataFrame(data)
data.txt contains 1Mn rows and 1000 columns similar to what would be obtained for example by the following code:
np.random.randint(20, size=(1000000, 1000))

Related

pyspark parse filename on load

I'm quite new to spark and there is one thing that I don't understand: how to manipulate column content.
I have a set of csv as follow:
each dsX is a table and I would like to load the data at once for each table.
So far no problems:
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*")
But There is one information missing: the client_id and this client id is the first part of the csv name: clientId_table_category.csv
So I tried to do this:
def extract_path(patht):
print(patht)
return patht
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*") \
.withColumn("clientId", fn.lit(extract_path(fn.input_file_name())))
But the print returns:
Column<b'input_file_name()'>
And I can't do much with this.
I'm quite stuck here, how do you manipulate data in this configuration?
Another solution for me is to load each csv one by one and parse the clientId from the file name manually, but I was wondering if there wouldn't be a more powerful solution with spark.
you are going a little too far away :
df = spark.read.csv(
table+"/*",
header=True,
sep='\\'
).withColumn("clientId", fn.input_file_name())
this will create a column with the full path. Then you just need some extra string manipulation - easy using an UDF. You can also do that with builtin function but it is trickier.
from pyspark.sql.types import StringType
#fn.udf(StringType())
def get_id(in_string):
return in_string.split("/")[-1].split("_")[0]
df = df.withColumn(
"clientId",
get_id(fn.col("clientId")
)

How to read xlsx or xls files as spark dataframe

Can anyone let me know without converting xlsx or xls files how can we read them as a spark dataframe
I have already tried to read with pandas and then tried to convert to spark dataframe but got the error and the error is
Error:
Cannot merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>
Code:
import pandas
import os
df = pandas.read_excel('/dbfs/FileStore/tables/BSE.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)
I try to give a general updated version at April 2021 based on the answers of #matkurek and #Peter Pan.
SPARK
You should install on your databricks cluster the following 2 libraries:
Clusters -> select your cluster -> Libraries -> Install New -> Maven -> in Coordinates: com.crealytics:spark-excel_2.12:0.13.5
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd
Then, you will be able to read your excel as follows:
sparkDF = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load(filePath)
PANDAS
You should install on your databricks cluster the following 2 libraries:
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd
Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: openpyxl
Then, you will be able to read your excel as follows:
import pandas
pandasDF = pd.read_excel(io = filePath, engine='openpyxl', sheet_name = 'NameOfYourExcelSheet')
Note that you will have two different objects, in the first scenario a Spark Dataframe, in the second a Pandas Dataframe.
As mentioned by #matkurek you can read it from excel directly. Indeed, this should be a better practice than involving pandas since then the benefit of Spark would not exist anymore.
You can run the same code sample as defined qbove, but just adding the class needed to the configuration of your SparkSession.
spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()
Then, you can read your excel file.
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load("your_file"))
There is no data of your excel shown in your post, but I had reproduced the same issue as yours.
Here is the data of my sample excel test.xlsx, as below.
You can see there are different data types in my column B: a double value 2.2 and a string value C.
So if I run the code below,
import pandas
df = pandas.read_excel('test.xlsx', sheet_name='Sheet1',inferSchema='')
sdf = spark.createDataFrame(df)
it will return a same error as yours.
TypeError: field B: Can not merge type <class 'pyspark.sql.types.DoubleType'> and class 'pyspark.sql.types.StringType'>
If we tried to inspect the dtypes of df columns via df.dtypes, we will see.
The dtype of Column B is object, the spark.createDateFrame function can not inference the real data type for column B from the real data. So to fix it, the solution is to pass a schema to help data type inference for column B, as the code below.
from pyspark.sql.types import StructType, StructField, DoubleType, StringType
schema = StructType([StructField("A", DoubleType(), True), StructField("B", StringType(), True)])
sdf = spark.createDataFrame(df, schema=schema)
To force make column B as StringType to solve the data type conflict.
You can read excel file through spark's read function. That requires a spark plugin, to install it on databricks go to:
clusters > your cluster > libraries > install new > select Maven and in 'Coordinates' paste com.crealytics:spark-excel_2.12:0.13.5
After that, this is how you can read the file:
df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'NameOfYourExcelSheet'!A1") \
.load(filePath)
Just open file xlsx or xlms,open file in pandas,after that in spark
import pandas as pd
df = pd.read_excel('file.xlsx', engine='openpyxl')
df = spark_session.createDataFrame(df.astype(str))
Below configuration and code works for me to read excel file into pyspark dataframe. Pre-requisites before executing python code.
Install Maven library on your databricks cluster.
Maven library name & version: com.crealytics:spark-excel_2.12:0.13.5
Databricks Runtime: 9.0 (includes Apache Spark 3.1.2, Scala 2.12)
Execute below code in your python notebook to load excel file into pyspark dataframe:
sheetAddress = "'<enter sheetname>'!A1"
filePath = "<enter excel file full path>"
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("dataAddress", sheetAddress) \
.option("treatEmptyValuesAsNulls", "false") \
.option("inferSchema", "true") \
.load(filePath)
Steps to read .xls / .xlsx files from Azure Blob storage into a Spark DF
You can read the excel files located in Azure blob storage to a pyspark dataframe with the help of a library called spark-excel. (Also refered as com.crealytics.spark.excel)
Install the library either using the UI or Databricks CLI. (Cluster settings page > Libraries > Install new option. Make sure to chose maven)
Once the library is installed. You need proper credentials to access Azure blob storage. You can provide the access key in Cluster settings page > Advanced option > Spark configs
Example:
spark.hadoop.fs.azure.account.key.<storage-account>.blob.core.windows.net <access key>
Note: If you're the cluster owner you can provide it as a secret instead of giving access key as plain text as mentioned in the docs
Restart the cluster. you can use below code to read those excel files located in blob storage
filePath = "wasbs://<container-name>#<storage-account>.blob.core.windows.net/MyFile1.xls"
DF = spark.read.format("excel").option("header", "true").option("inferSchema", "true").load(filePath)
display(DF)
PS: The spark.read.format("excel") is the V2 approach. while spark.read.format("com.crealytics.spark.excel") is the V1, you can read more here

How to read only n rows of large CSV file on HDFS using spark-csv package?

I have a big distributed file on HDFS and each time I use sqlContext with spark-csv package, it first loads the entire file which takes quite some time.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path")
now as I just want to do some quick check at times, all I need is few/ any n rows of the entire file.
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").take(n)
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").head(n)
but all these run after the file load is done. Can't I just restrict the number of rows while reading the file itself ? I am referring to n_rows equivalent of pandas in spark-csv, like:
pd_df = pandas.read_csv("file_path", nrows=20)
Or it might be the case that spark does not actually load the file, the first step, but in this case, why is my file load step taking too much time then?
I want
df.count()
to give me only n and not all rows, is it possible ?
You can use limit(n).
sqlContext.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true').load("file_path").limit(20)
This will just load 20 rows.
My understanding is that reading just a few lines is not supported by spark-csv module directly, and as a workaround you could just read the file as a text file, take as many lines as you want and save it to some temporary location. With the lines saved, you could use spark-csv to read the lines, including inferSchema option (that you may want to use given you are in exploration mode).
val numberOfLines = ...
spark.
read.
text("myfile.csv").
limit(numberOfLines).
write.
text(s"myfile-$numberOfLines.csv")
val justFewLines = spark.
read.
option("inferSchema", true). // <-- you are in exploration mode, aren't you?
csv(s"myfile-$numberOfLines.csv")
Not inferring schema and using limit(n) worked for me, in all aspects.
f_schema = StructType([
StructField("col1",LongType(),True),
StructField("col2",IntegerType(),True),
StructField("col3",DoubleType(),True)
...
])
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true').schema(f_schema).load(data_path).limit(10)
Note: If we use inferschema='true', its again the same time, and maybe hence the same old thing.
But if we dun have idea of the schema, Jacek Laskowski solutions works well too. :)
The solution given by Jacek Laskowski works well. Presenting an in-memory variation below.
I recently ran into this problem. I was using databricks and had a huge csv directory (200 files of 200MB each)
I originally had
val df = spark.read.format("csv")
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.load("dbfs:/huge/csv/files/in/this/directory/")
display(df)
which took a lot of time (10+ minutes), but then I change it to below and it ran instantly (2 seconds)
val lines = spark.read.text("dbfs:/huge/csv/files/in/this/directory/").as[String].take(1000)
val df = spark.read
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.csv(spark.createDataset(lines))
display(df)
Inferring schema for text formats is hard and it can be done this way for the csv and json (but not if it's a multi-line json) formats.
Since PySpark 2.3 you can simply load data as text, limit, and apply csv reader on the result:
(spark
.read
.options(inferSchema="true", header="true")
.csv(
spark.read.text("/path/to/file")
.limit(20) # Apply limit
.rdd.flatMap(lambda x: x))) # Convert to RDD[str]
Scala counterpart is available since Spark 2.2:
spark
.read
.options(Map("inferSchema" -> "true", "header" -> "true"))
.csv(spark.read.text("/path/to/file").limit(20).as[String])
In Spark 3.0.0 or later one can also apply limit and use from_csv function, but it requires a schema, so it probably won't fit your requirements.
Since I didn't see that solution in the answers, the pure SQL-approach is working for me:
df = spark.sql("SELECT * FROM csv.`/path/to/file` LIMIT 10000")
If there is no header the columns will be named _c0, _c1, etc. No schema required.
May be this would be helpful who is working in java.
Applying limit will not help to reduce the time. You have to collect the n rows from the file.
DataFrameReader frameReader = spark
.read()
.format("csv")
.option("inferSchema", "true");
//set framereader options, delimiters etc
List<String> dataset = spark.read().textFile(filePath).limit(MAX_FILE_READ_SIZE).collectAsList();
return frameReader.csv(spark.createDataset(dataset, Encoders.STRING()));

PySpark How to read CSV into Dataframe, and manipulate it

I'm quite new to pyspark and am trying to use it to process a large dataset which is saved as a csv file.
I'd like to read CSV file into spark dataframe, drop some columns, and add new columns.
How should I do that?
I am having trouble getting this data into a dataframe. This is a stripped down version of what I have so far:
def make_dataframe(data_portion, schema, sql):
fields = data_portion.split(",")
return sql.createDateFrame([(fields[0], fields[1])], schema=schema)
if __name__ == "__main__":
sc = SparkContext(appName="Test")
sql = SQLContext(sc)
...
big_frame = data.flatMap(lambda line: make_dataframe(line, schema, sql))
.reduce(lambda a, b: a.union(b))
big_frame.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://<...>") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("append") \
.save()
sc.stop()
This produces an error TypeError: 'JavaPackage' object is not callable at the reduce step.
Is it possible to do this? The idea with reducing to a dataframe is to be able to write the resulting data to a database (Redshift, using the spark-redshift package).
I have also tried using unionAll(), and map() with partial() but can't get it to work.
I am running this on Amazon's EMR, with spark-redshift_2.10:2.0.0, and Amazon's JDBC driver RedshiftJDBC41-1.1.17.1017.jar.
Update - answering also your question in comments:
Read data from CSV to dataframe:
It seems that you only try to read CSV file into a spark dataframe.
If so - my answer here: https://stackoverflow.com/a/37640154/5088142 cover this.
The following code should read CSV into a spark-data-frame
import pyspark
sc = pyspark.SparkContext()
sql = SQLContext(sc)
df = (sql.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("/path/to_csv.csv"))
// these lines are equivalent in Spark 2.0 - using [SparkSession][1]
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
spark.read.format("csv").option("header", "true").load("/path/to_csv.csv")
spark.read.option("header", "true").csv("/path/to_csv.csv")
drop column
you can drop column using "drop(col)"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
drop(col)
Returns a new DataFrame that drops the specified column.
Parameters: col – a string name of the column to drop, or a Column to drop.
>>> df.drop('age').collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.drop(df.age).collect()
[Row(name=u'Alice'), Row(name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df.name).collect()
[Row(age=5, height=85, name=u'Bob')]
>>> df.join(df2, df.name == df2.name, 'inner').drop(df2.name).collect()
[Row(age=5, name=u'Bob', height=85)]
add column
You can use "withColumn"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
withColumn(colName, col)
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
Parameters:
colName – string, name of the new column.
col – a Column expression for the new column.
>>> df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)]
Note: spark has a lot of other functions which can be used (e.g. you can use "select" instead of "drop")

Get CSV to Spark dataframe

I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?
With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
Employee_rdd = sc.textFile("\..\Employee.csv")
.map(lambda line: line.split(","))
Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
Employee_df.show()
for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)
Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
source: SPARK PROGRAMMING GUIDE
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession \
.builder \
.appName("basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")
I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.
Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])
With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation
Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem
When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.
A better way to do the above would be
spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show()

Resources