More convenient way to reproduce pyspark sample

More convenient way to reproduce pyspark sample - apache-spark

Most of the questions about spark are used show as code example without the code that generates the dataframe, like this:
df.show()
+-------+--------+----------+
|USER_ID|location| timestamp|
+-------+--------+----------+
| 1| 1001|1265397099|
| 1| 6022|1275846679|
| 1| 1041|1265368299|
+-------+--------+----------+
How can I reproduce this code in my programming environment without rewriting it manually? pyspark have some equivalent of read_clipboard in pandas?
Edit
The lack of a function to import data into my environment is a big obstacle for me to help others with pyspark in Stackoverflow.
So my question is:
What is the most convenient way to reproduce data pasted in stackoverflow from show command into my environment?

You can always use the following function :
from pyspark.sql.functions import *
def read_spark_output(file_path):
step1 = spark.read \
.option("header","true") \
.option("inferSchema","true") \
.option("delimiter","|") \
.option("parserLib","UNIVOCITY") \
.option("ignoreLeadingWhiteSpace","true") \
.option("ignoreTrailingWhiteSpace","true") \
.option("comment","+") \
.csv("file://{}".format(file_path))
# select not-null columns
step2 = t.select([c for c in t.columns if not c.startswith("_")])
# deal with 'null' string in column
return step2.select(*[when(~col(col_name).eqNullSafe("null"), col(col_name)).alias(col_name) for col_name in step2.columns])
It's one of the suggestions given in the following question : How to make good reproducible Apache Spark examples.
Note 1: Sometimes, there might be special cases where this might not apply for some reason or the other and which can generate in errors/issues i.e Group by column "grp" and compress DataFrame - (take last not null value for each column ordering by column "ord").
So please use it with caution !
Note 2: (Disclaimer) I'm not the original author of the code. Thanks to #MaxU for the code. I just made some modifications on it.

Late answer, but I often face the same issue so wrote a small utility for this https://github.com/ollik1/spark-clipboard
It basically allows copy-pasting data frame show strings to spark. To install it, add jcenter dependency com.github.ollik1:spark-clipboard_2.12:0.1 and spark config .config("fs.clipboard.impl", "com.github.ollik1.clipboard.ClipboardFileSystem") After this, data frames can be read directly from the system clipboard
val df = spark.read
.format("com.github.ollik1.clipboard")
.load("clipboard:///*")
or alternatively files if you prefer. Installation details and usage are described in the read me file.

You can always read the data in pandas as a pandas dataframe and then convert it back to a spark dataframe. No, there is not a direct equivalent of read_clipboard in pyspark unlike pandas.
The reason is that Pandas dataframes are mostly flat structures where as spark dataframes can have complex structures like struct, arrays etc, since it has a wide variety of data types and those doesn't appear on console output, it is not possible to recreate the dataframe from the output.

You can combine panda read_clipboard, and convert to pyspark dataframe
from pyspark.sql.types import *
pdDF = pd.read_clipboard(sep=',',
index_col=0,
names=['USER_ID',
'location',
'timestamp',
])
mySchema = StructType([ StructField("USER_ID", StringType(), True)\
,StructField("location", LongType(), True)\
,StructField("timestamp", LongType(), True)])
#note: True (implies nullable allowed)
df = spark.createDataFrame(pdDF,schema=mySchema)
Update:
What #terry really want is copy ASCII code table to python , and following is
example. When you parse data into python , then you can convert to anything.
def parse(ascii_table):
header = []
data = []
for line in filter(None, ascii_table.split('\n')):
if '-+-' in line:
continue
if not header:
header = filter(lambda x: x!='|', line.split())
continue
data.append(['']*len(header))
splitted_line = filter(lambda x: x!='|', line.split())
for i in range(len(splitted_line)):
data[-1][i]=splitted_line[i]
return header, data

Related

Read csv that contains array of string in pyspark

I'm trying to read a csv that has the following data:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3
using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe
If I give my own schema like:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])
results in this exception:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
so how would I properly read the csv without this failure?

Since csv doesn't support array, you need to first read as string, then convert it.
# You need to set escape option to ", since it is not the default escape character (\).
df = spark.read.csv('file.csv', header=True, escape='"')
df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))

I guess this is what you are looking for:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")
dataframe.printSchema()
Let me know if it helps

pyspark parse filename on load

I'm quite new to spark and there is one thing that I don't understand: how to manipulate column content.
I have a set of csv as follow:
each dsX is a table and I would like to load the data at once for each table.
So far no problems:
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*")
But There is one information missing: the client_id and this client id is the first part of the csv name: clientId_table_category.csv
So I tried to do this:
def extract_path(patht):
print(patht)
return patht
df = spark.read.format('csv') \
.option("header", "true") \
.option("escape", "\"") \
.load(table+"/*") \
.withColumn("clientId", fn.lit(extract_path(fn.input_file_name())))
But the print returns:
Column<b'input_file_name()'>
And I can't do much with this.
I'm quite stuck here, how do you manipulate data in this configuration?
Another solution for me is to load each csv one by one and parse the clientId from the file name manually, but I was wondering if there wouldn't be a more powerful solution with spark.

you are going a little too far away :
df = spark.read.csv(
table+"/*",
header=True,
sep='\\'
).withColumn("clientId", fn.input_file_name())
this will create a column with the full path. Then you just need some extra string manipulation - easy using an UDF. You can also do that with builtin function but it is trickier.
from pyspark.sql.types import StringType
#fn.udf(StringType())
def get_id(in_string):
return in_string.split("/")[-1].split("_")[0]
df = df.withColumn(
"clientId",
get_id(fn.col("clientId")
)

Appending data to an empty dataframe

I am creating an empty dataframe and later trying to append another data frame to that. In fact I want to append many dataframes to the initially empty dataframe dynamically depending on number of RDDs coming.
the union() function works fine if I assign the value to another a third dataframe.
val df3=df1.union(df2)
But I want to keep appending to the initial dataframe (empty) I created because I want to store all the RDDs in one dataframe. The below code however does not show right counts. It seems that it simply did not append
df1.union(df2)
df1.count() // this shows 0 although df2 has some data and that is shown if I assign to third datafram.
If I do the below (I get reassignment error since df1 is val. And if I change it to var type, I get kafka multithreading not safe error.
df1=d1.union(df2)
Any idea how to add all the dynamically created dataframes to one initially created data frame?

Not sure if this is what you are looking for!
# Import pyspark functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define your schema
field = [StructField("Col1",StringType(), True), StructField("Col2", IntegerType(), True)]
schema = StructType(field)
# Your empty data frame
df = spark.createDataFrame(sc.emptyRDD(), schema)
l = []
for i in range(5):
# Build and append to the list dynamically
l = l + [([str(i), i])]
# Create a temporary data frame similar to your original schema
temp_df = spark.createDataFrame(l, schema)
# Do the union with the original data frame
df = df.union(temp_df)
df.show()

DataFrames and other distributed data structures are immutable, therefore methods which operate on them always return new object. There is no appending, no modification in place, and no ALTER TABLE equivalent.
And if I change it to var type, I get kafka multithreading not safe error.
Without actual code is impossible to give you a definitive answer, but it is unlikely related to union code.
There is a number of known Spark bugs cause by incorrect internal implementation (SPARK-19185, SPARK-23623 to enumerate just a few).

How to read only n rows of large CSV file on HDFS using spark-csv package?

I have a big distributed file on HDFS and each time I use sqlContext with spark-csv package, it first loads the entire file which takes quite some time.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path")
now as I just want to do some quick check at times, all I need is few/ any n rows of the entire file.
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").take(n)
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").head(n)
but all these run after the file load is done. Can't I just restrict the number of rows while reading the file itself ? I am referring to n_rows equivalent of pandas in spark-csv, like:
pd_df = pandas.read_csv("file_path", nrows=20)
Or it might be the case that spark does not actually load the file, the first step, but in this case, why is my file load step taking too much time then?
I want
df.count()
to give me only n and not all rows, is it possible ?

You can use limit(n).
sqlContext.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true').load("file_path").limit(20)
This will just load 20 rows.

My understanding is that reading just a few lines is not supported by spark-csv module directly, and as a workaround you could just read the file as a text file, take as many lines as you want and save it to some temporary location. With the lines saved, you could use spark-csv to read the lines, including inferSchema option (that you may want to use given you are in exploration mode).
val numberOfLines = ...
spark.
read.
text("myfile.csv").
limit(numberOfLines).
write.
text(s"myfile-$numberOfLines.csv")
val justFewLines = spark.
read.
option("inferSchema", true). // <-- you are in exploration mode, aren't you?
csv(s"myfile-$numberOfLines.csv")

Not inferring schema and using limit(n) worked for me, in all aspects.
f_schema = StructType([
StructField("col1",LongType(),True),
StructField("col2",IntegerType(),True),
StructField("col3",DoubleType(),True)
...
])
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true').schema(f_schema).load(data_path).limit(10)
Note: If we use inferschema='true', its again the same time, and maybe hence the same old thing.
But if we dun have idea of the schema, Jacek Laskowski solutions works well too. :)

The solution given by Jacek Laskowski works well. Presenting an in-memory variation below.
I recently ran into this problem. I was using databricks and had a huge csv directory (200 files of 200MB each)
I originally had
val df = spark.read.format("csv")
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.load("dbfs:/huge/csv/files/in/this/directory/")
display(df)
which took a lot of time (10+ minutes), but then I change it to below and it ran instantly (2 seconds)
val lines = spark.read.text("dbfs:/huge/csv/files/in/this/directory/").as[String].take(1000)
val df = spark.read
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.csv(spark.createDataset(lines))
display(df)
Inferring schema for text formats is hard and it can be done this way for the csv and json (but not if it's a multi-line json) formats.

Since PySpark 2.3 you can simply load data as text, limit, and apply csv reader on the result:
(spark
.read
.options(inferSchema="true", header="true")
.csv(
spark.read.text("/path/to/file")
.limit(20) # Apply limit
.rdd.flatMap(lambda x: x))) # Convert to RDD[str]
Scala counterpart is available since Spark 2.2:
spark
.read
.options(Map("inferSchema" -> "true", "header" -> "true"))
.csv(spark.read.text("/path/to/file").limit(20).as[String])
In Spark 3.0.0 or later one can also apply limit and use from_csv function, but it requires a schema, so it probably won't fit your requirements.

Since I didn't see that solution in the answers, the pure SQL-approach is working for me:
df = spark.sql("SELECT * FROM csv.`/path/to/file` LIMIT 10000")
If there is no header the columns will be named _c0, _c1, etc. No schema required.

May be this would be helpful who is working in java.
Applying limit will not help to reduce the time. You have to collect the n rows from the file.
DataFrameReader frameReader = spark
.read()
.format("csv")
.option("inferSchema", "true");
//set framereader options, delimiters etc
List<String> dataset = spark.read().textFile(filePath).limit(MAX_FILE_READ_SIZE).collectAsList();
return frameReader.csv(spark.createDataset(dataset, Encoders.STRING()));

Get CSV to Spark dataframe

I'm using python on Spark and would like to get a csv into a dataframe.
The documentation for Spark SQL strangely does not provide explanations for CSV as a source.
I have found Spark-CSV, however I have issues with two parts of the documentation:
"This package can be added to Spark using the --jars command line option. For example, to include it when starting the spark shell: $ bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3"
Do I really need to add this argument everytime I launch pyspark or spark-submit? It seems very inelegant. Isn't there a way to import it in python rather than redownloading it each time?
df = sqlContext.load(source="com.databricks.spark.csv", header="true", path = "cars.csv") Even if I do the above, this won't work. What does the "source" argument stand for in this line of code? How do I simply load a local file on linux, say "/Spark_Hadoop/spark-1.3.1-bin-cdh4/cars.csv"?

With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
df = sqlContext.read.csv("/path/to/your.csv")
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.

from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
Employee_rdd = sc.textFile("\..\Employee.csv")
.map(lambda line: line.split(","))
Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
Employee_df.show()

for Pyspark, assuming that the first row of the csv file contains a header
spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)

Read the csv file in to a RDD and then generate a RowRDD from the original RDD.
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = "name age"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
source: SPARK PROGRAMMING GUIDE

If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])

Following Spark 2.0, it is recommended to use a Spark Session:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession \
.builder \
.appName("basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
def mapper(line):
fields = line.split(',')
return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)
# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")

I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.
Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.
Hope it works.

Based on the answer by Aravind, but much shorter, e.g. :
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])

With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation
Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem
When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.
A better way to do the above would be
spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

More convenient way to reproduce pyspark sample - apache-spark

Related

Read csv that contains array of string in pyspark

pyspark parse filename on load

Appending data to an empty dataframe

How to read only n rows of large CSV file on HDFS using spark-csv package?

Get CSV to Spark dataframe

Categories

Resources