Pyspark dataframe: load from csv and then remove the first line - python-3.x

I am able to load csv file from Azure datalake into pyspark dataframe.
How to remove the first line and make the second line as my header?
I have seen some RDD solution. But I am not able to load the file and I get error using the following code as "RDD is empty"
items = sc.textFile(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/tmp/items.csv")
firstRow=data.first()
Hence I prefer to load using standard spark as below. I could display the dataframe contents. I have to drop or remove the first line and make 2nrd row as header. Thanks.
items= spark.read.format("csv").load(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/tmp/items.csv", header=True)

Try this:
it's not an optimized solution but will solve the requirement.
df = spark.createDataFrame([(1,2,3),(4,5,6),(7,8,9)],['a','b','c'])
df.show()
df1 = df.rdd.zipWithIndex().toDF().where(F.col('_2') > 0).drop('_2')
for each_col in df.columns:
df1 = df1.withColumn(each_col, F.col('_1.'+each_col))
df1.drop('_1').show()

Related

Using PySpark to read in datalake table and can't parse timestamp column in Synapse Analytics

I can read in the datalake table and print schema but if I try and display data I get the following error. I am working within Synapse Analytics using a PySpark Notebook and Apache Spark Pool.
See error message:
You may get a different result due to the upgrading of Spark 3.0: Fail to parse '10/27/2022 1:14:31 PM' in the new parser.
You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
I don't want to use the LEGACY version.
I've tried converting using the following code
df = df.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"M/dd/yyyy h:m:s"))
df = df.withColumn("SinkModifiedOn",to_date(col("SinkModifiedOn"),"M/dd/yyyy h:m:s"))
I've also tried converting the suspect columns to StringType() or DateType() but no luck.
Any help appreciated
Thank you
Try the script with below date format
df = df1.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"MM/dd/yyyy h:mm:s a"))
I repro'd the same with sample input. Below is the approach.
Code:
df1=spark.createDataFrame(
data = [ ("1","Arpit","10/27/2022 1:14:31 PM"),("2","Anand","10/28/2022 1:14:31 PM"),("3","Mike","10/29/2022 1:14:31 PM")],
schema=["id","Name","SinkCreatedOn"])
df1.printSchema()
from pyspark.sql.functions import *
df_output = df1.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"MM/dd/yyyy h:mm:s a"))
df1.show()
df_output.show()
df1
df_output

Pyspark: How to convert a spark dataframe to json and save it as json file?

I am trying to convert my pyspark sql dataframe to json and then save as a file.
df_final = df_final.union(join_df)
df_final contains the value as such:
I tried something like this. But it created a invalid json.
df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}
My expected file should have data as below:
[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]
For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.
df_final.coalesce(1).write.format('json').save('/path/file_name.json')
and still you want to convert your datafram into json then you can used
df_final.toJSON().
A solution can be using collect and then using json.dump:
import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
json.dump(data, outfile)
Here is how you can do the equivalent of json.dump for a dataframe with PySpark 1.3+.
df_list_of_jsons = df.toJSON().collect()
df_list_of_dicts = [json.loads(x) for x in df_list_of_jsons]
df_json = json.dumps(df_list_of_dicts)
sc.parallelize([df_json]).repartition(1).cache().saveAsTextFile("<HDFS_PATH>")
Note this will result in the whole dataframe being loaded into the driver memory, so this is only recommended for small dataframe.
If you want to use spark to process result as json files, I think that your output schema is right in hdfs.
And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :
with open('data.json') as f:
data = json.load(f)
You should try to read data line by line:
data = []
with open("data.json",'r') as datafile:
for line in datafile:
data.append(json.loads(line))
and you can use pandas to create dataframe :
df = pd.DataFrame(data)

Save and append a file in HDFS using PySpark

I have a data frame in PySpark called df. I have registered this df as a temptable like below.
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
Now from this temp table I will get certain values, like max_id of a column id
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I will collect all these values like below.
test = ("{},{},{}".format(date,min_id,max_id))
I found that test is not a data frame but it is a str string
>>> type(test)
<type 'str'>
Now I want save this test as a file in HDFS. I would also like to append data to the same file in hdfs.
How can I do that using PySpark?
FYI I am using Spark 1.6 and don't have access to Databricks spark-csv package.
Here you go, you'll just need to concat your data with concat_ws and right it as a text:
query = """select concat_ws(',', date, nvl(min(id), 0), nvl(max(id), 0))
from mytempTable"""
sqlContext.sql(query).write("text").mode("append").save("/tmp/fooo")
Or even a better alternative :
from pyspark.sql import functions as f
(sqlContext
.table("myTempTable")
.select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id")))
.coalesce(1)
.write.format("text").mode("append").save("/tmp/fooo"))

Apache SPARK with SQLContext:: IndexError

I am trying to execute a basic example provided in Inferring the Schema Using Reflection segment of Apache SPARK documentation.
I'm doing this on Cloudera Quickstart VM(CDH5)
The example I'm trying to execute is as below ::
# sc is an existing SparkContext.
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
# Load a text file and convert each line to a Row.
lines = sc.textFile("/user/cloudera/analytics/book6_sample.csv")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")
# SQL can be run over DataFrames that have been registered as a table.
teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
# The results of SQL queries are RDDs and support all the normal RDD operations.
teenNames = teenagers.map(lambda p: "Name: " + p.name)
for teenName in teenNames.collect():
print(teenName)
I ran the code exactly as shown as above but always getting the error "IndexError: list index out of range" when I execute the last command(the for loop).
The input file book6_sample is available at
book6_sample.csv.
I ran the code exactly as shown as above but always getting the error "IndexError: list index out of range" when I execute the last command(the for loop).
Please suggest pointers on where I'm going wrong.
Thanks in advance.
Regards,
Sri
Your file has one empty line at the end which is causing this error.Open your file in text editor and remove that line hope it will work

How to read the csv and convert to RDD in sparkR

As i am a R programmer i want to use R as a interface to spark, with the sparkR package i installed sparkR in R.
I'm new to sparkR. I want to perform some operations on particular data in a CSV record. I'm trying to read a csv file and convert it to rdd.
This is the code i did:
sc <- sparkR.init(master="local") # created spark content
data <- read.csv(sc, "/home/data1.csv")
#It throws an error, to use read.table
Data i have to load and convert - http://i.stack.imgur.com/sj78x.png
if am wrong, how to read this data in csv and convert to RDD in sparkR
TIA
I believe that the problem is the header line, if you remove this line, it should work.
How do I convert csv file to rdd
--edited--
With this code you can test Sparkr with CSVs, but you need to remove the header line in your CSV file.
lines <- textFile(sc, "/home/data1.csv")
csvElements <- lapply(lines, function(line) {
#line represent each CSV line i. e. strsplit(line, ",") is useful
})
In the recent SparkR version (2.0+)
read.df(path, source = "csv")
In Spark 1.x
read.df(sc, path, source = "com.databricks.spark.csv")
with
spark.jars.packages com.databricks:spark-csv_2.10:1.4.0
This below code will let you read a csv with header . All the best
val csvrdd = spark.read.options(“header”,”true”).csv(filename)

Resources