How does spark sc.textFile works? - apache-spark

JavaRDD<String> input = sc.textFile("data.txt");
For the above sample code in Spark, I know it returns distributed list of string. But individual string in that list is a line or word tokens of data.txt?

A string in your rdd equals a line in data.txt.
If the data in your data.txt file is some type of csv data, you can use the spark-csv package that will split the data into columns for you, so you don't have to parse the lines yourself.

Related

Spark remove special characters from column name read from a parquet file [duplicate]

This question already has answers here:
Spark Dataframe validating column names for parquet writes
(7 answers)
Closed 2 years ago.
I have parquet files which I have read using the following spark command
lazy val out = spark.read.parquet("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")
The column names of a lot of column has special chracter "(". like WA_0_DWHRPD_Purge_Date_(TOD), WA_0_DWHRRT_Record_Type_(80=Index) How can I remove this special character.
My end goal is to remove these special character and write the parquet file back using the following command
df_hive.write.format("parquet").save("hdfs:///tmp/oip/logprint_poc_cleaned/")
Also, I am using Scala spark shell.
I am new to spark, I saw similar questions but nothing is working in my case. Any help is appreciated.
First thing you can do is read the parquet files into the data frame as you are doing.
val out = spark.read.parquet("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")
Once you have created the data frame, try to fetch the schema of the data frame and parse through it to remove all the special characters as below :
import org.apache.spark.sql.functions._
val schema = StructType(out.schema.map(
x => StructField(x.name.toLowerCase().replace(" ", "_").replace("#", "").replace("-", "_").replace(")", "").replace("(", "").trim(),
x.dataType, x.nullable)))
Now you can read the data back from the parquet files by specifying the schema that you have created.
val newDF = spark.read.format("parquet").schema(schema).load("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")
Now you can go ahead and save the data frame as you wanted with the cleaned column names.
df_hive.write.format("parquet").save("hdfs:///tmp/oip/logprint_poc_cleaned/")

Pyspark NLTK save output

I'm using spark 2.3.1 and I'm performing NLTK on thousands of input files.
From input files I'm extracting unigram,bigram and trigram words and save it in different dataframe.
Now I want to save dataframes into respected file in HDFS. (every time appending output into same file )
So at the end I have three CSV file named unigram.csv, bigram.csv, trigram.csv containing result of thousands of input file.
If this scenario doesn't possible with HDFS, can you suggest it with using local disk as storage path.
File append in normal programming language is not similar to what Dataframe write mode append is. Whenever we ask Dataframe to save to a location folder it will create a new file for every append . Only way you can achieve it by,
Read the old file into dfOld : Dataframe
Combine the old and new Dataframe dfOld.union(dfNewToAppend)
combine to single output file .coalesce(1)
Write to new temporary location /tempWrite
Delete the old HDFS location
Rename the /tempWrite folder your output folder name
val spark = SparkSession.builder.master("local[*]").getOrCreate;
import org.apache.hadoop.fs._
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
/// Write you unigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/unigram.csv")
/// Write you bigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/bigram.csv")
/// Write you trigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000"), new Path("yourNewHDFSDir/trigram.csv")
```

Escape Backslash(/) while writing spark dataframe into csv

I am using spark version 2.4.0. I know that Backslash is default escape character in spark but still I am facing below issue.
I am reading a csv file into a spark dataframe (using pyspark language) and writing back the dataframe into csv.
I have some "//" in my source csv file (as mentioned below), where first Backslash represent the escape character and second Backslash is the actual value.
Test.csv (Source Data)
Col1,Col2,Col3,Col4
1,"abc//",xyz,Val2
2,"//",abc,Val2
I am reading the Test.csv file and creating dataframe using below piece of code:
df = sqlContext.read.format('com.databricks.spark.csv').schema(schema).option("escape", "\\").options(header='true').load("Test.csv")
And reading the df dataframe and writing back to Output.csv file using below code:
df.repartition(1).write.format('csv').option("emptyValue", empty).option("header", "false").option("escape", "\\").option("path", 'D:\TestCode\Output.csv').save(header = 'true')
Output.csv
Col1,Col2,Col3,Col4
1,"abc//",xyz,Val2
2,/,abc,Val2
In 2nd row of Output.csv, escape character is getting lost along with the quotes("").
My requirement is to retain the escape character in output.csv as well. Any kind of help will be much appreciated.
Thanks in advance.
Looks like you are using the default behavior .option("escape", "\\"), change this to:
.option("escape", "'")
It should work.
Let me know if this solves your problem!

Write each row of a spark dataframe as a separate file

I have Spark Dataframe with a single column, where each row is a long string (actually an xml file).
I want to go through the DataFrame and save a string from each row as a text file, they can be called simply 1.xml, 2.xml, and so on.
I cannot seem to find any information or examples on how to do this.
And I am just starting to work with Spark and PySpark.
Maybe map a function on the DataFrame, but the function will have to write string to text file, I can't find how to do this.
When saving a dataframe with Spark, one file will be created for each partition. Hence, one way to get a single row per file would be to first repartition the data to as many partitions as you have rows.
There is a library on github for reading and writing XML files with Spark. However, the dataframe needs to have a special format to produce correct XML. In this case, since you have everything as a string in a single column, the easiest way to save would probably be as csv.
The repartition and saving can be done as follows:
rows = df.count()
df.repartition(rows).write.csv('save-dir')
I would do it this way in Java and Hadoop FileSystem API. You can write similar code using Python.
List<String> strings = Arrays.asList("file1", "file2", "file3");
JavaRDD<String> stringrdd = new JavaSparkContext().parallelize(strings);
stringrdd.collect().foreach(x -> {
Path outputPath = new Path(x);
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
OutputStream os = fs.create(outputPath);
});

How to parse CSV files with double-quoted strings in Julia?

I want to read CSV files where the columns are separated by commas. The columns can be strings and if those strings contain a comma in their content, they are wrapped in double-quotes. Currently I'm loading my data using:
file = open("data.csv","r")
data = readcsv(file)
But this code code would split the follwing string into 4 pieces whereas it only should be 3:
1,"text, more text",3,4
Is there a way in Julia's Standard Library to parse CSV while respecting quoting or do I have to write my own custom solution?
The readcsv function in base is super-basic (just blindly splitting on commas).
You will probably be happier with readtable from the DataFrames.jl package: http://juliastats.github.io/DataFrames.jl/io.html
To use the package, you just need to Pkg.add("DataFrames"), and then import it with `using DataFrames"
The readcsv function in base (0.3 prerelease) can now read quoted columns.
julia> readcsv(IOBuffer("1,\"text, more text\",3,4"))
1x4 Array{Any,2}:
1.0 "text, more text" 3.0 4.0
It is much simpler than DataFrames. But may be quicker if you just need the data as an array.

Resources