Reading Avro file in Spark and extracting column values - apache-spark

I want to read an avro file using Spark (I am using Spark 1.3.0 so I don't have data frames)
I read the avro file using this piece of code
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.NullWritable
import org.apache.spark.SparkContext
private def readAvro(sparkContext: SparkContext, path: String) = {
sparkContext.newAPIHadoopFile[
AvroKey[GenericRecord],
NullWritable,
AvroKeyInputFormat[GenericRecord]
](path)
}
I execute this and get an RDD. Now from RDD, how do I extract value of specific columns? like loop through all records and give value of column name?
[edit]As suggested by Justin below I tried
val rdd = sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input)
rdd.map(record=> record._1.get("accountId")).toArray().foreach(println)
but I get an error
<console>:34: error: value get is not a member of org.apache.avro.mapred.AvroKey[org.apache.avro.generic.GenericRecord]
rdd.map(record=> record._1.get("accountId")).toArray().foreach(println)

AvroKey has a datum method to extract the wrapped value. And GenericRecord has a get method that accepts the column name as a string. So you can just extract the columns using a map
rdd.map(record=>record._1.datum.get("COLNAME"))

Related

Spark RDD after splitting of data data type is changed how can i split without changing data type

I have loaded data from text file to Spark RDD after splitting of data data type is changed. How can I split without changing data type or how can I convert split data to original data type?
My code
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Movie")
sc = SparkContext(conf = conf)
movies = sc.textFile("file:///SaprkCourse/movie/movies.txt")
data=movies.map(lambda x: x.split(","))
data.collect()
My input is like
userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
after splitting my complete data is changed to String type
I required output to be same data type as in input text File, as IntegerType, IntegerType, IntegerType, IntegerType
spark when reading a text file affect the type StringType to all columns so if you want to treat your columns as IntegerType you need to cast them.
it seam that your data is csv,
you should use sparkSession, read the data with csv and define your schema.
scala code :
val schema = new Structype()
.add("userId",IntegerType)
.add("movieId",IntegerType)
.add("rating",IntegerType)
.add("timestamp",TimestampType)
spark.read.schema(schema).csv("file:///SaprkCourse/movie/movies.txt")
if you want to keep reading the file as text you can cast every column :
scala :
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{IntegerType,TimestampType}
val df = data
.select(
col("userId").cast(IntegerType),
col("movieId").cast(IntegerType),
col("rating").cast(IntegerType),
col("timestamp").cast(TimestampType)
)

remove all the special characters from a csv file using spark

how to remove all the special characters from a csv file from a spark dataframe using java spark
For example: Below is the csv file content with spaces and special characters
"UNITED STATES CELLULAR CORP. - OKLAHOMA",WIRELESS,"US Cellular"
o/p I needed
UNITEDSTATESCELLULARCORPOKLAHOMA|WIRELESS|US Cellular( in lower case)
Thanks in Advance
You should use String.replaceAll method (and regex) to replace every character that is not alapha numeric to empty string.
Use this as udf and apply to all columns in the dataframe.
The java code should look like
import org.apache.spark.sql.Column;
import static org.apache.spark.sql.functions.udf;
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import java.util.Arrays;
UserDefinedFunction cleanUDF = udf(
(String strVal) -> strVal.replaceAll("[^a-zA-Z0-9]", ""), DataTypes.StringType
);
Column newColsLst[] = Arrays.stream(df.columns())
.map(c -> cleanUDF.apply(new Column(c)).alias(c) )
.toArray(Column[]::new);
Dataset<Row> new_df = df.select(newColsLst);
Reference: How do I call a UDF on a Spark DataFrame using JAVA?

How to convert .CSV file to .Json file using Pyspark?

I am having a problem in converting .csv file to multiline json file using pyspark.
I have a csv file read via spark rdd and I need to convert this to multiline json using pyspark.
Here is my code:
import json
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("jsonconversion").getOrCreate()
df = spark.read.format("csv").option("header","True").load(csv_file)
df.show()
df_json = df.toJSON()
for row in df_json.collect():
line = json.loads(row)
result =[]
for key,value in list(line.items()):
if key == 'FieldName':
FieldName =line['FieldName']
del line['FieldName']
result.append({FieldName:line})
res =result
with open("D:/tasklist/jsaonoutput.json",'a+')as f:
f.write(json.dumps(res, indent=4, separators=(',',':')))
I need the output in below format.
{
"Name":{
"DataType":"String",
"Length":4,
"Required":"Y",
"Output":"Y",
"Address": "N",
"Phone Number":"N",
"DoorNumber":"N/A"
"Street":"N",
"Locality":"N/A",
"State":"N/A"
}
}
My Input CSV file Looks like this:
I am new to Pyspark, Any leads to modify this code to a working code will be much appreciated.
Thank you in advance.
Try the following code. It first creates pandas dataframe from spark DF (unless you care doing some else with spark df, you can load csv file directly into pandas). From pandas df, it creates groups based on FieldName column and then writes to file where json.dumps takes care of formatting.
import json
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("jsonconversion").getOrCreate()
df = spark.read.format("csv").option("header","True").load(csv_file)
df.show()
df_pandas_grped = df.toPandas().groupby('FieldName')
final_dict = {}
for key, grp in df_pandas_grped:
final_dict[str(key)] = grp.to_dict('records')
with open("D:/tasklist/jsaonoutput.json",'w')as f:
f.write(json.dumps(final_dict,indent=4))

Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?
My code
SparkContext().addPyFile("myfile.py")
spark = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
from myfile import myFunction
df = spark.read.csv(sys.argv[1], header=True,
mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()
I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].
I'm using Spark version 2.0.1
You can simply use User Defined Functions (udf) combined with a withColumn :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df = df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider
This will add a new column to the dataframe df containing the result of myFunction(line[3]).

Spark Avro write RDD to multiple directories by key

I need to split an RDD by first letters (A-Z) and write the files into directories respectively.
The simple solution is to filter the RDD for each letter, but this requires 26 passes.
There is a response to a similar question for writing to text files here, but I cannot figure out how to do this for Avro files.
Has anyone been able to do this?
You can use multipleoutputformat to do this
It is a two step task :-
First you need the multiple output format for avro. Below is the code for that:
package avro
import org.apache.hadoop.mapred.lib.MultipleOutputFormat
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.util.Progressable
import org.apache.avro.mapred.AvroOutputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.hadoop.io.NullWritable
import org.apache.spark.rdd.RDD
import org.apache.hadoop.mapred.RecordWriter
class MultipleAvroFileOutputFormat[K] extends MultipleOutputFormat[AvroWrapper[K], NullWritable] {
val outputFormat = new AvroOutputFormat[K]
override def generateFileNameForKeyValue(key: AvroWrapper[K], value: NullWritable, name: String) = {
val name = key.datum().asInstanceOf[String].substring(0, 1)
name + "/" + name
}
override def getBaseRecordWriter(fs: FileSystem,
job: JobConf,
name: String,
arg3: Progressable) = {
outputFormat.getRecordWriter(fs, job, name, arg3).asInstanceOf[RecordWriter[AvroWrapper[K], NullWritable]]
}
}
In your driver code you have to mention that you want to use the Above given output format. You also need to mention the output schema for avro data. Below is sample driver code which stores a RDD of string in avro format with schema {"type":"string"}
package avro
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.JobConf
import org.apache.avro.mapred.AvroJob
import org.apache.avro.mapred.AvroWrapper
object AvroDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf
conf.setAppName(args(0));
conf.setMaster("local[2]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[AvroWrapper[String]]))
val sc = new SparkContext(conf);
val input = sc.parallelize(Seq("one", "two", "three", "four"), 1);
val pairRDD = input.map(x => (new AvroWrapper(x), null));
val job = new JobConf(sc.hadoopConfiguration)
val schema = "{\"type\":\"string\"}"
job.set(AvroJob.OUTPUT_SCHEMA, schema) //set schema for avro output
pairRDD.partitionBy(new HashPartitioner(26)).saveAsHadoopFile(args(1), classOf[AvroWrapper[String]], classOf[NullWritable], classOf[MultipleAvroFileOutputFormat[String]], job, None);
sc.stop()
}
}
I hope you get a better answer than mine...
I've been in a similar situation myself, except with "ORC" instead of Avro. I basically threw up my hands and ended up calling the ORC file classes directly to write the files myself.
In your case, my approach would entail partitioning the data via "partitionBy" into 26 partitions, one for each first letter A-Z. Then call "mapPartitionsWithIndex", passing a function that outputs the i-th partition to an Avro file at the appropriate path. Finally, to convince Spark to actually do something, have mapPartitionsWithIndex return, say, a List containing the single boolean value "true"; and then call "count" on the RDD returned by mapPartitionsWithIndex to get Spark to start the show.
I found an example of writing an Avro file here: http://www.myhadoopexamples.com/2015/06/19/merging-small-files-into-avro-file-2/

Resources