Problem Statement:
We need to replace the Synonyms of the words in a row with its equivalent words (from large collection of synonym list ~40000+ Key Value pairs) on a large dataset(50000 rows).
Example:
Input
Allen jeevi pramod Allen Armstrong
sandesh Armstrong jeevi
harsha Nischay DeWALT
Synonym list (key value pair)
//We have 40000 entries
Key | Value
------------------------------------
Allen | Apex Tool Group
Armstrong | Columbus McKinnon
DeWALT | StanleyBlack
Above Synonym list has to be used on the Input and the Output should be as shown in the below format.
Expected Output
Apex Tool Group jeevi pramod Apex Tool Group Columbus McKinnon
sandesh Columbus McKinnon jeevi
harsha Nischay StanleyBlack
We have tried with 3 approaches all of them has its own limitations
Approach 1
Using UDF
public void test () {
List<Row> data = Arrays.asList(
RowFactory.create(0, "Allen jeevi pramod Allen Armstrong"),
RowFactory.create(1, "sandesh Armstrong jeevi"),
RowFactory.create(2, "harsha Nischay DeWALT")
);
StructType schema = new StructType(new StructField[] {
new StructField("label", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<Row> data2 = Arrays.asList(
RowFactory.create("Allen", "Apex Tool Group"),
RowFactory.create("Armstrong","Columbus McKinnon"),
RowFactory.create("DeWALT","StanleyBlack")
);
StructType schema2 = new StructType(new StructField[] {
new StructField("label2", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence2", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
UDF2<String, String, Boolean> contains = new UDF2<String, String, Boolean>()
{
private static final long serialVersionUID = -5239951370238629896L;
#Override
public Boolean call(String t1, String t2) throws Exception {
return t1.contains(t2);
}
};
spark.udf().register("contains", contains, DataTypes.BooleanType);
UDF3<String, String, String, String> replaceWithTerm = new UDF3<String,
String, String, String>() {
private static final long serialVersionUID = -2882956931420910207L;
#Override
public String call(String t1, String t2, String t3) throws Exception {
return t1.replaceAll(t2, t3);
}
};
spark.udf().register("replaceWithTerm", replaceWithTerm, DataTypes.StringType);
Dataset<Row> joined = sentenceDataFrame.join(sentenceDataFrame2, callUDF("contains", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2")))
.withColumn("sentence_replaced", callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2")))
.select(col("sentence_replaced"));
joined.show(false);
}
`
Input
Allen jeevi pramod Allen Armstrong
sandesh Armstrong jeevi
harsha Nischay DeWALT
Expected Output
Apex Tool Group jeevi pramod Apex Tool Group Columbus McKinnon
sandesh Columbus McKinnon jeevi
harsha Nischay StanleyBlack
Actual Output
Apex Tool Group jeevi pramod Apex Tool Group Armstrong
Allen jeevi pramod Allen Columbus McKinnon
sandesh Columbus McKinnon jeevi
harsha Nischay StanleyBlack
Issue with approach 1, if there are multiple synonym keys in the input dataset, that many rows are getting created as shown in the above example output.
Expected only one row with all the replacement
Approach 2.
Using ImmutableMap with replace function: Here we kept key and values pair in hashmap within ImmutableMap function, we called replace function to replace all the things
but if a row contains multiple keys then it ignores complete row without replacing single key…
try {
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder()
.appName("JavaTokenizerExample").getOrCreate();
HashMap<String, String> options = new HashMap<String, String>();
options.put("header", "true");
Dataset<Row> dataFileContent = sqlContext.load("com.databricks.spark.csv", options);
dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),"[^a-zA-Z0-9\\s+]",""));
dataFileContent= dataFileContent.na().replace("ManufacturerSource",ImmutableMap.<String, String>builder()
.put("Allen", "Apex Tool Group"),
.put("Armstrong","Columbus McKinnon"),
.put("DeWALT","StanleyBlack")
//Here we have 40000 entries
.build()
);
dataFileContent.show(10,false);
} catch (Exception e) {
e.printStackTrace();
}
Here is the sample code and output:
Input
Allen jeevi pramod Allen Armstrong
sandesh Armstrong jeevi
harsha Nischay DeWALT
Expected Output
Apex Tool Group jeevi pramod Apex Tool Group Columbus McKinnon
sandesh Columbus McKinnon jeevi
harsha Nischay StanleyBlack
Actual Output
Allen jeevi pramod Allen Armstrong
sandesh Columbus McKinnon jeevi
harsha Nischay StanleyBlack
Approach 3
Using replace all within UDF
public static void main(String[] args) {
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("JoinFunctions").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("StringSimiliarityExample").getOrCreate();
Dataset<Row> sourceFileContent = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.load("source100.csv");
sourceFileContent.show(false);
StructType schema = new StructType(new StructField[] {
new StructField("label", DataTypes.IntegerType, false,
Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,
Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
UDF1 mode = new UDF1<String, String>() {
public String call(final String types) throws Exception {
return types.replaceAll("Allen", "Apex Tool Group")
.replaceAll("Armstrong","Columbus McKinnon")
.replaceAll("DeWALT","StanleyBlack")
//40000 more entries.....
}
};
sqlContext.udf().register("mode", mode, DataTypes.StringType);
sentenceDataFrame.createOrReplaceTempView("people");
Dataset<Row> newDF = sqlContext.sql("SELECT mode(sentence), label FROM people").withColumnRenamed("UDF(sentence)", "sentence");
newDF.show(false);
}
Output
Stackoverflow exception.
Here, we are getting stackoverflow exception. Because it resembles recursive function call.
Kindly, let us know if there are any other innovative approaches that can help to resolve this issue.
Neither will work, since you always have the issue of substring matches. For example:
ABC -> DE
ABCDE -> ABC
With text "ABCDEF HIJ KLM" what will the output be? It should be the same as the input, but your approach will at best output "DEDEF HIJ KLM" and at worst you will do a double replacement and get "DEF HIJ KLM". Either case is incorrect.
You could improve this by adding boundaries to replacements, perhaps using regex. A better way however would be to first tokenize your input correctly, apply token replacement (which can be exact match), and then un-tokenize back to original format. This may be as simple as splitting by space, but you should give proper though as to what token boundaries may exist. (Stops, hyphens, etc).
Related
public static void main(String[] args) {
SparkSession sessn = SparkSession.builder().appName("RDD2DF").master("local").getOrCreate();
List<Integer> lst = Arrays.asList(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
Dataset<Integer> DF = sessn.createDataset(lst, Encoders.INT());
System.out.println(DF.javaRDD().getNumPartitions());
JavaRDD<Integer> mappartRdd = DF.repartition(3).javaRDD().mapPartitions(it-> Arrays.asList(JavaConversions.asScalaIterator(it).length()).iterator());
}
From the above code, i am unable to convert the JavaRdd (mappartRdd) to DataFrame in Java Spark.
I am using the below to convert JavaRdd to DataFrame/DataSet.
sessn.createDataFrame(mappartRdd, beanClass);
I tried multiple options and different overloaded functions for createDataFrame. I am facing issues to convert it to DF. what is the beanclass I need to provide for the code to work?
Unlike scala, there is no function like toDF() to convert the RDD to DataFrame in Java. can someone assist to convert it as per my requirement.
Note: I am able to create a Dataset directly by modifying the above code as below.
Dataset<Integer> mappartDS = DF.repartition(3).mapPartitions(it-> Arrays.asList(JavaConversions.asScalaIterator(it).length()).iterator(), Encoders.INT());
But I want to know why my JavaRdd is not getting converted to DF/DS if i use createDataFrame. Any help will be greatly appreciated.
This seems to be follow up of this SO Question
I think, you are in learning stage of spark. I would suggest to understand the apis for java provided - https://spark.apache.org/docs/latest/api/java/index.html
Regarding your question, if you check the createDataFrame api, it is as follows-
def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame = {
...
}
As you see, it takes JavaRDD[Row] and related StructType schema as args. Hence to create DataFrame which is equal to Dataset<Row> use below snippet-
JavaRDD<Integer> mappartRdd = DF.repartition(3).javaRDD().mapPartitions(it-> Arrays.asList(JavaConversions.asScalaIterator(it).length()).iterator());
StructType schema = new StructType()
.add(new StructField("value", DataTypes.IntegerType, true, Metadata.empty()));
Dataset<Row> df = spark.createDataFrame(mappartRdd.map(RowFactory::create), schema);
df.show(false);
df.printSchema();
/**
* +-----+
* |value|
* +-----+
* |6 |
* |8 |
* |6 |
* +-----+
*
* root
* |-- value: integer (nullable = true)
*/
We receive 2 files (data file and metadata file) from Vendors for data ingestion.
Vendor 1
data file format
user_id has_insurance postal_code city
101 Y 20001 Newyork
102 N 40001 Boston
metadata file format
user_id,String
has_insurance,Boolean
postal_code,String
city, String
we will receive same data fields from another vendor but fields order in data file might be different as below
Vendor 2
data file format
user_id postal_code city has_insurance
101 20001 Newyork Y
102 40001 Boston N
metadata file format
user_id,String
postal_code,String
city, String
has_insurance,Boolean
The metadata file will contain the fields order. Is it possible to assign schema dynamically based on the metadata file while reading CSV file?
//function to derive spark datatype for the given field data type
def strToDataType(str: String): DataType = {
| if (str == "String") StringType
| else if (str == "Boolean") BooleanType
| else StringType
| }
val metadataDf = spark.sqlContext.textFile("metadata_folder")
val headerSchema = StructType(metadataDf.map(_.split(",")).map(x => StructField(x(0),strToDataType(x(1)),true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.schema(headerSchema) // defining based on the custom schema
.load("data_file.csv")
val headerSchema =
StructType(bxfd.map(_.split(",")).map(x => StructField(x(0),strToDataType(x(1)),true)))
when I tried to create schema dynamically using the above command, getting below error. Could you please advise.
<console>:34: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.sql.types.StructField])
You can't use the CSV data source in this case, but you can parse it yourself without too much difficult. The basic steps are read the metadata file as one text blob, then the data file separating data lines from header lines and apply the schema.
I made some assumptions about your separators. One thing to remember is you shouldn't assume any ordering to the lines in your DataFrame once you load it, because you don't know what data is being loaded on what worker. This tries to identify data vs metadata lines based upon their content, so you may need to adjust those rules.
def loadVendor(dataPath: String, metaPath: String): DataFrame = {
val df = spark.read.text(path)
// First read the metadata file, wholeTextFiles lets us get it all
// as a single string we so we can parse locally
val metaText = sc.wholeTextFiles(metaPath).first._2
val metaLines = metaText.
split("\n").
map(_.split(","))
// Identify header as line that has all the field names
val fields = Seq("user_id", "has_insurance", "postal_code", "city")
val headerDf = fields.foldLeft(df)((df2, fname) => df2.filter($"value".contains(fname)))
val headerLine = headerDf.first.getString(0)
val header = headerLine.split(" ")
// Identify data rows as ones that aren't special lines or the header
var data = df.
filter($"value" =!= headerLine).
filter(!$"value".startsWith("Vendor 1")).
filter(!$"value".startsWith("data file format"))
// Split the data fields on separator and assign column names. Assumed any whitespace is your separator
val rows = data.select(split($"value", raw"\W+") as "fields")
val named = header.zipWithIndex.map( { case (f, idx) => $"fields".getItem(idx).alias(f)} )
val table = rows.select(named:_*)
// Cast to the right types
val castCols = metaLines.map { case Array(cname, ctype) => col(cname).cast(ctype) }
val typed = table.select(castCols:_*)
// Return the columns sorted by name, so union'ing multiple DF's will line up
typed.select(table.columns.sorted.map(col):_*)
}
Here is the data printed
scala> df.show
+-------+-------------+-----------+-------+
| city|has_insurance|postal_code|user_id|
+-------+-------------+-----------+-------+
|Newyork| true| 20001| 101|
| Boston| false| 40001| 102|
+-------+-------------+-----------+-------+
And here is the schema printed
scala> df.printSchema
root
|-- city: string (nullable = true)
|-- has_insurance: boolean (nullable = true)
|-- postal_code: string (nullable = true)
|-- user_id: string (nullable = true)
I am new to Spark ML. Spark ML has MinHash implementation for Jaccard Distance. Please see the doc https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance. In the sample code, input data for comparison are from vectors. I have no question about the sample code. But When I use the text docs as input and then convert them to vectors via word2Vec, I got 0 jaccard distance. Do not know what's wrong in my codes. Something I did not understand. Thanks in advance for any help.
SparkSession spark = SparkSession.builder().appName("TestMinHashLSH").config("spark.master", "local").getOrCreate();
List<Row> data1 = Arrays.asList(RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" "))));
List<Row> data2 = Arrays.asList(RowFactory.create(Arrays.asList("Hi I heard about Scala".split(" "))),
RowFactory.create(Arrays.asList("I wish python could also use case classes".split(" "))));
StructType schema4word = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) });
Dataset<Row> documentDF1 = spark.createDataFrame(data1, schema4word);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(30).setMinCount(0);
Word2VecModel w2vModel1 = word2Vec.fit(documentDF1);
Dataset<Row> result1 = w2vModel1.transform(documentDF1);
List<Row> myDataList1 = new ArrayList<>();
int id = 0;
for (Row row : result1.collectAsList()) {
List<String> text = row.getList(0);
Vector vector = (Vector) row.get(1);
myDataList1.add(RowFactory.create(id++, vector));
}
StructType schema1 = new StructType(
new StructField[] { new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) });
Dataset<Row> df1 = spark.createDataFrame(myDataList1, schema1);
Dataset<Row> documentDF2 = spark.createDataFrame(data2, schema4word);
Word2VecModel w2vModel2 = word2Vec.fit(documentDF2);
Dataset<Row> result2 = w2vModel2.transform(documentDF2);
List<Row> myDataList2 = new ArrayList<>();
id = 10;
for (Row row : result2.collectAsList()) {
List<String> text = row.getList(0);
Vector vector = (Vector) row.get(1);
System.out.println("Text: " + text + " => \nVector: " + vector + "\n");
myDataList2.add(RowFactory.create(id++, vector));
}
Dataset<Row> df2 = spark.createDataFrame(myDataList2, schema1);
MinHashLSH mh = new MinHashLSH().setNumHashTables(5).setInputCol("features").setOutputCol("hashes");
MinHashLSHModel model = mh.fit(df1);
// Feature Transformation
System.out.println("The hashed dataset where hashed values are stored in the column 'hashes':");
model.transform(df1).show();
// Compute the locality sensitive hashes for the input rows, then perform
// approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed
// dataset, e.g.
// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
System.out.println("Approximately joining df1 and df2 on Jaccard distance smaller than 0.6:");
model.approxSimilarityJoin(df1, df2, 1.6, "JaccardDistance")
.select(col("datasetA.id").alias("id1"), col("datasetB.id").alias("id2"), col("JaccardDistance"))
.show();
// $example off$
spark.stop();
From Word2Vec, I got the different vectors for different docs. I would expect to get some non zero values for JaccardDistance when comparing two different docs. But instead, I got all 0s. The following shows what I got when I run the program:
Text: [Hi, I, heard, about, Scala] =>
Vector: [0.005808539432473481,-0.001387741044163704,0.007890049391426146,... ,04969391227]
Text: [I, wish, python, could, also, use, case, classes] =>
Vector: [-0.0022146602132124826,0.0032128597667906433,-0.00658524181926623,...,-3.716901264851913E-4]
Approximately joining df1 and df2 on Jaccard distance smaller than 0.6:
+---+---+---------------+
|id1|id2|JaccardDistance|
+---+---+---------------+
| 1| 11| 0.0|
| 0| 10| 0.0|
| 2| 11| 0.0|
| 0| 11| 0.0|
| 1| 10| 0.0|
| 2| 10| 0.0|
+---+---+---------------+
Jaccard similarity as per the definition and spark implementation is between two sets.
As the spark documentation:
Jaccard distance of two sets is defined by the cardinality of their
intersection and union:
d(A,B)=1−|A∩B|/|A∪B|
Therefore, when you apply word2vec to a specific document it converts it into a vector space or embedding capturing the semantic of the text. Also the range of each element in the vector in your example looks like less than 1. This is an issue for minhash with jaccard distance. If you still want to pursue word2vec, go for cosine distance.
The correct preprocessing step with jaccard distance would be something like
CountVectorizer
Or you could hash the tokens themselves and use a vector assembler
Minhash expects binary vectors, non-zero values are treated as binary “1” values.
For a working example, please refer this example provided by Uber: https://eng.uber.com/lsh/
To make it simple, let's assume we have a dataframe containing the following data:
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|info1 |info2 |
|firstName1|lastName1|myInfo1 |dummyInfo2|
|firstName1|lastName1|dummyInfo1|myInfo2 |
+----------+---------+----------+----------+
How can I merge all rows grouping by (firstName,lastName) and keep in the columns Phone and Address only data starting by "my" to get the following :
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|myInfo1 |myInfo2 |
+----------+---------+----------+----------+
Maybe should I use agg function with a custom UDAF? But how can I implement it?
Note: I'm using Spark 2.2 along with Scala 2.11.
You can use groupBy and collect_set aggregation function and use a udf function to filter in the first string that starts with "my"
import org.apache.spark.sql.functions._
def myudf = udf((array: Seq[String]) => array.filter(_.startsWith("my")).head)
df.groupBy("firstName ", "lastName")
.agg(myudf(collect_set("Phone")).as("Phone"), myudf(collect_set("Address")).as("Address"))
.show(false)
which should give you
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
I hope the answer is helpful
If only two columns involved, filtering and join can be used instead of UDF:
val df = List(
("firstName1", "lastName1", "info1", "info2"),
("firstName1", "lastName1", "myInfo1", "dummyInfo2"),
("firstName1", "lastName1", "dummyInfo1", "myInfo2")
).toDF("firstName", "lastName", "Phone", "Address")
val myPhonesDF = df.filter($"Phone".startsWith("my"))
val myAddressDF = df.filter($"Address".startsWith("my"))
val result = myPhonesDF.alias("Phones").join(myAddressDF.alias("Addresses"), Seq("firstName", "lastName"))
.select("firstName", "lastName", "Phones.Phone", "Addresses.Address")
result.show(false)
Output:
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
For many columns, when only one row expected, such construction can be used:
val columnsForSearch = List("Phone", "Address")
val minExpressions = columnsForSearch.map(c => min(when(col(c).startsWith("my"), col(c)).otherwise(null)).alias(c))
df.groupBy("firstName", "lastName").agg(minExpressions.head, minExpressions.tail: _*)
Output is the same.
UDF with two parameters example:
val twoParamFunc = (firstName: String, Phone: String) => firstName + ": " + Phone
val twoParamUDF = udf(twoParamFunc)
df.select(twoParamUDF($"firstName", $"Phone")).show(false)
In Spark, create case class to specify schema, then create RDD from a file and convert it to DF. e.g.
case class Example(name: String, age: Long)
val exampleDF = spark.sparkContext
.textFile("example.txt")
.map(_.split(","))
.map(attributes => Example(attributes(0), attributes(1).toInt))
.toDF()
The question is, if the content in the txt file is like "ABCDE12345FGHIGK67890", without any symbols or spaces. How to extract specified length of string for the schema field. e.g. extract 'BCD' for name and '23' for age. Is this possible to use map and split?
Thanks !!!
You can use subString to pull the data from specific index as below
case class Example (name : String, age: Int)
val example = spark.sparkContext
.textFile("test.txt")
.map(line => Example(line.substring(1, 4), line.substring(6,8).toInt)).toDF()
example.show()
Output:
+----+---+
|name|age|
+----+---+
| BCD| 23|
| BCD| 23|
| BCD| 23|
+----+---+
I hope this helps!
In the map function where you are splitting by commas, just put a function that converts the input string to a list of values in the required order.