I am using Scala to read from a csv file. The file is formatted to have 3 columns each separated by a \t character. The first 2 columns are unimportant and the third column contains a list of comma separated identifiers stored as as strings. Below is a sample of what the input csv would look like:
0002ba73 US 6o7,6on,6qc,6qj,6nw,6ov,6oj,6oi,15me,6pb,6p9
002f50e4 US 6om,6pb,6p8,15m9,6ok,6ov,6qc,6oo,15me
004b5edc US 6oj,6nz,6on,6om,6qc,6ql,6p6,15me
005cc990 US 6pb,6qf,15me,6og,6nx,6qc,6om,6ok
005fe1ea US 15me,6p0,6ql,6ok,6ox,6ol,6o5,6qj
00777555 US 6pb,15me,6nw,6rk,6qc,6ov,6qj,6o0,6oj,6ok,6on,6p6,6nx,15m9
00cbcc7d US 6oj,6qc,6qg,6pb,6ol,6p6,6ov,15me
010254a6 US 6qc,6pb,6nw,6nx,15me,6o0,6ok,6p8
011b905c US 6oj,6nw,6ov,15me,6qc,6ow,6ql,6on,6qi,6qe
011fffa6 US 15me,6ok,6oj,6p6,6pb,6on,6qc,6ov,6oo,6nw,6oc
I want to read in the csv, get rid of the first two columns, and create a List that contains one instance of each unique identifier code found in the third column, so running the code on the above data should return the result List(6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6p8, 15m9, 6ok, 6oo, 6nz, 6om, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)
I have the following code which returns a List containing every distinct value found anywhere in the csv file:
val in_file = new File("input_file.csv")
val source = scala.io.Source.fromFile(in_file, "utf-8")
val labels = try source.getLines.mkString("\t") finally source.close()
val labelsList: List[String] = labels.split("[,\t]").map(_.trim).toList.distinct
Using the above input, my code returns labelsList with a value of List(0002ba73-e60c-4ffb-9131-c1612b904658, US, 6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 002f50e4-48cc-4b14-bb80-0502068b6161, 6om, 6p8, 15m9, 6ok, 6oo, 004b5edc-c0cc-4ffd-bef3-980bd92b92e6, 6nz, 6ql, 6p6, 005cc990-83dc-4e63-a4b6-58f38241e8fd, 6qf, 6og, 6nx, 005fe1ea-b918-48a3-a495-1f8ac12935ba, 6p0, 6ox, 6ol, 6o5, 00777555-83d4-401e-861b-5892f3aa3e1c, 6rk, 6o0, 00cbcc7d-1b48-4c5c-8141-8fc8f62b7b07, 6qg, 010254a6-2ef0-4a24-aa4d-3cc6656a55de, 011b905c-fbf3-441a-8912-a94cc0fe8a1d, 6ow, 6qi, 6qe, 011fffa6-0b9f-4d88-8ced-ce1cc864984f, 6oc)
How can I get my code to run properly and ignore anything contained within the first 2 columns of the csv?
You can ignore the first two columns and then split the third by the comma.
Finally a toSet will get rid of the duplicate identifiers.
val f = Source.fromFile("input_file.csv")
val lastColumns = f.getLines().map(_.split("\t")(2))
val uniques = lastColumns.flatMap(_.split(",")).toSet
uniques foreach println
Using Scala 2.13 resource management.
util.Using(io.Source.fromFile("input_file.csv")){
_.getLines()
.foldLeft(Array.empty[String]){
_ ++ _.split("\t")(2).split(",")
}.distinct.toList
}
//res0: scala.util.Try[List[String]] =
// Success(List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc))
The .toList can be dropped if an Array result is acceptable.
This is what you can do , Am doing on a sample DF, you can replace with yours
val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val reqCols = Seq(2)
val finalDf = Df.select(reqCols map Df.columns map col: _*)
finalDf.show
Note : This is 0-based index, so pass 2 to get third column.
If you want distinct values from your desired column.you can use distinct along with mkstring
val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val reqCols = Seq(2)
val distinctValues = Df.select(reqCols map Df.columns map col: _*).distinct.collect.mkString(",").filterNot("[]".toSet)
println(distinctValues)
Dates are duplicate , above code is removing duplicates.
Another method using regex
val data = scala.io.Source.fromFile("source.txt").getLines()
data.toList.flatMap {
line => """\S+\s+\S+\s+(\S+)""".r.findAllMatchIn(line).map( x => x.group(1).split(",").toList)
}.flatten.distinct
// res0: List[String] = List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)
How to replace substrings of a string. For example, I created a data frame based on the following json format.
line1:{"F":{"P3":"1:0.01","P8":"3:0.03,4:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"2:0.01,3:0.02","P10":"5:0.02", ...},"I":"blah"}
I need to replace the substrings "1:", "2:", "3:", with "a:", "b:", "c:", and etc. So the result will be:
line1:{"F":{"P3":"a:0.01","P8":"c:0.03,d:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"b:0.01,c:0.02","P10":"e:0.02", ...},"I":"blah"}
Please consider that this is just an example the real replacement is substring replacement not character replacement.
Any guidance either in Scala or Pyspark is helpful.
from pyspark.sql.functions import *
newDf = df.withColumn('col_name', regexp_replace('col_name', '1:', 'a:'))
Details here:
Pyspark replace strings in Spark dataframe column
Let's say you have a collection of strings for possible modification (simplified for this example).
val data = Seq("1:0.01"
,"3:0.03,4:0.04"
,"2:0.01,3:0.02"
,"5:0.02")
And you have a dictionary of required conversions.
val num2name = Map("1" -> "A"
,"2" -> "Bo"
,"3" -> "Cy"
,"4" -> "Dee")
From here you can use replaceSomeIn() to make the substitutions.
data.map("(\\d+):".r //note: Map key is only part of the match pattern
.replaceSomeIn(_, m => num2name.get(m group 1) //get replacement
.map(_ + ":"))) //restore ":"
//res0: Seq[String] = List(A:0.01
// ,Cy:0.03,Dee:0.04
// ,Bo:0.01,Cy:0.02
// ,5:0.02)
As you can see, "5:" is a match for the regex pattern but since the 5 part is not defined in num2name, the string is left unchanged.
This is the way I solved it in PySpark:
def _blahblah_name_replacement(val, ordered_mapping):
for key, value in ordered_mapping.items():
val = val.replace(key, value)
return val
mapping = {"1:":"aaa:", "2:":"bbb:", ..., "24:":"xxx:", "25:":"yyy:", ....}
ordered_mapping = OrderedDict(reversed(sorted(mapping.items(), key=lambda t: int(t[0][:-1]))))
replacing = udf(lambda x: _blahblah_name_replacement(x, ordered_mapping))
new_df = df.withColumn("F", replacing(col("F")))
I have a list of csv files each with a bunch of category names as header columns. Each row is a list of users with a boolean value (0, 1) whether they are part of that category or not. Each of the csv files does not have the same set of header categories.
I want to create a composite csv across all the files which has the following output:
Header is a union of all the headers
Each row is a unique user with a boolean value corresponding to the category column
The way I wanted to tackle this is to create a tuple of a user_id and a unique category_id for each cell with a '1'. Then reduce all these columns for each user to get the final output.
How do I create the tuple to begin with? Can I have a global lookup for all the categories?
Example Data:
File 1
user_id,cat1,cat2,cat3
21321,,,1,
21322,1,1,1,
21323,1,,,
File 2
user_id,cat4,cat5
21321,1,,,
21323,,1,,
Output
user_id,cat1,cat2,cat3,cat4,cat5
21321,,1,1,,,
21322,1,1,1,,,
21323,1,1,,,,
Probably the title of the question is misleading in the sense that conveys a certain implementation choice as there's no need for a global lookup in order to solve the problem at hand.
In big data, there's a basic principle guiding most solutions: divide and conquer. In this case, the input CSV files could be divided in tuples of (user,category).
Any number of CSV files containing an arbitrary number of categories can be transformed to this simple format. The resulting CSV results of the union of the previous step, extraction of the total nr of categories present and some data transformation to get it in the desired format.
In code this algorithm would look like this:
import org.apache.spark.SparkContext._
val file1 = """user_id,cat1,cat2,cat3|21321,,,1|21322,1,1,1|21323,1,,""".split("\\|")
val file2 = """user_id,cat4,cat5|21321,1,|21323,,1""".split("\\|")
val csv1 = sparkContext.parallelize(file1)
val csv2 = sparkContext.parallelize(file2)
import org.apache.spark.rdd.RDD
def toTuples(csv:RDD[String]):RDD[(String, String)] = {
val headerLine = csv.first
val header = headerLine.split(",")
val data = csv.filter(_ != headerLine).map(line => line.split(","))
data.flatMap{elem =>
val merged = elem.zip(header)
val id = elem.head
merged.tail.collect{case (v,cat) if v == "1" => (id, cat)}
}
}
val data1 = toTuples(csv1)
val data2 = toTuples(csv2)
val union = data1.union(data2)
val categories = union.map{case (id, cat) => cat}.distinct.collect.sorted //sorted category names
val categoriesByUser = union.groupByKey.mapValues(v=>v.toSet)
val numericCategoriesByUser = categoriesByUser.mapValues{catSet => categories.map(cat=> if (catSet(cat)) "1" else "")}
val asCsv = numericCategoriesByUser.collect.map{case (id, cats)=> id + "," + cats.mkString(",")}
Results in:
21321,,,1,1,
21322,1,1,1,,
21323,1,,,,1
(Generating the header is simple and left as an exercise for the reader)
You dont need to do this as a 2 step process if all you need is the resulting values.
A possible design:
1/ Parse your csv. You dont mention whether your data is on a distributed FS, so i'll assume it is not.
2/ Enter your (K,V) pairs into a mutable parallelized (to take advantage of Spark) map.
pseudo-code:
val directory = ..
mutable.ParHashMap map = new mutable.ParHashMap()
while (files[i] != null)
{
val file = directory.spark.textFile("/myfile...")
val cols = file.map(_.split(","))
map.put(col[0], col[i++])
}
and then you can access your (K/V) tuples by way of an iterator on the map.