I want to write a dataset to CSV file but I don't want columns to be ordered in ascending order(or any order for that matter).
For eg. Table: String id; String name; String age; +300 more fields
CSV formed is of schema: age name id +300 more columns in alphabetical order
but I want the CSV of the same ordering as of Model.
I could have used .select() or .selectExpr() but there I had to mention 300+ fields.
Is there any other easier way?
Currently using:
dataset.toDF().coalesce(1).selectExpr("templateId","batchId", +300 more fields ).write().format("com.databricks.spark.csv").option("nullValue","").mode(SaveMode.Overwrite).save(path);
A workaround I followed for the above question:
added the fields in a properties file(column.properties) under a single key
with fields comma-separated.
loaded that properties file in broadcast map.
used broadcast map in .selectExpr() method.
Code for loading properties file in broadcast map:
public static Map<String, String> getColumnMap() {
String propFileName = "column.properties";
InputStream inputStream =
ConfigurationLoader.class.getClassLoader().getResourceAsStream(propFileName);
if (inputStream != null) {
try {
prop.load(inputStream);
colMap = (Map) prop;
} catch (IOException e) {
// handle exception
}
}
return colMap;
}
JavaSparkContext sc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Broadcast<Map<String, String>> broadcastProperty = sc.broadcast(propertiesMap);
Code for writing to CSV file:
dataset.toDF().coalesce(1).selectExpr(broadcastColumn.getValue().get(TemplateConstants.COLUMN).split(",")).write().format(ApplicationConstants.CSV_FORMAT).option(ApplicationConstants.NULL_VALUE, "").mode(SaveMode.Overwrite).save(path);
Related
I want to read a timestamp column from Spark Dataset and cast it to String using appropriate format. Code looks like this:
import static org.apache.spark.sql.functions.*;
...
String result;
for (Row groupedRow : datasetGrouped.collectAsList()) {
for (StructField sf : groupedRow.schema().fields()) {
result = getDatasetFromRow(groupedRow).withColumn("fieldName", functions.date_format(col("fieldToGet"), "dd.MM.yyyy")).
collectAsList().stream().findFirst().get().getAs("fieldName");
}
}
...
private static Dataset<Row> getDatasetFromRow(Row row){
List<Row> rowListToGetDataset = new ArrayList<>();
List<String> strListToGetDataset = new ArrayList<>();
for(StructField sf : row.schema().fields()) {
strListToGetDataset.add(row.getAs(sf.name()));
}
rowListToGetDataset.add(RowFactory.create(strListToGetDataset.toArray()));
return SparkService.sqlContext().createDataFrame(rowListToGetDataset, row.schema());
}
This one is ugly and I'm looking for a solution, that doesn't create additional Dataset with timestamp field casted to String in format I need.
App uses Java Spark API, so any suggestions in Java please.
Spark ver: 2.3.1
We are currently exploring Apache Spark (with Hadoop) for performing large scale
data transformation (in Java).
We are using the new looking (and experimental) DataSourceV2 interfaces to build our custom
output data files. A component of this is an implementation of the org.apache.spark.sql.sources.v2.writer.DataWriter
interface. It all works beautifully, except for one problem:
The org.apache.spark.sql.sources.v2.writer.DataWriter.write(record) method is often (but not always)
called twice for the same input record.
Here is what I hope is enough code for you to get the gist of what we're doing:
Basically we have many large sets of input data that we land via a Spark application
into Hadoop tables using code that looks something like:
final Dataset<Row> jdbcTableDataset = sparkSession.read()
.format("jdbc")
.option("url", sqlServerUrl)
.option("dbtable", tableName)
.option("user", jdbcUser)
.option("password", jdbcPassword)
.load();
final DataFrameWriter<Row> dataFrameWriter = jdbcTableDataset.write();
dataFrameWriter.save(hdfsDestination + "/" + tableName);
There's roughly fifty of these tables, for what it is worth. I know that there are no duplicates
in the data because dataFrameWriter.count() and dataFrameWriter.distinct().count()
returns the same value.
The transformation process involves performing join operations on these tables and writing
the result to files in the (shared) file system in a custom format. The resulting rows contain a unique key,
a dataGroup column, a dataSubGroup column and about 40 other columns. The selected records are
ordered by dataGroup, dataSubGroup and key.
Each output file is distinguished by the dataGroup column, which is used to partition the write operation:
final Dataset<Row> selectedData = dataSelector.selectData();
selectedData
.write()
.partitionBy("dataGroup")
.format("au.com.mycompany.myformat.DefaultSource")
.save("/path/to/shared/directory/");
To give you an idea of the scale, the resulting selected data consists of fifty-sixty million
records, unevenly split between roughly 3000 dataGroup files. Large, but not enormous.
The partitionBy("dataGroup") neatly ensures that each dataGroup file is processed by a
single executor. So far so good.
My datasource implements the new looking (and experimental) DataSourceV2 interface:
package au.com.mycompany.myformat;
import java.io.Serializable;
import java.util.Optional;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.sources.DataSourceRegister;
import org.apache.spark.sql.sources.v2.DataSourceOptions;
import org.apache.spark.sql.sources.v2.WriteSupport;
import org.apache.spark.sql.sources.v2.writer.DataSourceWriter;
import org.apache.spark.sql.types.StructType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class DefaultSource implements DataSourceRegister, WriteSupport , Serializable {
private static final Logger logger = LoggerFactory.getLogger(DefaultSource.class);
public DefaultSource() {
logger.info("created");
}
#Override
public String shortName() {
logger.info("shortName");
return "myformat";
}
#Override
public Optional<DataSourceWriter> createWriter(String writeUUID, StructType schema, SaveMode mode, DataSourceOptions options) {
return Optional.of(new MyFormatSourceWriter(writeUUID, schema, mode, options));
}
}
There's a DataSourceWriter implementation:
public class MyFormatSourceWriter implements DataSourceWriter, Serializable {
...
}
and a DataSourceWriterFactory implementation:
public class MyDataWriterFactory implements DataWriterFactory<InternalRow> {
...
}
and finally a DataWriter implementation. It seems that a DataWriter is created and sent to
each executor. Therefore each DataWriter will process many of the dataGroups.
Each record has a unique key column.
public class MyDataWriter implements DataWriter<InternalRow>, Serializable {
private static final Logger logger = LoggerFactory.getLogger(XcdDataWriter.class);
...
MyDataWriter(File buildDirectory, StructType schema, int partitionId) {
this.buildDirectory = buildDirectory;
this.schema = schema;
this.partitionId = partitionId;
logger.debug("Created MyDataWriter for partition {}", partitionId);
}
private String getFieldByName(InternalRow row, String fieldName) {
return Optional.ofNullable(row.getUTF8String(schema.fieldIndex(fieldName)))
.orElse(UTF8String.EMPTY_UTF8)
.toString();
}
/**
* Rows are written here. Each row has a unique key column as well as a dataGroup
* column. Right now we are frequently getting called with the same record twice.
*/
#Override
public void write(InternalRow record) throws IOException {
String nextDataFileName = getFieldByName(record, "dataGroup") + ".myExt";
// some non-trivial logic for determining the right output file
...
// write the output record
outputWriter.append(getFieldByName(row, "key")).append(',')
.append(getFieldByName(row, "prodDate")).append(',')
.append(getFieldByName(row, "nation")).append(',')
.append(getFieldByName(row, "plant")).append(',')
...
}
#Override
public WriterCommitMessage commit() throws IOException {
...
outputWriter.close();
...
logger.debug("Committed partition {} with {} data files for zip file {} for a total of {} zip files",
partitionId, dataFileCount, dataFileName, dataFileCount);
return new MyWriterCommitMessage(partitionId, dataFileCount);
}
#Override
public void abort() throws IOException {
logger.error("Failed to collect data for schema: {}", schema);
...
}
}
Right now I'm working around this by keeping track of the last key that was processed and ignoring
duplicates.
I have written a kafka producer that tails the contents of a log file(format:csv).The kafka consumer is a streaming application that creates a JavaDStream.
using forEachRDD method,I'm splitting each line of file over the delimiter ',' and creating Row object.I have specified schema that has 7 columns.
Then I am creating dataframe using the JavaRDD and schema.
But the problem here is that,all the rows in the log file do not have same number of columns.
Thus, is there any way to filter out such rows that do not satisfy the schema or to create schema dynamically based on the row content?
Following is the part of the code:
JavaDStream<String> msgDataStream =directKafkaStream.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
msgDataStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
#Override
public Row call(String msg) {
String[] splitMsg=msg.split(",");
Object[] vals = new Object[splitMsg.length];
for(int i=0;i<splitMsg.length;i++)
{
vals[i]=splitMsg[i].replace("\"","").trim();
}
Row row = RowFactory.create(vals);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("timeIpReq", DataTypes.StringType, true),DataTypes.createStructField("SrcMac", DataTypes.StringType, true),
DataTypes.createStructField("Proto", DataTypes.StringType, true),DataTypes.createStructField("ACK", DataTypes.StringType, true),
DataTypes.createStructField("srcDst", DataTypes.StringType, true),DataTypes.createStructField("NATSrcDst", DataTypes.StringType, true),
DataTypes.createStructField("len", DataTypes.StringType, true)});
//Get Spark 2.0 session
Dataset<Row> msgDataFrame = session.createDataFrame(rowRDD, schema);
A simple way to remove rows that do not match the expected schema is to use flatMap with a Option type, also, if your target is to build a DataFrame, we use the same flatMap step to apply a schema to the data. This is facilitated in Scala by the use of case classes.
// Create Schema
case class NetInfo(timeIpReq: String, srcMac: String, proto: String, ack: String, srcDst: String, natSrcDst: String, len: String)
val netInfoStream = msgDataStream.flatMap{msg =>
val parts = msg.split(",")
if (parts.size == 7) { //filter out messages with unmatching set of fields
val Array(time, src, proto, ack, srcDst, natSrcDst, len) = parts // use a extractor to get the different parts in variables
Some(NetInfo(time, src, proto, ack, srcDst, natSrcDst, len)) // return a valid record
} else {
None // We don't have a valid. Return None
}
}
netInfoStream.foreachRDD{rdd =>
import sparkSession.implicits._
val df = rdd.toDF() // DataFrame transformation is possible on RDDs with a schema (based on a case class)
// do stuff with the dataframe
}
Regarding:
all the rows in the log file do not have same number of columns.
Assuming that they all represent the same kind of data but with potentially some columns missing, the right strategy would be to either filter out the incomplete data (like exemplified here) or use optional values in a defined schema if there is a deterministic way to know what fields are missing. This requirement should be posed to the upstream applications that generate the data. It's common to represent missing values in CSV with empty comma sequences (e.g. field0,,field2,,,field5)
A dynamic schema to handle per-row differences would not make sense, as there would be no way to apply that to complete DataFrame composed of rows with different schemas.
I am getting the Hbase data and trying to do a spark job on it. My table has around 70k rows and each row has a column 'type', which can have the values:post,comment or reply. Based on the type, I want to take out different Pair RDDs like shown below(for post).
JavaPairRDD<ImmutableBytesWritable, FlumePost> postPairRDD = hBaseRDD.mapToPair(
new PairFunction<Tuple2<ImmutableBytesWritable, Result>, ImmutableBytesWritable, FlumePost>() {
private static final long serialVersionUID = 1L;
public Tuple2<ImmutableBytesWritable, FlumePost> call(Tuple2<ImmutableBytesWritable, Result> arg0)
throws Exception {
FlumePost flumePost = new FlumePost();
ImmutableBytesWritable key = arg0._1;
Result result = arg0._2;
String type = Bytes.toString(result.getValue(Bytes.toBytes("cf"), Bytes.toBytes("t")));
if (type.equals("post")) {
return new Tuple2<ImmutableBytesWritable, FlumePost>(key, flumePost);
} else {
return null;
}
}
}).distinct();
Problem here is, For all the rows with type other than post I have to send null value which is undesirable. And iteration goes on for 70k times for all the three types, wasting the cycles. So my first question is:
1) What is the efficient way to do this?
So now after getting 70k results I put the distinct() method to remove the duplication of null values. So I end up having one null value object in it. I expect 20327 results but I get 20328.
2) Is there a way to remove this null entry from the pair RDD?
You can use the filter operation on the RDD.
Simply call:
.filter(new Function<Tuple2<ImmutableBytesWritable, FlumePost>, Boolean>() {
#Override
public Boolean call(Tuple2<ImmutableBytesWritable, FlumePost> v1) throws Exception {
return v1 != null;
}
})
before calling distinct() to filter out the nulls.
How can I create a DataFrame from an JavaRDD contains Integers. I have done something like below but not working.
List<Integer> input = Arrays.asList(101, 103, 105);
JavaRDD<Integer> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, Integer.class);
I got ClassCastException saying org.apache.spark.sql.types.IntegerType$ cannot be cast to org.apache.spark.sql.types.StructType
How can I achieve this?
Apparently (although not intuitively), this createDataFrame overload can only work for "Bean" types, which means types that do not correspond to any built-in Spark SQL type.
You can see that in the source code, the class you pass is matched with a Spark SQL type in JavaTypeInference.inferDataType, and the result is cast into a StructType (see dataType.asInstanceOf[StructType] in SQLContext.getSchema - but the built in "primitive" types (like IntegerType) are NOT StructTypes... Looks like a bug or undocumented behavior to me....
WORKAROUNDS:
Wrap your Integers with a "bean" class (that's ugly, I know):
public static class MyBean {
final int value;
MyBean(int value) {
this.value = value;
}
public int getValue() {
return value;
}
}
List<MyBean> input = Arrays.asList(new MyBean(101), new MyBean(103), new MyBean(105));
JavaRDD<MyBean> inputRDD = sc.parallelize(input);
DataFrame dataframe = sqlcontext.createDataFrame(inputRDD, MyBean.class);
dataframe.show(); // this works...
Convert to RDD<Row> yourself:
// convert to Rows:
JavaRDD<Row> rowRdd = inputRDD.map(new Function<Integer, Row>() {
#Override
public Row call(Integer v1) throws Exception {
return RowFactory.create(v1);
}
});
// create schema (this looks nicer in Scala...):
StructType schema = new StructType(new StructField[]{new StructField("number", IntegerType$.MODULE$, false, Metadata.empty())});
DataFrame dataframe = sqlcontext.createDataFrame(rowRdd, schema);
dataframe.show(); // this works...
Now in Spark 2.2 you can do the following to create a Dataset.
Dataset<Integer> dataSet = sqlContext().createDataset(javardd.rdd(), Encoders.INT());