Parsing data-forge DataFrame with missing data

Parsing data-forge DataFrame with missing data - node.js

I have a CSV file formatted like this and containing missing data:
time,col1,col2
0,12.3,99.2
1,,101.0
2,10.5,101.4
3,,102.5
4,11.9,
I'm using data-forge-js to read the data from CSV and convert it into floats:
const stringDF = dataForge.fromCSV(csvData);
const parsedDF = stringDF.parseFloats(stringDF.getColumnNames());
But this results an empty dataframe because (I assume) there are missing values in the data. It works when there are no missing values. How can I insert NaNs or some stand-in value for missing data?

I was getting confused by the lazy evaluation of DataFrames in data-forge. I needed to call parsedDF.bake() to force evaluation so that when it's printed in the console I could see the parsed data.
Working solution here: https://runkit.com/shrinkinguniverse/parse-dataframe-with-missing-data-using-data-forge
const dataForge = require('data-forge');
csvString = `time,col1,col2
0,12.3,99.2
1,,101.0
2,10.5,101.4
3,,102.5
4,11.9,`
const stringDF = dataForge.fromCSV(csvString);
let parsedDF = stringDF.parseFloats(stringDF.getColumnNames());
parsedDF = parsedDF.bake(); //Needed to force lazy evaluation to evaluate.
console.log(parsedDF);

Related

How to identify or reroute bad xml's when reading xmls with spark

Using spark, I am trying to read a bunch of xmls from a path, one of the files is a dummy file which is not an xml.
I would like the spark to tell me that one particular file is not valid, in any way
Adding "badRecordsPath" otiton writes the bad data into specified location for JSON files, but the same is not working for xml, is there some other way?
df = (spark.read.format('json')
.option('badRecordsPath','/tmp/data/failed')
.load('/tmp/data/dummy.json')

As far as I know.... Unfortunately it wasnt available in xml package of spark till today in a declarative way... in the way you are expecting...
Json it was working since FailureSafeParser was implemented like below... in DataFrameReader
/**
* Loads a `Dataset[String]` storing JSON objects (<a href="http://jsonlines.org/">JSON Lines
* text format or newline-delimited JSON</a>) and returns the result as a `DataFrame`.
*
* Unless the schema is specified using `schema` function, this function goes through the
* input once to determine the input schema.
*
* #param jsonDataset input Dataset with one JSON object per record
* #since 2.2.0
*/
def json(jsonDataset: Dataset[String]): DataFrame = {
val parsedOptions = new JSONOptions(
extraOptions.toMap,
sparkSession.sessionState.conf.sessionLocalTimeZone,
sparkSession.sessionState.conf.columnNameOfCorruptRecord)
val schema = userSpecifiedSchema.getOrElse {
TextInputJsonDataSource.inferFromDataset(jsonDataset, parsedOptions)
}
ExprUtils.verifyColumnNameOfCorruptRecord(schema, parsedOptions.columnNameOfCorruptRecord)
val actualSchema =
StructType(schema.filterNot(_.name == parsedOptions.columnNameOfCorruptRecord))
val createParser = CreateJacksonParser.string _
val parsed = jsonDataset.rdd.mapPartitions { iter =>
val rawParser = new JacksonParser(actualSchema, parsedOptions, allowArrayAsStructs = true)
val parser = new FailureSafeParser[String](
input => rawParser.parse(input, createParser, UTF8String.fromString),
parsedOptions.parseMode,
schema,
parsedOptions.columnNameOfCorruptRecord)
iter.flatMap(parser.parse)
}
sparkSession.internalCreateDataFrame(parsed, schema, isStreaming = jsonDataset.isStreaming)
}
you can implement the feature programatic way.
read all the files in the folder using sc.textFile .
foreach file using xml parser parse the entries.
If its valid redirect to another path .
If its invalid, then write in to bad record path.

read bq table by AvroBigQueryInputFormat from spark give unexpected behavior (using java)

An sample skeleton code is sort of as follows, where i am basically reading a RDD from bigquery and select out all data point where my_field_name value is null
JavaPairRDD<String, GenericData.Record> input = sc
.newAPIHadoopRDD(hadoopConfig, AvroBigQueryInputFormat.class, LongWritable.class, GenericData.Record.class)
.mapToPair( tuple -> {
GenericData.Record record = tuple._2;
Object rawValue = record.get(my_field_name); // Problematic !! want to get my_field_name of this bq row, but just gave something not making sense
String partitionValue = rawValue == null ? "EMPTY" : rawValue.toString();
return new Tuple2<String, GenericData.Record>(partitionValue, record);
}).cache();
JavaPairRDD<String, GenericData.Record> emptyData =
input.filter(tuple -> StringUtils.equals("EMPTY", tuple._1));
emptyData.values().saveAsTextFile(my_file_path)
However the output RDD is totally seems totally unexpected. Especially the value of my_field_name seems totally random. After a little debugging, it seems filtering is do what is expected but the problem is on the value I extracted from GenericData.Record (basically record.get(my_field_name)) seems totally random.
Therefore after I switched from AvroBigQueryInputFormat to GsonBigQueryInputFormat
to read bq in json instead, this code seems to be working correctly.
However, ideally I really I want to use Avro instead (which should be much faster than handling json) however its current behavior in my code is totally disturbing. My I just using the AvroBigQueryInputFormat wrong?

Difference between elements when reading from multiple files

I am trying to get the difference between each element after reading multiple csv files. Each csv file has 13 rows and 128 columns. I am trying to get the column-wise difference
I read the files using
data = [pd.read_csv(f, index_col=None, header=None) for f in _temp]
I get a list of all samples.
According to this I have to use .diff() to get the difference. Which goes something like this
data.diff()
This works but instead of getting the difference between each row in the same sample, I get the difference between each row of one sample to another sample.
Is there a way to separate this and let the difference happen within each sample?
Edit
Ok I am able to get the difference between the data elements by doing this
_local = pd.DataFrame(data)
_list = []
_a = _local.index
for _aa in _a:
_list.append(_local[0][_aa].diff())
flow = pd.DataFrame(_list, index=_a)
I am creating too many DataFrames, is there a better way to do this?

Here is a relatively efficient way to read you dataframes one at a time and calculate their differences which are stored in a list df_diff.
df_diff = []
df_old = pd.read_csv(_temp[0], index_col=None)
for f in _temp[1:]:
df = pd.read_csv(f, index_col=None)
df_diff.append(df_old - df)
df_old = df

Since your code work you should real post on https://codereview.stackexchange.com/
(PS. The leading "_" is not really pythonic. pls avoid. It makes your code harder to read. )
_local = pd.DataFrame(data)
_list = [ _local[0][_aa].diff() for _aa in _local.index ]
flow = pd.DataFrame(_list, index=_local.index )

How to covert MappartitionsRDD[] to list or vector, then export as csv file?

I used some training data to build a Random Forests to predict the diabetes, now I want to use this model to predict a set of patient, the corresponding features are stored in a csv file. Below is my code:
val sc = new SparkContext("local[4]", "RandomForestsMultiplePredict")
//load the RandomForestModel from the file
val RFModel = RandomForestModel.load(sc, "RFModelPath")
//Transform the data in the file to RDD<Vector> format to predict
val data = sc.textFile("data/Diabetes_for_Test.csv")
val features = data.map(x => Vectors.dense(x.split(',').map(_.toDouble))).cache()
//predict the data
val result = RFModel.predict(features)
After I get the result, I wanted to convert the MappartitionsRDD[] data to a list or vector, in order to export the predict result to a new csv file. How can I do it? Thanks.

From reading the API document of MappartitionsRDD, I get to know that there is a method can be use.
result.toLocalIterator.toList
From #zero323 's help, I found
result.collect().toList
also works well. Thanks.

How to read multiple text files into a single RDD?

I want to read a bunch of text files from a hdfs location and perform mapping on it in an iteration using spark.
JavaRDD<String> records = ctx.textFile(args[1], 1); is capable of reading only one file at a time.
I want to read more than one file and process them as a single RDD. How?

You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat and therefore this also works with Hadoop (and Scalding).

Use union as follows:
val sc = new SparkContext(...)
val r1 = sc.textFile("xxx1")
val r2 = sc.textFile("xxx2")
...
val rdds = Seq(r1, r2, ...)
val bigRdd = sc.union(rdds)
Then the bigRdd is the RDD with all files.

You can use a single textFile call to read multiple files. Scala:
sc.textFile(','.join(files))

You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D

In PySpark, I have found an additional useful way to parse files. Perhaps there is an equivalent in Scala, but I am not comfortable enough coming up with a working translation. It is, in effect, a textFile call with the addition of labels (in the below example the key = filename, value = 1 line from file).
"Labeled" textFile
input:
import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)
for filename in glob.glob(Data_File + "/*"):
Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)
output: array with each entry containing a tuple using filename-as-key and with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.
[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
...]
You can also recombine either as a list of lines:
Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]
Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):
Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

you can use
JavaRDD<String , String> records = sc.wholeTextFiles("path of your directory")
here you will get the path of your file and content of that file. so you can perform any action of a whole file at a time that saves the overhead

All answers are correct with sc.textFile
I was just wondering why not wholeTextFiles For example, in this case...
val minPartitions = 2
val path = "/pathtohdfs"
sc.wholeTextFiles(path,minPartitions)
.flatMap{case (path, text)
...
one limitation is that, we have to load small files otherwise performance will be bad and may lead to OOM.
Note :
The wholefile should fit in to memory
Good for file formats that are NOT splittable by line... such as XML files
Further reference to visit

There is a straight forward clean solution available. Use the wholeTextFiles() method. This will take a directory and forms a key value pair. The returned RDD will be a pair RDD.
Find below the description from Spark docs:
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file

TRY THIS
Interface used to write a DataFrame to external storage systems (e.g. file systems, key-value stores, etc). Use DataFrame.write() to access this.
New in version 1.4.
csv(path, mode=None, compression=None, sep=None, quote=None, escape=None, header=None, nullValue=None, escapeQuotes=None, quoteAll=None, dateFormat=None, timestampFormat=None)
Saves the content of the DataFrame in CSV format at the specified path.
Parameters:
path – the path in any Hadoop supported file system
mode –
specifies the behavior of the save operation when data already exists.
append: Append contents of this DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error (default case): Throw an exception if data already exists.
compression – compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
sep – sets the single character as a separator for each field and value. If None is set, it uses the default value, ,.
quote – sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
escape – sets the single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, \
escapeQuotes – A flag indicating whether values containing quotes should always be enclosed in quotes. If None is set, it uses the default value true, escaping all values containing a quote character.
quoteAll – A flag indicating whether all values should always be enclosed in quotes. If None is set, it uses the default value false, only escaping values containing a quote character.
header – writes the names of columns as the first line. If None is set, it uses the default value, false.
nullValue – sets the string representation of a null value. If None is set, it uses the default value, empty string.
dateFormat – sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type. If None is set, it uses the default value value, yyyy-MM-dd.
timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. If None is set, it uses the default value value, yyyy-MM-dd'T'HH:mm:ss.SSSZZ.

rdd = textFile('/data/{1.txt,2.txt}')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string