Error when reading in .adf raster file - attributes

When reading in a raster dataset I get the below error. Previously I have been able to read this same raster dataset successfully in R in this way, maintaining access to the attribute table and correct field names. I've tried updating the files with backups to eliminate the issue of corrupt files and I still get the below error. Besides corrupt files, what may be causing this error?
dat2 <- raster("data/data_LEMMA/lemma_clip/w001001.adf")
Error : GDAL Error 3: Failed reading table field info for table
lemma_clip.VAT File may be corrupt?
Warning message: In .rasterFromGDAL(x, band = band, objecttype, ...) :
Could not read RAT or Category names

This appears to be an ESRI GRID that works fine with Arc, but not with GDAL (when reading the Raster Attribute Table (RAT; or VAT in ESRI speak)). It would be useful if you could make it available for others to look at and look for a solution.
A work-around is to not read the RAT; perhaps that is acceptable in this case.
dat2 <- raster("data/data_LEMMA/lemma_clip/w001001.adf", RAT=FALSE)

Related

AzureML TabularDatasetFactory.from_parquet_files() error handling column types

I'm reading in a folder of parquet files using azureml's TabularDatasetFactory method:
dataset = TabularDatasetFactory.from_parquet_files(path=[(datastore_instance, "path/to/files/*.parquet")])
but am running into the issue that one of the columns is typed 'List' in the parquet files, and it seems TabularDatasetFactory.from_parquet_files() can't handle that typing?
ExecutionError:
Error Code: ScriptExecution.StreamAccess.Validation
Validation Error Code: NotSupported
Validation Target: ParquetType
Failed Step: xxxxxx
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by ValidationException.
No conversion exists for column: '[REDACTED COLUMN NAME]', from Parquet SchemaType: 'List' to DataPrep ValueKind
So I'm wondering if there's a way to tell TabularDatasetFactory.from_parquet_files() specifically which columns to pull in, or a way to tell it to fall back on any unsupported column types to just use object/string. Or maybe there's a work around by first reading in the files as a FileDataset, then selecting which columns in the files to use?
I do see the set_column_types parameter, but I don't know the columns until I read it into a dataset since I'm using datasets to explore what data is available in the folder paths in the first place

Invalid date:Error while import CSV to Cassandra using pySpark

I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.

Trouble importing VTK files into Paraview (Error reading ascii data)

I am very new to using Paraview, and I'm trying to import a few VTK files and view them. However, I'm receiving the following errors:
Generic Warning: In /Users/kitware/dashboards/buildbot-slave/8275bd07/build/superbuild/paraview/src/VTK/IO/Legacy/vtkDataReader.cxx, line 1436
Error reading ascii data. Possible mismatch of datasize with declaration.
ERROR: In /Users/kitware/dashboards/buildbot-slave/8275bd07/build/superbuild/paraview/src/VTK/IO/Legacy/vtkUnstructuredGridReader.cxx, line 346
vtkUnstructuredGridReader (0x7fb15582bd10): Unrecognized keyword: ,
I can't seem to figure out what's wrong, I've tried converting them to other formats to no avail.
I don't think there's a problem with the files. I can open them with Paraview 5.6. Maybe they were generated with a version of VTK that is more recent than the one used for your version of Paraview. You should install the latest version of Paraview (or at least 5.6).
The big file results in some visible geometry, the smaller one does not. But I have no error message, everything seems ok.

error uploading csv file on cloud jupyter notebook

I have set up a google cloud account
I want to perform my deep learning much more faster on a jupyter notebook, but
I cannot find a way to read my csv file
I downloaded it with wget from my github account and afterwards I tried
dataset = pd.read_csv('/home/user/.jupyter/SIEMENSTRAIN.csv')
but I get the following error
pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 12
Why? When I read it on my laptop using my jupyter notebooks, everything runs well
Any suggestions?
I tried the recommended solutions for this error and I got the next warning
/home/user/anaconda3/lib/python3.5/site-packages/ipykernel/main.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
if name == 'main':
When I ran dataset.head() this is what appeared
Any help please?
There are a number of possibilities that could be causing the problem... I would first always make sure that Pandas (pd)'s version is updated and compatible.
The more likely cause is that the CSV itself is not right, so pd.read_csv() is not able to work correctly (thus a Parse Error). This may have something to do with the headers, though I'm not sure what your original CSV file looks like. It's worth playing around with read_csv, for example:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
This tampers with 2 things - the delimiter, and if pd is reading a header from CSV or not.
I go through some pd.read_csv() stuff in my book about Stock Prediction (another cool Machine Learning problem) and Deep Learning, feel free to check it out.
Good Luck!
I tried what you proposed and this is what I got
So, any suggestions?
I suppose that the path is ok, but it just won't be read properly, or am I wrong?

How to read sequence files exported from HBase

I used the following code to export an HBase table and save the output to HDFS:
hbase org.apache.hadoop.hbase.mapreduce.Export \
MyHbaseTable1 hdfs://nameservice1/user/ken/data/exportTable1
Output files are binary files. If I use pyspark to read the file folder:
test1 = sc.textFile('hdfs://nameservice1/user/ken/data/exportTable1')
test1.show(5)
It shows:
u'SEQ\x061org.apache.hadoop.hbase.io.ImmutableBytesWritable%org.apache.hadoop.hbase.client.Result\x00\x00\x00\x00\x00\x00\ufffd-\x10A\ufffd~lUE\u025bt\ufffd\ufffd\ufffd&\x00\x00\x04\ufffd\x00\x00\x00'
u'\x00\x00\x00\x067-2010\ufffd\t'
u'|'
u'\x067-2010\x12\x01r\x1a\x08clo-0101 \ufffd\ufffd\ufffd*(\x042\\6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0'
u'u'
I can tell that
'7-2010' in the 2nd line is the Rowkey,
'r' in the 4th line is the column family,
'clo-0101' in the 4th line is the column name,
'6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0' is the value.
I don't know where 3rd and 5th line came from. It seems like Hbase-export followed its own rule to generate the file, if I use my own way to decode it, data might got corrupted.
Question:
How can I convert this file back to a readable format? For example:
7-2010, r, clo-0101, 6.67|10|10|10|7.33|6.67|6.67|6.67|6.67|6.67|6.67|5.83|3.17|0|0|0.67|0.67|0.67|0.67|0|0|0|0|0
I have tried:
test1 = sc.sequenceFile('/user/youyang/data/hbaseSnapshot1/', keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)
test1.show(5)
and
test1 = sc.sequenceFile('hdfs://nameservice1/user/ken/data/exportTable1'
, keyClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat'
, valueClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable'
, keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter'
, valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringCon verter'
, minSplits=None
, batchSize=100)
No luck, the code did not work, ERROR:
Caused by: java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured, if you're using custom serialization.
Any suggestions? Thank you!
I had this problem recently myself. I solved it by going away from sc.sequenceFile, and instead using sc.newAPIHadoopFile (or just hadoopFile if you're on the old API). The Spark SequenceFile-reader appears to only handle keys/values that are Writable types (it's stated in the docs).
If you use newAPIHadoopFile it uses the Hadoop deserialization logic, and you can specify which Serialization types you need in the config-dictionary you give it:
hadoop_conf = {"io.serializations": "org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.hbase.mapreduce.ResultSerialization"}
sc.newAPIHadoopFile(
<input_path>,
'org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat',
keyClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable',
valueClass='org.apache.hadoop.hbase.client.Result',
keyConverter='org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter',
valueConverter='org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter',
conf=hadoop_conf)
Note that the value in hadoop_conf for "io.serializations" is a comma separated list which includes "org.apache.hadoop.hbase.mapreduce.ResultSerialization". That is the key configuration you need to be able to deserialize the Result. The WritableSerialization is also needed in order to be able to deserialize ImmutableBytesWritable.
You can also use sc.newAPIHadoopRDD, but then you also need to set a value for "mapreduce.input.fileinputformat.inputdir" in the config dictionary.

Resources