I have problems reading a large shape file in GeoDMS GUI version 7.177.
I'm trying to read the BAG (basisadministratie gemeenten, Dutch municipality administration, a giant geo file) into GeoDMS directly from the Kadaster. It's first been converted from .xml into .csv, then from .csv to .shp (using Python 'shapefile' library). When I make a selection of 10 000 buildings, everything goes well. When I want to read the whole BAG (some 16 000 000 building), however, GeoDMS can't seem to read the entire shape file, after a while the CPU usage is close to 0% and no further progress seems to be made.
Code:
/*
This program reads the BAG in .shp format and writes it to .dmsdata format for speedy processing in Minta
*/
container root
{
unit<dpoint> rdc_base; // RDC stands for Rijksdriehoekscoordinaten, Dutch state coordinate system
unit<dpoint> rdc := range(rdc_base,point(300000.0,0.0),point(625000.0,280000.0)); // default rdc: built on doubles
unit<uint32> bagRead:
storageName = 'c:/zandbak/intermediate/bagPND.dbf'
, dialogData = 'geometry'
, dialogType = 'map'
, storageReadOnly = 'true'
, isHidden = 'true'
{
attribute<rdc> geometry(polygon):
storageName = 'c:/zandbak/intermediate/bagPND.shp'
, storageReadOnly = 'true';
attribute<string>buildingId;
attribute<string>status;
attribute<string>year;
}
unit<uint32> bagWrite := subset(bagRead/buildingId==bagRead/buildingId)
, storageName = 'c:/zandbak/output/bagPND.fss'
, storageReadOnly = 'false'
, dialogData = 'geometry'
, dialogType = 'map'
{
attribute<uint32> nr_OrgEntity;
attribute<rdc> geometry(polygon) := bagRead/geometry[nr_OrgEntity];
attribute<string> buildingId := bagRead/buildingId[nr_OrgEntity];
attribute<string> status := bagRead/status[nr_OrgEntity];
attribute<string> year := bagRead/year[nr_OrgEntity];
}
}
I run this code using the batch mode:
"C:\Program Files\ObjectVision\GeoDms7177\GeoDmsRun.exe" "C:\repository\vesta\bagpreprocessing\root.dms" /bagWrite
As said, this code functions well for 10 000 buildings. However, not for 16M buildings. Is there a way to read large shape files in GeoDMS?
Related
I have a file, with around 85 million json records. The file size is around 110 Gb. I want to read from this file in batches of 1 million (in sequence). I am trying to read from this file line by line using a scanner, and appending these 1 million records. Here is the code gist of what I am doing:
var rawBatch []string
batchSize := 1000000
file, err := os.Open(filePath)
if err != nil {
// error handling
}
scanner = bufio.NewScanner(file)
for scanner.Scan() {
rec := string(scanner.Bytes())
rawBatch = append(rawBatch, string(recBytes))
if len(rawBatch) == batchSize {
for i := 0; i < batchSize ; i++ {
var tRec parsers.TRecord
err := json.Unmarshal(rawBatch[i], &tRec)
if err != nil {
// Error thrown here
}
}
//process
rawBatch = nil
}
}
file.Close()
Sample of correct record:
type TRecord struct {
Key1 string `json:"key1"`
key2 string `json:"key2"`
}
{"key1":"15","key2":"21"}
The issue I am facing here is that while reading these records, some of these records are getting corrupted, example: changing a colon to semi colon, or double quote to #. Getting this error:
Unable to load Record: Unable to load record in:
{"key1":#15","key2":"21"}
invalid character '#' looking for beginning of value
Some observations:
Once we start reading, the contents of the file itself get corrupted.
For every batch of 1 million, I saw 1 (or max 2) records getting corrupted. Out of 84 million records, a total of 95 records were corrupted.
My code is working for for a file with size around 42Gb (23 million records). With a higher sized data file, my code is behaving erroneously.
':' are changing to ';'. Double quotes are changing to '#'. Space is changing to '!'. All these combinations, in their binary representations, have a single bit difference. Any chance that we might have some accidental bit manipulation?
Any ideas on why this is happening? And how can I fix it?
Details:
Go version used: go1.15.6 darwin/amd64
Hardware details: Debian GNU/Linux 9.12 (stretch), 224Gb RAM, 896Gb Hard disk
As suggested by #icza in the comments,
That occasional, very rare 1 bit change suggests hardware failure (memory, processor cache, hard disk). I do recommend to test it on another computer.
I tested my code on some other machines. The code is running perfectly fine now. Looks like this occasional rare bit change, due to some hard failure, was causing this issue.
I would like to reproject a raster and keep working on that reprojected raster instead of loading it again from a file.
To project a raster I use either gdal:
# Source
src = gdal.Open(vv_path, gdalconst.GA_ReadOnly)
src_proj = src.GetProjection()
src_geotrans = src.GetGeoTransform()
# We want a section of source that matches this:
match_ds = gdal.Open(sn2_red_path, gdalconst.GA_ReadOnly)
match_proj = match_ds.GetProjection()
match_geotrans = match_ds.GetGeoTransform()
wide = match_ds.RasterXSize
high = match_ds.RasterYSize
# Output / destination
dst_filename = os.path.join(sn1_processed_path,'vv.tif')
dst = gdal.GetDriverByName('Gtiff').Create(dst_filename, wide, high, 1, gdalconst.GDT_Float32)
dst.SetGeoTransform( match_geotrans )
dst.SetProjection( match_proj)
# Do the work
aa = gdal.ReprojectImage(src, dst, src_proj, match_proj, gdalconst.GRA_NearestNeighbour)
del dst # Flush
or rasterio from here
In both cases, the projected raster is saved to a file, and I have to load it again to procees it. Is it possible to save the projected raster also as a variable?
You could use VRT datasets:
src = gdal.Open(“reference.tif”)
dst = gdal.Warp(“warped.vrt”, src, format=“vrt”, dstSRS=“EPSG:3857”)
This way only a small VRT file will be created, and you can use the dst dataset in downstream processing at which point the warping will be actually performed.
You can even create the VRT itself in memory, so nothing is written to disk at all:
dst = gdal.Warp(“”, src, format=“vrt”, dstSRS=“EPSG:3857”)
If your dataset fits entirely in memory, you can create the actual dataset in memory using the vsimem virtual file system driver, which has the upside that you have to perform the processing only once if you want to use it downstream in multiple functions:
dst = gdal.Warp(“/vsimem/result_inmemory.tif”, src, format=“tif”, dstSRS=“EPSG:3857”)
This way the processing will be performed immediately, but then you can use the dataset object to e.g. write it to disk and then perform additional processing.
I setup a simple test to stream text files from S3 and got it to work when I tried something like
val input = ssc.textFileStream("s3n://mybucket/2015/04/03/")
and in the bucket I would have log files go in there and everything would work fine.
But if their was a subfolder, it would not find any files that got put into the subfolder (and yes, I am aware that hdfs doesn't actually use a folder structure)
val input = ssc.textFileStream("s3n://mybucket/2015/04/")
So, I tried to simply do wildcards like I have done before with a standard spark application
val input = ssc.textFileStream("s3n://mybucket/2015/04/*")
But when I try this it throws an error
java.io.FileNotFoundException: File s3n://mybucket/2015/04/* does not exist.
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1483)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1523)
at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:176)
at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:134)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
.....
I know for a fact that you can use wildcards when reading fileInput for a standard spark applications but it appears that when doing streaming input, it doesn't do that nor does it automatically process files in subfolders. Is there something I'm missing here??
Ultimately what I need is a streaming job to be running 24/7 that will be monitoring an S3 bucket that has logs placed in it by date
So something like
s3n://mybucket/<YEAR>/<MONTH>/<DAY>/<LogfileName>
Is there any way to hand it the top most folder and it automatically read files that show up in any folder (cause obviously the date will increase every day)?
EDIT
So upon digging into the documentation at http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources it states that nested directories are not supported.
Can anyone shed some light as to why this is the case?
Also, since my files will be nested based upon their date, what would be a good way of solving this problem in my streaming application? It's a little complicated since the logs take a few minutes to get written to S3 and so the last file being written for the day could be written in the previous day's folder even though we're a few minutes into the new day.
Some "ugly but working solution" can be created by extending FileInputDStream.
Writing sc.textFileStream(d) is equivalent to
new FileInputDStream[LongWritable, Text, TextInputFormat](streamingContext, d).map(_._2.toString)
You can create CustomFileInputDStream that will extend FileInputDStream. The custom class will copy the compute method from the FileInputDStream class and adjust the findNewFiles method to your needs.
changing findNewFiles method from:
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val filter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val newFiles = fs.listStatus(directoryPath, filter).map(_.getPath.toString)
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
to:
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val filter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val directories = fs.listStatus(directoryPath).filter(_.isDirectory)
val newFiles = ArrayBuffer[FileStatus]()
directories.foreach(directory => newFiles.append(fs.listStatus(directory.getPath, filter) : _*))
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles.map(_.getPath.toString).toArray
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
will check for files in all first degree sub folders, you can adjust it to use the batch timestamp in order to access the relevant "subdirectories".
I created the CustomFileInputDStream as I mentioned and activated it by calling:
new CustomFileInputDStream[LongWritable, Text, TextInputFormat](streamingContext, d).map(_._2.toString)
It seems to behave us expected.
When I write solution like this I must add some points for consideration:
You are breaking Spark encapsulation and creating a custom class that you would have to support solely as time pass.
I believe that solution like this is the last resort. If your use case can be implemented by different way, it is usually better to avoid solution like this.
If you will have a lot of "subdirectories" on S3 and would check each one of them it will cost you.
It will be very interesting to understand if Databricks doesn't support nested files just because of possible performance penalty or not, maybe there is a deeper reason I haven't thought about.
we had same problem. we joined sub folder names with comma.
List<String> paths = new ArrayList<>();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy/MM/dd");
try {
Date start = sdf.parse("2015/02/01");
Date end = sdf.parse("2015/04/01");
Calendar calendar = Calendar.getInstance();
calendar.setTime(start);
while (calendar.getTime().before(end)) {
paths.add("s3n://mybucket/" + sdf.format(calendar.getTime()));
calendar.add(Calendar.DATE, 1);
}
} catch (ParseException e) {
e.printStackTrace();
}
String joinedPaths = StringUtils.join(",", paths.toArray(new String[paths.size()]));
val input = ssc.textFileStream(joinedPaths);
I hope that in this way your problem is solved.
I'm trying to make a system which backs up and restores points for a gameserver, so it can safely restart without loosing anything.
I have made a script to do just this and the actual backing up part works fine, but the restore part does not.
This is the script that runs if 'Backup(read)' is used (Backup(write) works perfectly as it is designed to do):
if (source and read) then
System.LogAlways("[System] Restoring serverdata from file 'backup.CHK'");
for line in source:lines() do
Backup = {};
Backup.Date = (Date or line:match("File Last Modified: (.-)"));
Backup.Time = (Time or line:match("time: (.-)"));
US = tonumber((US or line:match("us: (.-)")));
NK = tonumber((NK or line:match("nk: (.-)")));
local params = {class = "Player";
position = {x = 1, y = 1, z = -1000};
Respawn = { bRespawn = 0; nTimer =0; bUnique = 1; };
bUsable = 0;
orientation = {0, 90, 135};
name = "BackupEntity"; };
local ent = System.SpawnEntity(params);
g_gameRules.game:SetTeam(1, ent.id);
g_gameRules.game:SetSynchedEntityValue(playerId, 100, (NK/3));
g_gameRules.game:SetTeam(2, ent.id);
g_gameRules.game:SetSynchedEntityValue(playerId, 100, (US/3));
System.RemoveEntity(params);
end
source:close();
return;
end
I'm not sure what I'm doing wrong,and most sites that I have looked at don't help that much. The problem is that it's not reading any values from the file.
Any help will be appreciated :).
Edit:
The reason that we have to divide the score by 3 is because the server multiplies all scores by 3. If we were not to divide it by 3, then the score will always be 3 times larger on each restore.
Example contents of the backup.CHK file:
The server is dependent on this file, and writes to it every hour. Please do not edit.
File Last Modified: 11/07/2013
This file was generated by the servers' autobackup system.
--------------------------
time: 22:51
us: 453445
nk: 454567
A couple of ideas of what might be causing the problem:
Use of (.-) lazy matching which matches the shortest pattern possible -- this can include an empty string. Usually, you want to make the pattern as specific as possible while still matching the required possible inputs. eg. It looks like (%d+) for us and nk is an appropriate fit.
The for line in source:lines() do reads one line at a time. That necessarily means not all the variables are going to be set inside the loop. Yet everything starting at local params and down uses those variables as if they were. It seems to me that section of code shouldn't even be in the loop.
Lastly, have you considered saving the Backup file as just another lua file? Doing so means you can let lua do the heavy lifting for you and you won't have to bother parsing it yourself. That also minimizes the risk for error.
I've written a simple Groovy script (below) to set the values of four of the ID3v1 and ID3v2 tag fields in mp3 files using the JAudioTagger library. The script successfully makes the changes but it also deletes the first 5 to 10 seconds of some of the files, other files are unaffected. It's not a big problem, but if anyone knows a simple fix, I would be grateful. All the files are from the same source, all have v1 and v2 tags, I can find no obvious difference in the source files to explain it.
import org.jaudiotagger.*
java.util.logging.Logger.getLogger("org.jaudiotagger").setLevel(java.util.logging.Level.OFF)
Integer trackNum = 0
Integer totalFiles = 0
Integer invalidFiles = 0
validMP3File = true
def dir = new File(/D:\Users\Jeremy\Music\Speech Radio\Unlistened\Z Temp Files to MP3 Tagged/)
dir.eachFile({curFile ->
totalFiles ++
try {
mp3File = org.jaudiotagger.audio.AudioFileIO.read(curFile)
} catch (org.jaudiotagger.audio.exceptions.CannotReadException e) {
validMP3File = false
invalidFiles ++
}
// Get the file name excluding the extension
baseFilename = org.jaudiotagger.audio.AudioFile.getBaseFilename(curFile)
// Check that it is an MP3 file
if (validMP3File) {
if (mp3File.getAudioHeader().getEncodingType() != 'mp3') {
validMP3File = false
invalidFiles ++
}
}
if (validMP3File) {
trackNum ++
if (mp3File.hasID3v1Tag()) {
curTagv1 = mp3File.getID3v1Tag()
} else {
curTagv1 = new org.jaudiotagger.tag.id3.ID3v1Tag()
}
if (mp3File.hasID3v2Tag()) {
curTagv2 = mp3File.getID3v2TagAsv24()
} else {
curTagv2 = new org.jaudiotagger.tag.id3.ID3v23Tag()
}
curTagv1.setField(org.jaudiotagger.tag.FieldKey.TITLE, baseFilename)
curTagv2.setField(org.jaudiotagger.tag.FieldKey.TITLE, baseFilename)
curTagv1.setField(org.jaudiotagger.tag.FieldKey.ARTIST, "BBC Radio")
curTagv2.setField(org.jaudiotagger.tag.FieldKey.ARTIST, "BBC Radio")
curTagv1.setField(org.jaudiotagger.tag.FieldKey.ALBUM, "BBC Radio - 20130205")
curTagv2.setField(org.jaudiotagger.tag.FieldKey.ALBUM, "BBC Radio - 20130205")
curTagv1.setField(org.jaudiotagger.tag.FieldKey.TRACK, trackNum.toString())
curTagv2.setField(org.jaudiotagger.tag.FieldKey.TRACK, trackNum.toString())
mp3File.setID3v1Tag(curTagv1)
mp3File.setID3v2Tag(curTagv2)
mp3File.save()
}
})
println """$trackNum tracks created from $totalFiles files with $invalidFiles invalid files"""
I'm still investigating and it appears that there is no problem with JAudioTagger. Before setting the tags, I use Total Recorder to reduce the quality of the download from 128kbps, 44,100Hz to 56kbps, 22,050Hz. This reduces the file size to less than half and the quality is fine for speech radio.
If I run my script on the original files, none of the audio track is deleted. The deletion of the first part of the audio track only occurs with the files that have been processed by Total Recorder.
Looking at the JAudioTagger logging for these files, there does appear to be a problem with the header:
Checking further because the ID3 Tag ends at 0x23f9 but the mp3 audio doesnt start until 0x7a77
Confirmed audio starts at 0x7a77 whether searching from start or from end of ID3 tag
This check is not performed for files that have not been processed by Total Recorder.
The log of the header read operation also shows (for a 27 minute track):
trackLength:06:52
It looks as though I shall have to find a new MP3 file editor!
Instead of
mp3File.save()
could you try:
mp3File.commit()
No idea if it will help, but that seems to be the documented method?