Rasterio for reading large Tiff files - python-3.x

I am trying to convert a large Tiff file into GeoJson. The problem I am facing here is that given the my Tif file is in gigabytes (3.5GB), the programme is getting out of memory.
Is there any way to process this in chunks and write a chunk to the output file every time?
import rasterio
import rasterio.features
import rasterio.warp
with rasterio.open('MyBigFile.tif') as dataset:
# Read the dataset's valid data mask as a ndarray.
mask = dataset.dataset_mask()
# Extract feature shapes and values from the array.
for geom, val in rasterio.features.shapes(mask, transform=dataset.transform):
# Transform shapes from the dataset's own coordinate
# reference system to CRS84 (EPSG:4326).
geom = rasterio.warp.transform_geom(
dataset.crs, 'EPSG:4326', geom, precision=6)
with open('output.json', 'w') as f:
f.write(geom)

Related

Converting netCDF4 data into .CSV format in python 3.8. Does not output desired file

I have a single netCDF4 file pertaining to historical average sea surface temperatures around a selected area of the Caribbean. The netCDF4 data spans a wide time scale (from 1850 to 2012) and I am trying to extract a snippet of this data (the decade from 1990 to 2000) and convert it to .CSV format.
So far, my code runs without any errors but it outputs a word document file (not a .CSV file) in my local .py3 directory. Within this file only a single line of text is present with the word ",temperature". ie. no actual data is present.
I am new to python and computer programming in general with nearly no previous experience so I have no clue as to what is wrong with my code. Any help is vastly appreciated.
Here is my code:
# Converting netCDF4 files to .csv
import netCDF4
from netCDF4 import Dataset
import numpy as np
import pandas as pd
# Reading in file
data = Dataset('/Users/terrysoennecken/Documents/Fish movement model data/Ocean Temp/adaptor.esgf_wps.retrieve-1628265122.7610238-22900-3-fb77a99c-36ca-48e6-ac6a-2895a39eddd0 2/historical.nc', 'r')
print(data)
tb = data.variables['time_bnds']
t = data.variables['time']
sst = data.variables['tos'] # Sea surface temp
lat = data.variables['latitude']
lon = data.variables['longitude']
# Storing relevant data into variables
time_data = data.variables['time'][:]
lon_data = data.variables['longitude'][:]
lat_data = data.variables['latitude'][:]
temp_data = data.variables['tos'][:]
# Creating a dataframe
start_date = time_data[1971]
end_date = time_data[1825]
time_scale = pd.date_range(start = start_date, end = end_date)
df = pd.DataFrame(1, columns = ['Temperature'], index = time_scale)
# Fnal dataframe save to a .csv file
df.to_csv('Sea Surface Temperature around the Bahamas for the decade, 1990 - 2000')

Decompress nifti medical image in gz format using python

I want to decompress a butch of nii.gz files in python so that they could be processed in sitk later on. When I decompress a single file manually by right-clicking the file and choosing 'Extract..', this file is then correctly interpreted by sitk (I do sitk.ReadImage(unzipped)). But when I try to decompress it in python using following code:
with gzip.open(segmentation_zipped, "rb") as f:
bindata = f.read()
segmentation_unzipped = os.path.join(segmentation_zipped.replace(".gz", ""))
with gzip.open(segmentation_unzipped, "wb") as f:
f.write(bindata)
I get error when sitk tries to read the file:
RuntimeError: Exception thrown in SimpleITK ReadImage: C:\d\VS14-Win64-pkg\SimpleITK\Code\IO\src\sitkImageReaderBase.cxx:82:
sitk::ERROR: Unable to determine ImageIO reader for "E:\BraTS19_2013_10_1_seg.nii"
Also when trying to do it a little differently:
input = gzip.GzipFile(segmentation_zipped, 'rb')
s = input.read()
input.close()
segmentation_unzipped = os.path.join(segmentation_zipped.replace(".gz", ""))
output = open(segmentation_unzipped, 'wb')
output.write(s)
output.close()
I get:
RuntimeError: Exception thrown in SimpleITK ReadImage: C:\d\VS14-Win64-pkg\SimpleITK-build\ITK\Modules\IO\PNG\src\itkPNGImageIO.cxx:101:
itk::ERROR: PNGImageIO(0000022E3AF2C0C0): PNGImageIO failed to read header for file:
Reason: fread read only 0 instead of 8
can anyone help?
No need to unzip the Nifti images, libraries such as Nibabel can handle it without decompression.
#==================================
import nibabel as nib
import numpy as np
import matplotlib.pyplot as plt
#==================================
# load image (4D) [X,Y,Z_slice,time]
nii_img = nib.load('path_to_file.nii.gz')
nii_data = nii_img.get_fdata()
fig, ax = plt.subplots(number_of_frames, number_of_slices,constrained_layout=True)
fig.canvas.set_window_title('4D Nifti Image')
fig.suptitle('4D_Nifti 10 slices 30 time Frames', fontsize=16)
#-------------------------------------------------------------------------------
mng = plt.get_current_fig_manager()
mng.full_screen_toggle()
for slice in range(number_of_slices):
# if your data in 4D, otherwise remove this loop
for frame in range(number_of_frames):
ax[frame, slice].imshow(nii_data[:,:,slice,frame],cmap='gray', interpolation=None)
ax[frame, slice].set_title("layer {} / frame {}".format(slice, frame))
ax[frame, slice].axis('off')
plt.show()
Or you can Use SimpleITK as following:
import SimpleITK as sitk
import numpy as np
# A path to a T1-weighted brain .nii image:
t1_fn = 'path_to_file.nii'
# Read the .nii image containing the volume with SimpleITK:
sitk_t1 = sitk.ReadImage(t1_fn)
# and access the numpy array:
t1 = sitk.GetArrayFromImage(sitk_t1)

Problem with processing large(>1 GB) CSV file

I have a large CSV file and I have to sort and write the sorted data to another csv file. The CSV file has 10 columns. Here is my code for sorting.
data = [ x.strip().split(',') for x in open(filename+'.csv', 'r').readlines() if x[0] != 'I' ]
data = sorted(data, key=lambda x: (x[6], x[7], x[8], int(x[2])))
with open(filename + '_sorted.csv', 'w') as fout:
for x in data:
print(','.join(x), file=fout)
It works fine with file size below 500 Megabytes but cannot process files with a size greater than 1 GB. Is there any way I can make this process memory efficient? I am running this code on Google Colab.
Here is a Link to a blog about using pandas for large datasets. In the examples from the link, they are looking at analyzing data from large datasets ~1gb in size.
Simply type the following to import your csv data into python.
import pandas as pd
gl = pd.read_csv('game_logs.csv', sep = ',')

How to iterate through audio files when converting into mfccs

I am a beginner, i am converting audio files into mfccs , i have done it for one file but don't know how to iterate it through all dataset. I have multiple folders in Training folder ,one of them is 001(0) from which one wav file is converted.I want to convert all folder's wav files present in Training folder
import os
import numpy as np
import matplotlib.pyplot as plt
from glob import glob
import scipy.io.wavfile as wav
from python_speech_features import mfcc, logfbank
# Read the input audio file
(rate,sig) = wav.read('Downloads/DataVoices/Training/001(0)/001000.wav')
# Take the first 10,000 samples for analysis
sig = sig[:10000]
features_mfcc = mfcc(sig,rate)
# Print the parameters for MFCC
print('\nMFCC:\nNumber of windows =', features_mfcc.shape[0])
print('Length of each feature =', features_mfcc.shape[1])
# Plot the features
features_mfcc = features_mfcc.T
plt.matshow(features_mfcc)
plt.title('MFCC')
# Extract the Filter Bank features
features_fb = logfbank(sig, rate)
# Print the parameters for Filter Bank
print('\nFilter bank:\nNumber of windows =', features_fb.shape[0])
print('Length of each feature =', features_fb.shape[1])
# Plot the features
features_fb = features_fb.T
plt.matshow(features_fb)
plt.title('Filter bank')
plt.show()
You can use glob recursively with wildcards to find all of the wav files.
for f in glob.glob(r'Downloads/DataVoices/Training/**/*.wav', recursive=True):
(rate,sig) = wav.read(f)
# Rest of your code

Trying to generate a Keras model with my own data instead of cifar10

I have followed this example:
https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/
and had an issue with the following line(line #51):
((trainX, trainY), (testX, testY)) = cifar10.load_data()
as i would like to train it on my own data
is there any simple way to generate this kind of output without digging deep into cifar's implementation?
I am pretty sure it is something that people already did but i cannot find a sample/tutorial/example
Thanks..
Assume you have your images as .jpg format, and your labels as csv format called label.csv, and separated them into 2 folders, train folder and test folder.
Then you can do the following to get the x_train
import cv2 #library for reading images
import numpy as np
import glob #library for reading files in a folder
x_train= []
for file in glob.glob("train/*.jpg"):
im = cv2.imread(file) #reading each image from the folder
x_train.append(im)
x_train = np.array(x_train)
And you can do the following to get the y_train
import csv
y_train= []
with open('train/label.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
y_train.append([int(row[0])]) #converting the string to int (otherwise the csv data will be read as string)
y_train = np.array(y_train)
You can do the same for your test folder, just change the name of the parameters and arguments.

Resources