Creating a map with basemap, filling countries - python-3.x

I'm currently working in my final project for my Coding class (my first coding class, so kind of an amateur).
My idea is for a code to search every newspaper in the world for a specific word within the titles (using bs4) and then obtaining a dictionary with the average mentions by country, taking into account the number of newspaper in each country. Afterwards, and this is the part where I'm stuck, I want to put this in a map.
The whole program is already working properly, until the part where I have a CSV with the following form:
'Country','Average'
'Afghanistan',10
'Albania',5
'Algeria',0
'Andorra',2
'Antigua and Barbuda',7
'Argentina',0
'Armenia',4
Now, I want to create a worldmap where the higher the number, the redder (or any other color) the whole polygon of the country. So far I've found many codes that work well placing points in space, but I haven't found one that "appends" the CSV data presented above and then fills each country accordingly. Below is the part of the code that currently created the worldmap:
# Now we proceed with the creation of the map
fig, ax = plt.subplots(figsize=(15,10)) # We define the size of the map
m = Basemap(resolution='c', # c, l, i, h, f or None
projection='merc', # Mercator projection
lat_0=24.20, lon_0=-6.67, # The center of the mas, so that the whole world is shown without splitting Asia
llcrnrlon=-180, llcrnrlat= -85,urcrnrlon=180, urcrnrlat=85) # The coordinates of the whole world
m.drawmapboundary(fill_color='#46bcec') # We choose a color for the boundary of the map
m.fillcontinents(color='#f2f2f2',lake_color='#46bcec') # We choose a color for the land and one for the lakes
m.drawcoastlines() # We choose to draw the lines of the map
m.readshapefile('Final project\\vincent_map_data-master\\ne_110m_admin_0_countries\\ne_110m_admin_0_countries', 'areas') # We import the shape file of the whole world
df_poly = pd.DataFrame({ # We define the polygon structure
'shapes': [Polygon(np.array(shape), True) for shape in m.areas],
'area': [area['name'] for area in m.areas_info]
})
cmap = plt.get_cmap('Oranges')
pc = PatchCollection(df_poly.shapes, zorder=2)
norm = Normalize()
mapper = matplotlib.cm.ScalarMappable(norm=norm, cmap=cmap)
# We show the map
plt.show(m)
I opened the shapefile of the countries and the way to identify the countries is with the variable "sovereignty". There might be some non-sensical things within my code, since I've extracted things from many places. Sorry about that.
If someone could help me out, I would really appreciated.
Thanks

Related

combine overlapping labelled objects and modify label values

I have a Z-stack of 2D confocal microscopy images (2D slices) and I want to segment cells. The Z-stack of 2D images is actually a 3D data. In different slices along the Z-axis, I see same cells do appear in multiple slices. I am interested in cell shape in the XY so I want to preserve the largest cell area from different Z-axis slices. I thought to combine the consecutive 2D slices after converting them to labelled binary images but I am having few issues and I need some help to proceed further.
I have two images img_a and img_b. I first converted them to binary images using OTSU, then applied some morphological operations and then used cv2.connectedComponentsWithStats() to obtain labelled objects. After labeling images, I combined them using cv2.bitwise_or() but it messes up with the labels. You can see this in the attached processed image (cell higlighted by red circles). I see multiple labels for overlapping cell. However, I want to assign one unique label for every combined overlapping object.
What I want at the end is that when I combine two labelled images, I want to assign one single label (a unique value) to the combined overlapping objects and keep the largest cell area by combining both images. Does anyone know how to do it?
Here is the code:
from matplotlib import pyplot as plt
from skimage import io, color, measure
from skimage.util import img_as_ubyte
from skimage.segmentation import clear_border
import cv2
import numpy as np
cells_a=img_a[:,:,1] # get the green channel
#Threshold image to binary using OTSU.
ret_a, thresh_a = cv2.threshold(cells_a, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)
# Morphological operations to remove small noise - opening
kernel = np.ones((3,3),np.uint8)
opening_a = cv2.morphologyEx(thresh_a,cv2.MORPH_OPEN,kernel, iterations = 2)
opening_a = clear_border(opening_a) #Remove edge touchingpixels
numlabels_a, labels_a, stats_a, centroids_a = cv2.connectedComponentsWithStats(opening_a)
img_a1 = color.label2rgb(labels_a, bg_label=0)
## now do the same with image_b
cells_b=img_b[:,:,1] # get the green channel
#Threshold image to binary using OTSU.
ret_b, thresh_b = cv2.threshold(cells_b, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)
# Morphological operations to remove small noise - opening
opening_b = cv2.morphologyEx(thresh_b,cv2.MORPH_OPEN,kernel, iterations = 2)
opening_b = clear_border(opening_b) #Remove edge touchingpixels
numlabels_b, labels_b, stats_b, centroids_b = cv2.connectedComponentsWithStats(opening_b)
img_b1 = color.label2rgb(labels_b, bg_label=0)
## Now combined two images
combined = cv2.bitwise_or(labels_a, labels_b) ## combined both labelled images to get maximum area per cell
combined_img = color.label2rgb(combined, bg_label=0)
plt.imshow(combined_img)
Images can be found here:
Based on the comments from Christoph Rackwitz and beaker, I started to look around for 3D connected components labeling. I found one python library that can handle such things and I installed it and give it a try. It seems to be doing pretty good. It does assign labels in each slice and keeps the labels same for the same cells in different slices. This is exactly what I wanted.
Here is the link to the library that I used to label objects in 3D.
https://pypi.org/project/connected-components-3d/

OSMNx : get coordinates of nodes/corners/edges of polygons/buildings

I am trying to retrieve the coordinates of all nodes/corners/edges of each commercial building in a list. E.g. for the supermarket Aldi in Macclesfield (UK), I can get from the UI 10 nodes (all the corners/edges of the supermarket) but I can only retrieve from osmnx 2 of those 10 nodes. I would need to access to the complete list of nodes but it truncates the results giving only 2 nodes of 10 in this case.Using this code below:
import osmnx as ox
test = ox.geocode_to_gdf('aldi, Macclesfield, Cheshire, GB')
ax = ox.project_gdf(test).plot()
test.geometry
or
gdf = ox.geometries_from_place('Grosvenor, Macclesfield, Cheshire, GB', tags)
gdf.geometry
Both return just two coordinates and truncate other info/results that is available in openStreetMap UI (you can see it in the first column of the image attached geometry>POLYGON>only two coordinates and other results truncated...). I would appreciate some help on this, thanks in advance.
It's hard to guess what you're doing here because you didn't provide a reproducible example (e.g., tags is undefined). But I'll try to guess what you're going for.
I am trying to retrieve the coordinates of all nodes/corners/edges of commercial buildings
Here I retrieve all the tagged commercial building footprints in Macclesfield, then extract the first one's polygon coordinates. You could instead filter these by other attribute values as you see fit if you only want certain kinds of buildings. Proper usage of OSMnx's geometries module is described in the documentation.
import osmnx as ox
# get the building footprints in Macclesfield
place = 'Macclesfield, Cheshire, England, UK'
tags = {'building': 'commercial'}
gdf = ox.geometries_from_place(place, tags)
# how many did we get?
print(gdf.shape) # (57, 10)
# extract the coordinates for the first building's footprint
gdf.iloc[0]['geometry'].exterior.coords
Alternatively, if you want a specific building's footprint, you can look up its OSM ID and tell OSMnx to geocode that value:
gdf = ox.geocode_to_gdf('W251154408', by_osmid=True)
polygon = gdf.iloc[0]['geometry']
polygon.exterior.coords
gdf = ox.geocode_to_gdf('W352332709', by_osmid=True)
polygon = gdf.iloc[0]['geometry']
polygon.exterior.coords
list(polygon.exterior.coords)

Pandas dropped row showing in plot

I am trying to make a heatmap.
I get my data out of a pipeline that class some rows as noisy, I decided to get a plot including them and a plot without them.
The problem I have: In the plot without the noisy rows I have blank line appearing (the same number of lines than rows removed).
Roughly The code looks like that (I can expand part if required I am trying to keep it shorts).
If needed I can provide a link with similar data publicly available.
data_frame = load_df_fromh5(file) # load a data frame from the hdf5 output
noisy = [..] # a list which indicate which row are vector
# I believe the problem being here:
noisy = [i for (i, v) in enumerate(noisy) if v == 1] # make a vector which indicates which index to remove
# drop the corresponding index
df_cells_noisy = df_cells[~df_cells.index.isin(noisy)].dropna(how="any")
#I tried an alternative method:
not_noisy = [0 if e==1 else 1 for e in noisy)
df = df[np.array(not_noisy, dtype=bool)]
# then I made a clustering using scipy
Z = hierarchy.linkage(df, method="average", metric="canberra", optimal_ordering=True)
df = df.reindex(hierarchy.leaves_list(Z))
# the I plot using the df variable
# quit long function I believe the problem being upstream.
plot(df)
The plot is quite long but I believe it works well because the problem only shows with the no noisy data frame.
IMO I believe somehow pandas keep information about the deleted rows and that they are plotted as a blank line. Any help is welcome.
Context:
Those are single-cell data of copy number anomaly (abnormalities of the number of copy of genomic segment)
Rows represent individuals (here individuals cells) columns represents for the genomic interval the number of copy (2 for vanilla (except sexual chromosome)).

Why is my notebook crashing when I run this for loop and what is the fix?

I have taken code in relation to the Kalman Filter and am attempting to iterate through each column of data. What I would like to have happen is:
The column data is fed into the filter
The filtered column data (xhat) is placed into another DataFrame (filtered)
The filtered column data (xhat) is used to produce a visual.
I have created a for loop to iterate through the column data, but when I run the cell, I crash the notebook. When it doesn't crash, I get this warning:
C:\Users\perso\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:45: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Thanks in advance for any help. I hope this question is detailed enough. I bombed on the last one.
'''A Python implementation of the example given in pages 11-15 of "An
Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
University of North Carolina at Chapel Hill, Department of Computer
Science, TR 95-041,
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf'''
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
# dataframe created to hold filtered data
filtered = pd.DataFrame()
# intial parameters
for column in data:
n_iter = len(data.index) #number of iterations equal to sample numbers
sz = (n_iter,) # size of array
z = data[column] # observations
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 1.0**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = z[0]
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
# add new data to created dataframe
filtered.assign(a = [xhat])
#create visualization of noise reduction
plt.rcParams['figure.figsize'] = (10, 8)
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('column data')
plt.ylabel('Measurement')
This seems like a pretty straightforward error. The warning indicates that you have attempted to plot more figures than the current limit before a warning is created (a parameter you can change but which by default is set to 20). This is because in each iteration of your for loop, you create a new figure. Depending on the size of n_iter, you are opening potentially hundreds or thousands of figures. Each of these figures takes resources to generate and show, so you are creating a very large resource load on your system. Either it is processing very slowly due or is crashing altogether. In any case, the solution is to plot fewer figures.
I don't know exactly what you're plotting in your loop but it seems like each iteration of your loop corresponds to one time step and at each time step you'd like to plot the estimated and actual values. In this case, you need to define a figure and figure options once, outside of the loop, rather than at each iteration. But a better way to do this is probably to generate all of the data you want to plot ahead of time and store it in an easy-to-plot datatype like lists, then plot it once at the end.

Geospatial fixed radius cluster hunting in python

I want to take an input of millions of lat long points (with a numerical attribute) and then find all fixed radius geospatial clusters where the sum of the attribute within the circle is above a defined threshold.
I started by using sklearn BallTree to sum the attribute within any defined circle, with the intention of then expanding this out to run across a grid or lattice of circles. The run time for one circle is around 0.01s, so this is fine for small lattices, but won't scale if I want to run 200m radius circles across the whole of the UK.
#example data (use 2m rows from postcode centroid file)
df = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000000)
#this will be our grid of points (or lattice) use points from same file for example
df2 = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000)
#reorder lat long columns for balltree input
columnTitles=["Y","X"]
df = df.reindex(columns=columnTitles)
df2 = df2.reindex(columns=columnTitles)
# assign new columns to existing dataframe. attribute will hold the data we want to sum over (set to 1 for now)
df['attribute'] = 1
df2['aggregation'] = 0
RADIANT_TO_KM_CONSTANT = 6367
class BallTreeIndex:
def __init__(self, lat_longs):
self.lat_longs = np.radians(lat_longs)
self.ball_tree_index =BallTree(self.lat_longs, metric='haversine')
def query_radius(self,query,radius):
radius_km = radius/1000
radius_radiant = radius_km / RADIANT_TO_KM_CONSTANT
query = np.radians(np.array([query]))
indices = self.ball_tree_index.query_radius(query,r=radius_radiant)
return indices[0]
#index the base data
a=BallTreeIndex(df.iloc[:,0:2])
#begin to loop over the lattice to test performance
for i in range(0,100):
b = df2.iloc[i,0:2]
output = a.query_radius(b, 200)
accumulation = sum(df.iloc[output, 2])
df2.iloc[i,2] = accumulation
It feels as if the above code is really inefficient as I don't need to run the calculation across all circles on my lattice (as most will be well below my threshold - or will have no data points in at all).
Instead of this for loop, is there a better way of scaling this algorithm to give me the most dense circles?
I'm new to python, so any help would be massively appreciated!!
First don't try to do this on a sphere! GB is small and we have a well defined geographic projection that will work. So use the oseast1m and osnorth1m columns as X and Y. They are in metres so no need to convert (roughly) to degrees and use Haversine. That should help.
Next add a spatial index to speed up lookups.
If you need more speed there are various tricks like loading a 2R strip across the country into memory and then running your circles across that strip, then moving down a grid step and updating that strip (checking Y values against a fixed value is quick, especially if you store the data sorted on Y then X value). If you need more speed then look at any of the papers the Stan Openshaw (and sometimes I) wrote about parallelising the GAM. There are examples of implementing GAM in python (e.g. this paper, this paper) that may also point to better ways.

Resources