I have a simple dataframe like
colC zipcode count
val1 71023 1
val2 75454 3
val3 77034 2
val2 78223 3
val2 91791 4
these are all US zipcodes.
I want to plot the zipcodes and the counts of values in colC on a map. For instance, zipcode 75454 has val2 in colC so it must have a different color than zipcode 71023 which has val1 in colC
Additionally I want to create a heatmap where the count column denotes the intensity of the heatmap across the map.
I went over some documentation for geopandas but looks like i have to convert the zipcodes to either some shape files or geojson in order to define the boundaries. I am not able to figure this step out.
Is geopandas the best tool to achieve this?
Any help is much appreciated
UPDATE
I was able to make some progress as
import pandas as pd
import pandas_bokeh
import matplotlib.pyplot as plt
import pgeocode
import geopandas as gpd
from shapely.geometry import Point
from geopandas import GeoDataFrame
pandas_bokeh.output_notebook()
nomi = pgeocode.Nominatim('us')
edf = pd.read_csv('myFile.tsv', sep='\t',header=None, index_col=False ,names=['colC','zipcode','count'])
edf['Latitude'] = (nomi.query_postal_code(edf['zipcode'].tolist()).latitude)
edf['Longitude'] = (nomi.query_postal_code(edf['zipcode'].tolist()).longitude)
geometry = [Point(xy) for xy in zip(edf['Longitude'], edf['Latitude'])]
gdf = GeoDataFrame(edf, geometry=geometry)
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
gdf.plot(ax=world.plot(figsize=(10, 6)), marker='o', color='red', markersize=15);
plt.savefig('world.jpg')
however, this gives me a map plot of the entire world. how can i reduce it to just show me the US as thats where all of my zipcodes are from?
turns out plotly is best suited for me
import pandas as pd
import pandas_bokeh
import matplotlib.pyplot as plt
import pgeocode
import geopandas as gpd
from shapely.geometry import Point
from geopandas import GeoDataFrame
pandas_bokeh.output_notebook()
import plotly.graph_objects as go
nomi = pgeocode.Nominatim('us')
edf = pd.read_csv('myFile.tsv', sep='\t',header=None, index_col=False ,names=['colC','zipcode','count'])
edf['Latitude'] = (nomi.query_postal_code(edf['zipcode'].tolist()).latitude)
edf['Longitude'] = (nomi.query_postal_code(edf['zipcode'].tolist()).longitude)
fig = go.Figure(data=go.Scattergeo(
lon = edf['Longitude'],
lat = edf['Latitude'],
text = edf['colC'],
mode = 'markers',
marker_color = edf['count'],
))
fig.update_layout(
title = 'colC Distribution',
geo_scope='usa',
)
fig.show()
Related
Trying to create a plot using Python Spyder. I have sample data in excel which I am able to import into Spyder, I want one column ('Frequency') to be the X axis, and the rest of the columns ('C1,C2,C3,C4') to be plotted on the Y axis. How do I do this? This is the data in excel and how the plot looks in excel (https://i.stack.imgur.com/eRug5.png) , the plot and data
This is what I have so far . These commands below (Also seen in the image) give an empty plot.
data = data.head()
#data.plot(kind='line', x='Frequency', y=['C1','C2','C3','C4'])
df = pd.DataFrame(data, columns=["Frequency","C1", "C2","C3","C4"])
df.plot(x = "Frequency",y=["C1", "C2","C3","C4"])
Here is an example, you can change columns names:
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'X_Axis':[1,3,5,7,10,20],
'col_2':[.4,.5,.4,.5,.5,.4],
'col_3':[.7,.8,.9,.4,.2,.3],
'col_4':[.1,.3,.5,.7,.1,.0],
'col_5':[.5,.3,.6,.9,.2,.4]})
dfm = df.melt('X_Axis', var_name='cols', value_name='vals')
g = sns.catplot(x="X_Axis", y="vals", hue='cols', data=dfm, kind='point')
import pandas as pd
import matplotlib.pyplot as plt
path = r"C:\Users\Alisha.Walia\Desktop\Alisha\SAMPLE.xlsx"
data = pd.read_excel(path)
#df = pd.DataFrame.from_dict(data)
#print(df)
#prints out data from excl in tabular format
dict1 = (data.to_dict()) #print(dict1)
Frequency=data["Frequency "].to_list() #print (Frequency)
C1=data["C1"].to_list() #print(C1)
C2=data["C2"].to_list() #print(C2)
C3=data["C3"].to_list() #print(C3)
C4=data["C4"].to_list() #print(C4)
plt.plot(Frequency,C1)
plt.plot(Frequency,C2)
plt.plot(Frequency,C3)
plt.plot(Frequency,C4)
plt.style.use('ggplot')
plt.title('SAMPLE')
plt.xlabel('Frequency 20Hz-200MHz')
plt.ylabel('Capacitance pF')
plt.xlim(5, 500)
plt.ylim(-20,20)
plt.legend()
plt.show()
I have working code that is utilizing dbscan to find tight groups of sparse spatial data imported with pd.read_csv.
I am maintaining the original spatial data locations and would like to annotate the labels returned by dbscan for each data point to the original dataframe and then write a csv with the same information.
So the code below is doing exactly what I would expect it to at this point, I would just like to extend it to import the label for each row in the original dataframe.
import argparse
import string
import os, subprocess
import pathlib
import glob
import gzip
import re
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from sklearn.cluster import DBSCAN
X = pd.read_csv(tmp_csv_name)
X = X.drop('Name', axis = 1)
X = X.drop('Type', axis = 1)
X = X.drop('SomeValue', axis = 1)
# only columns 'x' and 'y' now remain
db=DBSCAN(eps=EPS, min_samples=minSamples, metric='euclidean', algorithm='auto', leaf_size=30).fit(X)
labels = def_inst_dbsc.labels_
unique_labels = set(labels)
# maxX , maxY are manual inputs temporarily
while sizeX > 16 or sizeY > 16 :
sizeX=sizeX*0.8 ; sizeY=sizeY*0.8
fig, ax = plt.subplots(figsize=(sizeX,sizeY))
plt.xlim(0,maxX)
plt.ylim(0,maxY)
plt.scatter(X['x'], X['y'], c=colors, marker="o", picker=True)
# hackX , hackY are manual inputs temporarily
# which represent the boundaries defined in the original dataset
poly = patches.Polygon(xy=list(zip(hackX,hackY)), fill=False)
ax.add_patch(poly)
plt.show()
import plotly.graph_objects as go
import plotly.express as px
fig = px.histogram(df, nbins = 5, x = "numerical_col", color = "cat_1", animation_frame="date",
range_x=["10000","500000"], facet_col="cat_2")
fig.update_layout(
margin=dict(l=25, r=25, t=20, b=20))
fig.show()
How can I fix the output? I would like multiple subplots based on cat_2 where the hue is cat_1.
you have not provided sample data, so I've simulated it based on code you are using to generate figure
I have encountered one issue range_x does not work, it impacts y-axis as well. Otherwise approach fully works.
import plotly.graph_objects as go
import plotly.express as px
import numpy as np
import pandas as pd
# data not provided.... simulate some
DAYS = 5
ROWS = DAYS * 2000
df = pd.DataFrame(
{
"date_d": np.repeat(pd.date_range("1-Jan-2021", periods=DAYS), ROWS // DAYS),
"numerical_col": np.random.uniform(10000, 500000, ROWS),
"cat_1": np.random.choice(list("ABCD"), ROWS),
"cat_2": np.random.choice(list("UVWXYZ"), ROWS),
}
)
# animation frame has to be a string not a date...
df["date"] = df["date_d"].dt.strftime("%Y-%b-%d")
# always best to provide pre-sorted data to plotly
df = df.sort_values(["date", "cat_1", "cat_2"])
fig = px.histogram(
df,
nbins=5,
x="numerical_col",
color="cat_1",
animation_frame="date",
# range_x=[10000, 500000],
facet_col="cat_2",
)
fig.update_layout(margin=dict(l=25, r=25, t=20, b=20))
I have a geodataframe gdf that looks like this:
longitude latitude geometry
8628 4.890683 52.372383 POINT (4.89068 52.37238)
8629 4.890500 52.371433 POINT (4.89050 52.37143)
8630 4.889217 52.369469 POINT (4.88922 52.36947)
8631 4.889300 52.369415 POINT (4.88930 52.36942)
8632 4.889100 52.368683 POINT (4.88910 52.36868)
8633 4.889567 52.367416 POINT (4.88957 52.36742)
8634 4.889333 52.367134 POINT (4.88933 52.36713)
I was trying to convert these point geometries into a line. However, the following code below gives an error: AttributeError: 'Point' object has no attribute 'values'
line_gdf = gdf['geometry'].apply(lambda x: LineString(x.values.tolist()))
line_gdf = gpd.GeoDataFrame(line_gdf, geometry='geometry')
Any ideas ?
When you create a LineString from all Points in a geodataframe, you get only 1 line. Here is the code you can run to create the LineString:
from shapely.geometry import LineString
# only relevant code here
# use your gdf that has Point geometry
lineStringObj = LineString( [[a.x, a.y] for a in gdf.geometry.values] )
If you need a geodataframe of 1 row with this linestring as its geometry, proceed with this:
import pandas as pd
import geopandas as gpd
line_df = pd.DataFrame()
line_df['Attrib'] = [1,]
line_gdf = gpd.GeoDataFrame(line_df, geometry=[lineStringObj,])
Edit1
Pandas has powerful aggregate function that can be used to collect all the coordinates (longitude, latitude) for use by LineString() to create the required geometry.
I offer this runnable code that demonstrates such approach for the benefit of the readers.
import pandas as pd
import geopandas as gpd
from shapely.geometry import LineString
from shapely import wkt
from io import StringIO
import numpy as np
# Create a dataframe from CSV data
df5 = pd.read_csv(StringIO(
"""id longitude latitude
8628 4.890683 52.372383
8629 4.890500 52.371433
8630 4.889217 52.369469
8631 4.889300 52.369415
8632 4.889100 52.368683
8633 4.889567 52.367416
8634 4.889333 52.367134"""), sep="\s+")
# Using pandas' aggregate function
# Aggregate longitude and latitude
stack_lonlat = df5.agg({'longitude': np.stack, 'latitude': np.stack})
# Create the LineString using aggregate values
lineStringObj = LineString(list(zip(*stack_lonlat)))
# (Previously use) Create a lineString from dataframe values
#lineStringObj = LineString( list(zip(df5.longitude.tolist(), df5.latitude.tolist())) )
# Another approach by #Phisan Santitamnont may be the best.
# Create a geodataframe `line_gdf` for the lineStringObj
# This has single row, containing the linestring created from aggregation of (long,lat) data
df6 = pd.DataFrame()
df6['LineID'] = [101,]
line_gdf = gpd.GeoDataFrame(df6, crs='epsg:4326', geometry=[lineStringObj,])
# Plot the lineString in red
ax1 = line_gdf.plot(color="red", figsize=[4,10]);
# Plot the original data: "longitude", "latitude" as kind="scatter"
df5.plot("longitude", "latitude", kind="scatter", ax=ax1);
Sir,
as of 2022 , i would like to propose another updated pythonic style ....
# Create a dataframe from CSV data
df = pd.read_csv(StringIO(
"""id longitude latitude
8628 4.890683 52.372383
8629 4.890500 52.371433
8630 4.889217 52.369469
8631 4.889300 52.369415
8632 4.889100 52.368683
8633 4.889567 52.367416
8634 4.889333 52.367134"""), sep="\s+")
ls = LineString( df[['longitude','latitude']].to_numpy() )
line_gdf = gpd.GeoDataFrame( [['101']],crs='epsg:4326', geometry=[ls] )
# Plot the lineString in red
ax = line_gdf.plot(color="red", figsize=[4,10]);
df.plot("longitude", "latitude", kind="scatter", ax=ax);
plt.show()
I have this data
10,000 12,350 11153
12,350 17,380 39524
17,380 24,670 29037
24,670 36,290 25469
By using matplotlib.pyplot I would like to draw a bar chart where bar starts at column0 and ends at column1. A bar would represent an interval (10 - 12.35) and bar height is column2 (1153). How could this be done?
Thank you
You can find documentation for pyplot.bar() here. For your question, you need to assign your column0 to left, your column2 to height and use column1-column0 for width:
import io
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
s = """10000 12350 11153
12350 17380 39524
17380 24670 29037
24670 36290 25469"""
df = pd.read_table(io.StringIO(s), sep=' ', header=None, dtype='int')
plt.bar(df[0], df[2], df[1]-df[0])
plt.show()