How to calculate the area of a geospatial ZIP code polygon extracted from the US Census using tidycensus - geospatial

I would like to calculate the population density for ZIP codes in my state (North Carolina). I am able to extract the ZIP code populations and polygons from the US Census and plot the map of North Carolina using the following code:
library(tidycensus)
geo <- get_acs(geography = 'zcta', # Get zip code-level data
state = 'NC', # for NC
year = 2019, # for 2019
table = 'B01003', # from this US Census table
geometry = TRUE) %>% # and return the geospatial polygons for mapping
dplyr::rename('population' = estimate, 'zipcode' = NAME) %>%
select(-moe) %>%
arrange(zipcode)
p <- tm_shape(geo) +
tm_polygons('population')
p
This maps population by ZIP code. In order to calculate and map the population density by ZIP code, I need the area (in miles or kilometers squared) of each ZIP code polygon. I am struggling to find a way of (a) getting this data from the US Census site, (b) finding it elsewhere, or (c) using the polygon geometry to calculate it.
Any suggestions will be appreciated.

Another approach is to set keep_geo_vars = TRUE in get_acs(). This will return the area (in square meters) of the land and water areas in each ZCTA polygon. For calculating population density, you may prefer to use just the land area rather than the total area of each ZCTA polygon.
The land area variables is ALAND10 and the water area is AWATER10
library(tidycensus)
get_acs(geography = 'zcta',
state = 'NC',
year = 2019,
table = 'B01003',
geometry = TRUE,
keep_geo_vars = TRUE)
#> Getting data from the 2015-2019 5-year ACS
#> Simple feature collection with 808 features and 9 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32187 ymin: 33.84232 xmax: -75.46062 ymax: 36.58812
#> geographic CRS: NAD83
#> First 10 features:
#> ZCTA5CE10 AFFGEOID10 GEOID ALAND10 AWATER10 NAME variable
#> 1 28906 8600000US28906 28906 864608629 28813485 ZCTA5 28906 B01003_001
#> 2 28721 8600000US28721 28721 285413675 41953 ZCTA5 28721 B01003_001
#> 3 28365 8600000US28365 28365 498948199 2852124 ZCTA5 28365 B01003_001
#> 4 27317 8600000US27317 27317 139042432 9345547 ZCTA5 27317 B01003_001
#> 5 27562 8600000US27562 27562 139182043 11466187 ZCTA5 27562 B01003_001
#> 6 28748 8600000US28748 28748 218992045 0 ZCTA5 28748 B01003_001
#> 7 28025 8600000US28025 28025 282005597 384667 ZCTA5 28025 B01003_001
#> 8 28441 8600000US28441 28441 331231481 282711 ZCTA5 28441 B01003_001
#> 9 27893 8600000US27893 27893 285314738 3173744 ZCTA5 27893 B01003_001
#> 10 28101 8600000US28101 28101 2319755 131290 ZCTA5 28101 B01003_001
#> estimate moe geometry
#> 1 19701 736 MULTIPOLYGON (((-84.31137 3...
#> 2 10401 668 MULTIPOLYGON (((-82.93706 3...
#> 3 15533 1054 MULTIPOLYGON (((-78.29569 3...
#> 4 16169 875 MULTIPOLYGON (((-79.87638 3...
#> 5 2149 431 MULTIPOLYGON (((-78.99166 3...
#> 6 12606 1020 MULTIPOLYGON (((-82.88801 3...
#> 7 54425 1778 MULTIPOLYGON (((-80.62793 3...
#> 8 3396 588 MULTIPOLYGON (((-78.6127 34...
#> 9 39531 1258 MULTIPOLYGON (((-78.06593 3...
#> 10 970 245 MULTIPOLYGON (((-81.09122 3...

I was able to find an answer to my question that is very simple. Note that the polygons describing each ZIP code are found in the variable geometry in the geo data frame.
First, we calculate the area, which by default is returned in m^2 units.
library(sf)
geo$area.m2 <- st_area(geo$geometry)
Second, we convert to miles^2 units.
library(units)
geo$area.miles2 <- set_units(geo$area.m2, miles^2)

Related

P value from pairwise_cor using widyr

I am using widyr package in R to perform pairwise correlations between clusters of words. I work to examine correlation among clusters (restoration, recreation ...) , which indicates how often they appear together relative to how often they appear separately in my documents (social media text).
Everthing worked fine for correlations following this tutorial (https://www.youtube.com/watch?v=mApnx5NJwQA) from 10:34 to 12:52
# correlation co-occuring
correlatee <- data2 %>%
pairwise_cor(word, line, sort = TRUE)
# A tibble: 72 x 3
item1 item2 correlation
<chr> <chr> <dbl>
1 physical recreation 0.321
2 recreation physical 0.321
3 restoration recreation 0.304
4 recreation restoration 0.304
5 physical restoration 0.283
6 restoration physical 0.283
7 affection aesthetics 0.240
8 aesthetics affection 0.240
9 restoration aesthetics 0.227
10 aesthetics restoration 0.227
# ... with 62 more rows
# i Use `print(n = ...)` to see more rows
However, my question how I can get is p values of the correlations using pairwise_cor() or other ways to get p values in the pairwise comparison?
there is the question but the code is not working for me : pairwise_cor() in R: p value?
Thank you very much.
This link pairwise_cor() in R: p value?
but the code is not working for me

Plot many plots for each unique Value of one column

So, I am working with a Dataframe where there are around 20 columns, but only three columns are really of importance.
Index
ID
Date
Time_difference
1
01-40-50
2021-12-01 16:54:00
0 days 00:12:00
2
01-10
2021-10-11 13:28:00
2 days 00:26:00
3
03-48-58
2021-11-05 16:54:00
2 days 00:26:00
4
01-40-50
2021-12-06 19:34:00
7 days 00:26:00
5
03-48-58
2021-12-09 12:14:00
1 days 00:26:00
6
01-10
2021-08-06 19:34:00
0 days 00:26:00
7
03-48-58
2021-10-01 11:44:00
0 days 02:21:00
There are 90 unique ID's and a few thousand rows in total. What I want to do is:
Create a plot for each unique ID
Each plot with an y-axis of 'Time_difference' and a x-axis of 'date'
Each plot with a trendline
Optimally a plot that has the average of all other plots
Would appreciate any input as to how to start this! Thank you!
For future documentation, solved it as follows:
First transforming the time_delta to an integer:
df['hour_difference'] = df['time_difference'].dt.days * 24 +
df['time_difference'].dt.seconds / 60 / 60
Then creating a list with all unique entries of the ID:
id_list = df['ID'].unique()
And last, the for-loop for the plotting:
for i in id_list:
df.loc[(df['ID'] == i)].plot(y=["hour_difference"], figsize=(15,4))
plt.title(i, fontsize=18) #Labeling titel
plt.xlabel('Title name', fontsize=12) #Labeling x-axis
plt.ylabel('Title Name', fontsize=12) #Labeling y-axis

Using ax.twinx() with sns.FacetGrid and sns.lineplot

I am trying to apply a shared x-axis on each facet plot with 'total_bill' on the left y-axis, and 'tip' on the right y-axis. Using the tips dataframe to demonstrate.
The following dataset tip is used:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today= datetime.now()
days = pd.date_range(date_today, date_today + timedelta(tips.shape[0]-1), freq='D')
tips = sns.load_dataset("tips")
tips['date'] = days
Dataset preview:
tips.head()
total_bill
tip
sex
smoker
day
time
size
date
0
16.99
1.01
Female
No
Sun
Dinner
2
2021-01-19 16:39:38.363600
1
10.34
1.66
Male
No
Sun
Dinner
3
2021-01-20 16:39:38.363600
2
21.01
3.5
Male
No
Sun
Dinner
3
2021-01-21 16:39:38.363600
3
23.68
3.31
Male
No
Sun
Dinner
2
2021-01-22 16:39:38.363600
4
24.59
3.61
Female
No
Sun
Dinner
4
2021-01-23 16:39:38.363600
I have tried:
g = sns.FacetGrid(tips, row='smoker', col='time')
g.map(sns.lineplot, 'date', 'tip', color='b')
for ax in g.axes.flat:
ax.twinx()
for label in ax.get_xticklabels():
label.set_rotation(60)
g.map(sns.lineplot, 'date', 'total_bill', color='g')
plt.show()
I am unable to figure out the best way to pass the ax.twinx() into the secondary right y-axis plot 'tip'.
What I hope to achieve:
A sns.FacetGrid() with sns.lineplots where the features; 'total_bill' on the left-y-axis, 'tip' on the right-y-axis.
A green line represents the 'total_bill' fluctuations, a blue line representing the 'tip' fluctuations to the scale of each respective y-axis feature.
You need to create a custom plotting function that create the twin axes and plots the desired output on this newly created axes:
def twin_lineplot(x,y,color,**kwargs):
ax = plt.twinx()
sns.lineplot(x=x,y=y,color=color,**kwargs, ax=ax)
g = sns.FacetGrid(tips, row='smoker', col='time')
g.map(sns.lineplot, 'date', 'tip', color='b')
g.map(twin_lineplot, 'date', 'total_bill', color='g')
g.fig.autofmt_xdate()
plt.show()

How to access a column of grouped data to perform linear regression in pandas?

I want to perform a linear regression on groupes of grouped data frame in pandas. The function I am calling throws a KeyError that I cannot resolve.
I have an environmental data set called dat that includes concentration data of a chemical in different tree species of various age classes in different country sites over the course of several time steps. I now want to do a regression of concentration over time steps within each group of (site, species, age).
This is my code:
```
import pandas as pd
import statsmodels.api as sm
dat = pd.read_csv('data.csv')
dat.head(15)
SampleName Concentration Site Species Age Time_steps
0 batch1 2.18 Germany pine 1 1
1 batch2 5.19 Germany pine 1 2
2 batch3 11.52 Germany pine 1 3
3 batch4 16.64 Norway spruce 0 1
4 batch5 25.30 Norway spruce 0 2
5 batch6 31.20 Norway spruce 0 3
6 batch7 12.63 Norway spruce 1 1
7 batch8 18.70 Norway spruce 1 2
8 batch9 43.91 Norway spruce 1 3
9 batch10 9.41 Sweden birch 0 1
10 batch11 11.10 Sweden birch 0 2
11 batch12 15.73 Sweden birch 0 3
12 batch13 16.87 Switzerland beech 0 1
13 batch14 22.64 Switzerland beech 0 2
14 batch15 29.75 Switzerland beech 0 3
def ols_res_grouped(group):
xcols_const = sm.add_constant(group['Time_steps'])
linmod = sm.OLS(group['Concentration'], xcols_const).fit()
return linmod.params[1]
grouped = dat.groupby(['Site','Species','Age']).agg(ols_res_grouped)
```
I want to get the regression coefficient of concentration data over Time_steps but get a KeyError: 'Time_steps'. How can the sm method access group["Time_steps"]?
According to pandas's documentation, agg applies functions to each column independantly.
It might be possible to use NamedAgg but I am not sure.
I think it is a lot easier to just use a for loop for this :
for _, group in dat.groupby(['Site','Species','Age']):
coeff = ols_res_grouped(group)
# if you want to put the coeff inside the dataframe
dat.loc[group.index, 'coeff'] = coeff

applying a function rowwise inside mutate(dplyr)

I have the below data where Duration captures number of years in the same house, for each household.
Input df:
House_ID Duration
H29937 30 YEAR
H2996 30 YEAR
H156 25 YEAR
H10007 5 MONTH
I am trying to get the duration in months with the below query: If the second part of extracted string is YEAR, convert the number in duration to months by multiplying it with 12,
else just take the numeric part of Duration
info_df <- mutate(info_df,
residence_Months = ifelse(str_split(Duration," ",2)[[1]][2] == "YEAR",
as.numeric(str_split(Duration," ",2)[[1]][1])*12,
as.numeric(str_split(Duration," ",2)[[1]][1])))
Expected output df:
Agent_Code Duration Residence_Months
S1299317 30 YEAR 360
S1299622 30 YEAR 360
S1299656 25 YEAR 300
S1300067 5 MONTH 5
However, the code above, gives the same value for all rows as 360.
I am not sure where the error is occuring. Can someone please help me with this?
Note : I have tried the rowwise option as pointed out in other posts but to no avail.
Depending on your full data set, this may be better achieved with the lubridate package, but taking into account your example, you can do:
library(dplyr)
library(tidyr)
df <- tibble(House_ID = c("H29937", "H2996", "H156", "H10007"),
Duration = c("30 YEAR", "30 YEAR", "25 YEAR", "5 MONTH"))
df %>%
separate("Duration", c("duration", "unit")) %>%
mutate(duration = as.integer(duration),
Residence_Months = ifelse(unit == "YEAR", duration * 12, duration))
#> # A tibble: 4 x 4
#> House_ID duration unit Residence_Months
#> <chr> <int> <chr> <dbl>
#> 1 H29937 30 YEAR 360
#> 2 H2996 30 YEAR 360
#> 3 H156 25 YEAR 300
#> 4 H10007 5 MONTH 5
Created on 2019-07-18 by the reprex package (v0.3.0)

Resources