Download .csv file from github using HTTR GET request - get

I am trying to create an automatic pull in R using the GET function from the HTTR package for a csv file located on github.
Here is the table I am trying to download.
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv
I can make the connection to the file using the following GET request:
library(httr)
x <- httr::GET("https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
However I am unsure how I then convert that into a dataframe similar to the table on github.
Any assistance would be much appreciated.

I am new to R but here is my solution.
You need to use the raw version of the csv file from github (raw.githubusercontent.com)!
library(httr)
x <- httr::GET("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
# Save to file
bin <- content(x, "raw")
writeBin(bin, "data.csv")
# Read as csv
dat = read.csv("data.csv", header = TRUE, dec = ",")
colnames(dat) = gsub("X", "", colnames(dat))
# Group by country name (to sum regions)
# Skip the four first columns containing metadata
countries = aggregate(dat[, 5:ncol(dat)], by=list(Country.Region=dat$Country.Region), FUN=sum)
# Here is the table of the most recent total confirmed cases
countries_total = countries[, c(1, ncol(countries))]
The output graph
How I got this to work:
How to sum a variable by group

This is as simple as:
res <- httr::GET("https://.../file.csv")
data <- httr::content(res, "parsed")
This requires the readr package.
See https://httr.r-lib.org/reference/content.html

Related

Custom Filepath Exporting a Pandas Dataframe

I am working with financial data, and I am cleaning the data in Python before exporting it as a CSV. I want this file to be reused, so I want to make sure that the exported files are not overwritten. I am including this piece of code to help with this:
# Fill this out; this will help identify the dataset after it is exported
latestFY = '21'
earliestFY = '19'
I want the user to change the earliest and latest fiscal year variables to reflect the data they are working with, so when the data is exported, it is called financialData_FY19_FY21, for example. How can I do this using the to_csv function?
Here is what I currently have:
mergedDF.to_csv("merged_financial_data_FY.csv", index = False)
Here is what I want the file path to look like: financialData_FY19_FY21 where the 19 and 21 can be changed based on the input above.
You can use an f-string to update the strings that will be your file paths.
latestFY = '21'
earliestFY = '19'
filename = f"merged_financial_data_FY{earliestFY}_{latestFY}.csv"
mergedDF.to_csv(filename, index=False)
Link to docs

Appending data from multiple excel files into a single excel file without overwriting using python pandas

Here is my current code below.
I have a specific range of cells (from a specific sheet) that I am pulling out of multiple (~30) excel files. I am trying to pull this information out of all these files to compile into a single new file appending to that file each time. I'm going to manually clean up the destination file for the time being as I will improve this script going forward.
What I currently have works fine for a single sheet but I overwrite my destination every time I add a new file to the read in list.
I've tried adding the mode = 'a' and a couple different ways to concat at the end of my function.
import pandas as pd
def excel_loader(fname, sheet_name, new_file):
xls = pd.ExcelFile(fname)
df1 = pd.read_excel(xls, sheet_name, nrows = 20)
print(df1[1:15])
writer = pd.ExcelWriter(new_file)
df1.insert(51, 'Original File', fname)
df1.to_excel(new_file)
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
for name in names:
excel_loader(name, 'specific_sheet_name', destination)
Thanks for any help in advance can't seem to find an answer to this exact situation on here. Cheers.
Ideally you want to loop through the files and read the data into a list, then concatenate the individual dataframes, then write the new dataframe. This assumes the data being pulled is the same size/shape and the sheet name is the same. If sheet name is changing, look into zip() function to send filename/sheetname tuple.
This should get you started:
names = ['sheet1.xlsx', 'sheet2.xlsx']
destination = 'destination.xlsx'
#read all files first
df_hold_list = []
for name in names:
xls = pd.ExcelFile(name)
df = pd.read_excel(xls, sheet_name, nrows = 20)
df_hold_list.append(df)
#concatenate dfs
df1 = pd.concat(df_hold_list, axis=1) # axis is 1 or 0 depending on how you want to cancatenate (horizontal vs vertical)
#write new file - may have to correct this piece - not sure what functions these are
writer = pd.ExcelWriter(destination)
df1.to_excel(destination)

convert data string to list

I'm having some troubles processing some input.
I am reading data from a log file and store the different values according to the name.
So my input string consists of ip, name, time and a data value.
A log line looks like this and it has \t spacing:
134.51.239.54 Steven 2015-01-01 06:09:01 5423
I'm reading in the values using this code:
loglines = file.splitlines()
data_fields = loglines[0] # IP NAME DATE DATA
for loglines in loglines[1:]:
items = loglines.split("\t")
ip = items[0]
name = items[1]
date = items[2]
data = items[3]
This works quite well but I need to extract all names to a list but I haven't found a functioning solution.
When i use print name i get:
Steven
Max
Paul
I do need a list of the names like this:
['Steven', 'Max', 'Paul',...]
There is probably a simple solution and i haven't figured it out yet, but can anybody help?
Thanks
Just create an empty list and add the names as you loop through the file.
Also note that if that file is very large, file.splitlines() is probably not the best idea, as it reads the entire file into memory -- and then you basically copy all of that by doing loglines[1:]. Better use the file object itself as an iterator. And don't use file as a variable name, as it shadows the type.
with open("some_file.log") as the_file:
data_fields = next(the_file) # consumes first line
all_the_names = [] # this will hold the names
for line in the_file: # loops over the rest
items = line.split("\t")
ip, name, date, data = items # you can put all this in one line
all_the_names.append(name) # add the name to the list of names
Alternatively, you could use zip and map to put it all into one expression (using that loglines data), but you rather shouldn't do that... zip(*map(lambda s: s.split('\t'), loglines[1:]))[1]

Automatically naming output txt file in Python

I have 4 lists called I_list, Itiso, ItHDKR and Itperez and I would like to receive .txt output files with the data of these lists. I am trying to make Python rename automatically the name of the .txt output files in terms of some of my input data. In this way, the .txt output files will always have different names.
Now I am programming the following commands:
Horizontal_radiation = []
Isotropic_radiation = []
HDKR_radiation = []
Perez_radiation = []
Horizontal = open("outputHorizontal.txt", 'w')
Isotropic = open("outputIsotropic.txt", 'w')
HDKR = open("outputHDKR.txt", 'w')
Perez = open("outputPerez.txt", 'w')
for i in I_list:
Horizontal_radiation.append(i)
for x in Itiso:
Isotropic_radiation.append(x)
for y in ItHDKR:
HDKR_radiation.append(y)
for z in Itperez:
Perez_radiation.append(z)
Horizontal.write(str(Horizontal_radiation))
Isotropic.write(str(Isotropic_radiation))
HDKR.write(str(HDKR_radiation))
Perez.write(str(Perez_radiation))
Horizontal.close()
Isotropic.close()
HDKR.close()
Perez.close()
As you can see, the name of the .txt output file is fixed as "outputHorizontal.txt" (the first one). Is there any way to change this name and put it according to a input? For example, one of my inputs is the latitude, as 'lat'. I am trying to make the output file name be expressed in terms of 'lat', in this way everytime I run the program the name would be different, because now I always get the same name and the file is overwritten.
Thank you very much people, kind regards.
You can pass a string variable as the output file name. For example you could move the file declarations after you add elements to the lists (and before you write them) and use
Horizontal = open(str(Horizontal_radiation[0]), 'w')
Or just add a timestamp to the file name if it's all about don't overwriting files
Horizontal = open("horizontal-%s".format(datetime.today()), 'w')

Removing levels from data frame read from CSV file - R

I tried loading the baseball statistics from this link. When I read it from the file using
data <- read.csv("MLB2011.csv")
it seems to be reading all fields as factor values. I tried dropping those factor values by doing:
read.csv("MLB2011.xls", as.is= FALSE)
.. but it looks like the values are still being read as factors. What can I do to have them loaded as simple character values and not factors?
You aren't reading a csv file, it is an excel spreadsheet (.xls format). It contains two worksheets bat2011 and pitch2011
You could use the XLConnect library to read this
library(XLConnect)
# load the work book (connect to the file)
wb <- loadWorkbook("MLB2011.xls")
# read in the data from the bat2011 sheet
bat2011 <- readWorksheet(wb, sheet = 'bat2011')
readWorksheet has an argument colType which you could use to specify the column types.
Edit
If you have already saved the sheets as csv files then
as.is = TRUE or stringsAsFactors = FALSE will be the correct argument values

Resources