could not figure out how to read the metadata - pyhdf

I could not figure out how to read the metadata contained in the following HDF file.
I could successfully read the datasets and attributes as follows:
ftp://ladsweb.nascom.nasa.gov/allData/6/MOD07_L2/2014/126/MOD07_L2.A2014126.0640.006.2014126214544.hdf
infile ='MOD07_L2.A2014126.0640.006.2014126214544.hdf'
indata = SD(infile, SDC.READ)
datasets = indata.datasets()
print datasets
reqdata = indata.select('Processing_Flag')
attributes = reqdata.attributes()
print datasets
I hope someone can help me.

You can access the metadata, AKA file attributes, in a similar manner to the way you are accessing the data attributes.
all_metadata = indata.attributes()
print all_metadata
specific_metadata = getattr(indata, 'Pressure_Levels')
print specific_metadata

Related

I have one person in a dataframe that keeps showing up as \ufeff in my dataframe when I print to console

I have python code that loads a group of exam results. Each exam is saved in it's own csv file.
files = glob.glob('Exam *.csv')
frame = []
files1 = glob.glob('Exam 1*.csv')
for file in files:
frame.append(pd.read_csv(file, index_col=[0], encoding='utf-8-sig'))
for file in files1:
frame.append(pd.read_csv(file, index_col=[0], encoding='utf-8-sig'))
There is one person in the whole dataframe in their name column it shows up as
\ufeffStudents Name
It happens for every single exam. I tried using the encoding argument but that's not fixing the issue. I am out of ideas. Anyone else have anything?
That character is the BOM or "Byte Order Mark."
There are serveral ways to resovle it.
First, I want to suggest to add engine parameter (for example, engine='python' in pd.read_csv() when reading csv files.
pd.read_csv(file, index_col=[0], engine='python', encoding='utf-8-sig')
Secondly, you can simply remove it by replacing with empty string ('').
df['student_name'] = df['student_name'].apply(lambda x: x.replace("\ufeff", ""))

Passing Key,Value into a Function

I want to check a YouTube video's views and keep track of them over time. I wrote a script that works great:
import requests
import re
import pandas as pd
from datetime import datetime
import time
def check_views(link):
todays_date = datetime.now().strftime('%d-%m')
now_time = datetime.now().strftime('%H:%M')
#get the site
r = requests.get(link)
text = r.text
tag = re.compile('\d+ views')
views = re.findall(tag,text)[0]
#get the digit number of views. It's returned in a list so I need to get that item out
cleaned_views=re.findall('\d+',views)[0]
print(cleaned_views)
#append to the df
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
#df = df.append([todays_date, now_time, int(cleaned_views)],axis=0)
df.to_csv('views.csv')
return df
df = pd.DataFrame(columns=['Date','Time','Views'])
while True:
df = check_views('https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s')
time.sleep(1800)
But now I want to use this function for multiple links. I want a different CSV file for each link. So I made a dictionary:
link_dict = {'link1':'https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s',
'link2':'https://www.youtube.com/watch?v=ZPrAKuOBWzw'}
#this makes it easy for each csv file to be named for the corresponding link
The loop then becomes:
for key, value in link_dict.items():
df = check_views(value)
That seems to work passing the value of the dict (link) into the function. Inside the function, I just made sure to load the correct csv file at the beginning:
#Existing csv files
df=pd.read_csv(k+'.csv')
But then I'm getting an error when I go to append a new row to the df (“cannot set a row with mismatched columns”). I don't get that since it works just fine as the code written above. This is the part giving me an error:
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
What am I missing here? It seems like a super messy way using this dictionary method (I only have 2 links I want to check but rather than just duplicate a function I wanted to experiment more). Any tips? Thanks!
Figured it out! The problem was that I was saving the df as a csv and then trying to read back that csv later. When I saved the csv, I didn't use index=False with df.to_csv() so there was an extra column! When I was just testing with the dictionary, I was just reusing the df and even though I was saving it to a csv, the script kept using the df to do the actual adding of rows.

Pass encoding option in Pyspark text method

I have fixed length file encoded in ISO-8859-1.Spark 2.4 is not honoring encoding passed as option.
Below is the sample code(chars got corrupted).
g_df = g_spark.read.option("encoding", "ISO-8859-1").text(loc)
g_df.repartition(1).write.csv(path=loc, header="true", mode="overwrite", encoding="ISO-8859-1")
Howeve, when I read it as csv file ,chars are stored as expected.
g_df = g_spark.read.option("encoding", "ISO-8859-1").csv(loc)
g_df.repartition(1).write.csv(path=loc, header="true", mode="overwrite", encoding="ISO-8859-1")
This looks like spark does not support encoding for text method.
As this is fixed length file, so I cannot use csv method.
Could you please suggest a way out
Text method does not honor encoding, so I have tried RDD APIS, which is working for me.
Below is the sample code
encoded_rdd = sc.textFile(loc, use_unicode=False).map(lambda x: x.decode("iso-8859-1"))
encoded_df = encoded_rdd.map(Row("val")).toDF()

Converting multiple .pdf files with multiple pages into 1 single .csv file

I am trying to convert .pdf data to a spreadsheet. Based on some research, some guys recommended transforming it into csv first in order to avoid errors.
So, I made the below coding which is giving me:
"TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid"
Error appears at 'pd.concat' command.
'''
import tabula
import pandas as pd
import glob
path = r'C:\Users\REC.AC'
all_files = glob.glob(path + "/*.pdf")
print (all_files)
df = pd.concat(tabula.read_pdf(f1) for f1 in all_files)
df.to_csv("output.csv", index = False)
'''
Since this might be a common issue, I am posting the solution I found.
"""
df = []
for f1 in all_files:
df = pd.concat(tabula.read_pdf(f1))
"""
I believe that breaking the item iteration in two parts would generate the dataframe it needed and therefore would work.

Download .csv file from github using HTTR GET request

I am trying to create an automatic pull in R using the GET function from the HTTR package for a csv file located on github.
Here is the table I am trying to download.
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv
I can make the connection to the file using the following GET request:
library(httr)
x <- httr::GET("https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
However I am unsure how I then convert that into a dataframe similar to the table on github.
Any assistance would be much appreciated.
I am new to R but here is my solution.
You need to use the raw version of the csv file from github (raw.githubusercontent.com)!
library(httr)
x <- httr::GET("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
# Save to file
bin <- content(x, "raw")
writeBin(bin, "data.csv")
# Read as csv
dat = read.csv("data.csv", header = TRUE, dec = ",")
colnames(dat) = gsub("X", "", colnames(dat))
# Group by country name (to sum regions)
# Skip the four first columns containing metadata
countries = aggregate(dat[, 5:ncol(dat)], by=list(Country.Region=dat$Country.Region), FUN=sum)
# Here is the table of the most recent total confirmed cases
countries_total = countries[, c(1, ncol(countries))]
The output graph
How I got this to work:
How to sum a variable by group
This is as simple as:
res <- httr::GET("https://.../file.csv")
data <- httr::content(res, "parsed")
This requires the readr package.
See https://httr.r-lib.org/reference/content.html

Resources