Read CSV using pandas - python-3.x

I'm trying to read data from https://download.bls.gov/pub/time.series/bp/bp.measure using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t')
However, I just need to get the dataset with the two columns: measure_code and measure_text. As this dataset as a title BP measure I also tried to read it like:
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
df = pd.read_csv(url, sep='\t', skiprows=1)
But in this case it returns a dataset with just one column and I'm not being able to slipt it:
>>> df.columns
Index([' measure_code measure_text'], dtype='object')
Any suggestion/idea on a better approach to get this dataset?

It's definitely possible, but the format has a few quirks.
As you noted, the column headers start on line 2, so you need skiprows=1.
The file is space-separated, not tab-separated.
Column values are continued across multiple lines.
Issues 1 and 2 can be fixed using skiprows and sep. Problem 3 is harder, and requires you to preprocess the file a little. For that reason, I used a slightly more flexible way of fetching the file, using the requests library. Once I have the file, I use regular expressions to fix problem 3, and give the file back to pandas.
Here's the code:
import requests
import re
import io
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/bp/bp.measure'
# Get the URL, convert the document from DOS to Unix linebreaks
measure_codes = requests.get(url) \
.text \
.replace("\r\n", "\n")
# If there's a linebreak, followed by at least 7 spaces, combine it with
# previous line
measure_codes = re.sub("\n {7,}", " ", measure_codes)
# Convert the string to a file-like object
measure_codes = io.BytesIO(measure_codes.encode('utf-8'))
# Read in file, interpreting 4 spaces or more as a delimiter.
# Using a regex like this requires using the slower Python engine.
# Use skiprows=1 to skip the header
# Use dtype="str" to avoid converting measure code to integer.
df = pd.read_csv(measure_codes, engine="python", sep=" {4,}", skiprows=1, dtype="str")
print(df)

Related

Store 2 different encoded data in a file in python

I have 2 types of encoded data
ibm037 encoded - a single delimiter variable - value is ###
UTF8 encoded - a pandas dataframe with 100s of columns.
Example dataframe:
Date Time
1 2
My goal is to write this data into a python file. The format should be:
### 1 2
In this way I need to have all the rows of the dataframe in a python file where the 1st character for every line is ###.
I tried to store this character at the first location in the pandas dataframe as a new column and then write to the file but it throws error saying that two different encodings can't be written to a file.
Tried another way to write it:
df_orig_data = pandas dataframe,
Record_Header = encoded delimiter
f = open("_All_DelimiterOfRecord.txt", "a")
for row in df_orig_data.itertuples(index=False):
f.write(Record_Header)
f.write(str(row))
f.close()
It also doesn't work.
Is this kind of data write even possible? How can I write these 2 encoded data in 1 file?
Edit:
StringData = StringIO(
"""Date,Time
1,2
1,2
"""
)
df_orig_data = pd.read_csv(StringData, sep=",")
Record_Header = "2 "
f = open("_All_DelimiterOfRecord.txt", "a")
for index, row in df_orig_data.iterrows():
f.write(
"\t".join(
[
str(Record_Header.encode("ibm037")),
str(row["Date"]),
str(row["Time"]),
]
)
)
f.close()
I would suggest doing the encoding yourself, and writing a bytes object to the file. This isn't a situation where you can rely on the built-in encoding do it.
That means that the program opens the file in binary mode (ab), all of the constants are byte-strings, and it works with byte-strings whenever possible.
The question doesn't say, but I assumed you probably wanted a UTF8 newline after each line, rather than an IBM newline.
I also replaced the file handling with a context manager, since that makes it impossible to forget to close a file after you're done.
import io
import pandas as pd
StringData = io.StringIO(
"""Date,Time
1,2
1,2
"""
)
df_orig_data = pd.read_csv(StringData, sep=",")
Record_Header = "2 "
with open("_All_DelimiterOfRecord.txt", "ab") as f:
for index, row in df_orig_data.iterrows():
f.write(Record_Header.encode("ibm037"))
row_bytes = [str(cell).encode('utf8') for cell in row]
f.write(b'\t'.join(row_bytes))
# Note: this is an UTF8 newline, not an IBM newline.
f.write(b'\n')

Passing Key,Value into a Function

I want to check a YouTube video's views and keep track of them over time. I wrote a script that works great:
import requests
import re
import pandas as pd
from datetime import datetime
import time
def check_views(link):
todays_date = datetime.now().strftime('%d-%m')
now_time = datetime.now().strftime('%H:%M')
#get the site
r = requests.get(link)
text = r.text
tag = re.compile('\d+ views')
views = re.findall(tag,text)[0]
#get the digit number of views. It's returned in a list so I need to get that item out
cleaned_views=re.findall('\d+',views)[0]
print(cleaned_views)
#append to the df
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
#df = df.append([todays_date, now_time, int(cleaned_views)],axis=0)
df.to_csv('views.csv')
return df
df = pd.DataFrame(columns=['Date','Time','Views'])
while True:
df = check_views('https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s')
time.sleep(1800)
But now I want to use this function for multiple links. I want a different CSV file for each link. So I made a dictionary:
link_dict = {'link1':'https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s',
'link2':'https://www.youtube.com/watch?v=ZPrAKuOBWzw'}
#this makes it easy for each csv file to be named for the corresponding link
The loop then becomes:
for key, value in link_dict.items():
df = check_views(value)
That seems to work passing the value of the dict (link) into the function. Inside the function, I just made sure to load the correct csv file at the beginning:
#Existing csv files
df=pd.read_csv(k+'.csv')
But then I'm getting an error when I go to append a new row to the df (“cannot set a row with mismatched columns”). I don't get that since it works just fine as the code written above. This is the part giving me an error:
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
What am I missing here? It seems like a super messy way using this dictionary method (I only have 2 links I want to check but rather than just duplicate a function I wanted to experiment more). Any tips? Thanks!
Figured it out! The problem was that I was saving the df as a csv and then trying to read back that csv later. When I saved the csv, I didn't use index=False with df.to_csv() so there was an extra column! When I was just testing with the dictionary, I was just reusing the df and even though I was saving it to a csv, the script kept using the df to do the actual adding of rows.

Pandas skip header until key found

I am importing a bunch of data files. The format of the files has changed over the years, accumulating more "header" material without any identifying "comment", which makes it hard to know how many lines to skip.
Is there a way in pandas to skip rows until the desired column names are encountered:
import pandas as pd
import os
my_names=['A','B','C']
max_head=30
my_file=os.path.join(my_file)
f=open(my_file,'r')
lines=f.readlines()
for i,line in enumerate(lines[:max_head]):
if line.strip().split()==my_names:
skiprows=i
a=pd.read_csv(my_file,skiprows=skiprows)
And if not, should there be? Something like:
pd.read_csv(my_file,start_names=my_names)
You can do this in two parts. First find out the row where the matching header is then use that in your pd.read_csv
def header(file_name):
with open(file_name) as f:
for n,line in enumerate(f):
if line.startswith("whatever_the_header_name_is"):
return n
You can now pass the above function into read_csv like this
pd.read_csv (file_name, header=header(file_name))

Converting multiple .pdf files with multiple pages into 1 single .csv file

I am trying to convert .pdf data to a spreadsheet. Based on some research, some guys recommended transforming it into csv first in order to avoid errors.
So, I made the below coding which is giving me:
"TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid"
Error appears at 'pd.concat' command.
'''
import tabula
import pandas as pd
import glob
path = r'C:\Users\REC.AC'
all_files = glob.glob(path + "/*.pdf")
print (all_files)
df = pd.concat(tabula.read_pdf(f1) for f1 in all_files)
df.to_csv("output.csv", index = False)
'''
Since this might be a common issue, I am posting the solution I found.
"""
df = []
for f1 in all_files:
df = pd.concat(tabula.read_pdf(f1))
"""
I believe that breaking the item iteration in two parts would generate the dataframe it needed and therefore would work.

CSV to Pythonic List

I'm trying to convert a CSV file into Python list I have strings organize in columns. I need an Automation to turn them into a list.
my code works with Pandas, but I only see them again as simple text.
import pandas as pd
data = pd.read_csv("Random.csv", low_memory=False)
dicts = data.to_dict().values()
print(data)
so the final results should be something like that : ('Dan', 'Zac', 'David')
You can simply do this by using csv module in python
import csv
with open('random.csv', 'r') as f:
reader = csv.reader(f)
your_list = map(list, reader)
print your_list
You can also refer here
If you really want a list, try this:
import pandas as pd
data = pd.read_csv('Random.csv', low_memory=False, header=None).iloc[:,0].tolist()
This produces
['Dan', 'Zac', 'David']
If you want a tuple instead, just cast the list:
data = tuple(pd.read_csv('Random.csv', low_memory=False, header=None).iloc[:,0].tolist())
And this produces
('Dan', 'Zac', 'David')
I assumed that you use commas as separators in your csv and your file has no header. If this is not the case, just change the params of read_csv accordingly.

Resources