How to solve this data set - python-3.x

I am importing a Data set from quandl using API. Everything is perfect, however the time series I am importing is reversed. By this I mean if I used the .head method to print the first elements in the data set, I will get the latest Data set figures and printing the tail will get oldest figures
df = pd.read_csv("https://www.quandl.com/api/v3/datasets/CHRIS/CME_CD4.csv?api_key=H32H8imfVNVm9fcEX6kB",parse_dates=['Date'],index_col='Date')
df.head()

This should be a pretty easy fix if I understand. Credit to behzad.nouri on this answer Right way to reverse pandas.DataFrame?.
You just need to reverse the order of your dataframe using the line below.
df = df.reindex(index=df.index[::-1])

Related

Getting a list of VALUES rather than CLASSES in sqlalchemy

This problem seems very simple, but I'm having trouble finding an already existing solution on StackOverflow.
When I run a sqlalchemy command like the following
valid_columns = db.session.query(CrmLabels).filter_by(user_id=client_id).first()
I get back a CrmLabels object that is not iterable. If I print this object, I get a list
[Convert Source, Convert Medium, Landing Page]
But this is not iterable. I would like to get exactly what I've shown above, except as a list of strings
['Convert Source', 'Convert Medium', 'Landing Page']
How can I run a query that will return this result?
Below change should do it:
valid_columns = (
db.session.query(CrmLabels).filter_by(user_id=client_id)
.statement.execute() # just add this
.first()
)
However, you need to be certain about the order of columns, and you can use valid_columns.keys() to make sure the values are in the expected order.
Alternatively, you can create a dictionary using dict(valid_columns.items()).

gmplot Marker does not work after it marks 256 points

I am trying to mark a bunch of points on the map with gmplot and observed that after a certain point it stops marking and wipes out all the previously marked points. I debugged the gmplot.py module and saw that when the length of points array exceeds 256 this is happening without giving any error and warning.
self.points = [] on gmplot.py
Since I am very new to Python and OOPs concept, is there a way to override this and mark more than 256 points?
Are you using gmplot.GoogleMapPlotter.Scatter or gmplot.GoogleMapPlotter.Marker. I used either and was able to get 465 points for a project that I was working on. Is it possible it is an API key issue for you?
partial snippet of my code
import gmplot
import pandas as pd
# df is the database with Lat, Lon and formataddress columns
# change to list, not sure you need to do this. I think you can cycle through
# directly using iterrows. I have not tried that though
latcollection=df['Lat'].tolist()
loncollection=df['Lon'].tolist()
addcollection=df['formataddress'].tolist()
# center map with the first co-ordinates
gmaps2 = gmplot.GoogleMapPlotter(latcollection[0],loncollection[0],13,apikey='yourKey')
for i in range(len(latcollection)):
gmaps2.marker(latcollection[i],loncollection[i],color='#FF0000',c=None,title=str(i)+' '+ addcollection[i])
gmaps2.draw(newdir + r'\laplot_marker_full.html')
I could hover over the 465th point since I knew approximately where it was and I was able to get the title with str(464) <formataddress(464)>, since my array is indexed from 0
Make sure you check the GitHub site to modify your gmplot file, in case you are working with windows.

Is this example data cleaning code updating the pandas dataframe? [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
In this article on predicting values with linear regression there's a cleaning step
# For beginning, transform train['FullDescription'] to lowercase using text.lower()
train['FullDescription'].str.lower()
# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)
This isn't actually assigning the changes to the dataframe, is it? But if I try something like this...
train['FullDescription'] = train['FullDescription'].str.lower()
train['FullDescription'] = train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)
Then I get a warning...
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
What's the right way to apply these transformations? Are they in fact already being applied? A print(train['FullDescription']) seems to show me they're not.
EDIT: #EdChum and #jezrael are very much onto something about missing code. When I'm actually trying to run this, my data needs to be split into test and train sets.
from sklearn.model_selection import train_test_split
all_data = pandas.read_csv('salary.csv')
train, test = train_test_split(all_data, test_size=0.1)
That's what seems to be causing this error. If I make the next line
train = train.copy()
test = test.copy()
then everything is happy.
You may be wondering if I shouldn't then just apply this step to all_data, which works, but then lower down in the code train['Body'].fillna('nan', inplace=True) still causes an error. So it seems indeed the problem is with train_test_split not creating copies.
The right way to apply these transformations would be...
df.loc[:, 'FullDescription'] = ...
More informations about this would be here. This is a page from the pandas documentation, all the way to the bottom. Quoting...
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
# We don't know whether this will modify df or not!
foo['quux'] = value
return foo
You can also find extra-reasons of Why to use .loc here. Long story short : Explicit is better than implicit. And while df['some_column'] is not immediatly clear about the intent, usingdf.loc['some_column'] is.
I don't really know how to explain it in a simple way, but if you have further questions or if you think I could make my answer more explicit/eloquent, tell me. :)

How to remove special usual characters from a pandas dataframe using Python

I have a file some crazy stuff in it. It looks like this:
I attempted to get rid of it using this:
df['firstname'] = map(lambda x: x.decode('utf-8','ignore'), df['firstname'])
But I wound up with this in my dataframe: <map object at 0x0000022141F637F0>
I got that example from another question and this seems to be the Python3 method for doing this but I'm not sure what I'm doing wrong.
Edit: For some odd reason someone thinks that this has something to do with getting a map to return a list. The central issue is getting rid of non UTF-8 characters. Whether or not I'm even doing that correctly has yet to be established.
As I understand it, I have to apply an operation to every character in a column of the dataframe. Is there another technique or is map the correct way and if it is, why am I getting the output I've indicated?
Edit2: For some reason, my machine wouldn't let me create an example. I can now. This is what i'm dealing with. All those weird characters need to go.
import pandas as pd
data = [['🦎Ale','Αλέξανδρα'],['��Grain','Girl🌾'],['Đỗ Vũ','ên Anh'],['Don','Johnson']]
df = pd.DataFrame(data,columns=['firstname','lastname'])
print(df)
Edit 3: I tired doing this using a reg ex and for some reason, it still didn't work.
df['firstname'] = df['firstname'].replace('[^a-zA-z\s]',' ')
This regex works FINE in another process, but here, it still leaves the ugly characters.
Edit 4: It turns out that it's image data that we're looking at.

How to filter a CSV file without Pandas? (Best Substitute for Pandas in Pythonista)

I am trying to do some data analysis on Pythonista 3 (iOS app for python), however because of the C libraries of pandas it does not compile in the iOS device.
Is there any substitute for Pandas?
Would numpy be an option for data of type string?
The data set I have at the moment is the history of messages between my friends and I.
The whole history is in one csv file. Each row has the columns 'day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'
The goal of the analysis is to produce a report of our chat for the past year.
I want be able to count number of messages each friend sent. I want to be able to plot a histogram of the hours in which the messages where sent by each friend.
Then, I want to do some word counting individually and as a group.
In Pandas I know how to do that. For example:
df = read_csv("messages.csv")
number_of_messages_friend1 = len(df[df.author_of_message == 'friend1']
How can I filter a csv file without Pandas?
Since Pythonista does have numpy, you will want to look at recarrays, which are numpy's approach to this type of problem. The following worked out of the box in Pythonista for me:
import numpy as np
df=np.recfromcsv('messages.csv')
len(df[df.author_of_message==b'friend1'])
Depending on your data format, tou may find that recsfromcsv "just works", since it tries to guess data types, or you might need to customize things a bit. See genfromtext for a number of options, such as explictly specifying data types or for using converters for converting string dates to datetime objects. recsfromcsv is just a convienece wrapper around genfromtext
https://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html#
Once in recarray, many of the simple indexing operations work the same as in pandas. Note you may need to do string compares using b-prefixed strings (bytes objects), unless you convert to unicode strings, as shown above.
Use the csv module from the standard library to read the messages.
You could store it into a list of collections.namedtuple for easy access.
import csv
messages = []
with open('messages.csv') as csvfile:
reader = csv.DictReader(csvfile, fieldnames=('day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'))
for row in reader:
messages.append(row)
That gives you all the messages as a list of dictionaries.
Alternatively you could use a normal csv reader combined with a collections.namedtuple to make a list of named tuples, which are slightly easier to access.
import csv
from collections import namedtuple
Msg = namedtuple('Msg', ('day_of_the_week', 'date', 'time_of_message', 'author_of_message', 'message_body'))
messages = []
with open('messages.csv') as csvfile:
msgreader = csv.reader(csvfile)
for row in msgreader:
messages.append(Msg(*row))
Pythonista now has competition on iOS. The pyto app provides python 3.8 with pandas. https://apps.apple.com/us/app/pyto-python-3-8

Resources