Pandas 0.17.1 dataframe manipulation refering a column and writen to csv - python-3.x

Hi friends I am new to programming and I am having a difficult time accessing this dataframe that I read from a html
is just i want to print a specific column and because they are unnamed...I've tryed everything to access them and it throw's me some errors....this what I've tried so far to print them out:
print [data{'Unnamed: 0'}]
[ Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 \
0 Pinnacle NaN NaN NaN NaN
1 10/25 02:06:10pm NaN #101 Miami NaN
2 10/25 02:06:10pm NaN #102 New England NaN
3 10/25 04:40:03pm NaN #101 Miami NaN
4 10/25 04:40:04pm NaN #102 New England 8½-05
5 10/25 04:40:12pm NaN #101 Miami NaN
6 10/25 04:40:12pm NaN #102 New England 8½ev
Also I tried to write them down to a csv file and I get this error:
AttributeError: 'list' object has no attribute 'to_csv'
and here is the code I am using:
from pandas as pd
url = 'http://exoweb0.donbest.com/checkHTML/servlet/send_archive_history.servlet?by_casino=1&for_casino=37&league=1&game=0&date=20151029'
data = pd.read_html(url, header=0)
data.to_csv('Pinacle Lines.csv', index_col=0)
#print (data)

try this:
data = pd.read_html(url, header=0)[0]
You get a list of DataFrame's back when you call read_html, you need to figure out which one you want, the edit above will select the first one, you might want to look at all of them.

Related

create pandas column using cell values of another multi indexed data frame [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a multi indexed data frame as the attached image. Now I need to create a new column on a different data frame where each column name will be the unique room numbers.
For example, one expected output from the code will be as following:
N.B. I want to avoid for loops to save memory space and time. What would be the optimal way to get desired output ?
I have tried using for loops and could get desired output but I am not sure if it
s a good idea for a large dataset. Here is the code snippet :
import numpy as np
import pandas as pd
d = np.array(['624: COUPLE , 507: DELUXE+ ,301: HONEYMOON','624:FAMILY ,
507: FAMILY+','621:FAMILY , 517: FAMILY+','696:FAMILY , 585:
FAMILY+,624:FAMILY , 507: DELUXE'])
df = pd.Series(d)
df= df.str.extractall(r'(?P<room>[0-9]+):\s*(?P<grd>[^\s,]+)')
gh = df[df['room'] == '507'].index
rf = pd.DataFrame(index=range(0,4),columns=['room#507','room#624'],
dtype='float')
for i in range(0,rf.shape[0]):
for j in range(0,gh.shape[0]):
if (i == gh[j][0]):
rf['room#507'][i] = df.grd[gh[j][0]][gh[j][1]]
Use DataFrame.reset_index with DataFrame.pivot:
df= df.str.extractall(r'(?P<room>[0-9]+):\s*(?P<grd>[^\s,]+)')
df = df.reset_index(level=1, drop=True).reset_index()
df = df.pivot('index','room','grd').add_prefix('room_')
print (df)
room room_301 room_507 room_517 room_585 room_621 room_624 room_696
index
0 HONEYMOON DELUXE+ NaN NaN NaN COUPLE NaN
1 NaN FAMILY+ NaN NaN NaN FAMILY NaN
2 NaN NaN FAMILY+ NaN FAMILY NaN NaN
3 NaN DELUXE NaN FAMILY+ NaN FAMILY FAMILY
Or DataFrame.set_index with Series.unstack:
df= df.str.extractall(r'(?P<room>[0-9]+):\s*(?P<grd>[^\s,]+)')
df = (df.reset_index(level=1, drop=True)
.set_index('room', append=True)['grd']
.unstack()
.add_prefix('room_'))
print (df)
room room_301 room_507 room_517 room_585 room_621 room_624 room_696
0 HONEYMOON DELUXE+ NaN NaN NaN COUPLE NaN
1 NaN FAMILY+ NaN NaN NaN FAMILY NaN
2 NaN NaN FAMILY+ NaN FAMILY NaN NaN
3 NaN DELUXE NaN FAMILY+ NaN FAMILY FAMILY

How do i remove nan values from dataframe in Python. dropna() does not seem to be working for me

How do i remove nan values from dataframe in Python? I already tried with dropna(), but that did not work for me. Also is NaN diffferent from nan. I am using Pandas.
While printing the data frame it does not print as NaN but instead as nan.
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
You can change nan values with NaN using replace() and then use dropna().
import numpy as np
df = df.replace('nan', np.nan)
df = df.dropna()
Update:
Original dataframe:
1 2.11358 0.649067060588935
2 nan 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 nan
Applied df.replace('nan', np.nan):
1 2.11358 0.649067060588935
2 NaN 0.6094130485307419
3 2.10066 0.3653980276694516
4 2.10545 NaN
Applied df.dropna():
1 2.11358 0.649067060588935
3 2.10066 0.3653980276694516

Trying to append a single row of data to a pandas DataFrame, but instead adds rows for each field of input

I am trying to add a row of data to a pandas DataFrame, but it keeps adding a separate row for each piece of data. I feel I am missing something very simple and obvious, but what it is I do not know.
import pandas
colNames = ["ID", "Name", "Gender", "Height", "Weight"]
df1 = pandas.DataFrame(columns = colNames)
df1.set_index("ID", inplace=True, drop=False)
i = df1.shape[0]
person = [{"ID":i},{"Name":"Jack"},{"Gender":"Male"},{"Height":177},{"Weight":75}]
df1 = df1.append(pandas.DataFrame(person, columns=colNames))
print(df1)
Output:
ID Name Gender Height Weight
0 0.0 NaN NaN NaN NaN
1 NaN Jack NaN NaN NaN
2 NaN NaN Male NaN NaN
3 NaN NaN NaN 177.0 NaN
4 NaN NaN NaN NaN 75.0
You are using too many squiggly brackets. All of your data should be inside one pair of squiggly brackets. This creates a single python dictionary. Change that line to:
person = [{"ID":i,"Name":"Jack","Gender":"Male","Height":177,"Weight":75}]

How to compare PANDAS Columns in a Dataframes to find all entries appearing in different columns?

Full disclosure. I'm fairly new to Python and discovered PANDAS today.
I created a Dataframe from two csv files, one which is the results of a robot scanning barcode IDs and one which is a list of instructions for the robot to execute.
import pandas as pd
#import csv file and read the column containing plate IDs scanned by Robot
scancsvdata = pd.read_csv("G:\scan.csv", header=None, sep=';', skiprows=(1),usecols=[6])
#Rename Column to Plates Scanned
scancsvdata.columns = ["IDs Scanned"]
#Remove any Duplicate Plate IDs
scancsvdataunique = scancsvdata.drop_duplicates()
#import the Worklist to be executed CSV file and read the Source Column to find required Plates
worklistdataSrceID = pd.read_csv("G:\TestWorklist.CSV", usecols=["SrceID"])
#Rename SrceID Column to Plates Required
worklistdataSrceID.rename(columns={'SrceID':'IDs required'}, inplace=True)
#remove duplicates from Plates Required
worklistdataSrceIDunique = worklistdataSrceID.drop_duplicates()
#import the Worklist to be executed CSV file and read the Destination Column to find required Plates
worklistdataDestID = pd.read_csv("G:\TestWorklist.CSV", usecols=["DestID"])
#Rename DestID Column to Plates Required
worklistdataDestID.rename(columns={'DestID':'IDs required'}, inplace=True)
#remove duplicates from Plates Required
worklistdataDestIDunique = worklistdataDestID.drop_duplicates()
#Combine into one Dataframe
AllData = pd.concat ([scancsvdataunique, worklistdataSrceIDunique, worklistdataDestIDunique], sort=True)
print (AllData)
The resulting Dataframe lists IDs scanned in Column 1 and IDs required in Column 2.
IDs Scanned IDs required
0 1024800.0 NaN
1 1024838.0 NaN
2 1024839.0 NaN
3 1024841.0 NaN
4 1024844.0 NaN
5 1024798.0 NaN
6 1024858.0 NaN
7 1024812.0 NaN
8 1024797.0 NaN
9 1024843.0 NaN
10 1024840.0 NaN
11 1024842.0 NaN
12 1024755.0 NaN
13 1024809.0 NaN
14 1024810.0 NaN
15 8656.0 NaN
16 8657.0 NaN
17 8658.0 NaN
0 NaN 1024800.0
33 NaN 1024843.0
0 NaN 8656.0
7 NaN 8657.0
15 NaN 8658.0
How would I go about ensuring that all the IDs in the 'IDs Required' Column, appear in the 'IDs Scanned Column'?
Ideally the results of the comparison above would be a generic message like 'All IDs found'.
If different csv files were used and the Dataframe was as follows
IDs Scanned IDs required
0 1024800.0 NaN
1 1024838.0 NaN
2 1024839.0 NaN
3 1024841.0 NaN
4 1024844.0 NaN
5 1024798.0 NaN
6 1024858.0 NaN
7 1024812.0 NaN
8 1024797.0 NaN
9 1024843.0 NaN
10 1024840.0 NaN
11 1024842.0 NaN
12 1024755.0 NaN
13 1024809.0 NaN
14 1024810.0 NaN
15 8656.0 NaN
16 8657.0 NaN
17 8658.0 NaN
0 NaN 2024800.0
33 NaN 2024843.0
0 NaN 8656.0
7 NaN 8657.0
15 NaN 8658.0
Then the result of the comparison would be the list of the missing IDs, 2024800 and 2024843.
To check True/False if all the items required are in the column;
all([item in df["IDs Scanned"] for item in df["IDs required"].unique()])
To get a list of the unique missing items:
sorted(set(df["IDs required"]) - set(df["IDs Scanned"]))
Or using pandas syntax to return a DataFrame filtered to rows where IDs required are not found in IDs Scanned:
df.loc[~df["IDs required"].isin(df["IDs Scanned"])]
missing_ids = df.loc[~df['IDs required'].isin(df['IDs Scanned']), 'IDs required']

Pandas append returns DF with NaN values

I'm appending data from a list to pandas df. I keep getting NaN in my entries.
Based on what I've read I think I might have to mention the data type for each column in my code.
dumps = [];features_df = pd.DataFrame()
for i in range (int(len(ids)/50)):
dumps = sp.audio_features(ids[i*50:50*(i+1)])
for i in range (len(dumps)):
print(list(dumps[0].values()))
features_df = features_df.append(list(dumps[0].values()), ignore_index = True)
Expected results, something like-
[0.833, 0.539, 11, -7.399, 0, 0.178, 0.163, 2.1e-06, 0.101, 0.385, 99.947, 'audio_features', '6MWtB6iiXyIwun0YzU6DFP', 'spotify:track:6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/tracks/6MWtB6iiXyIwun0YzU6DFP', 'https://api.spotify.com/v1/audio-analysis/6MWtB6iiXyIwun0YzU6DFP', 149520, 4]
for one row.
Actual-
danceability energy ... duration_ms time_signature
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
For all rows
append() strategy in a tight loop isn't a great way to do this. Rather, you can construct an empty DataFrame and then use loc to specify an insertion point. The DataFrame index should be used.
For example:
import pandas as pd
df = pd.DataFrame(data=[], columns=['n'])
for i in range(100):
df.loc[i] = i
print(df)
time python3 append_df.py
n
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
real 0m13.178s
user 0m12.287s
sys 0m0.617s
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

Resources