I have a csv file with a number of columns in it. It is for students. I want to display only male students and their names. I used 1 for male students and 0 for female students. My code is:
import pandas as pd
data = pd.read_csv('normalizedDataset.csv')
results = pd.concat([data['name'], ['students']==1])
print results
I have got this error:
TypeError: cannot concatenate a non-NDFrame object
Can anyone help please. Thanks.
You can specify to read only certain column names of your data when you load your csv. Then use loc to locate all values where students equals 1.
data = pd.read_csv('normalizedDataset.csv', usecols=['name', 'students'])
data = data.loc[data.students == 1, :]
BTW, your original error is because you are trying to concatenate a dataframe with False.
>>> ['students']==1
False
No need to concat, you're stripping things away, not building.
Try:
data[data['friends']==1]['name']
To provide clarity on why you were getting the error:
The second thing you were trying to concat was:
['students']==1
Which is not an NDFrame object. You'd want to replace that with.
data[data['students']==1]['students']
Related
I have a text file (Player_hits.text) that I am trying to pull player batting averages from. Similar to lines 179-189 I want to find an average. However, I do not want to find the average for the entire team. Instead, I want to find the average of every individual player on the team.
For instance, the text file is set up as such:
Player_hits.txt
In this file a 1 defines a hit and a 0 means the player did not get a hit. I am trying to pull an individual average for both players. (Alex = 0.500, Riley = 0.666)
If someone could help, that would be greatly appreciated!
Thanks!
Link to original code on repl.it: Baseball Stat-Tracking
JSONDecodeError Image
The json.decoder.JSONDecodeError: is coming because the json.loads() doesn't interpret that (each line, '[[1, 'Riley']\n'as valid json format. You can use ast to read in that list as a literal evaluation, thus storing that as a list element [', 'Riley'] in your list of p_hits.
Then the second part is you can convert to the dataframe and groupby the 'name' column. So jim has the right idea, but there's errors in that too (Ie. colmuns should be columns, and the items in the list need to be strings ['hit','name'], not undeclared variables.
import pandas as pd
import ast
p_hits = []
with open('Player_hits.txt') as hits:
for line in hits:
l = ast.literal_eval(line)
p_hits.append(l)
df = pd.DataFrame(p_hits, columns=['hit', 'name'])
Output: with an example dataset I made
print(df.groupby(['name']).mean())
hit
name
Matt 0.714286
Riley 0.285714
Todd 0.500000
import pandas as pd
import json
p_hits = []
with open('Player_hits.txt') as hits:
for line in hits:
l = json.loads(line)
p_hits.append(l)
df = pd.DataFrame.from_records(p_hits, colmuns=[hit, name])
df.groupby(['name']).mean()
I am trying to build a simple random item generator for a game I am working on.
So far I am stuck trying to figure out how to store and access all of the data. I went with pandas using .csv files to store the data sets.
I want to add weighted probabilities to what items are generated so I tried to read the csv files and compile each list into a new set.
I got the program to pick a random set but got stuck when trying to pull a random row from that set.
I am getting an error when I use .sample() to pull the item row which makes me think I don't understand how pandas works. I think I need to be creating new lists so I can later index and access the various statistics of the items once one is selected.
Once I pull the item I was intending on adding effects that would change the damage and armor and such displayed. So I was thinking of having the new item be its own list then use damage = item[2] + 3 or whatever I need
error is: AttributeError: 'list' object has no attribute 'sample'
Can anyone help with this problem? Maybe there is a better way to set up the data?
here is my code so far:
import pandas as pd
import random
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
def get_item():
item_class = [random.choices(df, weights=(45,40,15), k=1)] #this part seemed to work. When I printed item_class it printed one of the entire lists at the correct odds
item = item_class.sample()
print (item) #to see if the program is working
get_item()
I think you are getting slightly confused with lists vs list elements. This should work. I stubbed your dfs with simple ones
import pandas as pd
import random
# Actual data. Comment it out if you do not have the csv files
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
# My stubs -- uncomment and use this instead of the line above if you want to run this specific example
# df = [pd.DataFrame({'weapons' : ['w1','w2']}), pd.DataFrame({'armor' : ['a1','a2', 'a3']}), pd.DataFrame({'aether' : ['e1','e2', 'e3', 'e4']})]
def get_item():
# I removed [] from the line below -- choices() already returns a list of length 1
item_class = random.choices(df, weights=(45,40,15), k=1)
# I added [0] to choose the first element of item_class which is a list of length 1 from the line above
item = item_class[0].sample()
print (item) #to see if the program is working
get_item()
prints random rows from random dataframes that I setup such as
weapons
1 w2
I am a newbie in python and need to extract info from a csv file containing terrorism data.
I need to extract top 5 cities in India, having maximum casualities, where Casuality = Killed(given in CSV) + Wounded(given in CSV).
City column is also given in the CSV file.
Output format should be like below in descending order of casuality
city_1 casualty_1 city_2 casualty_2 city_3 casualty_3 city_4
casualty_4 city_5 casualty_5
Link to CSV- https://ninjasdatascienceprod.s3.amazonaws.com/3571/terrorismData.csv?AWSAccessKeyId=AKIAIGEP3IQJKTNSRVMQ&Expires=1554719430&Signature=7uYCQ6pAb1xxPJhI%2FAfYeedUcdA%3D&response-content-disposition=attachment%3B%20filename%3DterrorismData.csv
import numpy as np
import csv
file_obj=open("terrorismData.csv",encoding="utf8")
file_data=csv.DictReader(file_obj,skipinitialspace=True)
country=[]
killed=[]
wounded=[]
city=[]
final=[]
#Making lists
for row in file_data:
if row['Country']=='India':
country.append(row['Country'])
killed.append(row['Killed'])
wounded.append(row['Wounded'])
city.append(row['City'])
final.append([row['City'],row['Killed'],row['Wounded']])
#Making numpy arrays out of lists
np_month=np.array(country)
np_killed=np.array(killed)
np_wounded=np.array(wounded)
np_city=np.array(city)
np_final=np.array(final)
#Fixing blank values in final arr
for i in range(len(np_final)):
for j in range(len(np_final[0])):
if np_final[i][j]=='':
np_final[i][j]='0.0'
#Counting casualities(killed+wounded) and storing in 1st column of final array
for i in range(len(np_final)):
np_final[i,1]=float(np_final[i,1])+float(np_final[i,2])
#Descending sort on casualities column
np_final=np_final[np_final[:,1].argsort()[::-1]]
I expect np_final to get sorted on column casualities , but it's not happening because type(casualities) is coming as 'String'
Any help is appreciated.
I would offer for you to use Pandas. It would be easier for you to manipulate date.
Read everything to DataFrame. It should read numbers into number formats.
If you must to use np, while reading data, you could simply cast your values to float or integer and everything should work, if there are no other bugs.
Something like this:
for row in file_data:
if row['Country']=='India':
country.append(row['Country'])
killed.append(int(row['Killed']))
wounded.append(int(row['Wounded']))
city.append(row['City'])
final.append([row['City'],row['Killed'],row['Wounded']])
I am aggregating a Pandas DF using numpy size and then want to load the results into an Excel using writer.save. But I am getting the following error: NotImplementedError: Writing as Excel with a MultiIndex is not yet implemented.
My data looks something like this:
agt_id unique_id
abc123 ab12345
abc123 cd23456
abc123 de34567
xyz987 ef45678
xyz987 fg56789
My results should look like:
agt_id unique_id
abc123 3
xyz987 2
This is an example of my code:
df_agtvol = df_agt.groupby('agt_id').agg({'unique_id':[np.size]})
writer = pd.ExcelWriter(outfilepath, engine='xlsxwriter')
df_agtvol.to_excel(writer, sheet_name='agt_vols')
I have tried to reset the index by using:
df_agt_vol_final = df_agtvol.set_index([df_agtvol.index, 'agt_id'], inplace=True)
based on some research, but am getting a completely different error.
I am relatively new to working with Pandas dataframes, so any help would be appreciated.
You don't need a MultiIndex. The reason you get one is because np.size is wrapped in a list.
Although not explicitly documented, Pandas interprets everything in the list as a subindex for 'unique_id'. This use case falls under the "nested dict of names -> dicts of functions" case in the linked documentation.
So
df_agtvol = df_agt.groupby('agt_id').agg({'unique_id':[np.size]})
Should be
df_agtvol = df_agt.groupby('agt_id').agg({'unique_id': np.size})
This is still overly complicated and you can get the same results with a call to the count method.
df_agtvol = df_agt.groupby('agt_id').count()
I am trying to clean a list of url's that has garbage as shown.
/gradoffice/index.aspx(
/gradoffice/index.aspx-
/gradoffice/index.aspxjavascript$
/gradoffice/index.aspx~
I have a csv file with over 190k records of different url's. I tried to load the csv into a pandas dataframe and took the entire column of url's into a list by using the statement
str = df['csuristem']
it clearly gave me all the values in the column. when i use the following code - It is only printing 40k records and it starts some where in the middle. I don't know where am going wrong. the program runs perfectly but is showing me only partial number of results. any help would be much appreciated.
import pandas
table = pandas.read_csv("SS3.csv", dtype=object)
df = pandas.DataFrame(table)
str = df['csuristem']
for s in str:
s = s.split(".")[0]
print s
I am looking to get an output like this
/gradoffice/index.
/gradoffice/index.
/gradoffice/index.
/gradoffice/index.
Thank you,
Santhosh.
You need to do the following, so call .str.split on the column and then .str[0] to access the first portion of the split string of interest:
In [6]:
df['csuristem'].str.split('.').str[0]
Out[6]:
0 /gradoffice/index
1 /gradoffice/index
2 /gradoffice/index
3 /gradoffice/index
Name: csuristem, dtype: object