Python3 issue regarding index out of range - python-3.x

I am having this problem. When i run this code, with the above file it gives me a index out of range error.
f = open(sys.argv[1], 'r')
file_contents = [x.split('\t')[2:5] for x in f.readlines()]
#Set the variables for average and total for cities
total = 0
city = set()
for line in file_contents:
print(line[0])
This is the content for the file
2012-01-01 09:00 San Jose Men's Clothing 214.05 Amex
2012-01-01 09:00 Fort Worth Women's Clothing 153.57 Visa
2012-01-01 09:00 San Diego Music 66.08 Cash
2012-01-01 09:00 Pittsburgh Pet Supplies 493.51 Discover
2012-01-01 09:00 Omaha Children's Clothing 235.63 MasterCard

You need to close the file after reading from it, the recommended practice is to open it using the with statement which automatically closes it;
with open(sys.argv[1], 'r') as f:
file_contents = [x.split(' ')[2:5] for x in f.readlines()]
#Set the variables for average and total for cities
total = 0
city = set()
for line in file_contents:
print(line[0])
However the issue you are having is splitting the lines by \t, use a blank space and it should give you what you need.
OUTPUT
09:00
09:00
09:00
09:00
09:00

Related

Creating multiple named dataframes by a for loop

I have a database that contains 60,000+ rows of college football recruit data. From there, I want to create seperate dataframes where each one contains just one value. This is what a sample of the dataframe looks like:
,Primary Rank,Other Rank,Name,Link,Highschool,Position,Height,weight,Rating,National Rank,Position Rank,State Rank,Team,Class
0,1,,D.J. Williams,https://247sports.com/Player/DJ-Williams-49931,"De La Salle (Concord, CA)",ILB,6-2,235,0.9998,1,1,1,Miami,2000
1,2,,Brock Berlin,https://247sports.com/Player/Brock-Berlin-49926,"Evangel Christian Academy (Shreveport, LA)",PRO,6-2,190,0.9998,2,1,1,Florida,2000
2,3,,Charles Rogers,https://247sports.com/Player/Charles-Rogers-49984,"Saginaw (Saginaw, MI)",WR,6-4,195,0.9988,3,1,1,Michigan State,2000
3,4,,Travis Johnson,https://247sports.com/Player/Travis-Johnson-50043,"Notre Dame (Sherman Oaks, CA)",SDE,6-4,265,0.9982,4,1,2,Florida State,2000
4,5,,Marcus Houston,https://247sports.com/Player/Marcus-Houston-50139,"Thomas Jefferson (Denver, CO)",RB,6-0,208,0.9980,5,1,1,Colorado,2000
5,6,,Kwame Harris,https://247sports.com/Player/Kwame-Harris-49999,"Newark (Newark, DE)",OT,6-7,320,0.9978,6,1,1,Stanford,2000
6,7,,B.J. Johnson,https://247sports.com/Player/BJ-Johnson-50154,"South Grand Prairie (Grand Prairie, TX)",WR,6-1,190,0.9976,7,2,1,Texas,2000
7,8,,Bryant McFadden,https://247sports.com/Player/Bryant-McFadden-50094,"McArthur (Hollywood, FL)",CB,6-1,182,0.9968,8,1,1,Florida State,2000
8,9,,Sam Maldonado,https://247sports.com/Player/Sam-Maldonado-50071,"Harrison (Harrison, NY)",RB,6-2,215,0.9964,9,2,1,Ohio State,2000
9,10,,Mike Munoz,https://247sports.com/Player/Mike-Munoz-50150,"Archbishop Moeller (Cincinnati, OH)",OT,6-7,290,0.9960,10,2,1,Tennessee,2000
10,11,,Willis McGahee,https://247sports.com/Player/Willis-McGahee-50179,"Miami Central (Miami, FL)",RB,6-1,215,0.9948,11,3,2,Miami,2000
11,12,,Antonio Hall,https://247sports.com/Player/Antonio-Hall-50175,"McKinley (Canton, OH)",OT,6-5,295,0.9946,12,3,2,Kentucky,2000
12,13,,Darrell Lee,https://247sports.com/Player/Darrell-Lee-50580,"Kirkwood (Saint Louis, MO)",WDE,6-5,230,0.9940,13,1,1,Florida,2000
13,14,,O.J. Owens,https://247sports.com/Player/OJ-Owens-50176,"North Stanly (New London, NC)",S,6-1,195,0.9932,14,1,1,Tennessee,2000
14,15,,Jeff Smoker,https://247sports.com/Player/Jeff-Smoker-50582,"Manheim Central (Manheim, PA)",PRO,6-3,190,0.9922,15,2,1,Michigan State,2000
15,16,,Marco Cooper,https://247sports.com/Player/Marco-Cooper-50171,"Cass Technical (Detroit, MI)",OLB,6-2,235,0.9918,16,1,2,Ohio State,2000
16,17,,Chance Mock,https://247sports.com/Player/Chance-Mock-50163,"The Woodlands (The Woodlands, TX)",PRO,6-2,190,0.9918,17,3,2,Texas,2000
17,18,,Roy Williams,https://247sports.com/Player/Roy-Williams-55566,"Permian (Odessa, TX)",WR,6-4,202,0.9916,18,3,3,Texas,2000
18,19,,Matt Grootegoed,https://247sports.com/Player/Matt-Grootegoed-50591,"Mater Dei (Santa Ana, CA)",OLB,5-11,205,0.9914,19,2,3,USC,2000
19,20,,Yohance Buchanan,https://247sports.com/Player/Yohance-Buchanan-50182,"Douglass (Atlanta, GA)",S,6-1,210,0.9912,20,2,1,Florida State,2000
20,21,,Mac Tyler,https://247sports.com/Player/Mac-Tyler-50572,"Jess Lanier (Hueytown, AL)",DT,6-6,320,0.9912,21,1,1,Alabama,2000
21,22,,Jason Respert,https://247sports.com/Player/Jason-Respert-55623,"Northside (Warner Robins, GA)",OC,6-3,300,0.9902,22,1,2,Tennessee,2000
22,23,,Casey Clausen,https://247sports.com/Player/Casey-Clausen-50183,"Bishop Alemany (Mission Hills, CA)",PRO,6-4,215,0.9896,23,4,4,Tennessee,2000
23,24,,Albert Means,https://247sports.com/Player/Albert-Means-55968,"Trezevant (Memphis, TN)",SDE,6-6,310,0.9890,24,2,1,Alabama,2000
24,25,,Albert Hollis,https://247sports.com/Player/Albert-Hollis-55958,"Christian Brothers (Sacramento, CA)",RB,6-0,190,0.9890,25,4,5,Georgia,2000
25,26,,Eric Moore,https://247sports.com/Player/Eric-Moore-55973,"Pahokee (Pahokee, FL)",OLB,6-4,226,0.9884,26,3,3,Florida State,2000
26,27,,Willie Dixon,https://247sports.com/Player/Willie-Dixon-55626,"Stockton Christian School (Stockton, CA)",WR,5-11,182,0.9884,27,4,6,Miami,2000
27,28,,Cory Bailey,https://247sports.com/Player/Cory-Bailey-50586,"American (Hialeah, FL)",S,5-10,175,0.9880,28,3,4,Florida,2000
28,29,,Sean Young,https://247sports.com/Player/Sean-Young-55972,"Northwest Whitfield County (Tunnel Hill, GA)",OG,6-6,293,0.9878,29,1,3,Tennessee,2000
29,30,,Johnnie Morant,https://247sports.com/Player/Johnnie-Morant-60412,"Parsippany Hills (Morris Plains, NJ)",WR,6-5,225,0.9871,30,5,1,Syracuse,2000
30,31,,Wes Sims,https://247sports.com/Player/Wes-Sims-60243,"Weatherford (Weatherford, OK)",OG,6-5,310,0.9869,31,2,1,Oklahoma,2000
31,33,,Jason Campbell,https://247sports.com/Player/Jason-Campbell-55976,"Taylorsville (Taylorsville, MS)",PRO,6-5,190,0.9853,33,5,1,Auburn,2000
32,34,,Antwan Odom,https://247sports.com/Player/Antwan-Odom-50168,"Alma Bryant (Irvington, AL)",SDE,6-7,260,0.9851,34,3,2,Alabama,2000
33,35,,Sloan Thomas,https://247sports.com/Player/Sloan-Thomas-55630,"Klein (Spring, TX)",WR,6-2,188,0.9847,35,6,5,Texas,2000
34,36,,Raymond Mann,https://247sports.com/Player/Raymond-Mann-60804,"Hampton (Hampton, VA)",ILB,6-1,233,0.9847,36,2,1,Virginia,2000
35,37,,Alphonso Townsend,https://247sports.com/Player/Alphonso-Townsend-55975,"Lima Central Catholic (Lima, OH)",DT,6-6,280,0.9847,37,2,3,Ohio State,2000
36,38,,Greg Jones,https://247sports.com/Player/Greg-Jones-50158,"Battery Creek (Beaufort, SC)",RB,6-2,245,0.9837,38,6,1,Florida State,2000
37,39,,Paul Mociler,https://247sports.com/Player/Paul-Mociler-60319,"St. John Bosco (Bellflower, CA)",OG,6-5,300,0.9833,39,3,7,UCLA,2000
38,40,,Chris Septak,https://247sports.com/Player/Chris-Septak-57555,"Millard West (Omaha, NE)",TE,6-3,245,0.9833,40,1,1,Nebraska,2000
39,41,,Eric Knott,https://247sports.com/Player/Eric-Knott-60823,"Henry Ford II (Sterling Heights, MI)",TE,6-4,235,0.9831,41,2,3,Michigan State,2000
40,42,,Harold James,https://247sports.com/Player/Harold-James-57524,"Osceola (Osceola, AR)",S,6-1,220,0.9827,42,4,1,Alabama,2000
For example, if I don't use a for loop, this line of code is what I use if I just want to create one dataframe:
recruits2022 = recruits_final[recruits_final['Class'] == 2022]
However, I want to have a named dataframe for each recruiting class.
In other words, recruits2000 would be a dataframe for all rows that have a class value equal to 2000, recruits2001 would be a dataframe for all rows that have a class value to 2001, and so forth.
This is what I tried recently, but have no luck saving the dataframe outside of the for loop.
databases = ['recruits2000', 'recruits2001', 'recruits2002', 'recruits2003', 'recruits2004',
'recruits2005', 'recruits2006', 'recruits2007', 'recruits2008', 'recruits2009',
'recruits2010', 'recruits2011', 'recruits2012', 'recruits2013', 'recruits2014',
'recruits2015', 'recruits2016', 'recruits2017', 'recruits2018', 'recruits2019',
'recruits2020', 'recruits2021', 'recruits2022', 'recruits2023']
for i in range(len(databases)):
year = pd.to_numeric(databases[i][-4:], errors = 'coerce')
db = recruits_final[recruits_final['Class'] == year]
db.name = databases[i]
print(db)
print(db.name)
print(year)
recruits2023
I would get this error instead of what I wanted
NameError Traceback (most recent call last)
<ipython-input-49-7cb5d12ab92f> in <module>()
29
30 # print(db.name)
---> 31 recruits2023
32
33
NameError: name 'recruits2023' is not defined
Is there something that I am missing to get this for loop to work? Any assistance is truly appreciated. Thanks in advance.
List use a dictionary of dataframes using groupby:
dict_dfs = dict(tuple(df.groupby('Class')))
Access you individual dataframes using
dict_dfs[2022]
You override variable db at each iteration and recruits2023 is not a variable so you can't use it like that:
You can use a dict to store your data:
recruits = {}
for year in recruits_final['Class'].unique():
recruits[year] = recruits_final[recruits_final['Class'] == year]
>>> recruits[2000]
Primary Rank Other Rank Name Link ... Position Rank State Rank Team Class
0 1 NaN D.J. Williams https://247sports.com/Player/DJ-Williams-49931 ... 1 1 Miami 2000
1 2 NaN Brock Berlin https://247sports.com/Player/Brock-Berlin-49926 ... 1 1 Florida 2000
2 3 NaN Charles Rogers https://247sports.com/Player/Charles-Rogers-49984 ... 1 1 Michigan State 2000
3 4 NaN Travis Johnson https://247sports.com/Player/Travis-Johnson-50043 ... 1 2 Florida State 2000
...
38 40 NaN Chris Septak https://247sports.com/Player/Chris-Septak-57555 ... 1 1 Nebraska 2000
39 41 NaN Eric Knott https://247sports.com/Player/Eric-Knott-60823 ... 2 3 Michigan State 2000
40 42 NaN Harold James https://247sports.com/Player/Harold-James-57524 ... 4 1 Alabama 2000
>>> recruits.keys()
dict_keys([2000])

Count Occurrences for Objects in a Column of Lists for Really Large CSV File

I have a huge CSV file (8gb) containing multiple columns. One of the columns are a column of lists that looks like this:
YEAR WIN_COUNTRY_ISO3
200 2017 ['BEL', 'FRA', 'ESP']
201 2017 ['BEL', 'LTU']
202 2017 ['POL', 'BEL']
203 2017 ['BEL']
204 2017 ['GRC', 'DEU', 'FRA', 'LVA']
205 2017 ['LUX']
206 2017 ['BEL', 'SWE', 'LUX']
207 2017 ['BEL']
208 2017 []
209 2017 []
210 2017 []
211 2017 ['BEL']
212 2017 ['SWE']
213 2017 ['LUX', 'LUX']
214 2018 ['DEU', 'LUX']
215 2018 ['ESP', 'PRT']
216 2018 ['AUT']
217 2018 ['DEU', 'BEL']
218 2009 ['ESP']
219 2009 ['BGR']
Each of the 3-letter code represents a country. I would like to create a frequency table for each country so i can count the occurrences of each country in the entire column. Since the file is really large and my PC can't handle to load the whole CSV as dataframes, I try to read the file lazily and iterate through the line --> getting the last column and add the object in each row of the WIN_COUNTRY_ISO3 column (which happens to be the last column) to a set of dictionary.
import sys
from itertools import islice
n=100
i = 0
col_dict={}
with open(r"filepath.csv") as file:
for nline in iter(lambda: tuple(islice(file, n)), ()):
row = nline.splitline
WIN_COUNTRY_ISO3 = row[-1]
for iso3 in WIN_COUNTRY_ISO3:
if iso3 in col_dict.keys():
col_dict[iso3]+=1
else:
col_dict[iso3]=1
i+=1
sys.stdout.write("\rDoing thing %i" % i)
sys.stdout.flush()
print(col_dict)
However, this process takes a really long time. I tried through iterate through multiple lines by using the code
for nline in iter(lambda: tuple(islice(file, n)), ())
Q1:
However, this doesn't seem to work and python process the file one by one. Does anybody know the most any
efficient way for me to generate the count of each country for a really large file like mine?
The resulting table would look like this:
Country Freq
BEL 4543
FRA 4291
ESP 3992
LTU 3769
POL 3720
GRC 3213
DEU 3119
LVA 2992
LUX 2859
SWE 2802
PRT 2584
AUT 2374
BGR 1978
RUS 1770
TUR 1684
I would also like to create the frequency table by each year (in the YEAR column) if anybody can help me with this. Thank you.
Try this:
from collections import defaultdict
import csv
import re
result = defaultdict(int)
f = open(r"filepath.csv")
next(f)
for row in f:
data = re.sub(r'[\s\d\'\[\]]', '', row)
if data:
for x in data.split(','):
result[x] += 1
print(result)
If you can handle awk, here's one:
$ cat program.awk
{
while(match($0,/'[A-Z]{3}'/)) {
a[substr($0,RSTART+1,RLENGTH-2)]++
$0=substr($0,RSTART+RLENGTH)
}
}
END {
for(i in a)
print a[i],i
}
Execute it:
$ awk -f program.awk file
Output:
1 AUT
3 DEU
3 ESP
1 BGR
1 LTU
2 FRA
1 PRT
5 LUX
8 BEL
1 POL
1 GRC
1 LVA
2 SWE
$0 processes the whole record (row) of data, so it might include false hits from elsewhere in the record. You can enhance that with proper field separation but as it wasn't available I can't help any further. See gnu awk, FS and maybe FPAT in google.

check amount of time between different rows of data (time) and date and name of employee

I have a df with this info ['Name', 'Department', 'Date', 'Time', 'Activity'],
so for example looks like this:
Acosta, Hirto 225 West 28th Street 9/18/2019 07:25:00 Punch In
Acosta, Hirto 225 West 28th Street 9/18/2019 11:57:00 Punch Out
Acosta, Hirto 225 West 28th Street 9/18/2019 12:28:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 06:57:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 12:00:00 Punch Out
Adams, Juan 225 West 28th Street 9/16/2019 12:28:00 Punch In
Adams, Juan 225 West 28th Street 9/16/2019 15:30:00 Punch Out
Adams, Juan 225 West 28th Street 9/18/2019 07:04:00 Punch In
Adams, Juan 225 West 28th Street 9/18/2019 11:57:00 Punch Out
I need to calculate the time between the punch in and the punch out in the same day for the same employee.
i manage to just clean the data
like:
self.raw_data['Time'] = pd.to_datetime(self.raw_data['Time'], format='%H:%M').dt.time
sorted_db = self.raw_data.sort_values(['Name', 'Date'])
sorted_db = sorted_db[['Name', 'Department', 'Date', 'Time', 'Activity']]
any suggestions will be appreciated
so i found the answer of my problem and i wanted to share it.
first a separate the "Punch in" and the "Punch Out" if two columns
def process_info(self):
# filter data and organized --------------------------------------------------------------
self.raw_data['in'] = self.raw_data[self.raw_data['Activity'].str.contains('In')]['Time']
self.raw_data['pre_out'] = self.raw_data[self.raw_data['Activity'].str.contains('Out')]['Time']
after i sort the information base in date and time
sorted_data = self.raw_data.sort_values(['Date', 'Name'])
after that i use the shift function to move on level up the 'out' column so in parallel with the in.
sorted_data['out'] = sorted_data.shift(-1)['Time']
and finally i take out the extra out columns that was created in the first step. but checking if it is by itself.
filtered_data = sorted_data[sorted_data['pre_out'].isnull()]

Python individual rows for each month's rent in term

I'm stuck on one piece of Python code.
From an XML file, we're parsing data successfully in the following code, excluding the while loops and associated variables. We need to load a table into SQL with the entire rent schedule, by month, for the life of the lease. Rent is always billed on the first of the month but the amount escalates at different times with different amounts depending on the lease. The objective is to return one row per billing month with the date of each months' rent to be billed (YYYY-MM-DD). If the lease is for 60 months and there is a rent escalation in the 25th month, we'll need to show 60 rows with the amount repeating 24 times for the first two years and 36 times for the remainder. The scenario needs to be flexible to adapt to annual increases for some, and a few other variable conditions.
Can someone point out where I've gone wrong in my While Loop to get the desired results?
import xml.etree.ElementTree as ET
import pyodbc
import dateutil.relativedelta as rd
import dateutil.parser as pr
tree = ET.parse('DealData.xml')
root = tree.getroot()
for deal in root.findall("Deals"):
for dl in deal.findall("Deal"):
dealid = dl.get("DealID")
for dts in dl.findall("DealTerms/DealTerm"):
dtid = dts.get("ID")
dstart = pr.parse(dts.find("CommencementDate").text)
dterm = dts.find("LeaseTerm").text
darea = dts.find("RentableArea").text
for brrent in dts.findall("BaseRents/BaseRent"):
brid = brrent.get("ID")
begmo = int(brrent.find("BeginIn").text)
if brrent.find("Duration").text is not None:
duration = int(brrent.find("Duration").text)
else:
duration = 0
brentamt = brrent.find("Rent").text
brper = brrent.find("Period").text
perst = dstart + rd.relativedelta(months=begmo-1)
perend = perst + rd.relativedelta(months=duration-1)
billmocount = begmo
while billmocount < duration:
monthnum = billmocount
billmocount += 1
billmo = perst
while billmo < perend:
billper = billmo
billmo += rd.relativedelta(months=1)
if dealid == "706880":
print(dealid, dtid, brid, begmo, dstart, dterm, darea, brentamt, brper, duration, perst, perend, \
monthnum, billper)
The results I'm getting look like this:
706880 4278580 45937180 1 2018-01-01 00:00:00 60 6200 15.0 rsf/year 36 2018-01-01 00:00:00 2020-12-01 00:00:00 35 2020-11-01 00:00:00
706880 4278580 45937181 37 2018-01-01 00:00:00 60 6200 18.0 rsf/year 24 2021-01-01 00:00:00 2022-12-01 00:00:00 35 2022-11-01 00:00:00
The problem that I was running into was simply the indentation of the print statement. By indenting the following text, I was able to get the expected results:
if dealid == "706880":
print(dealid, dtid, brid, begmo, dstart, dterm, darea, brentamt, brper, duration, perst, perend, \
monthnum, billper)

Convert .txt file into multi index dataframe pandas

I have a very unorganized dataset located in a text file say file.txt
The sample looks something like so
TYPE Invoice C AC DATE TIME Total Invoice Qty1 ITEMVG By Total 3,000.00
Piece Item
5696 01/03/2018 09:21 32,501.35 1 Golden Plate ÞÔÞæÇä ÈÞÑ 6,517.52
1 áÈä ÑæÇÈí ÊÚäÇíá 2 ßÛ 4,261.45
1 Magic chef pop corn 907g 3,509.43
1 áÈäÉ ÊÚäÇíá ÔÝÇÝÉ 1 ßíáæ 9,525.60
1 KHOURY UHT 1 L 2,506.74
1 ÎÈÒ ÔãÓíä ÕÛíÑ 1,002.69
2 Almera 200Tiss 2,506.74
1.55 VG Potato 1,550.17
0.41 VG Eggplant 619.67
1 Delivery Charge 501.35
5697 01/03/2018 09:31 15,751.35 0.5 Halloum 1K. 4,476.03
0.59 Cheese double Cream 3,253.75
3 ãæáÇä ÏæÑ ÎÈÒ æÓØ 32 3,760.11
3 ãæáÇä ÏæÑ ÎÈÒ æÓØ 32 3,760.11
1 Delivery Charge 501.35
I want to import it into a data frame pandas using multi-index. Can someone help me with this?
In fact it can not read it as a txt file
# Obtain the Unorganized data from txt
file1=open('file.txt','r')
UnOrgan=file1.read()
You should be able to just read it in using read_table.
import pandas as pd
df = pd.read_table(<your file>, sep="\t", headers=[rows with column info])
I'm guessing that the separator is a tab.

Resources