looping diffulties with 2 csv files - python-3.x

Ok this is the last question about csv files and looping.
So I with my loops I want to do the following.
This is the csv file of students I have made into lists.
File 1
['Needie Seagoon', '57', '', '83', '55', '78', '', '91', '73', '65', '56', '', '', '']
['Eccles', '', '98', '91', '80', '', '66', '', '', '', '77', '78', '48', '77']
['Bluebottle', '61', '', '88', '80', '60', '', '45', '52', '91', '85', '', '', '']
['Henry Crun', '92', '', '58', '50', '57', '', '67', '45', '77', '72', '', '', '']
['Minnie Bannister', '51', '', '97', '52', '53', '', '68', '58', '70', '69', '', '', '']
['Hercules Grytpype-Thynne', '', '78', '62', '75', '', '67', '', '', '', '48', '56', '89', '67']
['Count Jim Moriarty', '51', '', '68', '51', '66', '', '55', '72', '50', '74', '', '', '']
['Major Dennis Bloodnok', '', '54', '47', '59', '', '48', '', '', '', '66', '58', '53', '83']
I then have another csv file with the max scores of each course:
File 2
CITS1001 95
CITS1401 100
CITS1402 97
CITS2002 99
CITS2211 94
CITS2401 95
CITS3001 93
CITS3002 93
CITS3003 91
CITS3200 87
CITS3401 98
CITS3402 93
CITS3403 88
So what I want to do and have been trying very hard to achieve is try and divide each student score by the max score of the course.
so for each student, the value going horizontal, I want it to divide by the values of the other value vertically.
For example:
['Needie Seagoon', '57', '', '83', '55', '78', '', '91', '73', '65', '56', '', '', '']
I want 57/95 , skip, 83/100, 55/97... you get where I'm going?
I want to do this for every name In the file. This code might be familiar to some of you but I know I'm doing something wrong.
def normalise(students_file, units_list):
file1 = open(students_file, 'r')
data1 = file1.read().splitlines()
file2 = open(units_list, 'r')
data2 = file2.read().splitlines()
for line in data1:
line = line.split(",")
for row in data2:
row = row.split(",")
for n in range(1, len(row), 2):
for i in range(1, len(line), 1):
if line[i] == '' :
pass
else:
answer = int(line[1]) / int(row[n])
file1.close()
file2.close()
I'll show you some of the output(goes on for a very long time).
output:
1st loop
0.6
none
0.8736842105263158
0.5789473684210527
0.8210526315789474
none
0.9578947368421052
0.7684210526315789
0.6842105263157895
0.5894736842105263
none
none
none
2nd loop
0.57
none
0.83
0.55
0.78
none
0.91
0.73
0.65
0.56
none
none
I understand that I have readline() but when I do readlines(), I cant strip the /n as it doesn't allow me to and the end='' makes the code messy. This output is saying that every value in the students row is getting divided by 95 then looping back to the start and looping every value by 100 and so on. How can I make the first value divide by 95, second by 100 and so on.
Sorry for the long explanation/question but I get told to explain myself more.
thanks.

Related

Spark timeout on writing to parquet

I get a timeout on running this notebook in databricks. The last step in writing to parquet is taking approx 15-18 mins before timeout error occurs. I'm not sure as to where it goes wrong.
from pyspark.sql.functions import explode, sequence
# Create hours string
spark.sql(f"select explode(array('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23')) as hh").createOrReplaceTempView('hours')
# Create minutes string
spark.sql(f"select explode(array('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59')) as mm").createOrReplaceTempView('minutes')
# Create seconds string
spark.sql(f"select explode(array('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59')) as ss").createOrReplaceTempView('seconds')
# Create Time string, add hour, minute, second
spark.sql(f"select CAST(CONCAT(hours.hh, ':', minutes.mm, ':', seconds.ss) as string) as Time, explode(sequence(0,23,1)) as Hour from hours cross join minutes cross join seconds ").createOrReplaceTempView('time1')
spark.sql(f"select *, explode(sequence(0,59,1)) as Minute from time1").createOrReplaceTempView('time2')
spark.sql(f"select *, explode(sequence(0,59,1)) as Second from time2").createOrReplaceTempView('time3')
# Add TimeID
spark.sql(f"select row_number() over (order by TIME) as TimeID, * from time3").createOrReplaceTempView('src')
# Add HourDescription
spark.sql(f"select *, CONCAT(CASE date_part('HOUR', Time) WHEN 0 THEN '00' WHEN 1 THEN '01' WHEN 2 THEN '02' WHEN 3 THEN '03' WHEN 4 THEN '04' WHEN 5 THEN '05' WHEN 6 THEN '06' WHEN 7 THEN '07' WHEN 8 THEN '08' WHEN 9 THEN '09' END, ':00') as HourDescription from src").createOrReplaceTempView('src1')
# Add HourBucket
spark.sql(f"select *, CONCAT(HourDescription, ' - ', CONCAT(CASE date_part('HOUR', Time) WHEN 0 THEN '01' WHEN 1 THEN '02' WHEN 2 THEN '03' WHEN 3 THEN '04' WHEN 4 THEN '05' WHEN 5 THEN '06' WHEN 6 THEN '07' WHEN 7 THEN '08' WHEN 8 THEN '09' WHEN 9 THEN '10' WHEN 10 THEN '11' WHEN 11 THEN '12' WHEN 12 THEN '13' WHEN 13 THEN '14' WHEN 14 THEN '15' WHEN 15 THEN '16' WHEN 16 THEN '17' WHEN 17 THEN '18' WHEN 18 THEN '19' WHEN 19 THEN '20' WHEN 20 THEN '21' WHEN 21 THEN '22' WHEN 22 THEN '23' WHEN 23 THEN '00' END, ':00')) as HourBucket from src1").createOrReplaceTempView('src2')
# Add DayPart
spark.sql(f"select *, CASE WHEN (Hour >= 0 AND Hour < 6) THEN 'Night' WHEN (Hour >= 6 AND Hour < 12) THEN 'Morning' WHEN (Hour >= 12 AND Hour < 18) THEN 'Afternoon' ELSE 'Evening' END as DayPart FROM src2").createOrReplaceTempView('src3')
# Add BusinessHour
spark.sql(f"select *, CASE WHEN (Hour >= 8 AND Hour < 18) THEN 'Yes' ELSE 'No' END as BusinessHour FROM src3").createOrReplaceTempView('src_final')
#Write to Parquet
df = sqlContext.sql("select * from src_final");
df.write.parquet("/mnt/xxx/xx/xxx/")
I figured it out. the explode(sequence) took alot of effort. Especially as soon as the one for minutes stepped in. I fixed the code like this:
# Create Time string, add hour, minute, second
spark.sql(f"select CAST(CONCAT(hours.hh, ':', minutes.mm, ':', seconds.ss) as string) as Time, cast(hours.hh as int) as Hour, cast(minutes.mm as int) as Minute, cast(seconds.ss as int) as Second from hours cross join minutes cross join seconds ").createOrReplaceTempView('time')

How to filter string from all column from csv file using python

csv file exampleI have a csv file and I need to check all columns to find ? in the csv file and remove those rows.
below is an example
Column1 Column 2 Column 3
1 ? 3
2 ?.. 1
? 2 ?.
? 4 4
I tried the below however it does not work
data = readData(“text.csv”)
print(data)
def Filter(string, substr):
return [str for str in string if
any(sub not in str for sub in substr)]
string = data
substr = [’?’,’?.’,’? ‘,’? ']
filter_data=Filter(string, substr)
my code is below to get ouptut in tupples.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def readData(filename) :
data = pd.read_csv(filename, skipinitialspace=True)
return [d for d in data.itertuples(index=False, name=None)]
data = readData("problem2.csv")
print(data)
[('18.0', 8, '307.0 ', '130.0 ', '3504.', '12.0', 70, 1, 'chevrolet chevelle malibu'), ('15.0', 8, '350.0 ', '165.0 ', '3693.', '11.5', 70, 1, 'buick skylark 320'), ('18.0', 8, '318.0 ', '150.0 ', '?.', '11.0', 70, 1, 'plymouth satellite'), ('16.0', 8, '304.0 ', '150.0 ', '3433.', '12.0', 70, 1, 'amc rebel sst'), ('17.0', 8, '302.0 ', '140.0 ', '3449.', '10.5', 70, 1, 'ford torino'), ('15.0', 8, '429.0 ', '198.0 ', '4341.', '10.0', 70, 1, 'ford galaxie 500'), ('14.0', 8, '454.0 ', '220.0 ', '4354.', '9.0', 70, 1, 'chevrolet impala'), ('14.0', 8, '440.0 ', '215.0 ', '4312.', '8.5', 70, 1, 'plymouth fury iii'),
Next want to remove rows with '?; from all columns to provide the same output in tupples.
My input file is as follows:
mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name
18,8,307,130,3504,12,70,1,chevy malibu
18,8,308,140,?.,14,70,1,plymoth satellite
18,8,309,150,?,15,70,1,ford torino
18,8,310,150,? ,16,70,1,ford galaxy
18,8,310,150, ?,17,70,1,pontiac catalina
18,8,310,150,3505,18,70,1,ford maverick
The code to replace any of the following occurrences ['?','?.',' ?','? '] is as follows:
import csv
qs = ['?','?.',' ?','? ']
with open('abc.txt') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
row = ['' if r in qs else r for r in row]
print (row)
The output of this will be as follows:
['mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu']
['18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite']
['18', '8', '309', '150', '', '15', '70', '1', 'ford torino']
['18', '8', '310', '150', '', '16', '70', '1', 'ford galaxy']
['18', '8', '310', '150', '', '17', '70', '1', 'pontiac catalina']
['18', '8', '310', '150', '3505', '18', '70', '1', 'ford maverick']
As you can see values from rows 3 thru 6 got replaced with ''.
Ran with one more sample dataset:
mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name
18,8,307,130,3504,12,70,1,chevy malibu
18,8,308,140,?.,14,70,1,plymoth satellite
18,8,309,?,3506,15,70,1,ford torino
18,8,310,160,? ,16,70,1,ford galaxy
18,8,311,170,3508, ?,70,1,pontiac catalina
18,8,312,180,3509,18,70,1,ford maverick
Output is:
['mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu']
['18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite']
['18', '8', '309', '', '3506', '15', '70', '1', 'ford torino']
['18', '8', '310', '160', '', '16', '70', '1', 'ford galaxy']
['18', '8', '311', '170', '3508', '', '70', '1', 'pontiac catalina']
['18', '8', '312', '180', '3509', '18', '70', '1', 'ford maverick']
In this scenario, the ? is on various columns. It still addresses the problem.
In case you are looking for all the rows in one go, you can read all the lines into one variable and process it.
qs = {'?.':'',' ?':'','? ':'','?':''}
with open('abc.txt') as csv_file:
lines = csv_file.readlines()
for i,text in enumerate(lines):
[text := text.replace(a,b) for a,b in qs.items()]
lines[i] = text
print (lines)
Your output data will be as follows:
['mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name\n', '18,8,307,130,3504,12,70,1,chevy malibu\n', '18,8,308,140,,14,70,1,plymoth satellite\n', '18,8,309,,3506,15,70,1,ford torino\n', '18,8,310,160,,16,70,1,ford galaxy\n', '18,8,311,170,3508,,70,1,pontiac catalina\n', '18,8,312,180,3509,18,70,1,ford maverick\n']
tuple output
Looks like you are expecting tuples as output.
Here's the code to do it:
import csv
qs = {'?.':'',' ?':'','? ':'','?':''}
final_list = []
with open('abc.txt') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
row = ['' if r in qs else r for r in row]
final_list.append(tuple(row))
print (final_list)
The output will be as follows:
[('mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name'), ('18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu'), ('18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite'), ('18', '8', '309', '', '3506', '15', '70', '1', 'ford torino'), ('18', '8', '310', '160', '', '16', '70', '1', 'ford galaxy'), ('18', '8', '311', '170', '3508', '', '70', '1', 'pontiac catalina'), ('18', '8', '312', '180', '3509', '18', '70', '1', 'ford maverick')]

text file with list of list to open as dataframe

I am new to python. I have a text file as 'asv.txt' having the following content:
[['10', '50', '', ' Ind ', '', ''], ['40', '30', '', ' Ind ', 'Mum', ''], ['50', '10', '', ' Cd ', '', '']]
How do I read it as a csv or as a dataframe.
# Read file (or just copy text)
with open('asv.txt') as f:
data = f.read()
# Convert str to list with ast
import ast
data = ast.literal_eval(data)
## Load dataframe using the "data" argument, which can accept a list and treats it as rows
df = pd.DataFrame(data=data)
Or much simpler for this specific case:
df = pd.DataFrame(data=[['10', '50', '', ' Ind ', '', ''], ['40', '30', '', ' Ind ', 'Mum', ''], ['50', '10', '', ' Cd ', '', '']])

How to add a character into a chararray that already have character in ipython 3

In python 2.7 I can do...
>>> import numpy
>>> flag=numpy.chararray(10) + ' '
>>> flag
chararray(['', '', '', '', '', '', '', '', '', ''],
dtype='|S6')
>>> flag[5] = 'a'
>>> flag
chararray(['', '', '', '', '', 'a', '', '', '', ''],
dtype='|S6')
>>> flag[5]=flag[5]+'b'
>>> flag
chararray(['', '', '', '', '', 'ab', '', '', '', ''],
dtype='|S6')
But this did not word in python 3.....
BTW. How can I save the "flag" array with some number array in to a text file. Like
1 1
1 1
1 1
1 1
1 1
1 1 ab
1 1
1 1
1 1
1 1
I had used
np.savetxt but.... won't work....
many thx.....

What does the following code say in the simple way?

Can you tell me what this code says in the simple way:
board = [['' for x in range(BOARD_SIZE)] for y in range(BOARD_SIZE)]
This code creates a list of BOARD_SIZE lists. Each of these lists will contain BOARD_SIZE empty strings. So if BOARD_SIZE is 3 then the board will be:
board = [ ['', '', ''],
['', '', ''],
['', '', ''] ]
You can rewrite this code in a single line:
board = [['', '', ''], ['', '', ''], ['', '', '']]

Resources