Panda returns 50x1 matrix instead of 50x7? (read_csv gone wrong) - python-3.x

I'm quite new to Python. I'm trying to load a .csv file with Panda but it returns a 50x1 matrix instead of expected 50x7. I'm a bit uncertain whether it is becaue my data contains numbers with "," (although I thought the quotechar attribute would solve that problem).
EDIT: Should perhaps mention that including the attribute sep=',' doesn't solve the issue)
My code looks like this
df = pd.read_csv('data.csv', header=None, quotechar='"')
print(df.head)
print(len(df.columns))
print(len(df.index))
Any ideas? Thanks in advance
Here is a subset of the data as text
10-01-2021,813,116927,"2,01",-,-,-
11-01-2021,657,117584,"2,02",-,-,-
12-01-2021,462,118046,"2,03",-,-,-
13-01-2021,12728,130774,"2,24",-,-,-
14-01-2021,17895,148669,"2,55",-,-,-
15-01-2021,15206,163875,"2,81",5,5,"0,0001"
16-01-2021,4612,168487,"2,89",7,12,"0,0002"
17-01-2021,2536,171023,"2,93",717,729,"0,01"
18-01-2021,3883,174906,"3,00",2147,2876,"0,05"
Here is the output of the head-function
0
0 27-12-2020,6492,6492,"0,11",-,-,-
1 28-12-2020,1987,8479,"0,15",-,-,-
2 29-12-2020,8961,17440,"0,30",-,-,-
3 30-12-2020,11477,28917,"0,50",-,-,-
4 31-12-2020,6197,35114,"0,60",-,-,-
5 01-01-2021,2344,37458,"0,64",-,-,-
6 02-01-2021,8895,46353,"0,80",-,-,-
7 03-01-2021,6024,52377,"0,90",-,-,-
8 04-01-2021,2403,54780,"0,94",-,-,-

Using your data I got the expected result. (even without quotechar='"')
Could you maybe show us your output?
import pandas as pd
df = pd.read_csv('data.csv', header=None)
print(df)
> 0 1 2 3 4 5 6
> 0 10-01-2021 813 116927 2,01 - - -
> 1 11-01-2021 657 117584 2,02 - - -
> 2 12-01-2021 462 118046 2,03 - - -
> 3 13-01-2021 12728 130774 2,24 - - -
> 4 14-01-2021 17895 148669 2,55 - - -
> 5 15-01-2021 15206 163875 2,81 5 5 0,0001
> 6 16-01-2021 4612 168487 2,89 7 12 0,0002
> 7 17-01-2021 2536 171023 2,93 717 729 0,01
> 8 18-01-2021 3883 174906 3,00 2147 2876 0,05

You need to define the seperator and delimiter, like this:
df = pd.read_csv('data.csv', header=None, sep = ',', delimiter=',' , quotechar='"')

Related

Counting the number of times the values are more than the mean for a specific column in Dataframe

I'm trying to find the number of times the value in a certain column (in this case under "AveragePrice") is more than its mean & median. I calculated the mean using the below:
mean_AveragePrice = avocadodf["AveragePrice"].mean(axis = 0)
median_AveragePrice = avocadodf["AveragePrice"].median(axis = 0)
how do I count the number of times the values were more than the mean?
Sample of the Dataframe:
Date AveragePrice Total Volume PLU4046 PLU4225 PLU4770 Total Bags
0 27/12/2015 1.33 64236.62 1036.74 54454.85 48.16 8696.87
1 20/12/2015 1.35 54876.98 674.28 44638.81 58.33 9505.56
2 13/12/2015 0.93 118220.22 794.70 109149.67 130.50 8145.35
3 06/12/2015 1.08 78992.15 1132.00 71976.41 72.58 5811.16
4 29/11/2015 1.28 51039.60 941.48 43838.39 75.78 6183.95
5 22/11/2015 1.26 55979.78 1184.27 48067.99 43.61 6683.91
6 15/11/2015 0.99 83453.76 1368.92 73672.72 93.26 8318.86
7 08/11/2015 0.98 109428.33 703.75 101815.36 80.00 6829.22
8 01/11/2015 1.02 99811.42 1022.15 87315.57 85.34 11388.36
import numpy as np
mean_AveragePrice = avocadodf["AveragePrice"].mean(axis = 0)
median_AveragePrice = avocadodf["AveragePrice"].median(axis = 0)
where_bigger = np.where((avocadodf["AveragePrice"] > mean_AveragePrice) & (avocadodf["AveragePrice"] > median_AveragePrice), 1, 0 )
where_bigger.sum()
So you got the data you need and now you need the test. np.where will help you out

How to create a dataframe with different random numbers on each column?

I'm trying to make different random numbers but it keeps being the same on every column, how to fix it, using 1 line?
CODE:
yuju= pd.DataFrame()
column_price_x = [random.uniform(65.5,140.5) for i in range(20)]
for i in range(1990,2020):
yuju[i] = column_price_x
yuju
RESULT
EXPECTED:
Different numbers value for each column
How can I deal with it?
Its much easier than you think
In [12]: import numpy as np
In [13]: df = pd.DataFrame(np.random.rand(5,5))
In [14]: df
Out[14]:
0 1 2 3 4
0 0.463645 0.818606 0.520964 0.016413 0.286529
1 0.701693 0.556813 0.352911 0.738017 0.148805
2 0.899378 0.626350 0.821576 0.917648 0.404706
3 0.985617 0.336138 0.443910 0.690457 0.627859
4 0.121281 0.784853 0.799065 0.102332 0.156317
np.random.rand samples from standard uniform distribution (over [0,1])
Edit
if you want uniform distribution over given numbers, use np.random.uniform
In [16]: pd.DataFrame(np.random.uniform(low=65.5,high=140.5,size=(5,5))
...: )
Out[16]:
0 1 2 3 4
0 124.356069 96.718934 100.587485 136.670313 124.134073
1 68.109675 105.677037 86.084935 109.284336 108.393333
2 120.445978 125.036895 92.557137 105.864824 95.297450
3 91.027931 140.040051 94.362951 80.870850 70.106912
4 107.404708 92.472469 84.748544 82.116756 129.313166
here the solution
each iteration you should random again to assign new value for each column
yuju= pd.DataFrame()
for i in range(1990,2020):
yuju[i]= [random.uniform(65.5,140.5) for i in range(20)]
yuju
output
1990 1991 1992 1993 1994 1995 1996 1997 ...
0 73.117785 104.158470 76.704672 136.295814 106.008801 88.129275 96.843800 118.172649 ... 106.08
1 77.146977 131.584449 112.781430 113.071448 118.806880 140.301281 132.196554 136.222878 ... 74.85
2 67.976294 90.571586 137.313729 126.388545 134.941530 119.544528 119.692859 124.883332 ... 82.48
3 76.577618 102.765745 137.014399 84.696234 70.087628 86.180974 121.070030 87.991356 ... 71.67
4 104.675987 134.869611 120.221701 69.652423 105.650834 107.308007 122.372708 80.037225 ... 90.58
5 107.093326 124.649323 138.961846 84.312784 98.964176 87.691698 120.426266 79.888018 ... 97.46
6 97.375159 97.607740 119.027947 77.545403 81.365235 119.204719 75.426836 132.545121 ... 120.15
7 81.099338 94.315767 123.389789 85.734648 134.746295 99.196135 65.963834 72.895016 ... 135.63
8 129.577824 118.482358 137.838454 83.338883 68.603851 138.657750 85.155046 73.311065 ... 91.12
9 129.321333 134.598491 138.810883 119.487502 75.794849 125.314185 118.499014 126.969947 ... 74.86
10 122.704160 118.282868 114.196318 69.668442 112.237553 68.953530 115.395672 114.560736 ... 88.21
11 112.653109 109.635751 78.470715 81.973892 111.413094 76.918852 76.318205 129.423737 ... 103.06
12 80.984595 136.170595 83.258407 112.248942 96.730922 84.922575 104.984614 127.646325 ... 103.24
13 82.658896 97.066191 95.096705 107.757428 93.767250 93.958438 115.113325 98.931509 ... 105.32
14 85.173060 77.257117 72.668875 87.061919 130.088992 80.001858 104.526423 85.237558 ... 87.86
15 68.428850 79.948204 107.060400 92.962859 133.393354 93.806838 99.258857 138.314982 ... 86.80
16 115.105281 110.567551 119.868457 139.482290 103.235046 128.805920 140.131489 107.568099 ... 98.16
17 71.318147 119.965667 97.135972 90.174975 125.738171 115.655945 86.333461 114.574965 ... 134.80
18 134.000260 121.417473 104.832999 129.277671 139.932955 122.623911 92.369881 109.523118 ... 137.47
19 104.444951 111.712214 130.602922 119.446700 88.256841 110.316280 74.611164 88.364896 ... 115.32

Pandas dataframe.read_csv ,quotechar doesnot work

I am not getting the output as expected.
I am trying to convert CSV to dataframe, But it is not working:
sales=pd.read_csv('Downloads/item.csv',sep=',',delimeter='"',error_bad_lines=False,quotechar='"')
This is my CSV file sample:
"account_number,name,item_code,category,quantity,unit price,net_price,date "
"093356,Waters-Walker,AS-93055,Shirt,5,82.68,413.40,2013-11-17 20:41:11"
"659366,Waelchi-Fahey,AS-93055,Shirt,18,99.64,1793.52,2014-01-03 08:14:27"
"563905,""Kerluke, Reilly and Bechtelar"",AS-93055,Shirt,17,52.82,897.94,2013-12-04 02:07:05"
"995267,Cole-Eichmann,GS-86623,Shoes,18,15.28,275.04,2014-04-09 16:15:03"
"524021,Hegmann and Sons,LL-46261,Shoes,7,78.78,551.46,2014-06-18 19:25:10"
"929400,""Senger, Upton and Breitenberg"",LW-86841,Shoes,17,38.19,649.23,2014-02-10 05:55:56"
Please take a look at the bold characters in the CSV files they are enclosed with ""
Here is my proposal:
df = pd.read_csv('file.csv')
col_name = 'account_number,name,item_code,category,quantity,unit price,net_price,date'
z = pd.concat([df[col_name].str.split(r'(,(?=\S)|:)', expand=True)], axis=1)
z['date'] = z[14]+z[15]+z[16]+z[17]+z[18]
z = z.drop(columns=[1,3,5,7,9,11,13, 14,15,16,17,18])
z.columns = col_name.split(',')
Crucial is this regex r'(,(?=\S)|:)' - comma but not followed by space but I don't know why it also split on :. If you can fix it then you don't have manually concat date.
Output:
account_number ... date
0 093356 ... 2013-11-17 20:41:11
1 659366 ... 2014-01-03 08:14:27
2 563905 ... 2013-12-04 02:07:05
3 995267 ... 2014-04-09 16:15:03
4 524021 ... 2014-06-18 19:25:10
5 929400 ... 2014-02-10 05:55:56

read_html resulting in first row as column header name despite header = None

url = "http://www.espn.com/nba/standings"
dfs = pd.read_html(url, header = None)
dfs[1]
resulting in:
1* --MILMilwaukee Bucks
0 2y --TORToronto Raptors
1 3x --PHIPhiladelphia 76ers
2 4x --BOSBoston Celtics
3 5x --INDIndiana Pacers
0 2y --TORToronto Raptors
1* --MILMilwaukee Bucks shouldn't be a column name
I feel like I am doing something wrong (haven't used Pandas in a while), but from what I have read header = None should work.
I have tried doing it but in my case also header = None didn't work(I am searching for the reason why it didn't work) well instead of it you can use header = 0 it works well.
data = pd.read_html("test.html",header = 0)
print(data)
** Output::**
[ Programming Language Creator Year
0 C Dennis Ritchie 1972
1 Python Guido Van Rossum 1989
2 Ruby Yukihiro Matsumoto 1995]
This will work for you. ;)

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

Resources