How can I read and process 100 bytes at a time from a large CSV file? - python-3.x

The below csv is only a snippet of my main data file.
customer.csv
customer_id,order_id,number_of_items
10,4736,9
5,3049,1
1,4689,3
6,4114,9
1,4524,15
2,3727,16
3,3507,7
7,3988,3
5,4993,16
6,1945,4
7,3081,7
3,3707,2
5,1739,12
9,4167,17
7,3242,12
2,3109,10
10,2197,20
10,3528,13
8,4917,2
5,1713,19
8,4224,4
7,2160,2
10,2044,19
10,2956,8
3,3906,2
5,2288,16
7,1854,20
7,4404,2
9,1622,2
7,3685,2
10,2755,10
3,3390,10
6,1424,6
3,2127,15
4,1221,15
9,2994,14
1,1413,13
7,2771,7
3,4579,13
10,2208,4
CURRENTLY ALL I HAVE
import os
os.path.getsize("customer.csv") # outputs, 424 bytes
HOW I THINK I NEED TO PROCEED
I think I need to do something with open csv and read bytes? Then look at each row bit wise?
Please note, I am not looking specifically for someone to just give me an answer on how to do this (although that would be appreciated). Therefore, if someone could just point me in the right direction or give me some topics to look into that would be great. Side note, I know I am supposed to use encoding and decoding somewhere for this task.

This script will use the csv to load the data from customer.csv and compute the average using the builtin statistics module:
import csv
from statistics import mean
with open('customer.csv', newline='') as csvfile:
data = csv.DictReader(csvfile)
# group the customers by customer_id
customers = {}
for order in data:
customers.setdefault(order['customer_id'], []).append(int(order['number_of_items']))
# print the `average`:
print('{:<15} {}'.format('customer_id', 'average'))
for k, v in customers.items():
print('{:<15} {:.2f}'.format(k, mean(v)))
Prints:
customer_id average
10 11.86
5 12.80
1 10.33
6 6.33
2 13.00
3 8.17
7 6.88
9 11.00
8 3.00
4 15.00

Related

How to search for specific text in csv within a Pandas, python

Hello I want to find the account text # in the title column, and save it in the new csv. Pandas can do it, I tried to make it but it didn't work.
This is my csv http://www.sharecsv.com/s/c1ed9790f481a8d452049be439f4e3d8/Newnormal.csv
this is my code:
import pandas as pd
data = pd.read_csv("Newnormal.csv")
data.dropna(inplace = True)
sub ='#'
data["Indexes"]= data["title"].str.find(sub)
print(data)
I want results like this
From, to, title Xavier5501,KudiiThaufeeq,RT #KudiiThaufeeq: Royal
Rape, Royal Harassment, Royal Cocktail Party, Royal Pedo, Royal
Bidding, Royal Maalee Bayaan, Royal Slavery..et
Thank you.
reduce records to only those that have an "#" in title
define new column which is text between "#" and ":"
you are left with some records where this leave NaN in to column. I've just filtered these out
df = pd.read_csv("Newnormal.csv")
df = df[df["title"].str.contains("#")==True]
df["to"] = df["title"].str.extract(r".*([#][A-Z,a-z,0-9,_]+[:])")
df = df[["from","to","title"]]
df[~df["to"].isna()].to_csv("ToNewNormal.csv", index=False)
df[~df["to"].isna()]
output
from to title
1 Xavier5501 #KudiiThaufeeq: RT #KudiiThaufeeq: Royal Rape, Royal Harassmen...
2 Suzane24979006 #USAID_NISHTHA: RT #USAID_NISHTHA: Don't step outside your hou...
3 sandeep_sprabhu #USAID_NISHTHA: RT #USAID_NISHTHA: Don't step outside your hou...
4 oliLince #Timothy_Hughes: RT #Timothy_Hughes: How to Get a Salesforce Th...
7 rismadwip #danielepermana: RT #danielepermana: Pak kasus covid per hari s...
... ... ... ...
992 Reptoid_Hunter #sapiofoxy: RT #sapiofoxy: I literally can't believe we ha...
994 KPCResearch #sapiofoxy: RT #sapiofoxy: I literally can't believe we ha...
995 GreySparkUK #VoxSmartGlobal: RT #VoxSmartGlobal: The #newnormal will see mo...
997 Gabboa10 #HuShameem: RT #HuShameem: One of #PGO_MV admin staff test...
999 wanjirunjendu #ntvkenya: RT #ntvkenya: AAK's Mugure Njendu shares insig...

Resampling Time Series Data (Pandas Python 3)

Trying to convert data at daily frequency to weekly frequency.
In:
weeklyaaapl = pd.DataFrame()
weeklyaapl['Open'] = aapl.Open.resample('W').iloc[0]
#here I am trying to take the first value of the aapl.Open,
#that falls within the week.
Out:
ValueError: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
I want the true open (the first open that prints for the week) (the open of the first day in that week).
It instead wants me to take the mean of the daily open values for a given week using .mean(), which is not the information I need.
Can't seem to interpret the error, documentation isn't helping either.
I think you need.
aapl.resample('W').first()
Output:
Open High Low Close Volume
Date
2010-01-10 30.49 30.64 30.34 30.57 123432050
2010-01-17 30.40 30.43 29.78 30.02 115557365
2010-01-24 29.76 30.74 29.61 30.72 182501620
2010-01-31 28.93 29.24 28.60 29.01 266424802
2010-02-07 27.48 28.00 27.33 27.82 187468421

How do read a SEC txt-file into a pandas dataframe?

I am trying to use SEC (U.S. Security and Exchange Commision data). The SEC provides useful data in a txtformat. I am using
Financial Statement Data Sets for the second quarter of 2017. You can find the data I use here.
I try to read the txtfiles into a pandas dataframe. I tried it the following ways:
sub = pd.read_fwf('sub.txt')
sub_1 = pd.read_csv('sub.txt')
I get no error with using Pandas' read_fwf function - but the output is utter rubbish. Here is the head of the dataframe:
adsh cik name sic countryba stprba cityba zipba bas1 bas2 baph countryma stprma cityma zipma mas1 mas2 countryinc stprinc ein former changed afs wksi fye form period fy fp filed accepted prevrpt detail instance nciks aciks Unnamed: 1
0 0000002178-17-000038\t2178\tADAMS RESOURCES & ... NaN
1 0000002488-17-000107\t2488\tADVANCED MICRO DEV... NaN
I do get an error when using read_csv: Error tokenizing data. C error: Expected 2 fields in line 7, saw 3
Any ideas on how tor read the data into a pandas dataframe?
It looks like the files are tab separated - that's why you're seeing \t in the results. pandas read_csv defaults to comma separated values, so you have to change the separator. This is controlled by the sep parameter. In addition, you will need to provide the proper encoding (errors are thrown when trying to read the num, pre, and tag files). Generally ISO-8859-1 is a good choice.
#import pandas
import pandas as pd
#read in the .txt file and choose a separator and encoding standard
df = pd.read_csv('sub.txt', sep='\t', encoding='ISO-8859-1')
#output the results
print(df)
adsh cik name \
0 0000002178-17-000038 2178 ADAMS RESOURCES & ENERGY, INC.
1 0000002488-17-000107 2488 ADVANCED MICRO DEVICES INC
2 0000002969-17-000019 2969 AIR PRODUCTS & CHEMICALS INC /DE/
3 0000002969-17-000024 2969 AIR PRODUCTS & CHEMICALS INC /DE/
4 0000003499-17-000010 3499 ALEXANDERS INC
5 0000003545-17-000043 3545 ALICO INC
6 0000003570-17-000073 3570 CHENIERE ENERGY INC

simple Granger Casuality test using statsmodels.tsa.grangercausalitytests

I have several time-series files ( 540 rows x 6 columns ) that i would like to do a simple Granger Casuality test using statsmodels.tsa.grangercausalitytests
from statsmodels.tsa.stattools import grangercausalitytests
my pandas dataframe ( df) contains the data in the following format
i tried to print the tests using Open and close columns with following:
print(grangercausalitytests([df[Open], df[Close]], maxlag=15, addconst=True, verbose=True))
but it does not work. Is there a way to perform Granger test on each Column ( Open, High, Low ) with Close i.e Open and close, High and close , low and close )
Epochtime Open High Low Close Vol
1486094520, 808.11000, 808.11000, 808.11000, 808.11000, 100
1486094580, 809.45000, 809.45000, 809.45000, 809.45000, 100
1486094820, 809.99000, 809.99000, 809.99000, 809.99000, 100
1486095540, 811.45000, 811.45000, 811.45000, 811.45000, 100
1486095840, 811.30000, 811.30000, 811.01000, 811.01000, 300
1486095900, 810.76000, 810.76000, 810.76000, 810.76000, 100
1486096200, 812.00000, 812.00000, 812.00000, 812.00000, 100
It requires 2-dimensional array, try this:
print(grangercausalitytests(df[['Open', 'Close']], maxlag=15, addconst=True, verbose=True))

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

Resources