how to classify a large csv file of signals without headers in python? - python-3.x

i had a large csv file (3000*20000) of data without headers i added one columns to represent the classes. how i can fit the data to the model when the features has no headers and it can not be added manually due to the large number of columns.
is there i way to automatically iterate each columns in a row?
when i had a small file of 4 columns i used the following code:
import pandas as pd
pd = pd.ExcelFile("bcs.xlsx")
col = [0, 1, 2, 3]
data = pd.parse(pd.sheet_names[0], parse_cols = col)
pdc = list(data["pdc"])
pds = list(data["pds"])
pdsh = list(data["pdsh"])
pd_class = list(data["class"])
features = []
for i in range(len(pdc)):
features.append([pdc[i],pds[i],pdsh[i]])
labels = []
labels = pd_class
But with a 3000 by 20000 file i don't know how to identify the features and labels/target

Let's say you have a csv like that:
1,2,3,4,0
1,2,3,4,1
1,2,3,4,1
1,2,3,4,0
where the first 4 columns are features and the last one is the label or class you want. You can read the file with pandas.read_csv and create a dataframe for you features and one for your labels which you can fit next, to your model.
import pandas as pd
#CSV localPath
mypath ='C:\\...'
#The names of the columns you want to have in your dataframe
colNames = ['Feature1','Feature2','Feature3','Feature4','class']
#Read the data as dataframe
df = pd.read_csv(filepath_or_buffer = mypath,
names = colNames , sep = ',' , header = None)
#Get the first four columns as features
features = df.ix[:,:4]
#and last columns as label
labels = df['class']

Related

Append values to Dataframe in loop and if conditions

Need help please.
I have a dataframe that reads rows from Excel and appends to Dataframe if certain columns exist.
I need to add an additional Dataframe if the columns don't exist in a sheet and append filename and sheetname and write all the file names and sheet names for those sheets to an excel file. Also I want the values to be unique.
I tried adding to dfErrorList but it only showed the last sheetname and filename and repeated itself many times in the output excel file
from xlsxwriter import Workbook
import pandas as pd
import openpyxl
import glob
import os
path = 'filestoimport/*.xlsx'
list_of_dfs = []
list_of_dferror = []
dfErrorList = pd.DataFrame() #create empty df
for filepath in glob.glob(path):
xl = pd.ExcelFile(filepath)
# Define an empty list to store individual DataFrames
for sheet_name in xl.sheet_names:
df = pd.read_excel(filepath, sheet_name=sheet_name)
df['sheetname'] = sheet_name
file_name = os.path.basename(filepath)
df['sourcefilename'] = file_name
if "Project ID" in df.columns and "Status" in df.columns:
print('')
*else:
dfErrorList['sheetname'] = df['sheetname'] # adds `sheet_name` into the column
dfErrorList['sourcefilename'] = df['sourcefilename']
continue
list_of_dferror.append((dfErrorList))
df['Status'].fillna('', inplace=True)
df['Added by'].fillna('', inplace=True)
list_of_dfs.append(df)
# # Combine all DataFrames into one
data = pd.concat(list_of_dfs, ignore_index=True)
dataErrors = pd.concat(list_of_dferror, ignore_index=True)
dataErrors.to_excel(r'error.xlsx', index=False)
# data.to_excel("total_countries.xlsx", index=None)

Pandas - how to create a new dataframe from the columns and values of an old dataframe?

I have a CSV file in which I have tweets with the following column names: File, User, Date 1, month, day, Tweet, Permalink, Retweet count, Likes count, Tweet value, Language, Location.
I want to create a new data frame with tweets from certain cities. I can do it but only for the last city on my list (Girona). So it doesn't add all the rows. Here is my code:
import pandas as pd
import os
path_to_file = "populismo_merge.csv"
df = pd.read_csv(path_to_file, encoding='utf-8', sep=',')
values = df[df['Location'].str.contains("A Coruña",na=False)]
values = df[df['Location'].str.contains("Alava",na=False)]
values = df[df['Location'].str.contains("Albacete",na=False)]
values = df[df['Location'].str.contains("Alicante",na=False)]
values = df[df['Location'].str.contains("Almería",na=False)]
values = df[df['Location'].str.contains("Asturias",na=False)]
values = df[df['Location'].str.contains("Avila",na=False)]
values = df[df['Location'].str.contains("Badajoz",na=False)]
values = df[df['Location'].str.contains("Barcelona",na=False)]
values = df[df['Location'].str.contains("Burgos",na=False)]
values = df[df['Location'].str.contains("Cáceres",na=False)]
values = df[df['Location'].str.contains("Cádiz",na=False)]
values = df[df['Location'].str.contains("Cantabria",na=False)]
values = df[df['Location'].str.contains("Castellón",na=False)]
values = df[df['Location'].str.contains("Ceuta",na=False)]
values = df[df['Location'].str.contains("Ciudad Real",na=False)]
values = df[df['Location'].str.contains("Córdoba",na=False)]
values = df[df['Location'].str.contains("Cuenca",na=False)]
values = df[df['Location'].str.contains("Formentera",na=False)]
values = df[df['Location'].str.contains("Girona",na=False)]
values.to_csv(r'populismo_ciudad.csv', index = False)
Many thanks!!!
Use isin:
import pandas as pd
import os
path_to_file = "populismo_merge.csv"
df = pd.read_csv(path_to_file, encoding='utf-8', sep=',')
cities = ['A Coruña', 'Alava', 'Albacete', 'Alicante', 'Almería',
'Asturias', 'Avila', 'Badajoz', 'Barcelona', 'Burgos',
'Cáceres', 'Cádiz', 'Cantabria', 'Castellón', 'Ceuta',
'Ciudad Real', 'Córdoba', 'Cuenca', 'Formentera', 'Girona']
values = df[df['Location'].isin(cities)]
values.to_csv(r'populismo_ciudad.csv', index = False)
You are overwriting the values variable each time. A more concise answer would be along the lines of.
values= df[df['LocationName'].isin(["A Coruña", "Alava", ......)]

pandas read_csv with data and headers in alternate columns

I have a generated CSV file that
doesn't have headers
has header and data occur alternately in every row (headers do not change from row to row).
E.g.:
imageId,0,feat1,30,feat2,34,feat,90
imageId,1,feat1,0,feat2,4,feat,89
imageId,2,feat1,3,feat2,3,feat,80
IMO, this format is redundant and cumbersome (I don't see why anyone would generate files in this format). The saner/normal CSV of the same data (which I can directly read using pd.read_csv():
imageId,feat1,feat2,feat
0,30,34,90
1,0,4,89
2,3,3,80
My question is, how do I read the original data into a pd dataframe? For now, I do a read_csv and then drop all alternate columns:
df=pd.read_csv(file, header=None)
df=df[range(1, len(df.columns), 2]
Problem with this is I don't get the headers, unless I make it a point to specify them.
Is there a simpler way of telling pandas that the format has data and headers in every row?
Select columns by indexing in DataFrame.iloc and set new columns names with get first row and pair values (assuming pair columns have same values like in sample data):
#default headers
df = pd.read_csv(file, header=None)
df1 = df.iloc[:, 1::2]
df1.columns = df.iloc[0, ::2].tolist()
print (df1)
imageId feat1 feat2 feat
0 0 30 34 90
1 1 0 4 89
2 2 3 3 80
I didn't measure but I would expect that it could be a problem to read the entire file (redundant headers and actual data) before filtering for the interesting stuff. So I tried to exploit the optional parameters nrows and usecols to (hopefully) limit the amount of memory needed to process the CSV input file.
# --- Utilities for generating test data ---
import random as rd
def write_csv(file, line_count=100):
with open(file, 'w') as f:
r = lambda : rd.randrange(100);
for i in range(line_count):
line = f"imageId,{i},feat1,{r()},feat2,{r()},feat,{r()}\n"
f.write(line)
file = 'text.csv'
# Generate a small CSV test file
write_csv(file, 10)
# --- Actual answer ---
import pandas as pd
# Read columns of the first row
dfi = pd.read_csv(file, header=None, nrows=1)
ncols = dfi.size
# Read data columns
dfd = pd.read_csv(file, header=None, usecols=range(1, ncols, 2))
dfd.columns = dfi.iloc[0, ::2].to_list()
print(dfd)

How to create a .kml file from a dataframe?

I need to create a .kml file from a dataframe with more than 800 districts.
This is what I have done so far:
1) Read a .csv file (FIG 1) using PANDAS
2) Creat a new dataframe by choosing only the first 3 columns (longitude, latitude, altitude)
3) Create a list of tuples from the dataframe
4) Create a .kml file and do some styling (colors)
All this proceure work great ONLY when there is 1 district. Now I need to do the same but with more than 800 districts. In FIG 2, it is shown an example with 2 districts (ACTONO and AILSACRAIGO).
When converting the dataframe to a list of tuples, how can make "python" know that there are many districts?
I believe these lines have to be improved:
a) Here I will need a list of tuples (one for each district)
#Converting the dataframe to a list of tuples
tuples = [tuple(x) for x in df_modify.values]
b) And here, "outboundaryies" will have to change for each of the tuples
pol = kml.newpolygon(name= 'ACTONO', description= 'Acton County',
outerboundaryis=tuples, extrude=extrude, altitudemode=altitudemode)
This is all the code:
CODE FOR .CSV WITH 1 DISTRICT
import pandas
import simplekml
kml = simplekml.Kml()
#Using PANDAS to read .csv and chosing the first 3 columns
df = pandas.read_csv('C:\\Users\\disa_ONTshp.csv')
df_modify=df.iloc[:, [0,1,2]]
#Converting the dataframe to a list of tuples
tuples = [tuple(x) for x in df_modify.values]
#Creating a .kml file
extrude=1
altitudemode = simplekml.AltitudeMode.relativetoground
pol = kml.newpolygon(name= 'ACTONO', description= 'Acton County',
outerboundaryis=tuples, extrude=extrude, altitudemode=altitudemode)
#Styling colors
pol.style.linestyle.color = simplekml.Color.green
pol.style.linestyle.width = 5
pol.style.polystyle.color = simplekml.Color.changealphaint(100,
simplekml.Color.green)
#Saving
kml.save("Polygon Styling.kml")
FIG 1 (1 DISTRICT)
FIG 2 (2 DISTRICTS)
This is the answer. What I needed was a dictionary of dataframes.
import pandas as pd
import simplekml
import pprint
import numpy as np
kml = simplekml.Kml()
###LOADING THE .csv FILE WITH ALL THE COORDINATES (USING QGIS)
df = pd.read_csv('C:\\Users\\file.csv')
###ADDING A COLUMN "altitude" WITH RANDOM VALUES FROM 200 TO 2000
df['altitude']=df.groupby('name').name.transform(lambda x: np.random.randint(200,2000))
###CALLING THE COLUMNS OF INTEREST
df=df[['longitude', 'latitude', 'altitude', 'name']]
###CREATING A DICTIONARY OF DATAFRAMES (ONE FOR EACH DISTRICT)
dict_dataframes=dict(tuple(df.groupby('name')))
###CALLING EACH DATAFRAME FROM THE DICTIONARY
for name, df in dict_dataframes.items():
###CREATING A LIST OF TUPLES WITH THE COLUMNS OF THE DATAFRAME
tuples = [tuple(x) for x in df.values]
extrude=1
altitudemode = simplekml.AltitudeMode.relativetoground
pol = kml.newpolygon(name = name, description="District of " + name, outerboundaryis=tuples, extrude=extrude, altitudemode=altitudemode)
pol.style.linestyle.color = simplekml.Color.honeydew
pol.style.linestyle.width = 3
pol.style.polystyle.color = simplekml.Color.changealphaint(100, simplekml.Color.navy)
###SAVING THE FILE
kml.save('C:\\Users\\3d_file.kml')

Parsing data with variable number of columns

I have several .txt files with 140k+ lines each. They all have three types of data, which are a mix of string and floats:
- 7 col
- 14 col
- 18 col
What is the best and fastest way to parse such data?
I tried to use numpy.genfromtxt with usecols=np.arange(0,7) but obviously cuts out the 14 and 18 col data.
# for 7 col data
load = np.genfromtxt(filename, dtype=None, names=('day', 'tod', 'condition', 'code', 'type', 'state', 'timing'), usecols=np.arange(0,7))
I would like to parse the data as efficiently as possible.
The solution is rather simple and intuitive. We check if the number of columns in each row is equal to the specified number and append it to an array. For better analysis/modification of our data, we can then convert it to a Pandas DataFrame or Numpy as desired, below I show conversion to DataFrame. The number of columns in my dataset are 7, 14 and 18. I want my data labeled, so I can use Pandas' columns to label from an array.
import pandas as pd
filename = "textfile.txt"
labels_array1 = [] # 7 labels
labels_array2 = [] # 14 labels
labels_array3 = [] # 18 labels
with open(filename, "r") as f:
lines = f.readlines()
for line in lines:
num_items = len(line.split())
if num_items==7:
array1.append(line.rstrip())
elif num_items==14:
array2.append(line.rstrip())
elif num_items==18:
array3.append(line.rstrip())
else:
print("Detected a line with different columns.", num_items)
df1 = pd.DataFrame([sub.split() for sub in array1], columns=labels_array1)
df2 = pd.DataFrame([sub.split() for sub in array2], columns=labels_array2)
df3 = pd.DataFrame([sub.split() for sub in array3], columns=labels_array3)

Resources