Store the value from pandas dataframe without index or header - python-3.x

I am trying to get the values from a CSV file using python and pandas. To avoid index and headers i am using .values with iloc but as an output my value is stored in [] brackets. I dont want them i just need the value. I dont want to print it but want to use it for other operations.
My code is :
import pandas as pd
ctr_x = []
ctr_y = []
tl_list = []
br_list = []
object_list = []
img = None
obj = 'red_hat'
df = pd.read_csv('ring_1_05_sam.csv')
ctr_x = df.iloc[10:12, 0:1].values #to avoid headers and index
ctr_y = df.iloc[10:12, 1:2].values #to avoid headers and index
ctr_x =[]
ctr_y =[]
If i print the result of ctr_x and ctr_y to check if correct values are recorded
The output i get is :
[[1536.25]
[1536.5 ]]
[[895.25]
[896. ]]
So i short i am getting the correct values but i don't want the brackets. Can anyone please suggest any other alternatives to my method. Note : I dont want to print the values but store it(without index and headers) for further operations

When you use column slice, pandas returns a Dataframe. Try
type(df.iloc[10:12, 0:1])
pandas.core.frame.DataFrame
This in turn will return a 2-D array when you use
df.iloc[10:12, 0:1].values
If you want a 1 dimensional array, you can use integer indexing which will return a Series,
type(df.iloc[10:12, 0])
pandas.core.series.Series
And a one dimensional array,
df.iloc[10:12, 0].values
So use
ctr_x = df.iloc[10:12, 0].values
ctr_y = df.iloc[10:12, 1].values

Related

Fill csv data lists with for loop

I am manipulating .csv files. I have to loop through each column of numeric data in the file and enter them into different lists. The code I have is the following:
import csv
salto_linea = "\n"
csv_file = "02_CSV_data1.csv"
with open(csv_file, 'r') as csv_doc:
doc_reader = csv.reader(csv_doc, delimiter = ",")
mpg = []
cylinders = []
displacement = []
horsepower = []
weight = []
acceleration = []
year = []
origin = []
lt = [mpg, cylinders, displacement, horsepower,
weight, acceleration, year, origin]
for i,ln in zip(range (0,9),lt):
print(f"{i} -> {ln}")
for row in doc_reader:
y = row[i]
ln.append(y)
In the loop, try to have range() serve me as an index so that in the nested for loop, it loops through the first column (the first element of each row in the csv) and feeds it into the first list of 'lt'. The problem I have is that I go through the data column and enter it, but range() continues to advance in the first loop, ending the nesting, thinking that it would iterate i = 1, and that new value of 'i' would enter again. the nested loop traversing the next column and vice versa. I also tried it with some other while loop to iterate a counter that adds to each iteration and serves as an index but it didn't work either.
How I can fill the sublists in 'lt' with the data which is inside the csv file??
without seing the ontents of the CSV file itself, the best way of reading the data into a table is with the pandas module, which can be done in one line of code.
import pandas as pd
df = pd.read_csv('02_CSV_data1.csv')
this would have read all the data into a dataframe and you can work with this.
Alternatively, ammend the for loop like this:
for row in doc_reader:
for i, ln in enumerate(lt):
ln.append(row[i])
for bigger data, i would prefer pandas which has vectorised methods.

Can we get columns names sorted in the order of their tf-idf values (if exists) for each document?

I'm using sklearn TfIdfVectorizer. I'm trying to get the column names in a list in the order of thier tf-idf values in decreasing order for each document? So basically, If a document has all the stop words then we don't need any column names.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
msg = ["My name is Venkatesh",
"Trying to get the significant words for each vector",
"I want to get the list of words name in the decresasing order of their tf-idf values for each vector",
"is to my"]
stopwords=['is','to','my','the','for','in','of','i','their']
tfidf_vect = TfidfVectorizer(stop_words=stopwords)
tfidf_matrix=tfidf_vect.fit_transform(msg)
pd.DataFrame(tfidf_matrix.toarray(),
columns=tfidf_vect.get_feature_names_out())
I want to generate a column with the list word names in the decreasing order of their tf-idf values
So the column would be like this
['venkatesh','name']
['significant','trying','vector','words','each','get']
['decreasing','idf','list','order','tf','values','want','each','get','name','vector','words']
[] # empty list Since the document consists only stopwords
Above is the primary result I'm looking for, it would be great if we get the sorted dict with tdfidf values as keys and the list of words as values asociated with that tfidf value for each document
So,the result would be like the below
{'0.785288':['venkatesh'],'0.619130':['name']}
{'0.47212':['significant','trying'],'0.372225':['vector','words','each','get']}
{'0.314534':['decreasing','idf','list','order','tf','values','want'],'0.247983':['each','get','name','vector','words']}
{} # empty dict Since the document consists only stopwords
I think this code does what you want and avoids using pandas:
from itertools import groupby
sort_func = lambda v: v[0] # sort by first value in tuple
all_dicts = []
for row in tfidf_matrix.toarray():
sorted_vals = sorted(zip(row, tfidf_vect.get_feature_names()), key=sort_func, reverse=True)
all_dicts.append({val:[g[1] for g in group] for val, group in groupby(sorted_vals, key=sort_func) if val != 0})
You could make it even less readable and put it all in a single comprehension! :-)
The combination of the following function and to_dict() method on dataframe can give you the desired output.
def ret_dict(_dict):
# Get a list of unique values
list_keys = list(set(_dict.values()))
processed_dict = {key:[] for key in list_keys}
# Prepare dictionary
for key, value in _dict.items():
processed_dict[value].append(str(key))
# Sort the keys (as you want)
sorted_keys = sorted(processed_dict, key=lambda x: x, reverse=True)
sorted_keys = [ keys for keys in sorted_keys if keys > 0]
# Return the dictionary with sorted keys
sorted_dict = {k:processed_dict[k] for k in sorted_keys}
return sorted_dict
Then:
res = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vect.get_feature_names_out())
list_dict = res.to_dict('records')
processed_list = []
for _dict in list_dict:
processed_list.append(ret_dict(_dict))
processed_list contains the output you desire. For instance: processed_list[1] would output:
{0.47212002654617047: ['significant', 'trying'], 0.3722248517590162: ['each', 'get', 'vector', 'words']}

How to create a dataframe from extracted hashtags?

I have used below code to extract hashtags from tweets.
def find_tags(row_string):
tags = [x for x in row_string if x.startswith('#')]
return tags
df['split'] = df['text'].str.split(' ')
df['hashtags'] = df['split'].apply(lambda row : find_tags(row))
df['hashtags'] = df['hashtags'].apply(lambda x : str(x).replace('\\n', ',').replace('\\', '').replace("'", ""))
df.drop('split', axis=1, inplace=True)
df
However, when I am counting them using the below code I am getting output that is counting each character.
from collections import Counter
d = Counter(df.hashtags.sum())
data = pd.DataFrame([d]).T
data
Output I am getting is:
I think the problem lies with the code that I am using to extract hashtags. But I don't know how to solve this issue.
Change find_tags by replace in list comprehension with split and for count values use Series.explode with Series.value_counts:
def find_tags(row_string):
return [x.replace('\\n', ',').replace('\\', '').replace("'", "")
for x in row_string.split() if x.startswith('#')]
df['hashtags'] = df['text'].apply(find_tags)
and then:
data = df.hashtags.explode().value_counts().rename_axis('val').reset_index(name='count')

Only import ascending values from CSV to List - Python

I'm trying to only import part of a CSV file to a list.
In short, the CSV I recveive contains two columns [depth and speed]. Depth always starts at zero, gets larger and then back to zero again.
I would like to add the first part of the CSV to the list (depth 0-13+). I then want to add the second part of the CSV (13-0) to another list.
I assume a for loop would be the way to go, but I don't know how to check each row for ascending/descending numbers.
pullData = open("svp3.csv","r").read()
dataArray = pullData.split('\n')
depthArrayY = []
speedArrayX = []
depthArrayLength = len(depthArrayY)
for eachLine in dataArray:
if len(eachLine)>1:
x,y = eachLine.split(',')
speedArrayX.append(round(float(x), 2))
depthArrayY.append(round(float(y), 2))
I'd suggest using Pandas, I think it will allow you for much more when you need to deal with imported data.
import pandas as pd
df = pd.read_csv('svp3.csv')
tmp = df[df.depth <= df.depth.shift(-1)].values
depth_increase = tmp[:,0]
speed_while_depth_increase = tmp[:,1]
tmp = df[df.depth > df.depth.shift(-1)].values
depth_decrease = tmp[:,0]
speed_while_depth_decrease = tmp[:,1]
I assumed that your CSV has the first the depth column then the speed column.
Depth column had values from 0 to a certain max value say 14, then from 13 to 0 depth column->[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,13,12,11,10,9,8,7,6,5,4,3,2,1]
and I populated speed column with some random values.
The following code makes use of pandas library and splits the column of depth into 2 lists of ascending and descending values using a simple logic of storing the current max value to determine when the ascending part of the column ends.
import pandas as pd
data = pd.read_csv('svp3.csv')
max_val = -10000
depthArrayAscendingY = []
speedArrayX = []
depthArrayDescendingY = []
for a in data.values:
if a[0]>max_val:
depthArrayAscendingY.append(a[0])
speedArrayX.append(a[1])
max_val = a[0]
else:
depthArrayDescendingY.append(a[0])
speedArrayX.append(a[1])
The answer to this question by Baleato is more efficient and cleaner than this answer, you should definitely check their answer.

Pandas dataframe column float inside string (i.e. "float") to int

I'm trying to clean some data in a pandas df and I want the 'volume' column to go from a float to an int.
EDIT: The main issue was that the dtype for the float variable I was looking at was actually a str. So first it needed to be floated, before being changed.
I deleted the two other solutions I was considering, and left the one I used. The top one is the one with the errors, and the bottom one is the solution.
import pandas as pd
import numpy as np
#Call the df
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
def tick_data(tickers):
for i in tickers:
tick_df = pd.DataFrame(client.get_ticker())
tick = tick_df.loc[:, ['symbol', 'volume']]
tick.iloc[:,['volume']].astype(int)
if tick['volume'].dtype != np.number:
print('yes')
else:
print('no')
return tick
Below is the revised code:
import pandas as pd
#Call the df
def ticker():
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
for i in tickers:
#pulls out market data for each symbol
tickers = pd.DataFrame(client.get_ticker())
#isolates the symbol and volume
tickers = tickers.loc[:, ['symbol', 'volume']]
#floats volume
tickers['volume'] = tickers.loc[:, ['volume']].astype(float)
#volume to int
tickers['volume'] = tickers.loc[:, ['volume']].astype(int)
#deletes all symbols > 20,000 in volume, returns only symbol
tickers = tickers.loc[tickers['volume'] >= 20000, 'symbol']
return tickers
You have a few issues here.
In your first example, iloc only accepts integer locations for the rows and columns in the DataFrame, which is generating your error. I.e.
tick.iloc[:,['volume']].astype(int)
doesn't work. If you want label-based indexing, use .loc:
tick.loc[:,['volume']].astype(int)
Alternately, use bracket-based indexing, which allows you to take a whole column directly without using slice syntax (:) on the rows:
tick['volume'].astype(int)
Next, astype(int) returns a new value, it does not modify in-place. So what you want is
tick['volume'] = tick['volume'].astype(int)
As for your dtype is a number check, you don't want to check == np.number, but you don't want to check is either, which only returns True if it's np.number and not if it's a subclass like np.int64. Use np.issubdtype, or pd.api.types.is_numeric_dtype, i.e.:
if np.issubdtype(tick['volume'].dtype, np.number):
or:
if pd.api.types.is_numeric_dtype(tick['volume'].dtype):

Resources