How to append list of column to list - python-3.x

Im trying to add new column from csv to the table from the same csv. Im trying to use append but its still not working it says ''numpy.ndarray' object has no attribute 'append''
import pandas as pd
import numpy as np
path = r"D:\python projects\volcano_data_2010.csv"
data = pd.read_csv(path)
data_used = data.iloc[:,[1,2,8,9]].values
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan,strategy='mean')
data_used = imp.fit_transform(data_used) #so far ok
data_used = data_used.append([data.iloc[:,7].values])
print(data_used)

function append only applicable to list datatype, since your data type is in array use should use np.append but it will append array
a1 = np.append(data_used, data.iloc[:,7])
if you want to append like a columns, you should us np.column_stack function
a2 = np.column_stack((data_used, data.iloc[:,7]))

Related

Dask: Add list to a column value like pandas does

I am bit new to dask. I have large csv file and large list. Length of row of csv are equal to length of the list. I am trying to create a new column in the Dask dataframe from a list. In pandas, it pretty straight forward, however in Dask I am having hard time creating new column for it. I am avoiding to use pandas because my data is 15GB+.
Please see my tries below.
csv Data
name,text,address
john,some text here,MD
tim,some text here too,WA
Code tried
import dask.dataframe as dd
import numpy as np
ls = ['one','two']
ddf = dd.read_csv('../data/test.csv')
ddf.head()
Try #1:
ddf['new'] = ls # TypeError: Column assignment doesn't support type list
Try #2: What should be passed here for condlist?
ddf['new'] = np.select(choicelist=ls) # TypeError: _select_dispatcher() missing 1 required positional argument: 'condlist'
Looking for this output:
name text address new
0 john some text here MD one
1 tim some text here too WA two
Try creating a dask dataframe and then appending it like this -
#ls = dd.from_array(np.array(['one','two']))
#ddf['new'] = ls
# As tested by OP
import dask.array as da
ls = da.array(['one','two','three'])
ddf['new'] = ls

Pandas: Create a new column based on a another column which is a list of objects

I am trying to create/add a new column into the following dataframe (df_new):
I want this new column (df['category']) to be feed from df['tags'].
The tags columns, is a list of objects, and the value that I want to retrieve is the category, and if there is no category I want to set it as unknown.
This is a sample of my JSON file
{"submissionTime":"2019-02-25T09:26:00","b_data":{"bName":"Masato","b_Acc":[{"id":0,"transactions":[{"date":"2019-12-19","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-12-03","text":"LINE FEE","amount":-460.21,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-12-31","text":"INTEREST","amount":-871.62,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-12-31","text":"LOAN SERVICE FEE","amount":-120,"type":"Loan Related Fees","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-12-18","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-12-02","text":"LINE FEE","amount":-498.34,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-11-29","text":"INTEREST","amount":-794.4,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-11-19","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-11-01","text":"LINE FEE","amount":-484.87,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-10-31","text":"INTEREST","amount":-882.04,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-10-21","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-10-01","text":"LINE FEE","amount":-503.59,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-09-30","text":"INTEREST","amount":-916.98,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-09-30","text":"LOAN SERVICE FEE","amount":-120,"type":"Loan Related Fees","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-09-19","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-09-02","text":"LINE FEE","amount":-489.65,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-08-30","text":"INTEREST","amount":-892.13,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]}]}]}}
and this is how I have been able to do so far:
import json
import numpy as np
import pandas as pd
with open('question.json') as json_data:
d = json.load(json_data)
df = pd.json_normalize(d['b_data']['b_Acc'])
frames = []
#https://pandas.pydata.org/pandas-docs/stable/merging.html
for index, row in df.iterrows():
frames = frames + row['transactions']
df_new = pd.DataFrame(frames)
df['category'] = df_new['tags'].apply(pd.Series)[0]
This chould potentially work if category is always the first element of that array, however in raw 0 the first element is institution, the second raw is creditDebit (which i would like to be unknow as there is no category)
This will do what you did in for loop
s=pd.DataFrame(pd.DataFrame(df.transactions.tolist()).stack().str['tags'].tolist())

How can I do something similar to VLOOKUP in Excel with urlparse?

I need to compare two sets of data from csv's, one (csv1) with a column 'listing_url' the other (csv2) with columns 'parsed_url' and 'url_code'. I would like to use the result set from urlparse on csv1 (specifically the netloc) to compare to csv2 'parsed_url' and output matching value from 'url_code' to csv.
from urllib.parse import urlparse
import re, pandas as pd
scr = pd.read_csv('csv2',squeeze=True,usecols=['parsed_url','url_code'])[['parsed_url','url_code']]
data = pd.read_csv('csv1')
L = data.values.T[0].tolist()
T = pd.Series([scr])
for i in L:
n = urlparse(i)
nf = pd.Series([(n.netloc)])
I'm stuck trying to convert the data into objects I can use map with, if that's even the best thing to use, I don't know.

Changing specific strings into floats in multidimensional array

I saved all the data into an array full of strings, but I want to change the strings in that array into float without changing the header (the first row) and first column of the array. How should I change my code?
import numpy as np
import csv
with open('MI_5MINS_INDEX.csv', encoding="utf-8") as f:
data=list(csv.reader(f))
for line in data:
line.remove('')
ary=np.array(data)
ary.astype(float)
Use pandas read_csv() and it will work as you wish.

Pandas dataframe column float inside string (i.e. "float") to int

I'm trying to clean some data in a pandas df and I want the 'volume' column to go from a float to an int.
EDIT: The main issue was that the dtype for the float variable I was looking at was actually a str. So first it needed to be floated, before being changed.
I deleted the two other solutions I was considering, and left the one I used. The top one is the one with the errors, and the bottom one is the solution.
import pandas as pd
import numpy as np
#Call the df
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
def tick_data(tickers):
for i in tickers:
tick_df = pd.DataFrame(client.get_ticker())
tick = tick_df.loc[:, ['symbol', 'volume']]
tick.iloc[:,['volume']].astype(int)
if tick['volume'].dtype != np.number:
print('yes')
else:
print('no')
return tick
Below is the revised code:
import pandas as pd
#Call the df
def ticker():
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
for i in tickers:
#pulls out market data for each symbol
tickers = pd.DataFrame(client.get_ticker())
#isolates the symbol and volume
tickers = tickers.loc[:, ['symbol', 'volume']]
#floats volume
tickers['volume'] = tickers.loc[:, ['volume']].astype(float)
#volume to int
tickers['volume'] = tickers.loc[:, ['volume']].astype(int)
#deletes all symbols > 20,000 in volume, returns only symbol
tickers = tickers.loc[tickers['volume'] >= 20000, 'symbol']
return tickers
You have a few issues here.
In your first example, iloc only accepts integer locations for the rows and columns in the DataFrame, which is generating your error. I.e.
tick.iloc[:,['volume']].astype(int)
doesn't work. If you want label-based indexing, use .loc:
tick.loc[:,['volume']].astype(int)
Alternately, use bracket-based indexing, which allows you to take a whole column directly without using slice syntax (:) on the rows:
tick['volume'].astype(int)
Next, astype(int) returns a new value, it does not modify in-place. So what you want is
tick['volume'] = tick['volume'].astype(int)
As for your dtype is a number check, you don't want to check == np.number, but you don't want to check is either, which only returns True if it's np.number and not if it's a subclass like np.int64. Use np.issubdtype, or pd.api.types.is_numeric_dtype, i.e.:
if np.issubdtype(tick['volume'].dtype, np.number):
or:
if pd.api.types.is_numeric_dtype(tick['volume'].dtype):

Resources