I am bit new to dask. I have large csv file and large list. Length of row of csv are equal to length of the list. I am trying to create a new column in the Dask dataframe from a list. In pandas, it pretty straight forward, however in Dask I am having hard time creating new column for it. I am avoiding to use pandas because my data is 15GB+.
Please see my tries below.
csv Data
name,text,address
john,some text here,MD
tim,some text here too,WA
Code tried
import dask.dataframe as dd
import numpy as np
ls = ['one','two']
ddf = dd.read_csv('../data/test.csv')
ddf.head()
Try #1:
ddf['new'] = ls # TypeError: Column assignment doesn't support type list
Try #2: What should be passed here for condlist?
ddf['new'] = np.select(choicelist=ls) # TypeError: _select_dispatcher() missing 1 required positional argument: 'condlist'
Looking for this output:
name text address new
0 john some text here MD one
1 tim some text here too WA two
Try creating a dask dataframe and then appending it like this -
#ls = dd.from_array(np.array(['one','two']))
#ddf['new'] = ls
# As tested by OP
import dask.array as da
ls = da.array(['one','two','three'])
ddf['new'] = ls
I am trying to create/add a new column into the following dataframe (df_new):
I want this new column (df['category']) to be feed from df['tags'].
The tags columns, is a list of objects, and the value that I want to retrieve is the category, and if there is no category I want to set it as unknown.
This is a sample of my JSON file
{"submissionTime":"2019-02-25T09:26:00","b_data":{"bName":"Masato","b_Acc":[{"id":0,"transactions":[{"date":"2019-12-19","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-12-03","text":"LINE FEE","amount":-460.21,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-12-31","text":"INTEREST","amount":-871.62,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-12-31","text":"LOAN SERVICE FEE","amount":-120,"type":"Loan Related Fees","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-12-18","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-12-02","text":"LINE FEE","amount":-498.34,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-11-29","text":"INTEREST","amount":-794.4,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-11-19","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-11-01","text":"LINE FEE","amount":-484.87,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-10-31","text":"INTEREST","amount":-882.04,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-10-21","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-10-01","text":"LINE FEE","amount":-503.59,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-09-30","text":"INTEREST","amount":-916.98,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-09-30","text":"LOAN SERVICE FEE","amount":-120,"type":"Loan Related Fees","tags":[{"category":"Fees"},{"creditDebit":"debit"}]},{"date":"2019-09-19","text":"PERIODICAL PAYMENT","amount":3397,"type":"","tags":[{"institution":"University of MC"},{"lenderType":"private"},{"category":"birdy"},{"creditDebit":"credit"}]},{"date":"2019-09-02","text":"LINE FEE","amount":-489.65,"type":"Overdrawn Fees","tags":[{"category":"Overdrawn"},{"creditDebit":"debit"}]},{"date":"2019-08-30","text":"INTEREST","amount":-892.13,"type":"Interest Charge","tags":[{"category":"Fees"},{"creditDebit":"debit"}]}]}]}}
and this is how I have been able to do so far:
import json
import numpy as np
import pandas as pd
with open('question.json') as json_data:
d = json.load(json_data)
df = pd.json_normalize(d['b_data']['b_Acc'])
frames = []
#https://pandas.pydata.org/pandas-docs/stable/merging.html
for index, row in df.iterrows():
frames = frames + row['transactions']
df_new = pd.DataFrame(frames)
df['category'] = df_new['tags'].apply(pd.Series)[0]
This chould potentially work if category is always the first element of that array, however in raw 0 the first element is institution, the second raw is creditDebit (which i would like to be unknow as there is no category)
This will do what you did in for loop
s=pd.DataFrame(pd.DataFrame(df.transactions.tolist()).stack().str['tags'].tolist())
I need to compare two sets of data from csv's, one (csv1) with a column 'listing_url' the other (csv2) with columns 'parsed_url' and 'url_code'. I would like to use the result set from urlparse on csv1 (specifically the netloc) to compare to csv2 'parsed_url' and output matching value from 'url_code' to csv.
from urllib.parse import urlparse
import re, pandas as pd
scr = pd.read_csv('csv2',squeeze=True,usecols=['parsed_url','url_code'])[['parsed_url','url_code']]
data = pd.read_csv('csv1')
L = data.values.T[0].tolist()
T = pd.Series([scr])
for i in L:
n = urlparse(i)
nf = pd.Series([(n.netloc)])
I'm stuck trying to convert the data into objects I can use map with, if that's even the best thing to use, I don't know.
I saved all the data into an array full of strings, but I want to change the strings in that array into float without changing the header (the first row) and first column of the array. How should I change my code?
import numpy as np
import csv
with open('MI_5MINS_INDEX.csv', encoding="utf-8") as f:
data=list(csv.reader(f))
for line in data:
line.remove('')
ary=np.array(data)
ary.astype(float)
Use pandas read_csv() and it will work as you wish.
I'm trying to clean some data in a pandas df and I want the 'volume' column to go from a float to an int.
EDIT: The main issue was that the dtype for the float variable I was looking at was actually a str. So first it needed to be floated, before being changed.
I deleted the two other solutions I was considering, and left the one I used. The top one is the one with the errors, and the bottom one is the solution.
import pandas as pd
import numpy as np
#Call the df
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
def tick_data(tickers):
for i in tickers:
tick_df = pd.DataFrame(client.get_ticker())
tick = tick_df.loc[:, ['symbol', 'volume']]
tick.iloc[:,['volume']].astype(int)
if tick['volume'].dtype != np.number:
print('yes')
else:
print('no')
return tick
Below is the revised code:
import pandas as pd
#Call the df
def ticker():
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
for i in tickers:
#pulls out market data for each symbol
tickers = pd.DataFrame(client.get_ticker())
#isolates the symbol and volume
tickers = tickers.loc[:, ['symbol', 'volume']]
#floats volume
tickers['volume'] = tickers.loc[:, ['volume']].astype(float)
#volume to int
tickers['volume'] = tickers.loc[:, ['volume']].astype(int)
#deletes all symbols > 20,000 in volume, returns only symbol
tickers = tickers.loc[tickers['volume'] >= 20000, 'symbol']
return tickers
You have a few issues here.
In your first example, iloc only accepts integer locations for the rows and columns in the DataFrame, which is generating your error. I.e.
tick.iloc[:,['volume']].astype(int)
doesn't work. If you want label-based indexing, use .loc:
tick.loc[:,['volume']].astype(int)
Alternately, use bracket-based indexing, which allows you to take a whole column directly without using slice syntax (:) on the rows:
tick['volume'].astype(int)
Next, astype(int) returns a new value, it does not modify in-place. So what you want is
tick['volume'] = tick['volume'].astype(int)
As for your dtype is a number check, you don't want to check == np.number, but you don't want to check is either, which only returns True if it's np.number and not if it's a subclass like np.int64. Use np.issubdtype, or pd.api.types.is_numeric_dtype, i.e.:
if np.issubdtype(tick['volume'].dtype, np.number):
or:
if pd.api.types.is_numeric_dtype(tick['volume'].dtype):