I have a sample dataframe of a very huge dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'Start_x':['Tom', NaN, NaN, NaN,NaN],
'Start_y':[NaN, 'Nick', NaN, NaN, NaN],
'Start_z':[NaN, NaN, 'Alison', NaN, NaN],
'Start_a':[NaN, NaN, NaN, 'Mark',NaN],
'Start_b':[NaN, NaN, NaN, NaN, 'Oliver'],
'Sex': ['Male','Male','Female','Male','Male']}
df = pd.DataFrame(data)
df
I want the final result to look like the image given below. The 4 columns have to be merged to a single new column but the 'Sex' column should be as it is.
Any help is greatly appreciated. Thank you!
One option could be to backfill Start columns by rows and then take the first column:
df['New_Column'] = df.filter(like='Start').bfill(axis=1).iloc[:, 0]
df
Start_x Start_y Start_z Start_a Start_b Sex New_Column
0 Tom NaN NaN NaN NaN Male Tom
1 NaN Nick NaN NaN NaN Male Nick
2 NaN NaN Alison NaN NaN Female Alison
3 NaN NaN NaN Mark NaN Male Mark
4 NaN NaN NaN NaN Oliver Male Oliver
Related
Here is my code in Sklearn for using random forest - i have already manually filled NA with 0s. I have also gotten this code to run with the same data before, so it must be something in the code that is able to be fixed and not the data itself:
code:
parameters = {'max_depth':[2, 3, 4, 5, 7], 'n_estimators':[1, 10, 25, 50, 100, 256, 512], 'random_state':[42]}
def perform_grid_search(X_data, y_data): """ Function to perform a grid search. """
rf = RandomForestClassifier(criterion='entropy')
clf = GridSearchCV(rf, parameters, cv=4, scoring='roc_auc', n_jobs=3)
clf.fit(X_data, y_data)
print(clf.cv_results_['mean_test_score'])
return clf.best_params_['n_estimators'], clf.best_params_['max_depth']
and when running:
#next function
n_estimator, depth = perform_grid_search(X_train, y_train) c_random_state = 42
print(n_estimator, depth, c_random_state)
this error comes back:
model_selection_search.py:922: UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan] warnings.warn(
and ValueError: could not convert string to float: ''
Please let me know as this has totally broken down my code process!
I am writing a script to scrape a series of tables in a pdf into python using tabula-py.
This is fine. I do get the data. But the data is multi-line, and useless in reality.
I would like to merge the rows where the first column (Tag is not NaN).
I was about to put the whole thing in an iterator, and do it manually, but I realize that pandas is a powerful tool, but I don't have the pandas vocabulary to search for the right tool. Any help is much appreciated.
My Code
filename='tags.pdf'
tagTableStart=2 #784
tagTableEnd=39 #822
tableHeadings = ['Tag','Item','Length','Description','Value']
pageRange = "%d-%d" % (tagTableStart, tagTableEnd)
print ("Scanning pages %s" % pageRange)
# extract all the tables in that page range
tables = tabula.read_pdf(filename, pages=pageRange)
How The data is stored in the DataFrame:
(Empty fields are NaN)
Tag
Item
Length
Description
Value
AA
Some
2
Very Very
Text
Very long
Value
AB
More
4
Other Very
aaaa
Text
Very long
bbbb
Value
cccc
How I want the data:
This is almost as it is displayed in the pdf (I couldn't figure out how to make text multi line in SO editor)
Tag
Item
Length
Description
Value
AA
Some\nText
2
Very Very\nVery long\nValue
AB
More\nText
4
Other Very\nVery long\n Value
aaaa\nbbbb\ncccc
Actual sample output (obfuscated)
Tag Item Length Description Value
0 AA PYTHROM-PARTY-I 20 Some Current defined values are :
1 NaN NaN NaN texst Byte1:
2 NaN NaN NaN NaN C
3 NaN NaN NaN NaN DD
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN DD
6 NaN NaN NaN NaN DD
7 NaN NaN NaN NaN DD
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN B :
10 NaN NaN NaN NaN JLSAFISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
11 NaN NaN NaN NaN ISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
12 NaN NaN NaN NaN upon ISFLIHAJSLIhdsflhdliugdyg89o7fgy
13 NaN NaN NaN NaN asdsadct on the dasdsaf the
14 NaN NaN NaN NaN actsdfion.
15 NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN SLKJDBFDLFKJBDSFLIUFy7dfsdfiuojewv
17 NaN NaN NaN NaN csdfgfdgfd.
18 NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN fgfdgdfgsdfgfdsgdfsgfdgfdsgsdfgfdg
20 BB PRESENT-AMOUNT-BOX 11 Lorem Ipsum NaN
21 CC SOME-OTHER-VALUE 1 sdlkfgsdsfsdf 1
22 NaN NaN NaN device NaN
23 NaN NaN NaN ueghkjfgdsfdskjfhgsdfsdfkjdshfgsfliuaew8979vfhsdf NaN
24 NaN NaN NaN dshf87hsdfe4ir8hod9 NaN
Create groups from ID columns then join each rows:
agg_func = dict(zip(df.columns, [lambda s: '\n'.join(s).strip()] * len(df.columns)))
out = df.fillna('').groupby(df['Tag'].ffill(), as_index=False).agg(agg_func)
Output:
>>> out
Tag Item Length Description Value
0 AA Some\nText 2 Very Very\nVery long\nValue
1 AB More\nText 4 Other Very\nVery long\nValue aaaa\nbbbb\ncccc
agg_func is equivalent to write:
{'Tag': lambda s: '\n'.join(s).strip(),
'Item': lambda s: '\n'.join(s).strip(),
'Length': lambda s: '\n'.join(s).strip(),
'Description': lambda s: '\n'.join(s).strip(),
'Value': lambda s: '\n'.join(s).strip()}
I'm working with Python 3 on Mac OS 10.11.06 (el capitan).
I have a .csv dataset consisting of about 3,700 time series sets (of unequal lengths). The data are currently formatted as follows:
Current Format
trade_date price_usd ticker
0 2016-01-01 434.33000 BTC
1 2016-01-02 433.44000 BTC
2 2016-01-03 430.01000 BTC
3 2016-01-04 433.09000 BTC
4 2016-01-05 431.96000 BTC
... ... ... ...
2347227 2020-10-19 74.13000 BRAIN
2347228 2020-10-20 71.97000 BRAIN
2347229 2020-10-21 76.64000 BRAIN
2347230 2020-10-22 80.90000 BRAIN
2347231 2020-10-19 0.15004 DAOFI
Ignoring the default numerical index for the moment, notice that the datetime column, trade_date, is such that the sequence of values repeats with each new ticker group. My goal is to transform the data such that each ticker name becomes a column header under which its corresponding daily prices are listed in correct order with the datetime value on which it was recorded (i.e. the datetime index does not repeat and the daily price values for the ticker symbols are the rows):
Target Format
trade_date ticker1 ticker2 ... tickerN
day1 t1p1 t2p1 ... tNp1
day2 t1p2 t2p2 ... etc...
.
.
.
dayK
Thus far I've tried various approaches, including experiments with various methods, e.g. stack()/unstack(), groupby(), etc., as well as custom functions that attempt to iterate through the values to assign them to a new DF in which I created a structured frame into which to drop the values, but to no avail (see failed attempt below).
New, empty target data frame with ticker symbol as col and trade_date range as index:
BTC ETH XRP MKR LTC USDT BCH XLM EOS BNB ... MTLX INDEX WOA HAUT THRM YFED NMT DOKI BRAIN DAOFI
2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2016-01-05 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Failed attempt to populate the above ...
for element in crypto_df['ticker']:
if element == new_df.column and crypto['trade_date'] == new_df.index:
df['ticker'] = element
new_df.head()
My ultimate goal is to produce a multi-series time series forecast using FBProphet because of its ability to handle multiple time series forecasts in a "single" model.
One last thought I've just had is that one could maybe create separate data frames for each ticker, then rejoin along the datetime index, creating the separate columns in the new DF along the way, but that seems a bit round-about (I've literally just done this for a couple thousand .csv files with equities data, for example)... But I'd still like to find a more direct solution, if there is one? Surely this scenario will arise again in the future!
Thanks for any thoughts ...
You can set_index and unstack:
print(df.set_index(["trade_date", "ticker"]).unstack("ticker"))
price_usd
ticker BRAIN BTC DAOFI
trade_date
2016-01-01 NaN 434.33 NaN
2016-01-02 NaN 433.44 NaN
2016-01-03 NaN 430.01 NaN
2016-01-04 NaN 433.09 NaN
2016-01-05 NaN 431.96 NaN
2020-10-19 74.13 NaN 0.15004
2020-10-20 71.97 NaN NaN
2020-10-21 76.64 NaN NaN
2020-10-22 80.90 NaN NaN
First use .groupby(), then use .unstack():
import pandas as pd
from io import StringIO
text = """
trade_date price_usd ticker
2016-01-01 434.33000 BTC
2016-01-02 433.44000 BTC
2016-01-02 430.01000 Google
2016-01-03 433.09000 BTC
2016-01-03 431.96000 Google
"""
df = pd.read_csv(StringIO(text), sep='\s+', header=0)
df.groupby(['trade_date', 'ticker'])['price_usd'].mean().unstack()
Resulting dataframe:
trade_date ticker BTC Google
2016-01-01 434.33 NaN
2016-01-02 433.44 430.01
2016-01-03 433.09 431.96
I have a list of values [0.1, 0.43, 0.58] and a dataframe df with several columns. I added three new columns in my dataframe with NaN values, and I want to replace them with the ones from the list. Each list value split into each new column in that exact order.
The dataframe is 4 columns (no index shown), with 3 new columns.
Name A B C New1 New2 New3
Elem1 NaN NaN NaN NaN NaN NaN
Elem2 NaN NaN NaN NaN NaN NaN
Elem3 NaN NaN NaN NaN NaN NaN
Expected result:
Name A B C New1 New2 New3
Elem1 NaN NaN NaN NaN NaN NaN
Elem2 NaN NaN NaN 0.1 0.43 0.58
Elem3 NaN NaN NaN NaN NaN NaN
If l is your list, then:
df.loc[df.Name=='Elem2', 'New1':'New3'] = l
I am trying to clean np values in an open sourced data.
I am using python3, Jupyter and pandas.
response = urllib.request.urlopen('https://resources.lendingclub.com/LoanStats3c.csv.zip')
import shutil
url = 'https://resources.lendingclub.com/LoanStats3c.csv.zip'
file_name = 'LoanStats3c.csv.zip'
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
with zipfile.ZipFile(file_name) as zf:
zf.extractall()
loan=pd.read_csv(open('LoanStats3c.csv'), skiprows=1, parse_dates=True, index_col='id')
loan.describe()
# remove all columns with all NAs
loan = loan.dropna(axis=1, how = 'all')
loan.describe()
# remove all rows with any NAs
loan = loan.dropna(axis = 0)
loan.describe()
But, the results are all columns with all NAs:
loan_amnt funded_amnt funded_amnt_inv installment annual_inc dti \
count 0.0 0.0 0.0 0.0 0.0 0.0
mean NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN
Why all rows with valid values are gone and only the NA columns are left ?
thanks
When you'r using .dropna() like that all ocurrences with NaN values are deleted from dataframe
loan.dropna(axis=1, how = 'all')
Will delete the columns with all values in NaN
loan.dropna(axis = 0)
Will delete the rows with a least one value in NaN
I saw the file and i'm pretty sure that every rows has at least one column in NaN
Finally when using .describe() the dataframe is empty and the values that are shown are a descriptive statistics of that, if you want to see the real DF use print(df) or in jupyter just let the variable at the end of the block
some code
some code
some code
variable = pd.DataFrame([])
#print(variable)
variable
That would show you the value of the variable