log transformation of whole dataframe using numpy - python-3.x

I have a dataframe in python which is made using the following code:
import pandas as pd
df = pd.read_csv('myfile.txt', sep="\t")
df1 = df.iloc[:, 3:]
now in df1 there are 24 columns. I would like to transform the values to log2 value and make a new dataframe in which there are 24 columns with log value of original dataframe. to do so I used numpy.log like the following line:
df2 = (numpy.log(df1))
this code does not return what I would like to get. do you know how to fix it?

Related

Passing data from a for loop to a dataframe

I have just started out learning few things in python, I am stuck in between.
import yfinance as yf
import pandas as pd
import yahoo_fin.stock_info as si
ticker = ['20MICRONS.NS', '21STCENMGM.NS', '3IINFOTECH.NS', '3MINDIA.NS', '3PLAND.NS']
for i in ticker:
try:
quote = si.get_quote_table(i)
df = pd.DataFrame.from_dict(quote.items())
df = df.append(quote.items(), ignore_index=True)
except (ValueError, IndexError,TypeError):
continue
print(df)
Just for example: The value of i has more than 4 entries, whenever I am exiting the loop this data has to be added or should be appended in the dataframe.
But for some reason the dataframe is not appending these values.
Thanks in advance
You defined df within the loop, which means that a new dataframe is initialised in df at each iteration. Define a new dataframe in df before the loop and append to the df in the loop. Please add the information that you provided in the comments to the question.

Is there a way to order the columns (sliced with iloc) in pandas as the main dataframe?

I'm trying to separate columns by slicing them because I need to assign dtypes for each one. So I grouped them by dtypes and assign their respective dtype and now I want to join or concat and that has the same column order as the main dataframe. I add that is not possible to do it by its column name because it may change.
Example:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
After doing this I need to join them with the same order as the main file, that is from 0 to 11.
To join these 2 chunks you can use the join function, then reindex based on your original dataframes columns. Should look something like this:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
out = pd.join(intg, obj).reindex(file.columns, axis="columns")

Is there a way to list the rows and columns in a pandas DataFrame that are empty strings?

I have a 1650x40 dataframe that is a matrix of people who worked on projects each day. It looks like this:
import pandas as pd
df = pd.DataFrame([['bob','11/1/19','X','','',''], ['pete','11/1/19','X','','',''],
['wendy','11/1/19','','','X',''], ['sam','11/1/19','','','',''],
['cara','11/1/19','','','X','']],
columns=['person', 'date', 'project1','project2','project3','project4'])
I am trying to sanity check the data by:
listing any columns that do not have an X in them (in this case
'project2' and 'project4')
listing any rows that do not have an X in them (in this case
'sam')
Desired outcome:
Something like df.show_empty(columns) returns ['project2','project4'] and df.show_empty(rows) returns ['sam']
Obviously the this method would need some way to tell it that the first two columns are not expected to be empty and they should be ignored.
My desired outcome above would return lists of column headings (or row indexes) so that I could go back and check my data and application to find out why there's no entry in the relevant cell (I am guessing there's a good chance that more than one row or column are affected). This seems like it should be trivial but I'm really stuck with figuring this out.
Thanks for any help offered!
For me, it is easier to use apply to accomplish this task. The working code is shown below:
import pandas as pd
df = pd.DataFrame([['bob','11/1/19','X','','',''], ['pete','11/1/19','X','','',''],
['wendy','11/1/19','','','X',''], ['sam','11/1/19','','','',''],
['cara','11/1/19','','','X','']],
columns=['person', 'date', 'project1','project2','project3','project4'])
import numpy as np
df = df.replace('', np.NaN)
colmns = df.apply(lambda x: x.count()==0, axis=0)
df[colmns.index[colmns]]
df[df.apply(lambda x: x[2:].count()==0, axis=1)]
df = df.replace('', np.NaN) will replace the '' with NaN, so that we can use count() function.
colmns = df.apply(lambda x: x.count()==0, axis=0): this will find the columns that are all NaN.
df[df.apply(lambda x: x[2:].count()==0, axis=1)]: this will ignore the first two columns.

How to add a column to a PySpark dataframe which contains the nth quantile of another column in the dataframe

I have a very large CSV file which has been imported as a PySpark dataframe: df. The dataframe contains many columns including column ireturn. I want to compute the 0.99 and 0.01 percentile of this column and then add another column to the dataframe df as new_col_99 and new_col_01 which contains the 0.99 and 0.01 percentiles, respectively. I wrote the following code which works for small dataframes, but I get errors when I apply it to my large dataframe.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("name of the file", inferSchema = True, header = True)
precentile_99 = df.selectExpr('percentile(val1, 0.99)').head(1)[0][0]
precentile_01 = df.selectExpr('percentile(val1, 0.01)').head(1)[0][0]
from pyspark.sql.functions import lit
df = df.withColumn("new_col_99", lit(precentile_99))
df = df.withColumn("new_col_01", lit(precentile_01))
I tried to replace head with collect, but it did not work either. I get this error:
Logging error ---
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:49850)
Traceback (most recent call last):...
I have also tried the following:
percentile = df.approxQuantile('ireturn',[0.01,0.99],0.25)
df = df.withColumn("new_col_01", lit(percentile[0]))
df = df.withColumn("new_col_99", lit(percentile[1]))
The above code takes about 15-20 min to run but the result is wrong (my data on the column ireturn are less than 1 but it returns the 0.99 percentile as 6789....)
Late, but hopefully answers your concerns. You can get the result this way:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("name of the file", inferSchema = True, header = True)
df = df.withColumn("new_col_99", F.expr('percentile(val1, 0.99) over()'))
df = df.withColumn("new_col_01", F.expr('percentile(val1, 0.01) over()'))
For large datasets percentile_approx may be better:
df = df.withColumn("new_col_99", F.expr('percentile_approx(val1, 0.99) over()'))
df = df.withColumn("new_col_01", F.expr('percentile_approx(val1, 0.01) over()'))

Trying to name multiple indexes getting Key error

Good morning,
I'm using python 3.6. I'm trying to name my index (see last line in code below) because I plan on joining to another DataFrame. The DataFrame should be multi-indexed. The index is the first two columns ('currency' and 'rtdate') and the data
rate
AUD 2010-01-01 0.897274
2010-02-01 0.896608
2010-03-01 0.895943
2010-04-01 0.895277
2010-05-01 0.894612
This is the code that I'm running:
import pandas as pd
import numpy as np
import datetime as dt
df=pd.read_csv('file.csv',index_col=0)
df.index = pd.to_datetime(df.index)
new_index = pd.date_range(df.index.min(),df.index.max(),freq='MS')
df=df.reindex(new_index)
df=df.interpolate().unstack()
rate = pd.DataFrame(df)
rate.columns = ['rate']
rate.set_index(['currency','rtdate'],drop=False)
Running this throw's an error message:
KeyError: 'currency'
What am I missing.
Thanks for the assistance
I think you need to set the names of the levels of MultiIndex by using rename_axis first and then reset_index for columns from MultiIndex:
So you'd end up with this:
rate = df.interpolate().unstack().set_axis(('currency','rtdate')).reset_index()
instead of this:
df=df.interpolate().unstack()
rate = pd.DataFrame(df)
rate.columns = ['rate']
rate.set_index(['currency','rtdate'],drop=False)

Resources