I want to read a pickle file in python 3.5. I am using the following code.
The following is my output, I want to load it as pandas dataframe.
when I try to convert into pd Dataframe, using df = pd.DataFrame(df), I am getting the below error.
ValueError: arrays must all be same length
link to data- https://drive.google.com/file/d/1lSFBPLbUCluWfPjzolUZKmD98yelTSXt/view?usp=sharing
I think you need dict comprehension with concat:
from pandas.io.json import json_normalize,
import pickle
fh = open("imdbnames40.pkl", 'rb')
d = pickle.load(fh)
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
print (df.head())
ethnicity score best
'Aina Rapoza 0 Asian 0.89 Asian
1 GreaterAfrican 0.05 Asian
2 GreaterEuropean 0.06 Asian
3 IndianSubContinent 0.11 GreaterEastAsian
4 GreaterEastAsian 0.89 GreaterEastAsian
Then if need column from first level of MultiIndex:
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()
Related
I am new to python, I have got list which has a lot of number, then I would like to use np.histogram & pandas to generate a histogram like csv file:
import numpy as np
import pandas as pd
counts, bin_edges = np.histogram(data, bins=5)
print(counts)
print(bin_edges)
And I get the following output:
[27 97 24 27 11]
[-19.12 -8.406 2.308 13.022 23.736 34.45 ]
Then I tried to write the data into CSV file
bin = 5
min = np.delete(bin_edges, bin)
max = np.delete(bin_edges, 0)
df = pd.DataFrame({'Min': min, 'Max': max, 'Count': counts})
df.to_csv('data.csv', index=False, sep='\t')
However, I have got the following file ...
Min Max Count
-19.12 -8.405999999999999 27
-8.405999999999999 2.3080000000000034 97
2.3080000000000034 13.02200000000001 24
13.02200000000001 23.736000000000008 27
23.736000000000008 34.45 11
Is there any way that I can restrict the decimal numbers?
Many thanks,
You can use the float_format parameter of the to_csv() function.
df.to_csv(
'data.csv',
index=False,
sep='\t',
float_format='%.3f')
In the example above the output is to 3 decimal places. See the pandas docs and the python docs for more info.
I want to work on my excel file tb.xlsx and group the data by a column named 'Hybrid type' and then store the new dataframe back in another excel file.
import numpy as np
import pandas as pd
df=pd.read_excel("D:\\tb.xlsx")
group=df.groupby("Hybrid type")
print(group)
df1=pd.DataFrame(columns=df.columns)
for Hybridtype,frame in group:
df2=pd.DataFrame(frame)
df1.append(df2,ignore_index=True)
print(df1)
df1.to_excel("Montu.xlsx",sheet_name="Sheet1")
When I run this, it's giving output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000272FD41B108>
Empty DataFrame
Columns: [Electricity generation price per unit, Primary load demand, Hybrid type, Biomass type, Location, Country, System Type, Ref.]
Index: []
I think the append()command is not working properly here as my df1 database is empty.
Can someone please help me with my code?
Here you go. In contrast to native python, DataFrame.append() is not an in-place function. It's easy to forget ;)
import numpy as np
import pandas as pd
df=pd.read_excel("D:\\tb.xlsx")
group=df.groupby("Hybrid type")
print(group)
df1=pd.DataFrame(columns=df.columns)
for Hybridtype,frame in group:
df2=pd.DataFrame(frame)
df1 = df1.append(df2,ignore_index=True)
print(df1)
df1.to_excel("Montu.xlsx",sheet_name="Sheet1")
A few days back I also ran into a similar issue. Since I don't know what is the content of your xlsx I used a sample file to reproduce this issue.
Issue:
import numpy as np
import pandas as pd
df=pd.read_excel("/home/aakash/Downloads/file_example_XLS_100.xls")
group=df.groupby("Country")
print(group)
df1=pd.DataFrame(columns=df.columns)
for Hybridtype,frame in group:
df2=pd.DataFrame(frame)
df1.append(df2,ignore_index=True)
print(df1)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe9b48eba90>
Empty DataFrame
Columns: [0, First Name, Last Name, Gender, Country, Age, Date, Id]
Index: []
Solution:
As per the official documentation,"Dataframe.append" function append rows of other to the end of the caller, returning a new object.
so the modified code should be:
import numpy as np
import pandas as pd
df=pd.read_excel("/home/aakash/Downloads/file_example_XLS_100.xls")
group=df.groupby("Country")
print(group)
df1=pd.DataFrame(columns=df.columns)
for Hybridtype,frame in group:
df2=pd.DataFrame(frame)
df1=df1.append(df2,ignore_index=True) ## Modified code
print(df1)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe9b4380f60>
0 First Name Last Name Gender Country Age Date Id
0 3 Philip Gent Male France 36 21/05/2015 2587
1 12 Franklyn Unknow Male France 38 15/10/2017 2579
2 16 Shavonne Pia Female France 24 21/05/2015 1546
3 17 Shavon Benito Female France 39 15/10/2017 3579
I am trying to vectorize the dask.dataframe with dask HashingVectorizer. I want the vectorization results to stay in the cluster (distributed system). That's why I am using client.persist when I try to transform the data. But for some reason, I am getting the error below.
Traceback (most recent call last):
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/base_vectorizer.py", line 112, in hybrid_feature_vectorizer
CLUSTERING_FEATURES=self.clustering_features)
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/text_vectorizer.py", line 143, in vectorize
X = self.client.persist(fitted_vectorizer.transform, combined_data)
File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/client.py", line 2860, in persist
assert all(map(dask.is_dask_collection, collections))
AssertionError
I can't share the data but all of the necessary information about the data is as below:
>>>type(combined_data)
<class 'dask.dataframe.core.Series'>
>>>type(combined_data.compute())
<class 'pandas.core.series.Series'>
>>>combined_data.compute().shape
12
A minimal working example can be found below. Below, in the code snippet, combined_data holds the merged columns. Meaning: all of the columns are merged into 1 column. Data has 12 rows. All of the values inside the rows are string. This is the code where I am getting the error:
from stop_words import get_stop_words
from dask_ml.feature_extraction.text import HashingVectorizer as daskHashingVectorizer
import pandas as pd
import dask
import dask.dataframe as dd
from dask.distributed import Client
def convert_dataframe_to_single_text(documents):
"""
Combine all of the columns into 1 column.
"""
if type(documents) is dask.dataframe.core.DataFrame:
cols = documents.columns
documents['combined'] = documents[cols].apply(func=(lambda row: ' '.join(row.values.astype(str))), axis=1,
meta=('str'))
document_texts = documents.drop(cols, axis=1)
else:
raise TypeError('Wrong type of data. Expected Pandas DF or Dask DF but received ', type(documents))
return document_texts
# Init the client.
client = Client('localhost:8786')
# Get stopwords
stopwords = get_stop_words(language="english")
# Create dask dataframe from pandas dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':["twenty", "twentyone", "nineteen", "eighteen"]}
df = pd.DataFrame(data)
df = dd.from_pandas(df, npartitions=1)
# Init the vectorizer
vectorizer = daskHashingVectorizer(stop_words=stopwords, alternate_sign=False,
norm=None, binary=False,
n_features=10000)
# Combine all of to columns into 1 column.
combined_data = convert_dataframe_to_single_text(df)
# Fit the vectorizer.
fitted_vectorizer = client.persist(vectorizer.fit(combined_data))
# Transform the data.
X = client.persist(fitted_vectorizer.transform, combined_data)
I hope the information is enough.
Important note: I am not getting any kind of error when I say client.compute but from what I understand this doesn't work in the cluster of machines and instead it runs in the local machine. And it returns a csr matrix instead of a lazily evaluated dask.array.
This is not how I was supposed to use client.persist. Functions I was looking for are client.submit and client.map... In my case client.submit solved my issue.
I have a dataframe in python which is made using the following code:
import pandas as pd
df = pd.read_csv('myfile.txt', sep="\t")
df1 = df.iloc[:, 3:]
now in df1 there are 24 columns. I would like to transform the values to log2 value and make a new dataframe in which there are 24 columns with log value of original dataframe. to do so I used numpy.log like the following line:
df2 = (numpy.log(df1))
this code does not return what I would like to get. do you know how to fix it?
Good morning,
I'm using python 3.6. I'm trying to name my index (see last line in code below) because I plan on joining to another DataFrame. The DataFrame should be multi-indexed. The index is the first two columns ('currency' and 'rtdate') and the data
rate
AUD 2010-01-01 0.897274
2010-02-01 0.896608
2010-03-01 0.895943
2010-04-01 0.895277
2010-05-01 0.894612
This is the code that I'm running:
import pandas as pd
import numpy as np
import datetime as dt
df=pd.read_csv('file.csv',index_col=0)
df.index = pd.to_datetime(df.index)
new_index = pd.date_range(df.index.min(),df.index.max(),freq='MS')
df=df.reindex(new_index)
df=df.interpolate().unstack()
rate = pd.DataFrame(df)
rate.columns = ['rate']
rate.set_index(['currency','rtdate'],drop=False)
Running this throw's an error message:
KeyError: 'currency'
What am I missing.
Thanks for the assistance
I think you need to set the names of the levels of MultiIndex by using rename_axis first and then reset_index for columns from MultiIndex:
So you'd end up with this:
rate = df.interpolate().unstack().set_axis(('currency','rtdate')).reset_index()
instead of this:
df=df.interpolate().unstack()
rate = pd.DataFrame(df)
rate.columns = ['rate']
rate.set_index(['currency','rtdate'],drop=False)