How to preserve the datatype 'list' of a data frame while reading from csv or writing to csv - python-3.x

I want to preserve the datatype of a column of a data frame which is a list while writing it to csv file. when I read it, I have to have the values in lis format.
I have tried
pd.read_csv('namesss.csv',dtype = {'letters' = list})
but it says
dtype <class 'list'> not understood
this is an example
df = pd.DataFrame({'name': ['jack','johnny','stokes'],
'letters':[['j','k'],['j','y'],['s','s']]})
print(type(df['letters'][0]))
df
<class 'list'>
name letters
0 jack [j, k]
1 johnny [j, y]
2 stokes [s, s]
df.to_csv('namesss.csv')
print(type(pd.read_csv('namesss.csv')['letters'][0]))
<class 'str'>

You can use the ast module to make strings into lists :
import ast
df2 = pd.read_csv('namesss.csv')
df2['letters'] =[ast.literal_eval(x) for x in df2['letters'] ]
In [1] : print(type(df2['letters'][0]))
Out[1] : <class 'list'>

Related

How to read excel table with one column?

I have a table in Excel with one column that I want to read into the list:
At first I tried it like this:
>>> df = pandas.read_excel('emails.xlsx', sheet_name=None)
>>> df
OrderedDict([('Sheet1', Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> for k, v in df.items():
... print(type(v), v)
...
<class 'pandas.core.frame.DataFrame'> Chadisayed#gmx.com
0 wonderct#mail.ru
1 fcl#fcl-bd.com
2 galina#dorax-investments.com
>>> df = df.items()[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'odict_items' object is not subscriptable
I tried it differently:
>>> df = pandas.read_excel('emails.xlsx', index_col=0)
>>> df
Empty DataFrame
Columns: []
Index: [wonderct#mail.ru, fcl#fcl-bd.com, galina#dorax-investments.com]
[419 rows x 0 columns]
>>> foo = []
>>> for i in df.index:
... foo.append(i)
...
>>> foo
['wonderct#mail.ru', 'fcl#fcl-bd.com', 'galina#dorax-investments.com']
It almost worked, but the first element is missing. What else can I do? Is there really no way to read the Excel file simply line by line?
Try this:
df=pd.read_excel('temp.xlsx', header=None)
target_list=list(df[0].values)
Use:
target_list = pandas.read_excel('emails.xlsx', index_col=None, names=['A'])['A'].tolist()

Spacy - Convert Token type into list

I have few elements which I got after performing operation in spacy having type
Input -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output:
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
I want to make all elements in list with str type for iteration.
Expected output -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output
<class 'str'>
<class 'str'>
<class 'str'>
please suggest some optimized way..
Spacy Token has a attribute called text.
Here's a complete example:
import spacy
nlp = spacy.load('en_core_web_sm')
t = (u"India Australia Brazil")
li = nlp(t)
for i in li:
print(i.text)
or if you want the list of tokens as list of strings:
list_of_strings = [i.text for i in li]
Thanks for the solution and for sharing your knowledge. It works very well to convert a spacy doc/span to a string or list of strings to further use them in string operations.
you can also use this:-
for i in li:
print(str(i))

Attribute Error split when trying to apply split method in flatmap after converting DF to RDD

I am using below code snippet to read some sample file using spark context
>>> textFile = sc.textFile("hdfs:///user/hive/warehouse/sample.txt")
>>> textFile.flatMap(lambda word:word.split(" ")).collect()
Assume this gives output something like below
[u'hi', u'there,', u'I', u'am', u'working', u'on', u'something', u'random.']
Now I am using below code snippet to read some sample file using data frame and then trying to convert to rdd and apply flatmap as done earlier
>>> df = spark.read.text("hdfs:///user/hive/warehouse/sample.txt")
>>> df.rdd.flatMap(lambda word:word.split(" ")).collect()
This fails with attribute Error split.
I went on to check the datatype of df.rdd and textFile using below code snippet
>>> type(df.rdd)
<class 'pyspark.rdd.RDD'>
>>> type(textFile)
<class 'pyspark.rdd.RDD'>
Both are identical.
Now when I check type of individual element of these Rdd using below code snippet. I am observing difference.
>>> textFile.map(lambda x:type(x)).collect()
[<type 'unicode'>]
>>> df.rdd.map(lambda x:type(x)).collect()
[<class 'pyspark.sql.types.Row'>]
Why is there discrepency
You should convert it to list after you convert df to rdd
>>> textFile = sc.textFile("hdfs://localhost:8020/test/ali/sample.txt")
>>> textFile.flatMap(lambda word:word.split(" ")).collect()
['hi', 'there,', 'I', 'am', 'working', 'on', 'something', 'random.']
>>>
>>> df = spark.read.text("hdfs://localhost:8020/test/ali/sample.txt")
>>> df.rdd.flatMap(lambda x: list(x)).flatMap(lambda word:word.split(" ")).collect()
['hi', 'there,', 'I', 'am', 'working', 'on', 'something', 'random.']

Why index name always appears in the parquet file created with pandas?

I am trying to create a parquet using pandas dataframe, and even though I delete the index of the file, it is still appearing when I am re-reading the parquet file. Can anyone help me with this? I want index.name to be set as None.
>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df
key
0 1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1
>>> del df.index.name
>>> df
key
0 1
>>> df.to_parquet('test.parquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1
It works as expected using pyarrow:
>>> df = pd.DataFrame({'key': 1}, index=[0])
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> del df.index.name
>>> df
key
0 1
>>> df.to_parquet('test.parquet', engine='fastparquet')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
index
0 1 ---> INDEX NAME APPEARS EVEN AFTER DELETING USING fastparquet
>>> del df.index.name
>>> df.to_parquet('test.parquet', engine='pyarrow')
>>> df = pd.read_parquet('test.parquet')
>>> df
key
0 1 --> INDEX NAME IS NONE WHEN CONVERSION IS DONE WITH pyarrow
Hey this works with pyarrow with the following
df = pd.DataFrame({'key': 1}, index=[0])
df.to_parquet('test.parquet', engine='pyarrow', index=False)
df = pd.read_parquet('test.parquet', engine='pyarrow')
df.head()
As #alexopoulos7 mentioned in the to_parquet documentation it states you can use the "index" argument as a parameter. It seems to work, perhaps because I'm explicitly stating the engine='pyarrow'
I have been playing with both libraries pyarrow and fastparquet, trying to write a parquet file without preserving indexes since I need those data to be read from redshift as an external table.
For me what it worked was for library fastparquet
df.to_parquet(destination_file, engine='fastparquet', compression='gzip', write_index=False)
If you try to follow the to_parquet official documentation you will see that it mentions parameter "index" but this throws an error if this argument does not exist in the used engine. Currently, I have found that only fastparquet has such an option and in named "write_index"

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

The problem seems to stem from when I read in the csv with read_csv having a type issue when I try to perform operations on the nparray. The following is a minimum working example.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
np.array(df['numpy']).mean()
Out[151]: array([ 0.83151197, 0.00444986])
Which is what I would expect. However, if I write the result to a file and then read the data back into a pandas DataFrame the types are broken.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
df.to_csv('C:/temp/test5.csv')
df5 = pd.read_csv('C:/temp/test5.csv', dtype={'numpy': object})
np.array(df5['numpy']).mean()
TypeError: unsupported operand type(s) for /: 'str' and 'long'
The following is the output of "df5" object
df5
Out[186]:
Unnamed: 0 numpy
0 0 [0.83151197 0.00444986]
The following is the file contents:
,numpy
0,[ 0.83151197 0.00444986]
The only way I have figured out how to get this to work is to read the data and manually convert the type, which seems silly and slow.
[float(num) for num in df5['numpy'][0][1:-1].split()]
Is there anyway to avoid the above?
pd.DataFrame({'col_name': data}) expects a 1D array alike objects as data:
In [63]: pd.DataFrame({'numpy': [0.83151197,0.00444986]})
Out[63]:
numpy
0 0.831512
1 0.004450
In [64]: pd.DataFrame({'numpy': np.array([0.83151197,0.00444986])})
Out[64]:
numpy
0 0.831512
1 0.004450
you've wrapped numpy array with [] so you passed a list of numpy arrays:
In [65]: pd.DataFrame({'numpy': [np.array([0.83151197,0.00444986])]})
Out[65]:
numpy
0 [0.83151197, 0.00444986]
Replace df = pd.DataFrame({'numpy': [x]}) with df = pd.DataFrame({'numpy': x})
Demo:
In [56]: x = np.array([0.83151197,0.00444986])
...: df = pd.DataFrame({'numpy': x})
# ^ ^
...: df.to_csv('d:/temp/test5.csv', index=False)
...:
In [57]: df5 = pd.read_csv('d:/temp/test5.csv')
In [58]: df5
Out[58]:
numpy
0 0.831512
1 0.004450
In [59]: df5.dtypes
Out[59]:
numpy float64
dtype: object

Resources