Here is my code
#import modules
import pandas as pd
#assign pd.read_csv() to infile to read in data from a datafile
infile = pd.read_csv('..\Infiles\\StarWars_Data.txt')
#get user input for selecting a Series name
Series_name = input("please enter the name of one of the series's for closer inspection")
#Select the Series
Series_Data = infile[Series_name]
#Chain the value_counts(), and describe() and to_csv() methods
Series_Data.value_counts()\
.describe()\
.to_csv('..\Outfiles\StarWars_Results.txt')
I expect it to perform value_counts() (returns the numbers of unique values in a series), describe() (gives summary statistics on a series), and to_csv(writes what is stored into a specified csv file).
For some reason to_csv() is returning describe() but it is not returning value_counts(), how do I write the data from both value_counts() and describe() to the same document?
IIUC, do you want?
pd.concat([df['words'].value_counts(),
df['words'].describe()])\
.to_csv('..\Outfiles\StarWars_Results.txt')
MCVE:
s = pd.Series([*'AABBBBCDDEEEEEEE'])
pd.concat([s.value_counts(), s.describe()]).rename_axis('key').to_csv('a.text')
!type a.txt
Output:
key,0
E,7
B,4
A,2
D,2
C,1
count,16
unique,5
top,E
freq,7
Related
Question:
1、how to select the rows(Pseudo code) : columns['Name']='Name_A' (Name_A just a example) & columns['time'] isin (2021-11-21 00:00:00,2021-11-22 00:00:00) .
I have store about 4 billion rows data to a hdf5 file.
Now, I want to select some data.
My code like this:
import pandas as pd
ss = pd.HDFStore("xh_data_L9.hdf5") #<class 'pandas.io.pytables.HDFStore'>
print(type(ss))
print(ss.keys())
s_1 = ss.select('alldata',start=0,stop=500) # data example
print(s_1)
ss.close()
I found HDFStore.select usage like this:
HDFStore.select(key, where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, auto_close=False)
# can not run success.
s_3 = ss.select('alldata',where="Time>2021-11-21 00:00:00 & Time<2021-11-22 00:00:00)")
s_3 = ss.select('alldata',['Name'] == 'Name_A')
I have google some method,but don't how to use "where"
code and result
I found that the reason was whether the data_columns was established when the file was created.
#this method created hdf5 don't have data_columns
ss.append('store',df_temp,index=True)
#this method created hdf5 have data_columns
store.append("store", df_temp, format="table", data_columns=True)
#query whether include data_columns
import pandas as pd
ss = pd.HDFStore("store.hdf5")
print(ss.info())
if the result include " dc->[Time,Name,Value]".
ss.select("store",where="Name='Name_A'")
#Single quotation marks are required before and after the varies.
The following is the official website explanation for data_columns:
data_columns :
list of columns, or True, default None
List of columns to create as indexed data columns for on-disk
queries, or True to use all columns. By default only the axes of the object are indexed.
See here <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#query-via-data-columns>.
I am bit new to dask. I have large csv file and large list. Length of row of csv are equal to length of the list. I am trying to create a new column in the Dask dataframe from a list. In pandas, it pretty straight forward, however in Dask I am having hard time creating new column for it. I am avoiding to use pandas because my data is 15GB+.
Please see my tries below.
csv Data
name,text,address
john,some text here,MD
tim,some text here too,WA
Code tried
import dask.dataframe as dd
import numpy as np
ls = ['one','two']
ddf = dd.read_csv('../data/test.csv')
ddf.head()
Try #1:
ddf['new'] = ls # TypeError: Column assignment doesn't support type list
Try #2: What should be passed here for condlist?
ddf['new'] = np.select(choicelist=ls) # TypeError: _select_dispatcher() missing 1 required positional argument: 'condlist'
Looking for this output:
name text address new
0 john some text here MD one
1 tim some text here too WA two
Try creating a dask dataframe and then appending it like this -
#ls = dd.from_array(np.array(['one','two']))
#ddf['new'] = ls
# As tested by OP
import dask.array as da
ls = da.array(['one','two','three'])
ddf['new'] = ls
I have an excel sheet with columns ISIN and URL as header as shown below:
*ISIN URL*
ISIN1 https://mylink3.pdf
ISIN2 https://mylink2.pdf
I need to create a dictionary of the values of this sheet and have used the below code:
import pandas as pd
my_dic = pd.read_excel('PDFDwn.xlsx', index_col=0).to_dict()
print(my_dic)
The output that I receive is as below.
{'URL': {'ISIN1': 'https://mylink3.pdf', 'ISIN2': 'https://mylink2.pdf'}}
whereas expected output should be as below without URL piece.
{'ISIN1': 'https://mylink3.pdf', 'ISIN2': 'https://mylink2.pdf'}
try this,
print df.set_index('ISIN')['URL'].to_dict()
Output:
{'ISIN2': 'https://mylink2.pdf', 'ISIN1': 'https://mylink3.pdf'}
As Per User sample:
my_dic = pd.read_excel('PDFDwn.xlsx', index_col=0).set_index('ISIN')['URL'].to_dict()
Simple solution:
my_dic['URL']
Output:
{'ISIN1': 'https://mylink3.pdf', 'ISIN2': 'https://mylink2.pdf'}
Here is my problem:
I have a csv file containing articles data set with columns: ID, CATEGORY, TITLE, BODY.
In python, I read the file to a pandas data frame like this:
import pandas as pd
df = pd.read_csv('my_file.csv')
Now I need to transform somehow this df to get a corpus object, let's call it my_corpus. But how exactly I can do it? I assume I need to use:
from nltk.corpus.reader import CategorizedCorpusReader
my_corpus = some_nltk_function(df) # <- what is the function?
At the end I can use NLTK methods to analyze the corpus. For example:
import nltk
my_corpus.fileids() # <- I expect values from column ID
my_corpus.categories() # <- I expect values from column CATEGORY
my_corpus.words(categories='cat_A') # <- I expect values from column TITLE and BODY
my_corpus.sents(categories=['cat_A', 'cat_B', 'cat_C']) # <- I expect values from column TITLE and BODY
Please, advise.
I guess you need to do 2 things.
First you need to convert each row of your dataframe df to corpus files. The following function should do it for you
def CreateCorpusFromDataFrame(corpusfolder,df):
for index, r in df.iterrows():
id=r['ID']
title=r['TITLE']
body=r['BODY']
category=r['CATEGORY']
fname=str(category)+'_'+str(id)+'.txt'
corpusfile=open(corpusfolder+'/'+fname,'a')
corpusfile.write(str(body) +" " +str(title))
corpusfile.close()
CreateCorpusFromDataFrame('yourcorpusfolder/',df)
Second, you need to read the files from yourcorpusfolder and then do the NLTK processing required by you
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
my_corpus=CategorizedPlaintextCorpusReader('yourcorpusfolder/',
r'.*', cat_pattern=r'(.*)_.*')
my_corpus.fileids() # <- I expect values from column ID
my_corpus.categories() # <- I expect values from column CATEGORY
my_corpus.words(categories='cat_A') # <- I expect values from column TITLE and BODY
my_corpus.sents(categories=['cat_A', 'cat_B']) # <- I expect values from column TITLE and BODY
Some helpful references :
https://groups.google.com/forum/#!topic/nltk-users/YFCKjHbpUkY
Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line
I am a python beginner. I am trying to write multiple lists into separate columns in a csv file.
In my csv file, I would like to have
2.9732676520000001 0.0015852047556142669 1854.1560636319559
4.0732676520000002 0.61902245706737125 2540.1258143280334
4.4032676520000003 1.0 2745.9167395368572
Following is the code that I wrote.
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
with open('output/'+file,'w') as f:
for dt,int_norm,CSs in zip(df,int_peak,CS):
f.write('{0:f},{1:f},{2:f}\n'.format(dt,int_norm,CSs))
This isn't running properly. I'm getting non-empty format string passed to object.format this error message. I'm having a hard time to catch what is going wrong. Could anyone spot what's going wrong with my code?
You are better off using pandas
import pandas as pd
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
file_name = "your_file_name.csv"
# pandas can convert a list of lists to a dataframe.
# each list is a row thus after constructing the dataframe
# transpose is applied to get to the user's desired output.
df = pd.DataFrame([df, int_peak, CS])
df = df.transpose()
# write the data to the specified output path: "output"/+file_name
# without adding the index of the dataframe to the output
# and without adding a header to the output.
# => these parameters are added to be fit the desired output.
df.to_csv("output/"+file_name, index=False, header=None)
The output CSV looks like this:
2.973268 0.001585 1854.156064
4.073268 0.619022 2540.125814
4.403268 1.000000 2745.916740
However to fix your code, you need to use another file name variable other than file. I changed that in your code as follows:
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
file_name = "your_file_name.csv"
with open('/tmp/'+file_name,'w') as f:
for dt,int_norm,CSs in zip(df,int_peak,CS):
f.write('{0:f},{1:f},{2:f}\n'.format(dt,int_norm,CSs))
and it works. The output is as follows:
2.973268,0.001585,1854.156064
4.073268,0.619022,2540.125814
4.403268,1.000000,2745.916740
If you need to write only a few selected columns to CSV then you should use following option.
csv_data = df.to_csv(columns=['Name', 'ID'])