Python Pandas: Split Column & Unpivot with Varying Results - python-3.x

I am running NLTK on a dataset and am looking to clean the end result by creating a single column containing the single words. A desired before/after view is shown below.
I figure that this would be a combination of splitting Tokenized and melting the dataframe. What is confusing me is how to handle differing counts of words for each comment. Any thoughts on what would solve this?

You can do this by:
df1 = df.explode("Tokenized")
df1.rename({"Tokenized": "Single Word"})
pd.concat([df1, df["Tokenized"])\

Related

How to drop columns from a pandas DataFrame that have elements containing a string?

This is not about dropping columns whose name contains a string.
I have a dataframe with 1600 columns. Several hundred are garbage. Most of the garbage columns contain a phrase such as invalid value encountered in double_scalars (XYZ) where `XYZ' is a filler name for the column name.
I would like to delete all columns that contain, in any of their elements, the string invalid
Purging columns with strings in general would work too. What I want is to clean it up so I can fit a machine learning model to it, so removing any/all columns that are not boolean or real would work.
This must be a duplicate question, but I can only find answers to how to remove a column with a specific column name.
You can use df.select_dtypes(include=[float,bool]) or df.select_dtypes(exclude=['object'])
Link to docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html
Use apply to make a mask checking if each column contains invalid, and then pass that mask to the second position of .loc:
df = df.loc[:, ~df.apply(lambda col: col.astype(str).str.contains('invalid')).any()]

find incorrect word from dataframes in spark scala

I am learning scala/spark. As part of practice, I am trying to search incorrect words from dataframe. Below is input dataframe which has no header and it is pipe('|') delimited.
1|I am hair for spark practss
2|I thank I am doing vary badd
3|But stay tuned whoo knooow
4|That's foor now
I have other dataframe where list of correct and incorrect word specified(first column is incorrect word and other is correct word). Below is sample dataframe which has no header and pipe delimited.May I know how can I find list of incorrect word per line from input dataframe.
hair|here
thank|think
whoo|who
knooow|know
foor|for

How do I get additional column name information in a pandas group by / nlargest calculation?

I am comparing pairs of strings using six fuzzywuzzy ratios, and I need to output the top three scores for each pair.
This line does the job:
final2_df = final_df[['nameHiringOrganization', 'mesure', 'name', 'valeur']].groupby(['nameHiringOrganization', 'name'])['valeur'].nlargest(3)
However, the excel output table lacks the 'mesure' column, which contains the ratio's name. This is annoying, because then I'm not able to identify which of the six ratios works best for any given pair.
I thought selecting columns ath the beginning might work (final_df[['columns', ...]]), but it doesn't seem to.
Any thought on how I might add that info?
Many thanks in advance!
I think here is possible use another solution with sorting by 3 columns with DataFrame.sort_values and then using GroupBy.head:
final2_df = (final_df.sort_values(['nameHiringOrganization', 'name', 'valeur'],
ascending=[True, True, False])
.groupby(['nameHiringOrganization', 'name'])
.head(3))

how to search a text file in python 3

I have this text file that has lists in it. How would I search for that individual list? I have tried using loops to find it, but every time it gives me an error since I don't know what to search for.
I tried using a if statement to find it but it returns -1.
thanks for the help
I was doing research on this last night. You can use pandas for this. See here: Load data from txt with pandas. One of the answers talks about list in text files.
You can use:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["Name", "b", "c", "etc."]
Add sep=" " in your code, leaving a blank space between the quotes. So pandas can detect spaces between values and sort in columns. Data columns isenter code here for naming your columns.
With a JSON or XML format, text files become more searchable. In my research I’ve decided to go with an XML approach. Here is the link to a blog that explains how do use Python with XML: http://www.austintaylor.io/lxml/python/pandas/xml/dataframe/2016/07/08/convert-xml-to-pandas-dataframe.
If you want to search the data frame try:
import pandas as pd
txt_file = 'C:\path\to\your\txtfile.txt'
df = pd.read_table(txt_file, sep = ",")
row = df.loc[df['Name'] == 'bob']
Print(row)
Now depending how your text file is formated, your results will not work for every text file. The idea of a dataframe in pandas helps u create a CSV file formats. This giving the process a repeatable structure to enable testing results. Again I recommend using a JSON or XML format before implementing pandas data frames in ur solution. U can then create a consistent result, that is testable too!

Pandas read_table with duplicate names

When reading a table while specifying duplicate column names - let's say two different names - pandas 0.16.1 will copy the last two columns of the data over and over again.
In [1]:
​
df = pd.read_table('Datasets/tbl.csv', header=0, names=['one','two','one','two','one'])
df
tbl.csv contains a table with 5 different columns. The last two will be repeated instead of giving all columns.
Out[1]:
one two one two one
0 0.132846 0.120522 0.132846 0.120522 0.132846
1 -0.059710 -0.151850 -0.059710 -0.151850 -0.059710
2 0.003686 0.011072 0.003686 0.011072 0.003686
3 -0.220749 -0.029358 -0.220749 -0.029358 -0.220749
The actual table has different values in every column. Here, the same two columns (corresponding to the two last ones in the file) are repeated. No error or warning is given.
Do you think this is a bug or is it intended? I find it very dangerous to silently change an input like that. Or is it my ignorance?
Using duplicate values in indexes are inherently problematic.
They lead to ambiguity. Code that you think works fine can suddenly fail on DataFrames with non-unique indexes. argmax, for instance, can lead to a similar pitfall when DataFrames have duplicates in the index.
It's best to avoid putting duplicate values in (row or
column) indexes if you can. If you need to use a non-unique index, use them with care.
Double-check the effect duplicate values have on the behavior of your code.
In this case, you could use
df = pd.read_csv('data', header=None)
df.columns = ['one','two','one','two','one']
instead.

Resources