This question already has answers here:
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Closed 2 years ago.
I'm looking for a way to clean the following data:
I would like to output something like this:
with the tokenized words in the first column and their associated labels on the other.
Is there a particular strategy with Pandas and NLTK to obtain this type of output in one go?
Thank you in advance for your help or advice
Given the 1st table, it's simply a matter of splitting the first column and repeating the 2nd column:
import pandas as pd
data = [['foo bar', 'O'], ['George B', 'PERSON'], ['President', 'TITLE']]
df1 = pd.DataFrame(data, columns=['col1', 'col2'])
print(df1)
df2 = pd.concat([pd.Series(row['col2'], row['col1'].split(' '))
for _, row in df1.iterrows()]).reset_index()
df2 = df2.rename(columns={'index': 'col1', 0: 'col2'})
print(df2)
The output:
col1 col2
0 foo bar O
1 George B PERSON
2 President TITLE
col1 col2
0 foo O
1 bar O
2 George PERSON
3 B PERSON
4 President TITLE
As for splitting the 1st column, you want to look at the split method which supports regular expression, which should allow you to handle the various language delimiters:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
If 1st table is not given there is no way to do this in 1 go with pandas since pandas has no built-in NLP capabilities.
Related
I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)
This question already has answers here:
Trim in a Pyspark Dataframe
(5 answers)
Closed 1 year ago.
For this dataframe: How to trim all leading and trailing spaces for each column in a loop?
df = spark.createDataFrame(
[
(' a', '10 ', ' b '), # create your data here, be consistent in the types.
],
['col1', 'col2','col3'] # add your columns label here
)
df.show(5)
I know how to do that by specifing each column like below, but need to do that for all columns in a loop because in real case i will not know the column names and how many of the columns.
from pyspark.sql.functions import trim
df = df.withColumn("col2", trim(df.col2))
df.show(5)
You can use a list comprehension to apply trim to all columns:
from pyspark.sql.functions import trim, col
df2 = df.select([trim(col(c)).alias(c) for c in df.columns])
This is my first time asking a question. I have a dataframe that looks like below:
import pandas as pd
data = [['AK', 'Co',2957],
['AK', 'Ot', 15],
['AK','Petr', 86848],
['AL', 'Co',167],
['AL', 'Ot', 10592],
['AL', 'Petr',1667]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
I need to find the maximum and minimum values of the third column based on the first two columns. I did browse through a few stackoverflow questions but couldn't find the right way to solve this.
My output should look like below:
data = [['AK','Ot', 15],
['AK','Petr',86848],
['AL','Co',167],
['AL','Ot', 10592]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
Note: Please let me know where I am lagging before leaving a negative marking on the question
This link helped me: Python pandas dataframe: find max for each unique values of an another column
try idxmin and idxmax with .loc filter.
new_df = my_df.loc[
my_df.groupby(["State"])
.agg(ElecMin=("Elec", "idxmin"), ElecMax=("Elec", "idxmax"))
.stack()
]
)
print(new_df)
State Energy Elec
0 AK Ot 15
1 AK Petr 86848
2 AL Co 167
3 AL Ot 10592
I need to find columns names if they contain one of these words COMPLETE, UPDATED and PARTIAL
This is my code, not working.
import pandas as pd
df=pd.DataFrame({'col1': ['', 'COMPLETE',''],
'col2': ['UPDATED', '',''],
'col3': ['','PARTIAL','']},
)
print(df)
items=["COMPLETE", "UPDATED", "PARTIAL"]
if x in items:
print (df.columns)
this is the desired output:
I tried to get inspired by this question Get column name where value is something in pandas dataframe but I couldn't wrap my head around it
We can do isin and stack and where:
s=df.where(df.isin(items)).stack().reset_index(level=0,drop=True).sort_index()
s
col1 COMPLETE
col2 UPDATED
col3 PARTIAL
dtype: object
Here's one way to do it.
# check each column for any matches from the items list.
matched = df.isin(items).any(axis=0)
# produce a list of column labels with a match.
matches = list(df.columns[matched])
I have a dataframe with 1M+ rows. A sample of the dataframe is shown below:
df
ID Type File
0 123 Phone 1
1 122 Computer 2
2 126 Computer 1
I want to split this dataframe based on Type and File. If the total count of Type is 2 (Phone and Computer), total number of files is 2 (1,2), then the total number of splits will be 4.
In short, total splits is as given below:
total_splits=len(set(df['Type']))*len(set(df['File']))
In this example, total_splits=4. Now, I want to split the dataframe df in 4 based on Type and File.
So the new dataframes should be:
df1 (having data of type=Phone and File=1)
df2 (having data of type=Computer and File=1)
df3 (having data of type=Phone and File=2)
df4 (having data of type=Computer and File=2)
The splitting should be done inside a loop.
I know we can split a dataframe based on one condition (shown below), but how do you split it based on two ?
My Code:
data = {'ID' : ['123', '122', '126'],'Type' :['Phone','Computer','Computer'],'File' : [1,2,1]}
df=pd.DataFrame(data)
types=list(set(df['Type']))
total_splits=len(set(df['Type']))*len(set(df['File']))
cnt=1
for i in range(0,total_splits):
for j in types:
locals()["df"+str(cnt)] = df[df['Type'] == j]
cnt += 1
The result of the above code gives 2 dataframes, df1 and df2. df1 will have data of Type='Phone' and df2 will have data of Type='Computer'.
But this is just half of what I want to do. Is there a way we can make 4 dataframes here based on 2 conditions ?
Note: I know I can first split on 'Type' and then split the resulting dataframe based on 'File' to get the output. However, I want to know of a more efficient way of performing the split instead of having to create multiple dataframes to get the job done.
EDIT
This is not a duplicate question as I want to split the dataframe based on multiple column values, not just one!
You can make do with groupby:
dfs = {}
for k, d in df.groupby(['Type','File']):
type, file = k
# do want ever you want here
# d is the dataframe corresponding with type, file
dfs[k] = d
You can also create a mask:
df['mask'] = df['File'].eq(1) * 2 + df['Type'].eq('Phone')
Then, for example:
df[df['mask'].eq(0)]
gives you the first dataframe you want, i.e. Type==Phone and File==1, and so on.