Issue with parsing text file in pandas - python-3.x

I have the following text file which I would like to load in python:
cabin embarked boat body
0 B5 S 2 NaN
1 C22 C26 S 11 NaN
2 C22 C26 S NaN NaN
3 C22 C26 S NaN 135.0
4 C22 C26 S NaN NaN
5 E12 S 3 NaN
6 D7 S 10 NaN
7 A36 S NaN NaN
8 C101 S D NaN
Based on the response to a similar question that I received, I tried the following:
df = pd.read_fwf("test.csv", header=0, index_col=0)
and it worked fine.
But the following doesnt work:
pd.read_csv("test.csv", sep="\s{2,}", header=0, index_col=0, engine="python")
I get the following error:
ValueError: Expected 4 fields in line 2, saw 5
Given the fact that sep="\s{2,}" considers the fields to be separated by 2 or more whitespaces,
line 2 (0 B5 S 2 NaN),
should have been parsed without any problem. Also, I see only 4 fields in line 2 (excluding the row index which is taken care of by index_col=0); which is the 5th field that the error is referring to?

cabin embarked is only one space apart and gets parsed as a single string.
pd.read_csv is given some latitude and figures that there is an empty space for an index.
cabin embarked boat body
# ^ ^ ^ ^
# field 1 field 2 field 3 field 4
# this row establishes expectations
0 B5 S 2 NaN
^ ^ ^ ^ ^
field 1 field 2 field 3 fd 4 field 5
And that's the error. Row 1 established precedence of 4 fields and row 2 shows 5.

Related

pyspark - assign non-null columns to new columns

I have a dataframe of the following scheme in pyspark:
user_id datadate page_1.A page_1.B page_1.C page_2.A page_2.B \
0 111 20220203 NaN NaN NaN NaN NaN
1 222 20220203 5 5 5 5.0 5.0
2 333 20220203 3 3 3 3.0 3.0
page_2.C page_3.A page_3.B page_3.C
0 NaN 1.0 1.0 2.0
1 5.0 NaN NaN NaN
2 4.0 NaN NaN NaN
So it contains columns like user_id, datadate, and few columns for each page (got 3 pages), which are the result of 2 joins. In this example, i have page_1, page_2, page_3, and each has 3 columns: A,B,C. Additionally, for each page columns, for each row, they will either be all null or all full, like in my example.
I don't care about the values of each of the columns per page, I just want to get for each row, the [A,B,C] values that are not null.
example for a wanted result table:
user_id datadate A B C
0 111 20220203 1 1 2
1 222 20220203 5 5 5
2 333 20220203 3 3 3
so the logic will be something like:
df[A] = page_1.A or page_2.A or page_3.A, whichever is not null
df[B] = page_1.B or page_2.B or page_3.B, whichever is not null
df[C] = page_1.C or page_2.C or page_3.C, whichever is not null
for all of the rows..
and of course, I would like to do it in an efficient way.
Thanks a lot.
You can use the sql functions greatest to extract the greatest values in a list of columns.
You can find the documentation here: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.greatest.html
from pyspark.sql import functions as F
(df.withColumn('A', F.greates(F.col('page_1.A'), F.col('page_2.A), F.col('page_3.A'))
.withColumn('B', F.greates(F.col('page_1.B'), F.col('page_2.B), F.col('page_3.B'))
.select('userid', 'datadate', 'A', 'B'))

Display row with False values in validated pandas dataframe column [duplicate]

This question already has answers here:
Display rows with one or more NaN values in pandas dataframe
(5 answers)
Closed 2 years ago.
I was validating 'Price' column in my dataframe. Sample:
ArticleId SiteId ZoneId Date Quantity Price CostPrice
53 194516 9 2 2018-11-26 11.0 40.64 27.73
164 200838 9 2 2018-11-13 5.0 99.75 87.24
373 200838 9 2 2018-11-27 1.0 99.75 87.34
pd.to_numeric(df_sales['Price'], errors='coerce').notna().value_counts()
And I'd love to display those rows with False values so I know whats wrong with them. How do I do that?
True 17984
False 13
Name: Price, dtype: int64
Thank you.
You could print your rows when price isnull():
print(df_sales[df_sales['Price'].isnull()])
ArticleId SiteId ZoneId Date Quantity Price CostPrice
1 200838 9 2 2018-11-13 5 NaN 87.240
pd.to_numeric(df['Price'], errors='coerce').isna() returns a Boolean, which can be used to select the rows that cause errors.
This includes NaN or rows with strings
import pandas as pd
# test data
df = pd.DataFrame({'Price': ['40.64', '99.75', '99.75', pd.NA, 'test', '99. 0', '98 0']})
Price
0 40.64
1 99.75
2 99.75
3 <NA>
4 test
5 99. 0
6 98 0
# find the value of the rows that are causing issues
problem_rows = df[pd.to_numeric(df['Price'], errors='coerce').isna()]
# display(problem_rows)
Price
3 <NA>
4 test
5 99. 0
6 98 0
Alternative
Create an extra column and then use it to select the problem rows
df['Price_Updated'] = pd.to_numeric(df['Price'], errors='coerce')
Price Price_Updated
0 40.64 40.64
1 99.75 99.75
2 99.75 99.75
3 <NA> NaN
4 test NaN
5 99. 0 NaN
6 98 0 NaN
# find the problem rows
problem_rows = df.Price[df.Price_Updated.isna()]
Explanation
Updating the column with .to_numeric(), and then checking for NaNs will not tell you why the rows had to be coerced.
# update the Price row
df.Price = pd.to_numeric(df['Price'], errors='coerce')
# check for NaN
problem_rows = df.Price[df.Price.isnull()]
# display(problem_rows)
3 NaN
4 NaN
5 NaN
6 NaN
Name: Price, dtype: float64

How to compare PANDAS Columns in a Dataframes to find all entries appearing in different columns?

Full disclosure. I'm fairly new to Python and discovered PANDAS today.
I created a Dataframe from two csv files, one which is the results of a robot scanning barcode IDs and one which is a list of instructions for the robot to execute.
import pandas as pd
#import csv file and read the column containing plate IDs scanned by Robot
scancsvdata = pd.read_csv("G:\scan.csv", header=None, sep=';', skiprows=(1),usecols=[6])
#Rename Column to Plates Scanned
scancsvdata.columns = ["IDs Scanned"]
#Remove any Duplicate Plate IDs
scancsvdataunique = scancsvdata.drop_duplicates()
#import the Worklist to be executed CSV file and read the Source Column to find required Plates
worklistdataSrceID = pd.read_csv("G:\TestWorklist.CSV", usecols=["SrceID"])
#Rename SrceID Column to Plates Required
worklistdataSrceID.rename(columns={'SrceID':'IDs required'}, inplace=True)
#remove duplicates from Plates Required
worklistdataSrceIDunique = worklistdataSrceID.drop_duplicates()
#import the Worklist to be executed CSV file and read the Destination Column to find required Plates
worklistdataDestID = pd.read_csv("G:\TestWorklist.CSV", usecols=["DestID"])
#Rename DestID Column to Plates Required
worklistdataDestID.rename(columns={'DestID':'IDs required'}, inplace=True)
#remove duplicates from Plates Required
worklistdataDestIDunique = worklistdataDestID.drop_duplicates()
#Combine into one Dataframe
AllData = pd.concat ([scancsvdataunique, worklistdataSrceIDunique, worklistdataDestIDunique], sort=True)
print (AllData)
The resulting Dataframe lists IDs scanned in Column 1 and IDs required in Column 2.
IDs Scanned IDs required
0 1024800.0 NaN
1 1024838.0 NaN
2 1024839.0 NaN
3 1024841.0 NaN
4 1024844.0 NaN
5 1024798.0 NaN
6 1024858.0 NaN
7 1024812.0 NaN
8 1024797.0 NaN
9 1024843.0 NaN
10 1024840.0 NaN
11 1024842.0 NaN
12 1024755.0 NaN
13 1024809.0 NaN
14 1024810.0 NaN
15 8656.0 NaN
16 8657.0 NaN
17 8658.0 NaN
0 NaN 1024800.0
33 NaN 1024843.0
0 NaN 8656.0
7 NaN 8657.0
15 NaN 8658.0
How would I go about ensuring that all the IDs in the 'IDs Required' Column, appear in the 'IDs Scanned Column'?
Ideally the results of the comparison above would be a generic message like 'All IDs found'.
If different csv files were used and the Dataframe was as follows
IDs Scanned IDs required
0 1024800.0 NaN
1 1024838.0 NaN
2 1024839.0 NaN
3 1024841.0 NaN
4 1024844.0 NaN
5 1024798.0 NaN
6 1024858.0 NaN
7 1024812.0 NaN
8 1024797.0 NaN
9 1024843.0 NaN
10 1024840.0 NaN
11 1024842.0 NaN
12 1024755.0 NaN
13 1024809.0 NaN
14 1024810.0 NaN
15 8656.0 NaN
16 8657.0 NaN
17 8658.0 NaN
0 NaN 2024800.0
33 NaN 2024843.0
0 NaN 8656.0
7 NaN 8657.0
15 NaN 8658.0
Then the result of the comparison would be the list of the missing IDs, 2024800 and 2024843.
To check True/False if all the items required are in the column;
all([item in df["IDs Scanned"] for item in df["IDs required"].unique()])
To get a list of the unique missing items:
sorted(set(df["IDs required"]) - set(df["IDs Scanned"]))
Or using pandas syntax to return a DataFrame filtered to rows where IDs required are not found in IDs Scanned:
df.loc[~df["IDs required"].isin(df["IDs Scanned"])]
missing_ids = df.loc[~df['IDs required'].isin(df['IDs Scanned']), 'IDs required']

Remove "x" number of characters from a string in a pandas dataframe?

I have a pandas dataframe df looking like this:
a b
thisisastring 5
anotherstring 6
thirdstring 7
I want to remove characters from the left of the strings in column a based on the number in column b. So I tried:
df["a"] = d["a"].str[df["b"]:]
But this will result in:
a b
NaN 5
NaN 6
NaN 7
Instead of:
a b
sastring 5
rstring 6
ring 7
Any help? Thanks in advance!
Using zip with string slice
df.a=[x[y:] for x,y in zip(df.a,df.b)]
df
Out[584]:
a b
0 sastring 5
1 rstring 6
2 ring 7
You can do it with apply, to apply this row-wise:
df.apply(lambda x: x.a[x.b:],axis=1)
0 sastring
1 rstring
2 ring
dtype: object

Replacing values in specific columns in a Pandas Dataframe, when number of columns are unknown

I am brand new to Python and stacks exchange. I have been trying to replace invalid values ( x<-3 and x>12) with np.nan in specific columns.
I don't know how many columns I will have to deal with and thus will have to create a general code that takes this into account. I do however know, that the first two columns are ids and names respectively. I have searched google and stacks exchange for a solution but haven't been able to find a solution that solves my specific objective.
My question is; How would one replace values found in the third column and onwards?
My dataframe looks like this;
Data
I tried this line:
Data[Data > 12.0] = np.nan.
this replaced the first two columns with nan
1st attempt
I tried this line:
Data[(Data.iloc[(range(2,Columns))] >=12) & (Data.iloc[(range(2,Columns))]<=-3)] = np.nan
where,
Columns = len(Data.columns)
This is clearly wrong replacing all values in rows 2 to 6 (Columns = 7).
2nd attempt
Any thoughts would be greatly appreciated.
Python 3.6.1 64bits, Qt 5.6.2, PyQt5 5.6 on Darwin
You're looking for the applymap() method.
import pandas as pd
import numpy as np
# get the columns after the second one
cols = Data.columns[2:]
# apply mask to those columns
new_df = Data[cols].applymap(lambda x: np.nan if x > 12 or x <= -3 else x)
Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html
This approach assumes your columns after the second contain float or int values.
You can set values to specific columns of a dataframe by using iloc and slicing the columns that you need. Then we can set the values using where
A short example using some random data
df = pd.DataFrame(np.random.randint(0,10,(4,10)))
0 1 2 3 4 5 6 7 8 9
0 7 7 9 4 2 6 6 1 7 9
1 0 1 2 4 5 5 3 9 0 7
2 0 1 4 4 3 8 7 0 6 1
3 1 4 0 2 5 7 2 7 9 9
Now we set the region to update and the region we want to update using iloc, and we slice columns indexed as 2 to the last column
df.iloc[:,2:] = df.iloc[:,2:].where((df < 7) & (df > 2))
Which will set the values in the Data Frame to NaN.
0 1 2 3 4 5 6 7 8 9
0 7 7 NaN 4.0 NaN 6.0 6.0 NaN NaN NaN
1 0 1 NaN 4.0 5.0 5.0 3.0 NaN NaN NaN
2 0 1 4.0 4.0 3.0 NaN NaN NaN 6.0 NaN
3 1 4 NaN NaN 5.0 NaN NaN NaN NaN NaN
For your data the code would be this
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data <= 12) & (Data >= -3))
Operator clarification
The setup I show directly above would look like this
-3 <= Data <= 12, gives everything between those numbers
If we reverse this logic using the & operator it looks like this
-3 >= Data <= 12, a number cannot be both less than -3 and greater than 12 at the same time.
So we use the or operator instead |. Code looks like this now....
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data >= 12) | (Data <= -3))
So the data is checked on a conditional basis
Data <= -3 or Data >= 12

Resources