This question already has answers here:
Pandas: get second character of the string, from every row
(2 answers)
Closed 4 years ago.
I have a data frame and want to parse the 9th character into a second column. I'm missing the syntax somewhere though.
#develop the data
df = pd.DataFrame(columns = ["vin"], data = ['LHJLC79U58B001633','SZC84294845693987','LFGTCKPA665700387','L8YTCKPV49Y010001',
'LJ4TCBPV27Y010217','LFGTCKPM481006270','LFGTCKPM581004253','LTBPN8J00DC003107',
'1A9LPEER3FC596536','1A9LREAR5FC596814','1A9LKEER2GC596611','1A9L0EAH9C596099',
'22A000018'])
df['manufacturer'] = ['A','A','A','A','B','B','B','B','B','C','C','D','D']
def check_digit(df):
df['check_digit'] = df['vin'][8]
print(df['checkdigit'])]
For some reason, this puts the 8th row VIN in every line.
In your code doing this:
df['check_digit'] = df['vin'][8]
Is only selecting the 8th element in the column 'vin'. Try this instead:
for i in range(len(df['vin'])):
df['check_digit'] = df['vin'][i][8]
As a rule of thumb, whenever you are stuck, simply check the type of the variable returned. It solves a lot of small problems.
EDIT: As pointed out by #Georgy in the comment, using a loop wouldn't be pythonic and a more efficient way of solving this would be :
df['check_digit'] = df['vin'].str[8]
The .str does the trick. For future reference on that, I think you would find this helpful.
The correct way is:
def check_digit(df):
df['check_digit'] = df['vin'].str[8]
print(df)
Related
I am currently working on pandas structure in Python. I wrote a function that extracts data from Pandas data frame and stores it in lists. The code is working but I feel like there is a part that I could write in one for loop instead four for loops. I will give you an example below. The idea of this part of the code is to extract four columns from a pandas data frame into four lists. I did it with 4 separate for loops but I want to have one loop that does the thing.
col1,col1,col1,col1 = [],[],[],[]
for j in abc['col1']:
col1.append(j)
for k in abc['col2']:
col2.append(k)
for l in abc['col3']:
col3.append(l)
for n in abc['col4']:
col4.append(n)
And my idea is to write a one for loop that does all the code. I tried to do something like this, but it doesn't work.
col1,col1,col1,col1 = [],[],[],[]
for j,k,l,n in abc[['col1','col2','col3','col4']]
col1.append(j)
col2.append(k)
col3.append(l)
col4.append(n)
Can you help me with this idea to wrap four for loops into the one? I would appreciate your help!
You don't need to use loops at all; you can just convert each column into a list directly.
list_1 = df["col"]to_list()
Have a look at this previous question.
Treating a panda dataframe like a list usually works, but is very bad for performance. I'd consider using the iterrows() function instead.
This would work as in the following example:
col1,col2,col3,col4 = [],[],[],[]
for index, row in df.iterrows():
col1.append(row['col1'])
col2.append(row['col2'])
col3.append(row['col3'])
col4.append(row['col4'])
It's probably easier to use pandas.values and then numpy.ndarray.to_list():
col = ['col1','col2','col3']
data = []*len(col)
for i in range(len(col)):
data[i] = df[col(i)].values.to_list()
This question already has answers here:
Is there a built in function for string natural sort?
(23 answers)
Closed 3 years ago.
I have a list of strings that I am trying to organize numerically it looks like this :
List=['Core_0_0.txt', 'Core_0_1.txt','Core_0_2.txt',...'Core_1_0.txt','Core_2_3.txt', ]
but when I sort it sorted(List)
It doesn't sort the list properly.
It's very important that I keep the values as strings and they must be ordered by the number; I.E. 0_1, 0_2,0_3....31_1, they all have Core_X_X.txt How would I do this.
If you can assume all your entries will look like *_N1_N2.txt, you can use the str.split method along with a sorting key function to sort your list properly. It might look something like this
sorted_list = sorted(List, key = lambda s: (int(s.split("_")[1]), int(s.split("_")[2].split(".")[0])))
Essentially, this internally creates tuples like (N1, N2) where your file is named *_N1_N2.txt and sorts based on the N1 value. If there's a tie, it will resort to the N2 value.
Your question is a possible duplicate of another question.
Which I am posting for you here again.
you just need to change 'alist' to your 'List'.
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
alist=[
"something1",
"something12",
"something17",
"something2",
"something25",
"something29"]
alist.sort(key=natural_keys)
print(alist)
yields
['something1', 'something2', 'something12', 'something17', 'something25', 'something29']
This question already has answers here:
Pandas Passing Variable Names into Column Name
(3 answers)
Closed 4 years ago.
I am trying to read unique values for columns in list but unable to put variable correctly in a way that it becomes a command. If i run c_data.ABC.unique() directly then i get list of unique values in ABC column. Please suggest what is going wrong.
import pandas as pd
c_data=pd.read_csv("/home/fileName.csv")
list=['ABC','DEF']
for f in list:
cl="c_data.{}.unique()".format(f)
print(cl)
Output:
c_data.ABC.unique()
c_data.DEF.unique()
You should definetly check on these indexing basic in pandas. So, about your answer you can use the most basic indexing by brackets [] and string column name, for example c_data['ABC'], so you can iterate like this:
c_data = pd.read_csv("/home/fileName.csv")
list = ['ABC', 'DEF']
for f in list:
print(c_data[f].unique())
If you want/need to use format method, you can just replace column name with formatted string:
c_data = pd.read_csv("/home/fileName.csv")
list = ['ABC', 'DEF']
for f in list:
print(c_data['{0}'.format(f)].unique()])
Also, you can use bracket indexing with a list of string, which will give you another DataFrame. Then you can iterate over DataFrame itself which will give you column names:
c_data = pd.read_csv("/home/fileName.csv")
f_data = c_data[['ABC', 'DEF']]
for f in f_data:
print(f_data[f].unique())
This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 1 year ago.
I'm not sure where I am astray but I cannot seem to reset the index on a dataframe.
When I run test.head(), I get the output below:
As you can see, the dataframe is a slice, so the index is out of bounds.
What I'd like to do is to reset the index for this dataframe. So I run test.reset_index(drop=True). This outputs the following:
That looks like a new index, but it's not. Running test.head again, the index is still the same. Attempting to use lambda.apply or iterrows() creates problems with the dataframe.
How can I really reset the index?
reset_index by default does not modify the DataFrame; it returns a new DataFrame with the reset index. If you want to modify the original, use the inplace argument: df.reset_index(drop=True, inplace=True). Alternatively, assign the result of reset_index by doing df = df.reset_index(drop=True).
BrenBarn's answer works.
The following also worked via this thread, which isn't a troubleshooting so much as an articulation of how to reset the index:
test = test.reset_index(drop=True)
As an extension of in code veritas's answer... instead of doing del at the end:
test = test.reset_index()
del test['index']
You can set drop to True.
test = test.reset_index(drop=True)
I would add to in code veritas's answer:
If you already have an index column specified, then you can save the del, of course. In my hypothetical example:
df_total_sales_customers = pd.DataFrame({'Sales': total_sales_customers['Sales'],
'Customers': total_sales_customers['Customers']}, index = total_sales_customers.index)
df_total_sales_customers = df_total_sales_customers.reset_index()
This question already has answers here:
Python extract pattern matches
(10 answers)
Closed 3 years ago.
I'm pulling some data out of the web utilizing python in the Jupyter notebook. I have pulled down the data, parsed, and created the data frame. I need to extract a number out of a string that I have in the data frame. I utilizing this regex to do it:
for note in df["person_notes"]:
print(re.search(r'\d+', note))
and the outcome is the following:
<_sre.SRE_Match object; span=(53, 55), match='89'>
How can I get just the match number; in this line would be 89. I tried to convert the whole line to str() and the replace(), but not all lines have the span=(number, number) iqual. Thank you in advance!
You can use the start() and end() methods on the returned match objects to get the correct positions within the string:
for note in df["person_notes"]:
match = re.search(r'\d+', note)
if match:
print(note[match.start():match.end()])
else:
# no match found ...