Use de startswith function for numbers in Python - python-3.x

I have a column vector with number and characters like this:
Data
123456
789101
159482
Airplane
Car
Blue
159874
I need to filter just the numeric values.
I try to use the Data.int.startswith function, but i believe what this function doesn't exist.
Thanks.

Not sure exactly what you are asking, but if you mean that you want to filter out a list of ints from the string, you can do the following:
string = """Data
123456
789101
159482
Airplane
Car
Blue
159874""" #The data you provided
def isInt(s): #returns true if the string is an int
try:
int(s)
return True
except ValueError:
return False
print( [i for i in string.splitlines() if isInt(i)] ) #Loop through the lines in the string, checking if they are integers.
This will return the following list:
[123456, 789101, 159482, 159874]

Related

Efficient and pythonic way to search through a long string

I have created some code to search through a string and return True if there is an emoji in the string. The strings are found in a column in a pandas dataframe, and one can assume the string and the length of the dataframe could be arbitrarily long. I then create a new column in my dataframe with these boolean results.
Here is my code:
import emoji
contains_emoji = []
for row in df['post_text']:
emoji_found = False
for char in row:
if emoji.is_emoji(char):
emoji_found = True
break
contains_emoji.append(emoji_found)
df['has_emoji'] = contains_emoji
In an effort to get slicker, I was wondering if anyone could recommend a faster, shorter, or more pythonic way of searching like this?
Use emoji.emoji_count():
import emoji
# Create example dataframe
df = pd.DataFrame({'post_text':['🌍', '😂', 'text 😃', 'abc']})
# Create column based on emoji within text
df['has_emoji'] = df['post_text'].apply(lambda x: emoji.emoji_count(x) > 0)
# print dataframe
print(df)
OUTPUT:
post_text has_emoji
0 🌍 True
1 😂 True
2 text 😃 True
3 abc False
why not just
df["has_emoji"] = df.post_text.apply(emoji.emoji_count) > 0
You can use str.contains with a regex pattern that matches any emoji:
df['has_emoji'] = df['post_text'].str.contains(r'[\U0001f600-\U0001f650]')
For reference here is a link to the source code for emoji.emoji_count(): https://github.com/carpedm20/emoji/blob/master/emoji/core.py

masking string and phone number for dataframe in python pandas

Here I am trying to mask a data frame/dataset which have columns both integers and String values like this:
sno,Name,Type 1,Type 2,phonenumber
1,Bulbasaur,Grass,Poison,9876543212
2,Ivysaur,Grass,Poison,9876543212
3,Venusaur,Grass,Poison,9876543212
This is the code I am using,below code is working fine for string values it is masking well but for integers it is not masking:
import pandas as pd
filename = "path/to/file"
columnname= "phonenumber"
valuetomask = "9876543212"
column_dataset1 = pd.read_csv(filename)
print(column_dataset1)
# if(choice == "True"):
#masking for particular string/number in a column
column_dataset1[columnname]=column_dataset1[columnname].mask(column_dataset1[columnname] == valuetomask,"XXXXXXXXXX")
print(column_dataset1)
# masking last four digits
column_dataset1[columnname]=column_dataset1[columnname].str[:-4]+"****"
print(column_dataset1)
The above code is perfectly working for strings but when I gave "phonenumber"(any integer value) column it is not working.
Note: I need to do full masking(whole value should be masked) and partial masking(i.e last three digits/characters or first three digits/characters from above file) for any file which is given.
Convert to str and replace last four digits:
>>> df['phonenumber'].astype(str).str.replace(r'\d{4}$' , '****', regex=True)
0 987654****
1 987654****
2 987654****
Name: phonenumber, dtype: object
Which is the same of what #babakfifoo suggested:
>>> df['phonenumber'].astype(str).str[:-4] + '****'
0 987654****
1 987654****
2 987654****
Name: phonenumber, dtype: object
Convert your phone numbers to string and then try masking:
mask_len = 5 # length of digits to mask from right side
column_dataset1['phonenumber'] = (
column_dataset1['phonenumber'].astype(str) # convert to string
.str[:-mask_len]+"*" * mask_len # masking digits
)

How to find a string in a list of tuples by defining a function

This may be simple, but I would like to check for an existing string(s)inside of a list of tuples, and then return the corresponding tuple(s) that the string appears in. I also want it to be a case insensitive search so that it can pick up letters regardless of capitalization, etc.
I would like to define a function that will do this, this is what I have tried:
test_scores = [('Math midterm, 87','math final, 92'),
('english essay, 100','english midterm, 87','english final, 99'),
('science midterm, 95','science final, 100')]
def searchScores(searchString):
for i in range (len(test_scores)):
for j in range (len(test_scores[i])):
if test_scores[i][j].casefold() == searchString.casefold():
print (test_scores[i])
I want to be able to input my search like so:
searchScores('math')
searchScores('87')
which should return:
('Math midterm, 87','math final, 92')
('english essay, 100','english midterm, 87','english final, 99')
However this returned nothing when I inputted a string to check...
Let me know if any clarification is needed. Thanks!
I don't know why it didn't show any output for u but assuming the test_scores variables to be a "list of tuples" which has strings, this worked for me.
test_scores = [("abc","def"),("ghi","jkl"),("mno","pqr")]
def searchScores(searchString):
for tup in test_scores:
for ele in tup:
if ele.lower() == searchString.lower():
print(tup)
st = input("Enter : ")
searchScores(st)
The function in this code will search each tuple for the search string.
test_scores =[('math', '87'), ('english', '33'), ('67', 'math')]
def search_scores(scores, searchString):
lower_scores = [(score[0].lower(), score[1].lower()) for score in scores]
items = list(filter(lambda x:searchString.lower() in x, lower_scores))
return items
print(search_scores(test_scores, 'math'))

Indexing the list in python

record=['MAT', '90', '62', 'ENG', '92','88']
course='MAT'
suppose i want to get the marks for MAT or ENG what do i do? I just know how to find the index of the course which is new[4:10].index(course). Idk how to get the marks.
Try this:
i = record.index('MAT')
grades = record[i+1:i+3]
In this case i is the index/position of the 'MAT' or whichever course, and grades are the items in a slice comprising the two slots after the course name.
You could also put it in a function:
def get_grades(course):
i = record.index(course)
return record[i+1:i+3]
Then you can just pass in the course name and get back the grades.
>>> get_grades('ENG')
['92', '88']
>>> get_grades('MAT')
['90', '62']
>>>
Edit
If you want to get a string of the two grades together instead of a list with the individual values you can modify the function as follows:
def get_grades(course):
i = record.index(course)
return ' '.join("'{}'".format(g) for g in record[i+1:i+3])
You can use index function ( see this https://stackoverflow.com/a/176921/) and later get next indexes, but I think you should use a dictionary.

Creating a dictionary of dictionaries from csv file

Hi so I am trying to write a function, classify(csv_file) that creates a default dictionary of dictionaries from a csv file. The first "column" (first item in each row) is the key for each entry in the dictionary and then second "column" (second item in each row) will contain the values.
However, I want to alter the values by calling on two functions (in this order):
trigram_c(string): that creates a default dictionary of trigram counts within the string (which are the values)
normal(tri_counts): that takes the output of trigram_c and normalises the counts (i.e converts the counts for each trigram into a number).
Thus, my final output will be a dictionary of dictionaries:
{value: {trigram1 : normalised_count, trigram2: normalised_count}, value2: {trigram1: normalised_count...}...} and so on
My current code looks like this:
def classify(csv_file):
l_rows = list(csv.reader(open(csv_file)))
classified = dict((l_rows[0], l_rows[1]) for rows in l_rows)
For example, if the csv file was:
Snippet1, "It was a dark stormy day"
Snippet2, "Hello world!"
Snippet3, "How are you?"
The final output would resemble:
{Snippet1: {'It ': 0.5352, 't w': 0.43232}, Snippet2: {'Hel' : 0.438724,...}...} and so on.
(Of course there would be more than just two trigram counts, and the numbers are just random for the purpose of the example).
Any help would be much appreciated!
First of all, please check classify function, because I can't run it. Here corrected version:
import csv
def classify(csv_file):
l_rows = list(csv.reader(open(csv_file)))
classified = dict((row[0], row[1]) for row in l_rows)
return classified
It returns dictionary with key from first column and value is string from second column.
So you should iterate every dictionary entry and pass its value to trigram_c function. I didn't understand how you calculated trigram counts, but for example if you just count the number of trigram appearence in string you could use the function below. If you want make other counting you just need to update code in the for loop.
def trigram_c(string):
trigram_dict = {}
start = 0
end = 3
for i in range(len(string)-2):
# you could implement your logic in this loop
trigram = string[start:end]
if trigram in trigram_dict.keys():
trigram_dict[trigram] += 1
else:
trigram_dict[trigram] = 1
start += 1
end += 1
return trigram_dict

Resources