PythonBasics: How to select a substring from a series in a dataframe

PythonBasics: How to select a substring from a series in a dataframe - python-3.x

I am not sure how to select a substring from a series in a dataframe to extract some needed text.
Example: I have a 2 series in the dataframe and am trying to extract the last portion of the string in QRY series that will have "AND" string.
So If I have "This is XYZ AND y = 1" then I need to extract "AND y = 1".
For this I've chosen rfind("AND") since the AND can occur anywhere in string but I need the highest index and then wants to extract the string that begins with the highest index AND.
Sample for one string
strg = "This is XYZ AND y = 1"
print(strg[strg.rfind("AND"):])
-- This is working, but on a data frame its saying cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'
data set
import pandas as pd
data = {"CELL":["CELL1","CELL2","CELL3"], "QRY": ["This is XYZ AND y = 1","No that is not AND z = 0","Yay AND a= -1"]}
df = pd.DataFrame(data,columns = ["CELL","QRY"])
print(df.QRY.str.rfind("AND"))

Related

How to split data from list with more than 1 data and append bellow to Data frame - PYTHON - pandas

I did this code to extract data from pdfs and create a list in excel with 'number of PO' / 'Item' / 'Data' / archive name. But when the pdf has more than once the number of PO and item, this data is appended with a list within a list. Its ok but when I put the lists to dataframe pandas, it creates a list with more than one data and I need to split the data and include in a new column below in order.
lista_Pedido = []
lista_Data = []
lista_Item = []
nome_arquivo = []
for f in os.listdir():
col_3 = [f]
nome_arquivo.append(col_3)
reader = PdfReader(f)
page = reader.pages[0]
pdf_atual = page.extract_text(f)
col_1 = re.findall(r'\w+(?<=PO: 45)\d+',pdf_atual)
lista_Pedido.append(col_1)
col_12= re.findall(r'(?<=Item )\d+',pdf_atual)
lista_Item.append(col_12)
col_2 = re.findall(r'[?<=(Date of delivery: )|?<=(Data de fornecimento: )]\s+\d+/+\d+/+\d+',pdf_atual)
lista_Data.append(col_2)
df = pd.DataFrame(data=(), columns=['Pedido','Item','Data'])
df['Item'] = (lista_Item)
df['Data'] = (lista_Data)
df['arquivo'] = (nome_arquivo)
Wrong result = list with more than 1 data, I need to splite e append below following the order of a list
enter image description here

The reason why you are getting a list of lists is because re.findall returns a list. If you would like to add the results as individual results you can do the following.
col_1 = re.findall(r'\w+(?<=PO: 45)\d+',pdf_atual)
lista_Pedido.extend(col_1)
Or:
col_1 = re.findall(r'\w+(?<=PO: 45)\d+',pdf_atual)
for result in col_1:
lista_Pedido.append(result)

Spark keeping words in column that match a list

I currently have a list and a Spark dataframe:
['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
I am having a tough time figuring out a way to create a new column in the dataframe that takes the first matching word from the tags column for each row and puts it in the newly created column for that row.
For example, lets say the first row in the tags column has only "murder" in it, I would want that to show in the new column. Then, if the next row had "boring", "silly" and "cult" in it, I would want it to show cult in the new column since it matches the list. If the last row in tags column had "revenge", "cult" in it, I would want it to only show revenge, since its the first word that matches the list.

from pyspark.sql import functions as F
df = spark.createDataFrame([('murder',), ('boring silly cult',), ('revenge cult',)], ['tags'])
mylist = ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
pattern = '|'.join([f'({x})' for x in mylist])
df = df.withColumn('first_from_list', F.regexp_extract('tags', pattern, 0))
df.show()
# +-----------------+---------------+
# | tags|first_from_list|
# +-----------------+---------------+
# | murder| murder|
# |boring silly cult| cult|
# | revenge cult| revenge|
# +-----------------+---------------+

You could use a PySpark UDF (User Defined Function).
First, let's write a python function to find a first match between a list (in this case the list you provided) and a string, that is, the value of the tags column:
def find_first_match(tags):
first_match = ''
genres= ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
for tag in tags.split():
for genre in genres:
if tag==genre:
first_match=tag
return first_match
Then, we need to convert this function into a PySpark udf so that we can use it in combination with the .withColumn() operation:
find_first_matchUDF = udf(lambda z:find_first_match(z))
Now we can apply the udf function to generate a new column. Assuming df is the name of your DataFrame:
from pyspark.sql.functions import col
new_df = df.withColumn("first_match", find_first_matchUDF(col("tags")))
This approach only works if all tag in your tags column are separated by white spaces.
P.S
You can avoid the second step using annotation:
from pyspark.sql.functions import col
#udf(returnType=StringType())
def find_first_match(tags):
first_match = ''
genres= ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
for tag in tags.split():
for genre in genres:
if tag==genre:
first_match=tag
return first_match
new_df = df.withColumn("first_match", find_first_match(col("tags")))

How to average groups of columns

Given the following pandas dataframe:
I am trying to get to point b (shown in image 2). Where I want to use row 'class' to identify column names and average columns with the same class. I have been trying to use setdefault to create a dictionary but I am not having much luck. I aim to achieve the final result shown in fig 2.
Since this is a representative example (the actual dataframe is huge), please let me know of a loop based example if possible.
Any help or pointers in the right direction is immensely appreciated.

Imports and Test DataFrame
import pandas as pd
from string import ascii_lowercase # for test data
import numpy as np # for test data
np.random.seed(365)
df = pd.DataFrame(np.random.rand(5, 6) * 1000, columns=list(ascii_lowercase[:6]))
df.index.name = 'Class'
a b c d e f
Class
0 941.455743 641.602705 684.610467 588.562066 543.887219 368.070913
1 766.625774 305.012427 442.085972 110.443337 438.373785 752.615799
2 291.626250 885.722745 996.691261 486.568378 349.410194 151.412764
3 891.947611 773.542541 780.213921 489.000349 532.862838 189.855095
4 958.551868 882.662907 86.499676 243.609553 279.726092 215.662172
Create a DataFrame of column pair means
# use list slicing to select even and odd columns
even_cols = df.columns[0::2]
odd_cols = df.columns[1::2]
# zip the two lists into pairs
# zip creates tuples, but pandas requires list of columns, so we map the tuples into lists
col_pairs = list(map(list, zip(even_cols, odd_cols)))
# in a list comprehension iterate through each column pair, get the mean, and concat the results into a dataframe
df_means = pd.concat([df[pairs].mean(axis=1) for pairs in col_pairs], axis=1)
# in a list comprehension create column header names with a string join
df_means.columns = [' & '.join(pair) for pair in col_pairs]
# display(df_means)
a & b c & d e & f
Class
0 791.529224 636.586267 455.979066
1 535.819101 276.264655 595.494792
2 588.674498 741.629819 250.411479
3 832.745076 634.607135 361.358966
4 920.607387 165.054615 247.694132

Try This
df['A B'] = df[['A', 'B']].mean(axis=1)

Scala How to Find All Unique Values from a Specific Column in a CSV?

I am using Scala to read from a csv file. The file is formatted to have 3 columns each separated by a \t character. The first 2 columns are unimportant and the third column contains a list of comma separated identifiers stored as as strings. Below is a sample of what the input csv would look like:
0002ba73 US 6o7,6on,6qc,6qj,6nw,6ov,6oj,6oi,15me,6pb,6p9
002f50e4 US 6om,6pb,6p8,15m9,6ok,6ov,6qc,6oo,15me
004b5edc US 6oj,6nz,6on,6om,6qc,6ql,6p6,15me
005cc990 US 6pb,6qf,15me,6og,6nx,6qc,6om,6ok
005fe1ea US 15me,6p0,6ql,6ok,6ox,6ol,6o5,6qj
00777555 US 6pb,15me,6nw,6rk,6qc,6ov,6qj,6o0,6oj,6ok,6on,6p6,6nx,15m9
00cbcc7d US 6oj,6qc,6qg,6pb,6ol,6p6,6ov,15me
010254a6 US 6qc,6pb,6nw,6nx,15me,6o0,6ok,6p8
011b905c US 6oj,6nw,6ov,15me,6qc,6ow,6ql,6on,6qi,6qe
011fffa6 US 15me,6ok,6oj,6p6,6pb,6on,6qc,6ov,6oo,6nw,6oc
I want to read in the csv, get rid of the first two columns, and create a List that contains one instance of each unique identifier code found in the third column, so running the code on the above data should return the result List(6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6p8, 15m9, 6ok, 6oo, 6nz, 6om, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)
I have the following code which returns a List containing every distinct value found anywhere in the csv file:
val in_file = new File("input_file.csv")
val source = scala.io.Source.fromFile(in_file, "utf-8")
val labels = try source.getLines.mkString("\t") finally source.close()
val labelsList: List[String] = labels.split("[,\t]").map(_.trim).toList.distinct
Using the above input, my code returns labelsList with a value of List(0002ba73-e60c-4ffb-9131-c1612b904658, US, 6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 002f50e4-48cc-4b14-bb80-0502068b6161, 6om, 6p8, 15m9, 6ok, 6oo, 004b5edc-c0cc-4ffd-bef3-980bd92b92e6, 6nz, 6ql, 6p6, 005cc990-83dc-4e63-a4b6-58f38241e8fd, 6qf, 6og, 6nx, 005fe1ea-b918-48a3-a495-1f8ac12935ba, 6p0, 6ox, 6ol, 6o5, 00777555-83d4-401e-861b-5892f3aa3e1c, 6rk, 6o0, 00cbcc7d-1b48-4c5c-8141-8fc8f62b7b07, 6qg, 010254a6-2ef0-4a24-aa4d-3cc6656a55de, 011b905c-fbf3-441a-8912-a94cc0fe8a1d, 6ow, 6qi, 6qe, 011fffa6-0b9f-4d88-8ced-ce1cc864984f, 6oc)
How can I get my code to run properly and ignore anything contained within the first 2 columns of the csv?

You can ignore the first two columns and then split the third by the comma.
Finally a toSet will get rid of the duplicate identifiers.
val f = Source.fromFile("input_file.csv")
val lastColumns = f.getLines().map(_.split("\t")(2))
val uniques = lastColumns.flatMap(_.split(",")).toSet
uniques foreach println

Using Scala 2.13 resource management.
util.Using(io.Source.fromFile("input_file.csv")){
_.getLines()
.foldLeft(Array.empty[String]){
_ ++ _.split("\t")(2).split(",")
}.distinct.toList
}
//res0: scala.util.Try[List[String]] =
// Success(List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc))
The .toList can be dropped if an Array result is acceptable.

This is what you can do , Am doing on a sample DF, you can replace with yours
val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val reqCols = Seq(2)
val finalDf = Df.select(reqCols map Df.columns map col: _*)
finalDf.show
Note : This is 0-based index, so pass 2 to get third column.
If you want distinct values from your desired column.you can use distinct along with mkstring
val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val reqCols = Seq(2)
val distinctValues = Df.select(reqCols map Df.columns map col: _*).distinct.collect.mkString(",").filterNot("[]".toSet)
println(distinctValues)
Dates are duplicate , above code is removing duplicates.

Another method using regex
val data = scala.io.Source.fromFile("source.txt").getLines()
data.toList.flatMap {
line => """\S+\s+\S+\s+(\S+)""".r.findAllMatchIn(line).map( x => x.group(1).split(",").toList)
}.flatten.distinct
// res0: List[String] = List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)

Create Python string placeholder (%s) n times

I am looking to automatically generate the following string in Python 2.7 using a loop based on the number of columns in a Pandas DataFrame:
INSERT INTO table_name (firstname, lastname) VALUES (534737, 100.115)
This assumes that the DataFrame has 2 columns.
Here is what I have:
# Generate test numbers for table:
df = pd.DataFrame(np.random.rand(5,2), columns=['firstname','lastname'])
# Create list of tuples from numbers in each row of DataFrame:
list_of_tuples = [tuple(x) for x in df.values]
Now, I create the string:
Manually - this works:
add_SQL = INSERT INTO table_name (firstname, lastname) VALUES %s" % (list_of_tuples[4])
In this example, I only used 2 column names - 'firstname' and 'lastname'. But I must do this with a loop since I have 156 column names - I cannot do this manually.
What I need:
I need to automatically generate the placeholder %s the same
number of times as the number of columns in the Pandas DataFrame.
Here, the DataFrame has 2 columns so I need an automatic way to
generate %s twice.
Then I need to create a tuple with 2 entries,
without the ''.
My attempt:
sss = ['%s' for x in range(0,len(list(df)))]
add_SQL = "INSERT INTO table_name (" + sss + ") VALUES %s" % (len(df), list_of_tuples[4])
But this is not working.
Is there a way for me to generate this string automatically?

Here is what I came up with - it is based on dwanderson's approach in the 2nd comment of the original post (question):
table_name = name_a #name of table
# Loop through all columns of dataframe and generate one string per column:
cols_n = df.columns.tolist()
placeholder = ",".join(["%s"]*df.shape[1]) #df.shape[1] gives # of columns
column_names = ",".join(cols_n)
insrt = "INSERT INTO %s " % table_name
for qrt in range(0,df.shape[0]):
add_SQL_a_1 = insrt + "(" + column_names + ") VALUES (" + placeholder + ")" #part 1/2
add_SQL_a_2 = add_SQL_a_1 % list_of_tuples[qrt] #part 2/2
This way, the final string is in part 2/2.
For some reason, it would not let me do this all in one line and I can't figure out why.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PythonBasics: How to select a substring from a series in a dataframe - python-3.x

Related

How to split data from list with more than 1 data and append bellow to Data frame - PYTHON - pandas

Spark keeping words in column that match a list

How to average groups of columns

Scala How to Find All Unique Values from a Specific Column in a CSV?

Create Python string placeholder (%s) n times

Categories

Resources