PythonBasics: How to select a substring from a series in a dataframe - python-3.x
I am not sure how to select a substring from a series in a dataframe to extract some needed text.
Example: I have a 2 series in the dataframe and am trying to extract the last portion of the string in QRY series that will have "AND" string.
So If I have "This is XYZ AND y = 1" then I need to extract "AND y = 1".
For this I've chosen rfind("AND") since the AND can occur anywhere in string but I need the highest index and then wants to extract the string that begins with the highest index AND.
Sample for one string
strg = "This is XYZ AND y = 1"
print(strg[strg.rfind("AND"):])
-- This is working, but on a data frame its saying cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'
data set
import pandas as pd
data = {"CELL":["CELL1","CELL2","CELL3"], "QRY": ["This is XYZ AND y = 1","No that is not AND z = 0","Yay AND a= -1"]}
df = pd.DataFrame(data,columns = ["CELL","QRY"])
print(df.QRY.str.rfind("AND"))
Related
How to split data from list with more than 1 data and append bellow to Data frame - PYTHON - pandas
I did this code to extract data from pdfs and create a list in excel with 'number of PO' / 'Item' / 'Data' / archive name. But when the pdf has more than once the number of PO and item, this data is appended with a list within a list. Its ok but when I put the lists to dataframe pandas, it creates a list with more than one data and I need to split the data and include in a new column below in order. lista_Pedido = [] lista_Data = [] lista_Item = [] nome_arquivo = [] for f in os.listdir(): col_3 = [f] nome_arquivo.append(col_3) reader = PdfReader(f) page = reader.pages[0] pdf_atual = page.extract_text(f) col_1 = re.findall(r'\w+(?<=PO: 45)\d+',pdf_atual) lista_Pedido.append(col_1) col_12= re.findall(r'(?<=Item )\d+',pdf_atual) lista_Item.append(col_12) col_2 = re.findall(r'[?<=(Date of delivery: )|?<=(Data de fornecimento: )]\s+\d+/+\d+/+\d+',pdf_atual) lista_Data.append(col_2) df = pd.DataFrame(data=(), columns=['Pedido','Item','Data']) df['Item'] = (lista_Item) df['Data'] = (lista_Data) df['arquivo'] = (nome_arquivo) Wrong result = list with more than 1 data, I need to splite e append below following the order of a list enter image description here
The reason why you are getting a list of lists is because re.findall returns a list. If you would like to add the results as individual results you can do the following. col_1 = re.findall(r'\w+(?<=PO: 45)\d+',pdf_atual) lista_Pedido.extend(col_1) Or: col_1 = re.findall(r'\w+(?<=PO: 45)\d+',pdf_atual) for result in col_1: lista_Pedido.append(result)
Spark keeping words in column that match a list
I currently have a list and a Spark dataframe: ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil'] I am having a tough time figuring out a way to create a new column in the dataframe that takes the first matching word from the tags column for each row and puts it in the newly created column for that row. For example, lets say the first row in the tags column has only "murder" in it, I would want that to show in the new column. Then, if the next row had "boring", "silly" and "cult" in it, I would want it to show cult in the new column since it matches the list. If the last row in tags column had "revenge", "cult" in it, I would want it to only show revenge, since its the first word that matches the list.
from pyspark.sql import functions as F df = spark.createDataFrame([('murder',), ('boring silly cult',), ('revenge cult',)], ['tags']) mylist = ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil'] pattern = '|'.join([f'({x})' for x in mylist]) df = df.withColumn('first_from_list', F.regexp_extract('tags', pattern, 0)) df.show() # +-----------------+---------------+ # | tags|first_from_list| # +-----------------+---------------+ # | murder| murder| # |boring silly cult| cult| # | revenge cult| revenge| # +-----------------+---------------+
You could use a PySpark UDF (User Defined Function). First, let's write a python function to find a first match between a list (in this case the list you provided) and a string, that is, the value of the tags column: def find_first_match(tags): first_match = '' genres= ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil'] for tag in tags.split(): for genre in genres: if tag==genre: first_match=tag return first_match Then, we need to convert this function into a PySpark udf so that we can use it in combination with the .withColumn() operation: find_first_matchUDF = udf(lambda z:find_first_match(z)) Now we can apply the udf function to generate a new column. Assuming df is the name of your DataFrame: from pyspark.sql.functions import col new_df = df.withColumn("first_match", find_first_matchUDF(col("tags"))) This approach only works if all tag in your tags column are separated by white spaces. P.S You can avoid the second step using annotation: from pyspark.sql.functions import col #udf(returnType=StringType()) def find_first_match(tags): first_match = '' genres= ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil'] for tag in tags.split(): for genre in genres: if tag==genre: first_match=tag return first_match new_df = df.withColumn("first_match", find_first_match(col("tags")))
How to average groups of columns
Given the following pandas dataframe: I am trying to get to point b (shown in image 2). Where I want to use row 'class' to identify column names and average columns with the same class. I have been trying to use setdefault to create a dictionary but I am not having much luck. I aim to achieve the final result shown in fig 2. Since this is a representative example (the actual dataframe is huge), please let me know of a loop based example if possible. Any help or pointers in the right direction is immensely appreciated.
Imports and Test DataFrame import pandas as pd from string import ascii_lowercase # for test data import numpy as np # for test data np.random.seed(365) df = pd.DataFrame(np.random.rand(5, 6) * 1000, columns=list(ascii_lowercase[:6])) df.index.name = 'Class' a b c d e f Class 0 941.455743 641.602705 684.610467 588.562066 543.887219 368.070913 1 766.625774 305.012427 442.085972 110.443337 438.373785 752.615799 2 291.626250 885.722745 996.691261 486.568378 349.410194 151.412764 3 891.947611 773.542541 780.213921 489.000349 532.862838 189.855095 4 958.551868 882.662907 86.499676 243.609553 279.726092 215.662172 Create a DataFrame of column pair means # use list slicing to select even and odd columns even_cols = df.columns[0::2] odd_cols = df.columns[1::2] # zip the two lists into pairs # zip creates tuples, but pandas requires list of columns, so we map the tuples into lists col_pairs = list(map(list, zip(even_cols, odd_cols))) # in a list comprehension iterate through each column pair, get the mean, and concat the results into a dataframe df_means = pd.concat([df[pairs].mean(axis=1) for pairs in col_pairs], axis=1) # in a list comprehension create column header names with a string join df_means.columns = [' & '.join(pair) for pair in col_pairs] # display(df_means) a & b c & d e & f Class 0 791.529224 636.586267 455.979066 1 535.819101 276.264655 595.494792 2 588.674498 741.629819 250.411479 3 832.745076 634.607135 361.358966 4 920.607387 165.054615 247.694132
Try This df['A B'] = df[['A', 'B']].mean(axis=1)
Scala How to Find All Unique Values from a Specific Column in a CSV?
I am using Scala to read from a csv file. The file is formatted to have 3 columns each separated by a \t character. The first 2 columns are unimportant and the third column contains a list of comma separated identifiers stored as as strings. Below is a sample of what the input csv would look like: 0002ba73 US 6o7,6on,6qc,6qj,6nw,6ov,6oj,6oi,15me,6pb,6p9 002f50e4 US 6om,6pb,6p8,15m9,6ok,6ov,6qc,6oo,15me 004b5edc US 6oj,6nz,6on,6om,6qc,6ql,6p6,15me 005cc990 US 6pb,6qf,15me,6og,6nx,6qc,6om,6ok 005fe1ea US 15me,6p0,6ql,6ok,6ox,6ol,6o5,6qj 00777555 US 6pb,15me,6nw,6rk,6qc,6ov,6qj,6o0,6oj,6ok,6on,6p6,6nx,15m9 00cbcc7d US 6oj,6qc,6qg,6pb,6ol,6p6,6ov,15me 010254a6 US 6qc,6pb,6nw,6nx,15me,6o0,6ok,6p8 011b905c US 6oj,6nw,6ov,15me,6qc,6ow,6ql,6on,6qi,6qe 011fffa6 US 15me,6ok,6oj,6p6,6pb,6on,6qc,6ov,6oo,6nw,6oc I want to read in the csv, get rid of the first two columns, and create a List that contains one instance of each unique identifier code found in the third column, so running the code on the above data should return the result List(6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6p8, 15m9, 6ok, 6oo, 6nz, 6om, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc) I have the following code which returns a List containing every distinct value found anywhere in the csv file: val in_file = new File("input_file.csv") val source = scala.io.Source.fromFile(in_file, "utf-8") val labels = try source.getLines.mkString("\t") finally source.close() val labelsList: List[String] = labels.split("[,\t]").map(_.trim).toList.distinct Using the above input, my code returns labelsList with a value of List(0002ba73-e60c-4ffb-9131-c1612b904658, US, 6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 002f50e4-48cc-4b14-bb80-0502068b6161, 6om, 6p8, 15m9, 6ok, 6oo, 004b5edc-c0cc-4ffd-bef3-980bd92b92e6, 6nz, 6ql, 6p6, 005cc990-83dc-4e63-a4b6-58f38241e8fd, 6qf, 6og, 6nx, 005fe1ea-b918-48a3-a495-1f8ac12935ba, 6p0, 6ox, 6ol, 6o5, 00777555-83d4-401e-861b-5892f3aa3e1c, 6rk, 6o0, 00cbcc7d-1b48-4c5c-8141-8fc8f62b7b07, 6qg, 010254a6-2ef0-4a24-aa4d-3cc6656a55de, 011b905c-fbf3-441a-8912-a94cc0fe8a1d, 6ow, 6qi, 6qe, 011fffa6-0b9f-4d88-8ced-ce1cc864984f, 6oc) How can I get my code to run properly and ignore anything contained within the first 2 columns of the csv?
You can ignore the first two columns and then split the third by the comma. Finally a toSet will get rid of the duplicate identifiers. val f = Source.fromFile("input_file.csv") val lastColumns = f.getLines().map(_.split("\t")(2)) val uniques = lastColumns.flatMap(_.split(",")).toSet uniques foreach println
Using Scala 2.13 resource management. util.Using(io.Source.fromFile("input_file.csv")){ _.getLines() .foldLeft(Array.empty[String]){ _ ++ _.split("\t")(2).split(",") }.distinct.toList } //res0: scala.util.Try[List[String]] = // Success(List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)) The .toList can be dropped if an Array result is acceptable.
This is what you can do , Am doing on a sample DF, you can replace with yours val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal") val reqCols = Seq(2) val finalDf = Df.select(reqCols map Df.columns map col: _*) finalDf.show Note : This is 0-based index, so pass 2 to get third column. If you want distinct values from your desired column.you can use distinct along with mkstring val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal") val reqCols = Seq(2) val distinctValues = Df.select(reqCols map Df.columns map col: _*).distinct.collect.mkString(",").filterNot("[]".toSet) println(distinctValues) Dates are duplicate , above code is removing duplicates.
Another method using regex val data = scala.io.Source.fromFile("source.txt").getLines() data.toList.flatMap { line => """\S+\s+\S+\s+(\S+)""".r.findAllMatchIn(line).map( x => x.group(1).split(",").toList) }.flatten.distinct // res0: List[String] = List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)
Create Python string placeholder (%s) n times
I am looking to automatically generate the following string in Python 2.7 using a loop based on the number of columns in a Pandas DataFrame: INSERT INTO table_name (firstname, lastname) VALUES (534737, 100.115) This assumes that the DataFrame has 2 columns. Here is what I have: # Generate test numbers for table: df = pd.DataFrame(np.random.rand(5,2), columns=['firstname','lastname']) # Create list of tuples from numbers in each row of DataFrame: list_of_tuples = [tuple(x) for x in df.values] Now, I create the string: Manually - this works: add_SQL = INSERT INTO table_name (firstname, lastname) VALUES %s" % (list_of_tuples[4]) In this example, I only used 2 column names - 'firstname' and 'lastname'. But I must do this with a loop since I have 156 column names - I cannot do this manually. What I need: I need to automatically generate the placeholder %s the same number of times as the number of columns in the Pandas DataFrame. Here, the DataFrame has 2 columns so I need an automatic way to generate %s twice. Then I need to create a tuple with 2 entries, without the ''. My attempt: sss = ['%s' for x in range(0,len(list(df)))] add_SQL = "INSERT INTO table_name (" + sss + ") VALUES %s" % (len(df), list_of_tuples[4]) But this is not working. Is there a way for me to generate this string automatically?
Here is what I came up with - it is based on dwanderson's approach in the 2nd comment of the original post (question): table_name = name_a #name of table # Loop through all columns of dataframe and generate one string per column: cols_n = df.columns.tolist() placeholder = ",".join(["%s"]*df.shape[1]) #df.shape[1] gives # of columns column_names = ",".join(cols_n) insrt = "INSERT INTO %s " % table_name for qrt in range(0,df.shape[0]): add_SQL_a_1 = insrt + "(" + column_names + ") VALUES (" + placeholder + ")" #part 1/2 add_SQL_a_2 = add_SQL_a_1 % list_of_tuples[qrt] #part 2/2 This way, the final string is in part 2/2. For some reason, it would not let me do this all in one line and I can't figure out why.