A mapreduce task:
Key1 in file_one is a1,a2,a3,a10,a11,a12;
Key2 in file_two is persona1, persona1, persona2, persona3, persona12,persona12,persona3, persona11, persona10.
Merge_file=JOIN file_one BY Key1, file_two by Key2??(how to write this..)
And since the second key has duplication, does it matter?
thanks
My suggestion would be to create a new column for each dataset and join on that, eg.:
A = foreach file_one generate *, join_key1 as SUBSTRING(key1, 1, 100);
B = foreach file_two generate *, join_key2 as SUBSTRING(key2, 7, 100);
C = join A by join_key1, B by join_key2;
Related
Given a list of integers, I would like to insert every integer into a new row in a Postgres table, ideally in a very efficient way (i.e. not looping through and inserting 1-by-1). arr = [1,2,3,4,5]. What I've tried doing is converting this to a list of tuples, arr2 = [(i,) for i in arr], and then feeding this into postgres with cur.execute("INSERT INTO my_table (my_value) VALUES (%s)", arr2, but I am receiving an error: Not all arguments converted during string formatting`. What exactly am I doing wrong here?
Full code
import psycopg2
conn = psycopg2.connect(host="myhost", database="mydb", user="postgres", password="password", port="5432")
cur = conn.cursor()
arr = [1,2,3,4,5]
arr2 = [(i,) for i in arr]
cur.execute("INSERT INTO my_table (my_value) VALUES (%s)", arr2
I am not familiar yet with psycopg2, working on it, but a ways to go. So I'll give the pure sql version. Postgres has a a pretty good set of built in array functions, one being UNNEST(). That function takes a array as a parameter and returns the individual entries. So you just need to provide an array to the query. (see demo).
insert into my_table(my_column)
select unnest( array [1,2,3,4,5] );
Borrowing (ie copying) your code perhaps:
import psycopg2
conn = psycopg2.connect(host="myhost", database="mydb", user="postgres", password="password", port="5432")
cur = conn.cursor()
arr = [1,2,3,4,5]
cur.execute("insert into my_table (my_column) select unnest (array [%s])", arr
But I am not sure if that gets the Postgres Array structure; it neede the [].
What exactly am I doing wrong here?
You are trying to insert a list of integers into a single row.
Instead, use execute_values() to insert many rows in a single query. Do not forget to commit the insert:
#...
cur = conn.cursor()
arr = [1,2,3,4,5]
arr2 = [(i,) for i in arr]
from psycopg2.extras import execute_values
execute_values(cur, "INSERT INTO my_table (my_value) VALUES %s", arr2)
conn.commit() # important!
I have this (simplified) Spark dataset with these columns:
"col1", "col2", "col3", "col4"
And I would like to add a new column: "result".
The value of "result" is the return value of a function that takes all the other columns ("col1", "col2", ...) values as parameters.
map/foreach can't change the iterated row, and UDF functions don't take a whole row as a parameter, so I will have to collect all the column names as input, and I will also have to specify each column type in the UDF registration part.
Notes:
The dataset doesn't have a lot of rows, so I don't mind having a low performant solution.
The dataset does have a lot of columns with different types, so specifying all the columns in the UDF registration part doesn't seem like the most elegant solution.
The project is written in Java, so I'm using the Java API to interact with Spark.
How can I achieve that behavior?
You actually could add a new column with a map.
df.map { row =>
val col1 = row.getAs[String]("col1")
val col2 = row.getAs[String]("col2")
// etc, extract all your columns
....
val newColumns = col1 + col2
// do what you need to do to obtain value for a new column
(col1, col2, ..., newColumn)
}.toDF("col1", "col2", ..., "new")
In term of Java API this will be just the same with some adjustments:
data.map((MapFunction<Row, Tuple3<String, String, String>>) row -> {
String col1 = row.getAs("col1");
String col2 = row.getAs("col2");
// whatever you need
String newColumns = col1 + col2;
return new Tuple3<>(col1, col2, newColumns);
}, Encoders.tuple(Encoders.STRING(), Encoders.STRING(), Encoders.STRING()))
.toDF("col1", "col2", ..., "new")
Alternatively, you could collect all your columns to the array and then process this array in your UDF.
val transformer = udf { arr: Seq[Any] =>
// do your stuff but bevare of types
}
data.withColumn("array", array($"col1", $"col2", ..., $"colN"))
.select($"col1", $"col2",..., transformer($"array") as "newCol")
I've found a solution for my question:
String[] allColumnsAsStrings = dataset.columns();
final Column[] allColumns = Arrays.stream(allColumnsAsStrings).toArray(Column[]::new);
UserDefinedFunction addColumnUdf = udf((Row row) -> {
double score;
// Calculate stuff based on the row values
// ...
return score;
}, DataTypes.DoubleType
);
dataset = dataset.withColumn("score", addColumnUdf.apply(functions.struct(allColumns)));
I am trying to create Sqlite3 statement in Python 3 to collect data from two tables called FreightCargo & Train where a train ID is the input value. I want to use Pandas since its easy to read the tables.
I have created the code below which is working perfectly fine, but its static and looks for only one given line in the statement.
import pandas as pd
SQL = '''SELECT F.Cargo_ID, F.Name, F.Weight, T.Train_ID, T.Assembly_date
FROM FreightCargo F LEFT JOIN [Train] T
ON F.Cargo_ID = T.Cargo_ID
WHERE Train_ID = 2;'''
cursor = conn.cursor()
cursor.execute( SQL )
names = [x[0] for x in cursor.description]
rows = cursor.fetchall()
Temp = pd.DataFrame( rows, columns=names)
Temp'''
I want to be able to create a variable with an input. The outcome of this action will then be determined with what has been given from the user. For example the user is asked for a train_id which is a primary key in a table and the relations with the train will be listed.
I expanded the code, but I am getting an error: ValueError: operation parameter must be str
Train_ID = input('Train ID')
SQL = '''SELECT F.Cargo_ID, F.Name, F.Weight, T.Train_ID, T.Assembly_date
FROM FreightCargo F LEFT JOIN [Train] T
ON F.Cargo_ID = T.Cargo_ID
WHERE Train_ID = ?;''', (Train_ID)
cursor = conn.cursor()
cursor.execute( SQL )
names = [x[0] for x in cursor.description]
rows = cursor.fetchall()
Temp = pd.DataFrame( rows, columns=names)
Temp
The problem lays in your definition of the SQL variable.
You are creating a tuple/collection of two elements. If you print type(SQL) you will see something like this: ('''SELECT...?;''', ('your_user's_input')).
When you pass this to cursor.execute(sql[, parameters]), it is expecting a string as the first argument, with the "optional" parameters. Your parameters are not really optional, since they are defined by your SQL-query's [Train]. Parameters must be a collection, for example a tuple.
You can unwrap your SQL statement with cursor.execute(*SQL), which will pass each element of your SQL list as a different argument, or you can move the parameters to the execute function.
Train_ID = input('Train ID')
SQL = '''SELECT F.Cargo_ID, F.Name, F.Weight, T.Train_ID, T.Assembly_date
FROM FreightCargo F LEFT JOIN [Train] T
ON F.Cargo_ID = T.Cargo_ID
WHERE Train_ID = ?;'''
cursor = conn.cursor()
cursor.execute( SQL, (Train_ID,) )
names = [x[0] for x in cursor.description]
rows = cursor.fetchall()
Temp = pd.DataFrame( rows, columns=names)
Temp
I quite often write a function to return different dataframes based on the parameters I enter. Here's an example dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='M'), 10000)})
I then created a function to perform sub-totals for me like this:
def some_fun(DF1, agg_column, myList=[], *args):
y = pd.concat([
DF1.assign(**{x:'[Total]' for x in myList[i:]})\
.groupby(myList).agg(sumz = (agg_column,'sum')) for i in range(1,len(myList)+1)]).sort_index().unstack(0)
return y
I then write out lists that I'll pass as arguments to the function:
list_one = [pd.Grouper(key='Date',freq='A'),'Category','Product']
list_two = [pd.Grouper(key='Date',freq='A'),'Category','Sub-Category','Sub-Category-2']
list_three = [pd.Grouper(key='Date',freq='A'),'Sub-Category','Product']
I then have to run each list through my function creating new dataframes:
df1 = some_fun(df,'Units_Sold',list_one)
df2 = some_fun(df,'Dollars_Sold',list_two)
df3 = some_fun(df,'Units_Sold',list_three)
I then use a function to write each of these dataframes to an Excel worksheet. This is just an example - I perform this same exercise 10+ times.
My question - is there a better way to perform this task than to write out df1, df2, df3 with the function information applied? Should I be looking at using a dictionary or some other data type to do this my pythonically with a function?
A dictionary would be my first choice:
variations = ([('Units Sold', list_one), ('Dollars_Sold',list_two),
..., ('Title', some_list)])
df_variations = {}
for i, v in enumerate(variations):
name = v[0]
data = v[1]
df_variations[i] = some_fun(df, name, data)
You might further consider setting the keys to unique / helpful titles for the variations, that goes beyond something like 'Units Sold', which isn't unique in your case.
IIUC,
as Thomas has suggested we can use a dictionary to parse through your data, but with some minor modifications to your function, we can use the dictionary to hold all the required data then pass that through to your function.
the idea is to pass two types of keys, the list of columns and the arguments to your pd.Grouper call.
data_dict = {
"Units_Sold": {"key": "Date", "freq": "A"},
"Dollars_Sold": {"key": "Date", "freq": "A"},
"col_list_1": ["Category", "Product"],
"col_list_2": ["Category", "Sub-Category", "Sub-Category-2"],
"col_list_3": ["Sub-Category", "Product"],
}
def some_fun(dataframe, agg_col, dictionary,column_list, *args):
key = dictionary[agg_col]["key"]
frequency = dictionary[agg_col]["freq"]
myList = [pd.Grouper(key=key, freq=frequency), *dictionary[column_list]]
y = (
pd.concat(
[
dataframe.assign(**{x: "[Total]" for x in myList[i:]})
.groupby(myList)
.agg(sumz=(agg_col, "sum"))
for i in range(1, len(myList) + 1)
]
)
.sort_index()
.unstack(0)
)
return y
Test.
df1 = some_fun(df,'Units_Sold',data_dict,'col_list_3')
print(df1)
sumz
Date 2016-12-31 2017-12-31 2018-12-31
Sub-Category Product
X Product 1 18308 17839 18776
Product 2 18067 19309 18077
Product 3 17943 19121 17675
[Total] 54318 56269 54528
Y Product 1 20699 18593 18103
Product 2 18642 19712 17122
Product 3 17701 19263 20123
[Total] 57042 57568 55348
Z Product 1 19077 17401 19138
Product 2 17207 21434 18817
Product 3 18405 17300 17462
[Total] 54689 56135 55417
[Total] [Total] 166049 169972 165293
as you want to automate the writing of the 10x worksheets, we can again do that with a dictionary call over your function:
matches = {'Units_Sold': ['col_list_1','col_list_3'],
'Dollars_Sold' : ['col_list_2']}
then a simple for loop to write all the files to a single excel sheet, change this to match your required behavior.
writer = pd.ExcelWriter('finished_excel_file.xlsx')
for key,value in matches.items():
for items in value:
dataframe = some_fun(df,k,data_dict,items)
dataframe.to_excel(writer,f'{key}_{items}')
writer.save()
Please bear with on this one. I have three RDDs ( coming from Hadoop ). All Three have unique keys such as ippaddress and boxnumber on which they can be matched/joined. Here are some sample data from all tables. Table A column boxnumber have to converted to number before it can be matched.
Table A:
ipaddress|boxnumber|cardnumber
94.254.57.16|59774DEa1|0D1EDF40
94.154.57.176|5F7377Ga9|0D3F796D
Table B:
cardno,boxnumber
1500914,2000096
1500413,2211469
Table C:
ipaddress|kanal|bitrate|kanaltimespent|date|country
94.254.57.16|sky|2023|003DF6A.ts|12-02-2016|chile
94.154.57.176|ITV|3425|003DF6A.ts|23-04-2014|egypt
My first attempt in java:
//TABLE A
JavaSparkContext sc = SetupSparkContext("SparkSample");
JavaRDD<ExtractTable_A> ta_RDD= ExtractTable_A.getRDD(sc);
JavaPairRDD<String, ExtractTable_A> A_PairRDD = ta_RDD.mapToPair(new PairFunction<extractTable_A, String, ExtractTable_A>()
{
#Override
public Tuple2<String, ExtractTable_A> call(ExtractTable_A extractTable_A) throws Exception
{
String [] A= extractTable_A.toString().split("|") ;
return new Tuple2<>(A[0],extractTable_A);
}
});
//TABLE B
JavaRDD<ExtractOttPdl> tb_RDD = ExtractTableB.getRDD(sc);
JavaPairRDD<String, ExtractTable_B> BPairRDD = tb_RDD.mapToPair(new PairFunction<extractTable_B, String, ExtractTable_B>()
{
#Override
public Tuple2<String, ExtractTable_B> call(ExtractTable_B extractTable_B) throws Exception
{
String [] B= extractTable_B.toString().split(",") ;
return new Tuple2<>(B[1],extractTable_B);
}
});
//TABE C
JavaRDD<ExtractTable_C> tc_RDD = ExtractTableC.getRDD(sc);
JavaPairRDD<String, ExtractTable_C> CPairRDD = tb_RDD.mapToPair(new PairFunction<extractTable_C, String, ExtractTable_C>()
{
#Override
public Tuple2<String, ExtractTableC> call(ExtractTableC extractTable_C) throws Exception
{
String [] C= extractTable_A.toString().split("|") ;
return new Tuple2<>(C[0],extractTable_A);
}
});
//At this point i need to join and create an .txt output file
The final result shoud be a file with these headers
KANAL|BITRATE|TIMESPENT|DATE|COUNTRY
===update===
I have managed to join the Table A and Table B but now i am stuck on how to join the TableC to Table A?
//Joined table A and B
JavaPairRDD<String, Tuple2<ExtractTableA, ExtractTableB>> join_1 = A_PairRDD.join(B_PairRDD);
. . .
//Joined table A and C
JavaPairRDD<String, Tuple2<ExtractTableA, ExtractTableC>> Join_2 = A_PairRDD.join(B_PairRDD);
// Output results from TableA and TableB
join_1.map(in -> {
return new ResultStringBuilder("|")
.append(Long.parseLong((in._2()._1().getCardno().trim()),16))
.append(Long.parseLong((in._2()._1().getBoxno().trim()),16))
.append(in._2()._2().getBoxno())
*** HERE I NEED TO ALSO APPEND THE COLUMN FROM TableC
.toString();
})
.saveAsTextFile("c:\outfile");
Remember that when you are working with the spark API, you always want to create a new RDD when you modify anything in the RDD sturcture because RDD is immutable.
In order to do three way join in this case,
you need to create a new JavaPairRDD after you join the first two tables,
because you want to have a PairRDD with new key-value pair because unique keys for Table A, B, C are different.
There could be two ways to do this (either join AB first or AC first)
The way you could join tables is like this:
Table A - Table B (PairRDD with key : boxnumber or cardnumber or maybe both)
After you join Table A and Table B, you need to create a new PairRDD with key ipaddress because you want to join with Table C.
// joinedAB is RDD resulting from join operation of Table A and C
JavaPairRDD joinedABForC = joinedAB.map(l -> new Tuple2(l[0], l));
// now joinedABForC has ipaddress as the RDD's key
// join rdd joinedABForC with Table C
After we moved the unique column to the key of the pairRdd, you can now join it with the Table C and three way join is done.
Joined Table AB - Table C (PairRdd with key : ipaddress)