Save data as text file from spark to hdfs - apache-spark

I processed data using pySpark and sqlContext using the following query:
(sqlContext.sql("select LastUpdate,Count(1) as Count" from temp_t)
.rdd.coalesce(1).saveAsTextFile("/apps/hive/warehouse/Count"))
It is stored in the following format:
Row(LastUpdate=u'2016-03-14 12:27:55.01', Count=1)
Row(LastUpdate=u'2016-02-18 11:56:54.613', Count=1)
Row(LastUpdate=u'2016-04-13 13:53:32.697', Count=1)
Row(LastUpdate=u'2016-02-22 17:43:37.257', Count=5)
But I want to store the data in a Hive table as
LastUpdate Count
2016-03-14 12:27:55.01 1
. .
. .
Here is how I create the table in Hive:
CREATE TABLE Data_Count(LastUpdate string, Count int )
ROW FORMAT DELIMITED fields terminated by '|';
I tried many options but was not successful. Please help me on this.

Why not load the data into Hive itself, without going through the process of saving the file and then loading it to hive.
from datetime import datetime, date, time, timedelta
hiveCtx = HiveContext(sc)
#Create sample data
currTime = datetime.now()
currRow = Row(LastUpdate=currTime)
delta = timedelta(days=1)
futureTime = currTime + delta
futureRow = Row(LastUpdate=futureTime)
lst = [currRow, currRow, futureRow, futureRow, futureRow]
#parallelize the list and convert to dataframe
myRdd = sc.parallelize(lst)
df = myRdd.toDF()
df.registerTempTable("temp_t")
aggRDD = hiveCtx.sql("select LastUpdate,Count(1) as Count from temp_t group by LastUpdate")
aggRDD.saveAsTable("Data_Count")

You created a table, now you need to fill it with the data you generated.
This could be ran from a Spark HiveContext, I believe
LOAD DATA INPATH '/apps/hive/warehouse/Count' INTO TABLE Data_Count
Alternatively, you may want to build a table over the data
CREATE EXTERNAL TABLE IF NOT Exists Data_Count(
LastUpdate DATE,
Count INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/apps/hive/warehouse/Count';

Related

Does PySpark run operation out-of-sequence due to optimization?

I'm confused about the result my code is giving me. Here is the code I wrote:
def update_cassandra(df : DataFrame, aggr: str):
aggr_map_dict = {
'Giornaliera' : 'day',
'Settimanale' : 'week',
'Bi-Settimanale' : 'bi_week',
'Mensile': 'month'
}
max_min_dates = df.agg(F.max(df['data']), F.min(df['data'])).collect()[0]
upper_date = max_min_dates[0]
lower_date = max_min_dates[1]
df = (df.select('data', 'punto_di_interesse', 'id_telco', 'presenze', 'presenze_uniche', 'presenze_00_06','presenze_06_08', 'presenze_08_10', 'presenze_10_12', 'presenze_12_14', 'presenze_14_16', 'presenze_16_18', 'presenze_18_20', 'presenze_20_22', 'presenze_22_24')
)
print('contenuto del csv')
display(df.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
telco_day_aggr = read_from_cassandra_dev(f'telco_{aggr_map_dict[aggr]}_aggr').where(F.col('data').between(lower_date,upper_date))
if telco_day_aggr.count() == 0:
telco_day_aggr = create_empty_df()
print('telco_day_aggr as is')
display(telco_day_aggr.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
union_df = df.union(telco_day_aggr)
print('unione del AS-IS e del csv')
display(union_df.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
output_df = (union_df.groupBy('data', 'punto_di_interesse', 'id_telco')
.agg(
F.sum('presenze').alias('presenze'),
F.sum('presenze_uniche').alias('presenze_uniche'),
F.sum('presenze_00_06').alias('presenze_00_06'),
F.sum('presenze_06_08').alias('presenze_06_08'),
F.sum('presenze_08_10').alias('presenze_08_10'),
F.sum('presenze_10_12').alias('presenze_10_12'),
F.sum('presenze_12_14').alias('presenze_12_14'),
F.sum('presenze_14_16').alias('presenze_14_16'),
F.sum('presenze_16_18').alias('presenze_16_18'),
F.sum('presenze_18_20').alias('presenze_18_20'),
F.sum('presenze_20_22').alias('presenze_20_22'),
F.sum('presenze_22_24').alias('presenze_22_24')
)
)
return output_df
aggregate_df = aggregate_table(df_daily, 'Giornaliera')
write_on_cassandra_dev(aggregate_df, 'telco_day_aggr')
What I expect to achieve is to create a sort of update for cassandra, becouse the cassandra drivers. So the operation in my head are like this:
read from blob storage the csv and store it in a dataframe (the df variable, input of the method)
with max and min dates of this csv file, query the table in cassandra and save it in another variable
concatenate the two dataframe
summing up with the groupby
write on cassandra the new dataframe overwriting the existing rows with the new ones
it seems to me that, some how, what is in the dataframe "df" is written before I can read "telco_day_aggr" and that the union and grupby part are ininfluent. In other words on my cassandra table there is present only the content of df.
I can provide additional information if needed.

How to extract specific time interval on working days with sql in apache spark?

I loaded csv file in sql table databricks, which is using apache spark. I need to extract sql table column which has content:
01.01.2018,15:25
01.01.2018,00:10
01.01.2018,13:20
...
...
on data which represent only working days and time between 8.30 and 9.30 a.m. How should I do that? Should I first extract column on two columns? I found how I can do some parts with data which I enter into databricks, but this data are part of sql table.
Also some commands from classical sql does not work on apache spark, it means databricks.
This is query for reading data:
# File location and type
file_location = "/FileStore/tables/NEZ_OPENDATA_2018_20190125-1.csv"
file_type = "csv"
# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
# Create a view or table
temp_table_name = "NEZ_OPENDATA_2018_20190125"
df.createOrReplaceTempView(temp_table_name)
%sql
/* Query the created temp table in a SQL cell */
select * from `NEZ_OPENDATA_2018_20190125`
permanent_table_name = "NEZ_OPENDATA_2018_20190125"
df.write.format("parquet").saveAsTable(permanent_table_name)
Reading as a text file is probably more appropriate as the timestamp consists of both the date and time. Then you can filter the day of week and time using relevant Pyspark functions. Note that day of week is 1 for Sunday, 2 for Monday, ... etc.
import pyspark.sql.functions as F
file_location = "/FileStore/tables/NEZ_OPENDATA_2018_20190125-1.csv"
df = spark.read.text(file_location).toDF('timestamp')
result = df.select(
F.to_timestamp('timestamp', 'dd.MM.yyyy,HH:mm').alias('timestamp')
).filter(
F.dayofweek('timestamp').isin([2,3,4,5,6]) & (
( (F.hour('timestamp') == 8) & (F.minute('timestamp').between(30,59)) ) |
( (F.hour('timestamp') == 9) & (F.minute('timestamp').between(0,30)) )
)
)
If you want to show the output, you can do result.show() or display(result).

Table in Pyspark shows headers from CSV File

I have a csv file with contents as below which has a header in the 1st line .
id,name
1234,Rodney
8984,catherine
Now I was able create a table in hive to skip header and read the data appropriately.
Table in Hive
CREATE EXTERNAL TABLE table_id(
`tmp_id` string,
`tmp_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-testing/test/data/'
tblproperties ("skip.header.line.count"="1");
Results in Hive
select * from table_id;
OK
1234 Rodney
8984 catherine
Time taken: 1.219 seconds, Fetched: 2 row(s)
But, when I use the same table in pyspark (Ran the same query) I see even the headers from file in pyspark results as below.
>>> spark.sql("select * from table_id").show(10,False)
+------+---------+
|tmp_id|tmp_name |
+------+---------+
|id |name |
|1234 |Rodney |
|8984 |catherine|
+------+---------+
Now, how can I ignore these showing up in the results in pyspark.
I'm aware that we can read the csv file and add .option("header",True) to achieve this but, I wanna know if there's a way to do something similar in pyspark while querying tables.
Can someone suggest me a way.... Thanks 🙏 in Advance !!
u can use below two properties:
serdies properties and table properties, you will be able to access table from hive and spark by skipping header in both env.
CREATE EXTERNAL TABLE `student_test_score_1`(
student string,
age string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'delimiter'=',',
'field.delim'=',',
'header'='true',
'skip.header.line.count'='1',
'path'='hdfs:<path>')
LOCATION
'hdfs:<path>'
TBLPROPERTIES (
'spark.sql.sources.provider'='CSV')
This is know issue in Spark-11374 and closed as won't fix.
In query you can have where clause to select all records except 'id' and 'name'.
spark.sql("select * from table_id where tmp_id <> 'id' and tmp_name <> 'name'").show(10,False)
#or
spark.sql("select * from table_id where tmp_id != 'id' and tmp_name != 'name'").show(10,False)
Another way would be using reading files from HDFS with .option("header","true").

PySpark Pushing down timestamp filter

I'm using PySpark version 2.4 to read some tables using jdbc with a Postgres driver.
df = spark.read.jdbc(url=data_base_url, table="tablename", properties=properties)
One column is a timestamp column and I want to filter it like this:
df_new_data = df.where(df.ts > last_datetime )
This way the filter is pushed down as a SQL query but the datetime format
is not right. So I tried this approach
df_new_data = df.where(df.ts > F.date_format( F.lit(last_datetime), "y-MM-dd'T'hh:mm:ss.SSS") )
but then the filter is no pushed down anymore.
Can someone clarify why this is the case ?
While loading the data from a Database table, if you want to push down queries to database and get few result rows, instead of providing the 'table', you can provide the 'Query' and return just the result as a DataFrame. This way, we can leverage database engine to process the query and return only the results to Spark.
The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause. Note that alias is mandatory to be provided in query.
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
df.show()

SparkSQL: Am I doing in right?

Here is how I use Spark-SQL in a little application I am working with.
I have two Hbase tables say t1,t2.
My input being a csv file, I parse each and every line and query(SparkSQL) the table t1. I write the output to another file.
Now I parse the second file and query the second table and I apply certain functions over the result and I output the data.
the table t1 hast the purchase details and t2 has the list of items that were added to cart along with the time frame by each user.
Input -> CustomerID(list of it in a csv file)
Output - > A csv file in a particular format mentioned below.
CustomerID, Details of the item he brought,First item he added to cart,All the items he added to cart until purchase.
For a input of 1100 records, It takes two hours to complete the whole process!
I was wondering if I could speed up the process but I am struck.
Any help?
How about this DataFrame approach...
1) Create a dataframe from CSV.
how-to-read-csv-file-as-dataframe
or something like this in example.
val csv = sqlContext.sparkContext.textFile(csvPath).map {
case(txt) =>
try {
val reader = new CSVReader(new StringReader(txt), delimiter, quote, escape, headerLines)
val parsedRow = reader.readNext()
Row(mapSchema(parsedRow, schema) : _*)
} catch {
case e: IllegalArgumentException => throw new UnsupportedOperationException("converted from Arg to Op except")
}
}
2) Create Another DataFrame from Hbase data (if you are using Hortonworks) or phoenix.
3) do join and apply functions(may be udf or when othewise.. etc..) and resultant file could be a dataframe again
4) join result dataframe with second table & output data as CSV as in pseudo code as an example below...
It should be possible to prepare dataframe with custom columns and corresponding values and save as CSV file.
you can this kind in spark shell as well.
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema","true").
load("cars93.csv")
val df2=df.filter("quantity <= 4.0")
val col=df2.col("cost")*0.453592
val df3=df2.withColumn("finalcost",col)
df3.write.format("com.databricks.spark.csv").
option("header","true").
save("output-csv")
Hope this helps.. Good luck.

Resources