Python KafkaConsumer start consuming messages from a timestamp - python-3.x

I'm planning to skip the start of the topic and only read messages from a certain timestamp to the end. Any hints on how to achieve this?

I'm guessing you are using kafka-python (https://github.com/dpkp/kafka-python) as you mentioned "KafkaConsumer".
You can use the offsets_for_times() method to retrieve the offset that matches a timestamp. https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html#kafka.KafkaConsumer.offsets_for_times
Following that just seek to that offset using seek(). https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html#kafka.KafkaConsumer.seek
Hope this helps!

I got around it, however I'm not sure about the values that I got from using the method.
I have a KafkaConsumer (ck), I got the partitions for the topic with the assignment() method. Thus, I can create a dictionary with the topics and the timestamp I'm interested into (in this case 100).
Side Question: Should I use 0 in order to get all the messages?.
I can use that dictionary as the argument in the offsets_for_times(). However, the values that I got are all None
zz = dict(zip(ck.assignment(), [100]*ck.assignment() ))
z = ck.offsets_for_times(zz)
z.values()
dict_values([None, None, None])

Related

compress dataframe to one json string apache spark

This bounty has ended. Answers to this question are eligible for a +50 reputation bounty. Bounty grace period ends in 12 hours.
Mike3355 is looking for a canonical answer.
I have a dataframe that when I write it to json it has several hundred lines of json but that are exactly the same. I am trying to compress it to one json line. Is there an out of the box way to accomplish this?
def collect_metrics(df) -> pyspark.sql.DataFrame:
neg_value = df.where(df.count < 0).count()
return df.withColumn(loader_neg_values, F.lit(neg_value))
main(args):
df_metrics = collect_metrics(df)
df_metrics.write.json(args.metrics)
In the end the goal is the write one json line and the file has to be a json file, not compressed.
It seems like you have hundreds of (duplicated) lines but you only want to keep one. You can use limit(1) in that case:
df_metrics.limit(1).write.json(args.metrics)
You want something like this:
df_metrics.limit(1).repartition(1).write.json(args.metrics)
.repartition(1) guarantees 1 output file, and .limit(1) guarantees one output row.

Logging from pandas udf

I am trying to log from a pandas udf called within a python transform.
Because the code is being called on the executor is does not show up in the driver's logs.
I have been looking at some options on SO but so far the closest option is this one
Any idea on how to surface the logs in the driver logs or any other log files available under build is welcome.
import logging
logger = logging.getLogger(__name__)
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def my_udf(my_pdf):
logger.info('calling my udf')
do_some_stuff()
results_df = my_df.groupby("Name").apply(my_udf)
As you said, the work done by the UDF is done by the executor not the driver, and Spark captures the logging output from the top-level driver process. If you are using a UDF within your PySpark query and need to log data, create and call a second UDF that returns the data you wish to capture and store it in a column to view once the build is finished:
def some_transformation(some_input):
logger.info("log output related to the overall query")
#F.udf("integer")
def custom_function(integer_input):
return integer_input + 5
#F.udf("string")
def custom_log(integer_input):
return "Original integer was %d before adding 5" % integer_input
df = (
some_input
.withColumn("new_integer", custom_function(F.col("example_integer_col"))
.withColumn("debugging", custom_log(F.col("example_integer_col"))
)
I also explain another option is you are more familiar with pandas here:
How to debug pandas_udfs without having to use Spark?
Edit: I have a complete answer here: In Palantir Foundry, how do I debug pyspark (or pandas) UDFs since I can't use print statements?
It is not ideal (as it stops the code) but you can do
raise Exception(<variable_name>)
inside the pandas_udf and it gives you the value of the named variable.

If timestamp includes certain time, then trigger an action

I am trying to trigger an action, based on part of the timestamp.
Code (error is in line 3):
for ticker in tickers:
for i in range(1,len(ohlc_dict[ticker])):
if ohlc_dict[ticker]['Date'][i] == "21:50:00":
ohlc_dict:
Now there are 2 issues with this:
1: It doesn't recognize 'Date' (KeyError: 'Date')
2: it should get triggered at this time on every single day (if timestamp includes "21:50:00" THAN ...)
Just would like to answer an answer to my own question. I was over-complicating thing by wanting to do it all in that line. In stead, I went back to creating the dataframe and added this line:
ohlc_dict[ticker]['time'] = ohlc_dict[ticker].index.str[10:]
Causing 1 extra column in my DF with just the time, hence solving both issues.

Conditional loading of partitions from file-system

I am aware that there have been questions regarding wildcards in pySparks .load()-function like here or here.
Anyhow, none of the questions/answers I found dealt with my variation of it.
Context
In pySpark I want to load files directly from HDFS because I have to use databricks avro-library for Spark 2.3.x. I'm doing so like this:
partition_stamp = "202104"
df = spark.read.format("com.databricks.spark.avro") \
.load(f"/path/partition={partition_stamp}*") \
.select("...")
As you can see the partitions are deriving from timestamps in the format yyyyMMdd.
Question
Currently I only get all partitions used for April 2021 (partition_stamp = "202104").
However, I need all partitions starting from April 2021.
Written in pseudo-code, I'd need a solution something alike this:
.load(f"/path/partition >= {partition_stamp}*")
Since there actually exist several hundred partitions it is no use to do it in any fashion that requires hard-coding.
So my question is: Is there a function for conditional file-loading?
As I learned there exist only the following options to dynamically process paths inside the .load()-function:
*: Wildcard for any character or sequence of characters until the end of the line or a new sub-directory ('/') -> (/path/20200*)
[1-3]: Regex-like inclusion of a defined character-range -> (/path/20200[1-3]/...)
{1,2,3}: Set-like inclusion of a defined set of characters -> (/path/20200{1,2,3}/...)
Thus, to answer my question: There is no built-in function for conditional file-loading.
Anyhow, I want to provide you my solution:
import pandas as pd # Utilize pandas date-functions
partition_stamp = ",".join((set(
str(_range.year) + "{:02}".format(_range.month)
for _range in pd.date_range(start=start_date, end=end_date, freq='D')
)))
df = spark.read.format("com.databricks.spark.avro") \
.load(f"/path/partition={{{partition_stamp}}}*") \
.select("...")
This way the restriction for a timestamp of format yyyyMM is generated dynamically for a given start- and end-date and the string-based .load() is still usable.

How to solve this data set

I am importing a Data set from quandl using API. Everything is perfect, however the time series I am importing is reversed. By this I mean if I used the .head method to print the first elements in the data set, I will get the latest Data set figures and printing the tail will get oldest figures
df = pd.read_csv("https://www.quandl.com/api/v3/datasets/CHRIS/CME_CD4.csv?api_key=H32H8imfVNVm9fcEX6kB",parse_dates=['Date'],index_col='Date')
df.head()
This should be a pretty easy fix if I understand. Credit to behzad.nouri on this answer Right way to reverse pandas.DataFrame?.
You just need to reverse the order of your dataframe using the line below.
df = df.reindex(index=df.index[::-1])

Resources