apache spark - Does DStream's RDD pull entire data created for the batch interval at one shot? -
i have gone through this stackoverflow question, per answer creates dstream
1 rdd
batch interval.
for example:
my batch interval 1 minute , spark streaming job consuming data kafka topic.
my question is, rdd available in dstream pulls/contains entire data last 1 minute? there criteria or options need set pull data created last 1 minute?
if have kafka topic 3 partitions, , 3 partitions contains data last 1 minute, dstream pulls/contains data created last 1 minute in kafka topic partitions?
update:
in case dstream contains more 1 rdd?
a spark streaming dstream consuming data kafka topic partitioned, 3 partitions on 3 different kafka brokers.
does rdd available in dstream pulls/contains entire data last 1 minute?
not quite. rdd only describes offsets read data when tasks submitted execution. other rdds in spark only (?) description of , find data work on when tasks submitted.
if use "pulls/contains" in more loose way express @ point records (from partitions @ given offsets) going processed, yes, you're right, entire minute mapped offsets , offsets in turn mapped records kafka hands on process.
in kafka topic partitions?
yes. it's kafka not spark streaming / dstream / rdd handle it. dstream's rdds request records topic(s) , partitions per offsets, last time queried now.
the minute spark streaming might different kafka since dstream's rdds contain records offsets not records per time.
in case dstream contains more 1 rdd?
never.
Comments
Post a Comment