apache spark - Does DStream's RDD pull entire data created for the batch interval at one shot? -

- January 15, 2014

i have gone through this stackoverflow question, per answer creates dstream 1 rdd batch interval.

for example:

my batch interval 1 minute , spark streaming job consuming data kafka topic.

my question is, rdd available in dstream pulls/contains entire data last 1 minute? there criteria or options need set pull data created last 1 minute?

if have kafka topic 3 partitions, , 3 partitions contains data last 1 minute, dstream pulls/contains data created last 1 minute in kafka topic partitions?

update:

in case dstream contains more 1 rdd?

a spark streaming dstream consuming data kafka topic partitioned, 3 partitions on 3 different kafka brokers.

does rdd available in dstream pulls/contains entire data last 1 minute?

not quite. rdd only describes offsets read data when tasks submitted execution. other rdds in spark only (?) description of , find data work on when tasks submitted.

if use "pulls/contains" in more loose way express @ point records (from partitions @ given offsets) going processed, yes, you're right, entire minute mapped offsets , offsets in turn mapped records kafka hands on process.

in kafka topic partitions?

yes. it's kafka not spark streaming / dstream / rdd handle it. dstream's rdds request records topic(s) , partitions per offsets, last time queried now.

the minute spark streaming might different kafka since dstream's rdds contain records offsets not records per time.

in case dstream contains more 1 rdd?

never.

Search This Blog

Swift

apache spark - Does DStream's RDD pull entire data created for the batch interval at one shot? -

Comments

Post a Comment

Popular posts from this blog

asp.net - How to correctly use QUERY_STRING in ISAPI rewrite? -

jsf - "PropertyNotWritableException: Illegal Syntax for Set Operation" error when setting value in bean -

arrays - Algorithm to find ideal starting spot in a circle -