loops - huge files spark sort and iterate through the whole data set -

- June 15, 2013

i new spark , still don't understand if developer needs aware of parallelism. transparent developer. problem following. need sort records (by cityid , timestamp) large number of large files , need iterate through sorted records in loop , calculation. can not group because order of records (by timestamp) same city important calculation. how can in spark?
using rdd , iterate through means on 1 machine because need whole data set , data can not fit on same machine. right?
i read lot rdds still missing part, if data can not fit on same machine, same rdd , want iterate through data in 1 loop.. need control something? there way have 1 loop across cluster?

in general:

if computation sequential (loop on records) spark won't provide advantage.

here (if order important in groups defined city):

partition data city
sort date
perform sequential computations on each partition

Search This Blog

Swift

loops - huge files spark sort and iterate through the whole data set -

Comments

Post a Comment

Popular posts from this blog

asp.net - How to correctly use QUERY_STRING in ISAPI rewrite? -

jsf - "PropertyNotWritableException: Illegal Syntax for Set Operation" error when setting value in bean -

laravel - Undefined property: Illuminate\Pagination\LengthAwarePaginator::$id (View: F:\project\resources\views\admin\carousels\index.blade.php) -