loops - huge files spark sort and iterate through the whole data set -


i new spark , still don't understand if developer needs aware of parallelism. transparent developer. problem following. need sort records (by cityid , timestamp) large number of large files , need iterate through sorted records in loop , calculation. can not group because order of records (by timestamp) same city important calculation. how can in spark?
using rdd , iterate through means on 1 machine because need whole data set , data can not fit on same machine. right?
i read lot rdds still missing part, if data can not fit on same machine, same rdd , want iterate through data in 1 loop.. need control something? there way have 1 loop across cluster?

in general:

  • if computation sequential (loop on records) spark won't provide advantage.

here (if order important in groups defined city):

  • partition data city
  • sort date
  • perform sequential computations on each partition

Comments

Popular posts from this blog

php - How to display all orders for a single product showing the most recent first? Woocommerce -

asp.net - How to correctly use QUERY_STRING in ISAPI rewrite? -

angularjs - How restrict admin panel using in backend laravel and admin panel on angular? -