loops - huge files spark sort and iterate through the whole data set -
i new spark , still don't understand if developer needs aware of parallelism. transparent developer. problem following. need sort records (by cityid , timestamp) large number of large files , need iterate through sorted records in loop , calculation. can not group because order of records (by timestamp) same city important calculation. how can in spark?
using rdd , iterate through means on 1 machine because need whole data set , data can not fit on same machine. right?
i read lot rdds still missing part, if data can not fit on same machine, same rdd , want iterate through data in 1 loop.. need control something? there way have 1 loop across cluster?
in general:
- if computation sequential (loop on records) spark won't provide advantage.
here (if order important in groups defined city):
- partition data city
- sort date
- perform sequential computations on each partition
Comments
Post a Comment