Does nodetool cleanup affect Apache Spark rdd.count() of a Cassandra table? -


i've been tracking growth of big cassandra tables using spark rdd.count(). 'till expected behavior consistent, number of rows growing.

today ran nodetool cleanup on 1 of seeds , usual ran 50+ minutes.

and rdd.count() returns 1 third of rows did before....

did destroy data using nodetool cleanup? or spark count unreliable , counting ghost keys? got no errors during cleanup , lots don't show out of usual. did seem successful operation, until now.

update 2016-11-13

turns out cassandra documentation set me loss of 25+ million rows of data.

the documentation explicit:

use nodetool status verify node bootstrapped and all other nodes (un) , not in other state. after new nodes running, run nodetool cleanup on each of existing nodes remove keys no longer belong nodes. wait cleanup complete on 1 node before running nodetool cleanup on next node.

cleanup can safely postponed low-usage hours.

well check status of other nodes via nodetool status , , normal (un), here's catch, also need run command nodetool describecluster might find schemas not synced.

my schemas not synced , ran cleanup, when nodes un, , running per documentation. cassandra documentation not mention nodetool describecluster after adding new nodes.

so merrily added nodes, waited till un (up / normal) , ran cleanup.

as result, 25+ million rows of data gone. hope helps others avoid dangerous pitfall. datastax documentation sets destroy data recommending cleanup step of process of adding new nodes.

in opinion, cleanup step should taken out of new node procedure documentation altogether. should mentioned, elsewhere, cleanup practice not in same section adding new nodes...it's recommending rm -rf / 1 of steps virus removal. sure remove virus...

thank aravind r. yarram reply, came same conclusion reply , came here update this. appreciate feedback.

i guessing might have either added/removed nodes cluster or decreased replication factor before running nodetool cleanup. until run cleanup, guess cassandra still reports old key ranges part of rdd.count() old data still exists on nodes.

reference: https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolscleanup.html


Comments

Popular posts from this blog

php - How to display all orders for a single product showing the most recent first? Woocommerce -

asp.net - How to correctly use QUERY_STRING in ISAPI rewrite? -

angularjs - How restrict admin panel using in backend laravel and admin panel on angular? -