AnsweredAssumed Answered

Gradual Degradation Indexing HCP >200M objects

Question asked by Jason vanValkenburgh on May 18, 2017
Latest reply on May 22, 2017 by Frank Fodera

Hi,

 

We have about 500M e-mails in a single HCP namespace (8-node HCP and 8-node HCI) to index to an 8-node external Solr Cloud instance.  Following the tuning practices listed on this site and in consultation with HDS, we have been indexing HCP at the rate of ~41-43M e-mails a day, without issue. 

 

We're using the HCP MQE connector with a batch size of 25K, workflow batch size of 25K, weighting of 20 extra cores, etc. - and these settings have been working great.  We were seeing messages indicating batches of 25K documents were getting processed.

 

After about 220M e-mails and ~4/5 days in, we received a task failure (GC overhead limit reached) - no biggie - resumed the workflow, and bumped up heap on the job driver to be safe (workers show having plenty of memory).

 

The period when the job failed can be seen below (~8am) on the HCP monitoring console for network traffic.  But, at ~5pm - with no events as far as we can tell - we started to see a massive drop (the drop ~11am was a pause to restart HCP as a diagnostic).

 

 

Since this event, as you can see, throughput from HCP continued to decline.  Reviewing the HCP HTTP gateway logs shows objects being retrieved from ecoStorage in ~30-50ms, so we don't believe HCP is a problem, but we paused our job, restarted the HCP cluster to be safe (it's cleared up issues before), and resumed indexing.  Still low throughput (now ~4-7m/day vs. the rate of 42M we last saw).

 

We then tried restarting  HCI; no change, and even worse.  We do notice now that messages in the UI seem to indicate only 1-2 jobs running at a time, and "processing a batch of 1 document."t.  We've gotten less than 25K documents over several hours.

The overall workload on the cluster looks low:

 

 

The question I have is:

(a) is this expected behavior, and if so under what circumstances?

(b) what would the next steps be to isolate the cause?

(c) are there any recommended guides on using Kibana?  We setup Kibana and assume metrics are available there that could help identify for example if Solr is slow accepting documents, but we're struggling with the tool.

 

Any pointer are appreciated,

Jason

Outcomes