Tagger escapes at Stream index: 34160469 #10

petulla · 2019-11-03T21:26:30Z

I raised another issue related to this. I can't get past training on index 34160469. The

tapioca index-dump wiki_collection latest-all.json.bz2 --profile profiles/human_organization_place.json

step falls out at this point every time. Any idea what might be happening? The previous steps ended successfully. Is there pre-trained model I can use to supplement any of the steps for testing?

Solr 8.2
Python 3.7.4
Mac OS Mojave

019-11-03 16:23:49,575 opentapioca.taggerfactory INFO     Stream index: 34160469
2019-11-03 16:23:49,576 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
Traceback (most recent call last):
  File "/Users/username/.pyenv/versions/jupyter3/bin/tapioca", line 11, in <module>
    load_entry_point('opentapioca==0.1.0', 'console_scripts', 'tapioca')()
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/opentapioca-0.1.0-py3.7.egg/opentapioca/cli.py", line 118, in index_dump
    batch_size=2000, commit_time=10, delete_excluded=False, skip_docs=skip)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/opentapioca-0.1.0-py3.7.egg/opentapioca/taggerfactory.py", line 91, in index_stream
    self._push_documents(batch, collection_name, commit)
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/opentapioca-0.1.0-py3.7.egg/opentapioca/taggerfactory.py", line 121, in _push_documents
    r.raise_for_status()
  File "/Users/username/.pyenv/versions/jupyter3/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:8983/solr/wiki_collection/update?commit=false
(jupyter3)

The text was updated successfully, but these errors were encountered:

wetneb · 2019-11-04T06:16:54Z

@petulla it might be worth checking the Solr logs for any errors there?

petulla · 2019-11-04T13:50:57Z

This is the error.. Any ideas? The same document id throws the error every time.

org.apache.solr.common.SolrException: Exception writing document id Q23672838 to the index; possible analysis error: input automaton is too large: 1001

Full read out:

org.apache.solr.common.SolrException: Exception writing document id Q23672838 to the index; possible analysis error: input automaton is too large: 1001
	at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:244)
	at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:257)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:487)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:337)
	at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:337)
	at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:223)
	at org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:231)
	at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:475)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldNameMutatingUpdateProcessorFactory$1.processAdd(FieldNameMutatingUpdateProcessorFactory.java:75)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:92)
	at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
	at org.apache.solr.update.processor.DocBasedVersionConstraintsProcessor.processAdd(DocBasedVersionConstraintsProcessor.java:396)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:507)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:156)
	at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:121)
	at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:84)
	at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
...

wetneb · 2019-11-04T16:28:42Z

Hmm, it looks like we are running into a hard-coded bound on the size of the index here, not sure if we can do much about it! We probably need to report that upstream to Solr. I haven't got much time to investigate this right now though.

If you want a quick fix, try narrowing down the scope of the profile (by selecting smaller classes of Wikidata items to include), which should decrease the size of the index and hopefully avoid this bug. Sorry that I cannot give a more satisfactory fix!

petulla · 2019-11-04T16:34:12Z

Hm. So the file like uman_organization_location.json, restrict the index?

I'm confused because I assumed you had run this on the full wikipedia dataset.

wetneb · 2019-11-04T16:43:38Z

I have indeed run this on the full Wikidata dump, but that was a while ago now and Wikidata grows all the time, so it is totally possible that this error appeared in the mean time.

Yes, I would change human_organization_location.json to restrict to whatever suits you best, depending on your use case.

petulla · 2019-11-04T16:48:32Z

Can try just running a recent dump and seeing if it works for you? I'm trying Facebook's recent NEL codebase now but may need to return to this and am concerned fixing may take several hours at minimum.

wetneb · 2019-11-04T16:51:08Z

I do intend to re-run this myself on a recent dump in the coming months, I will report back here once this is done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tagger escapes at Stream index: 34160469 #10

Tagger escapes at Stream index: 34160469 #10

petulla commented Nov 3, 2019 •

edited

Loading

wetneb commented Nov 4, 2019

petulla commented Nov 4, 2019

wetneb commented Nov 4, 2019

petulla commented Nov 4, 2019

wetneb commented Nov 4, 2019

petulla commented Nov 4, 2019

wetneb commented Nov 4, 2019

Tagger escapes at Stream index: 34160469 #10

Tagger escapes at Stream index: 34160469 #10

Comments

petulla commented Nov 3, 2019 • edited Loading

wetneb commented Nov 4, 2019

petulla commented Nov 4, 2019

wetneb commented Nov 4, 2019

petulla commented Nov 4, 2019

wetneb commented Nov 4, 2019

petulla commented Nov 4, 2019

wetneb commented Nov 4, 2019

petulla commented Nov 3, 2019 •

edited

Loading