Tue, Aug 30, 2016
After many months of hard work, we kind of, sort of have a document ingestion pipeline that seems to work. By this I mean we have a minimal set of configuration and ingestion tools that we can compile and then run without crashing, and these tools seem to ingest files mostly as expected. We’re still going to need to do a lot of testing, tuning and evaluation, but I thought it would be helpful to take this time to walk through the process of bringing up an index from a set of chunk files extracted from Wikipedia.
(read more...)