Basic Lucene Content Search configuration?

Hi,

I’m having trouble with making Lucene Search work properly in Pydio.

I have created a workspace with .txt, and .docx files in it (I understand that docx are not indexed) however doesnt index txt either.

I enable “Lucene Indexing” in the Workspace configuration. I go to the workspace and I choose “Index Content”.

When I go to search > advanced > content search, I get no results for my queries. However I can search and find files when I choose filename search.

Then I read in the documentation: "You must make sure to add a meta source “index.lucene” to the repositories that you want to be indexed."

What does that mean, and could that be the cause? I simply do not know what is meant by repository (I know what a repo is, but is that in Pydio terms the workspace?)

I would really like to see a monkey-see, monkey-do description of how to make Lucene Search index the content of my text files. The current documentation seems like you have to be “in-the-know” to understand it. A Lucene / Pydio noob like me is dropped on the floor. :slight_smile:

Anyone?

  • I have enabled Lucene Search on Workspace.
  • Create .txt file with test content to search on.
  • Pressed “Index Content” on Workspace
  • Searched on content using Content Search (More search options).

No result!

did you configured the lucene engine in the plugins ?

This is how the configuration looks. The plugin is enabled.

And on the Workspace i’m testing

What I may be doing wrong is the “be sure to add index.lucene to the repository part”. I didnt really understand that. So i tried adding an empty index.lucene to the workspace path. I also tried adding an empty file called .ajxp_meta and adding the line “index.lucene”. Didn’t work.

for .doc and docx files you must include the unoconv configuration,
for pdf ones you need pdftotext
considering a doc or a docx as a .txt file may bring to corrupted indexes…
In the documentation of the lucene plugin (reported in the configuration page) is clearly written that to index that sort of file you need extra components.

Yeah, I understood that actually. However I have, as mentioned, created a .txt file as well with some words in it, that neither is indexed.

The txt file should have been indexed but if the index is already compromised the results can be wrong.
I’m pretty sure that in my installation (made from scratch) lucene is actually working even in PDF and doc files not only in .txt ones.

Hmmm… Can you help me understand the index.lucene part?

Indexes are saved in the cache directory of your installation: ./data/cache/indexes (depending on your settings)
If an index get corrupted search on that index does not works anymore.
I’m not sure whenever a index command actually drops and recreate the index or just updates diffs.

Interesting, I located the folder. Deleted all the indexes, re-ran indexation and I can see new files were created.

2_1.del
_2.sti
_3.sti
read.lock.file
segments_6
write.lock.file
_2.cfs
_3.cfs
optimization.lock.file
read-lock-processing.lock.file
segments.gen

However the only indexation I get from search, is by filename. Content search simply doesn’t work for me.

Pydio documentation says: “You must make sure to add a meta source “index.lucene” to the repositories that you want to be indexed.”

I’m still looking for an answer to understand that part. How do I add a meta source to the repository?

You must enter in settings, select the workspace, on the left there is a part named Additional features
You must be sure that:
Lucene Search engine is listed in that section

after that you must click Lucene search Engine and verify that the switch Index Content is enabled.

Everything mentioned is enabled. If that is all you had to do, to make it work, then I guess I’m out of luck.

Okay, found the solution. I feel like screaming.

Apparently a workspace is not a repository. However a folder inside a workspace is a repository (i have no idea why this is not mentioned in the docs).

So if I create a folder, put a txt file in it, the contents are indexed.

Thanks for your support romoloman. You already helped me once, with the Collabora Online. <3

Just a quick update. Since the indexer, doesnt parse docx files. It wont event index their filenames. So I end up deciding to disable indexing content. Now I can at least search on all filenames.

for docs files (docx, doc, odt, xls, ods, ppt, odp) the solution is simple:

UNICONV + XPDF INTEGRATION
If you can install the uniconv utilitary on your server, along with the openoffice or libreoffice headless suite, and the xpdf utilitary, the plugin will be able to extract and index textual contents from office documents (Word,Excel,Powerpoint and all their closed or open-source variants).
On ubuntu 16.04:
apt-get install unoconv xpdf

for pdf files you need to install pdftotext

Yeah already tried that. Unfortunately without success. Did everything according to the configuration. Binaries are installed. Paths are entered in Pydio. Pydio runs with a user that can access the binaries. I click launch indexation, search and more, enter string i content field. No results.