Basic Lucene Content Search configuration?

Raker · March 26, 2018, 12:34pm

Hi,

I’m having trouble with making Lucene Search work properly in Pydio.

I have created a workspace with .txt, and .docx files in it (I understand that docx are not indexed) however doesnt index txt either.

I enable “Lucene Indexing” in the Workspace configuration. I go to the workspace and I choose “Index Content”.

When I go to search > advanced > content search, I get no results for my queries. However I can search and find files when I choose filename search.

Then I read in the documentation: "You must make sure to add a meta source “index.lucene” to the repositories that you want to be indexed."

What does that mean, and could that be the cause? I simply do not know what is meant by repository (I know what a repo is, but is that in Pydio terms the workspace?)

I would really like to see a monkey-see, monkey-do description of how to make Lucene Search index the content of my text files. The current documentation seems like you have to be “in-the-know” to understand it. A Lucene / Pydio noob like me is dropped on the floor.

Raker · March 28, 2018, 9:24am

Anyone?

I have enabled Lucene Search on Workspace.
Create .txt file with test content to search on.
Pressed “Index Content” on Workspace
Searched on content using Content Search (More search options).

No result!

romoloman · March 28, 2018, 9:43am

did you configured the lucene engine in the plugins ?

Raker · March 28, 2018, 12:13pm

This is how the configuration looks. The plugin is enabled.

And on the Workspace i’m testing

What I may be doing wrong is the “be sure to add index.lucene to the repository part”. I didnt really understand that. So i tried adding an empty index.lucene to the workspace path. I also tried adding an empty file called .ajxp_meta and adding the line “index.lucene”. Didn’t work.

romoloman · March 28, 2018, 12:22pm

for .doc and docx files you must include the unoconv configuration,
for pdf ones you need pdftotext
considering a doc or a docx as a .txt file may bring to corrupted indexes…
In the documentation of the lucene plugin (reported in the configuration page) is clearly written that to index that sort of file you need extra components.

Raker · March 28, 2018, 12:36pm

Yeah, I understood that actually. However I have, as mentioned, created a .txt file as well with some words in it, that neither is indexed.

romoloman · March 28, 2018, 12:43pm

The txt file should have been indexed but if the index is already compromised the results can be wrong.
I’m pretty sure that in my installation (made from scratch) lucene is actually working even in PDF and doc files not only in .txt ones.

Raker · March 28, 2018, 12:46pm

Hmmm… Can you help me understand the index.lucene part?

romoloman · March 28, 2018, 12:58pm

Indexes are saved in the cache directory of your installation: ./data/cache/indexes (depending on your settings)
If an index get corrupted search on that index does not works anymore.
I’m not sure whenever a index command actually drops and recreate the index or just updates diffs.

Raker · March 28, 2018, 1:28pm

Interesting, I located the folder. Deleted all the indexes, re-ran indexation and I can see new files were created.

2_1.del
_2.sti
_3.sti
read.lock.file
segments_6
write.lock.file
_2.cfs
_3.cfs
optimization.lock.file
read-lock-processing.lock.file
segments.gen

However the only indexation I get from search, is by filename. Content search simply doesn’t work for me.

Raker · March 28, 2018, 1:30pm

Pydio documentation says: “You must make sure to add a meta source “index.lucene” to the repositories that you want to be indexed.”

I’m still looking for an answer to understand that part. How do I add a meta source to the repository?

romoloman · March 28, 2018, 2:44pm

You must enter in settings, select the workspace, on the left there is a part named Additional features
You must be sure that:
Lucene Search engine is listed in that section

after that you must click Lucene search Engine and verify that the switch Index Content is enabled.

Raker · March 28, 2018, 7:44pm

Everything mentioned is enabled. If that is all you had to do, to make it work, then I guess I’m out of luck.

Raker · March 28, 2018, 7:54pm

Okay, found the solution. I feel like screaming.

Apparently a workspace is not a repository. However a folder inside a workspace is a repository (i have no idea why this is not mentioned in the docs).

So if I create a folder, put a txt file in it, the contents are indexed.

Thanks for your support romoloman. You already helped me once, with the Collabora Online. <3

Raker · March 28, 2018, 8:43pm

Just a quick update. Since the indexer, doesnt parse docx files. It wont event index their filenames. So I end up deciding to disable indexing content. Now I can at least search on all filenames.

romoloman · March 29, 2018, 6:07am

for docs files (docx, doc, odt, xls, ods, ppt, odp) the solution is simple:

UNICONV + XPDF INTEGRATION
If you can install the uniconv utilitary on your server, along with the openoffice or libreoffice headless suite, and the xpdf utilitary, the plugin will be able to extract and index textual contents from office documents (Word,Excel,Powerpoint and all their closed or open-source variants).
On ubuntu 16.04:
apt-get install unoconv xpdf

for pdf files you need to install pdftotext

Raker · April 6, 2018, 8:59am

Yeah already tried that. Unfortunately without success. Did everything according to the configuration. Binaries are installed. Paths are entered in Pydio. Pydio runs with a user that can access the binaries. I click launch indexation, search and more, enter string i content field. No results.

Topic		Replies	Views
Content search on pydio 8.01 Pydio 8	2	629	May 16, 2023
Not possible to have elasticsearch without lucene? " : Could not find action index" Pydio 8	3	1006	July 27, 2018
Index PDF files content Development	2	9610	July 5, 2018
Enable ElasticSearch using Indexing plugin on Pydio 8.x Pydio 8	1	870	May 2, 2019
Search Query and Indexing Questions on Pydio Cells Home Edition Pydio Cells	1	218	January 5, 2024

Basic Lucene Content Search configuration?

Related topics