lunedì 29 giugno 2015

django-cms and haystack: update index with a django signal

Haystack provides the HAYSTACK_SIGNAL_PROCESSOR settings and it works well, especially the RealtimeSignalProcessor for the django models.

But, the problem that I had was update the search index when the user publishes a django-cms page.
In fact, if you use the HAYSTACK_SIGNAL_PROCESSOR settings, haystack will update your index with the draft copy of the page - both if you save a plugin or a page itself - and it's not a good thing for user experience... and also for our indexes! :-)

Going deep into this issue, I wrote a signal that wraps the RealtimeSignalProcessor and that doesn't break the django-cms publish command if there's no connection to the search engine.

# models.py
# other imports
import logging
# ...
logger = logging.getLogger(__name__)
# ...

def real_time_signal_processor(instance, **kwargs):
    from elasticsearch.exceptions import ConnectionError
    from django.conf import settings
    from haystack import signals
    from haystack.utils import loading

    connections = loading.ConnectionHandler(settings.HAYSTACK_CONNECTIONS)
    connection_router = loading.ConnectionRouter()
    if hasattr(settings, 'HAYSTACK_ROUTERS'):
        connection_router = loading.ConnectionRouter(settings.HAYSTACK_ROUTERS)
    try:
        signals.RealtimeSignalProcessor(connections, connection_router).handle_save(kwargs['sender'], instance)
    except ConnectionError as e:
        logger.error(e.message)


post_publish.connect(real_time_signal_processor)

Just for curiosity, the models.py is a part of the django app named search, where I put all the search engine code (custom backends, forms, etc). I think it's a good practice and nice to have an isolated app that serve to search purposes.

That's it!

Cheers

[Update] The code above catch the exception from elasticsearch client. With a generic exception, you can use that signal with your search backend. IMHO is not a good practice to catch a generic Exception, then use the appropriate one! :-)

mercoledì 24 giugno 2015

Dealing with elasticsearch reindex and haystack

Prerequisites: some knowledge on haystack and elasticsearch and obviously django are required.

When we're using e.g. synonyms or stopwords in our indices, we need to reindex our data in order to have the new settings on board. Elasticsearch documentation rises this problem and suggest how to fix, see Reindex Your Data. Well, let's do with haystack!

Suppose we have this haystack settings:

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'search.backend.ElasticsearchEngineCustom',
        'URL': 'http://127.0.0.1:9200/',
        'INDEX_NAME': 'haystack',
        'BATCH_SIZE': 1000
    }
}

As you can see I created a custom backend, but it's not the focus on this article, please read the links below to get an idea:
As a sample, for this article the backend module looks like:

from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend

class ElasticsearchEngineBackendCustom(ElasticsearchSearchBackend):
    def __init__(self, connection_alias, **connection_options):
        super(ElasticsearchEngineBackendCustom, self).__init__(connection_alias, **connection_options)

        self.setup_complete = True

class ElasticsearchEngineCustom(ElasticsearchSearchEngine):
    backend = ElasticsearchEngineBackendCustom

Setting self.setup_complete to True avoids the 'put index' call from haystack setup and permits us to use INDEX_NAME as index alias name.

Now we have to manage all the haystack index setup via management commands overriding rebuild_index and creating the reindex_index command, let's do it.

In a module named utils, I created the current_index function that is an utility function that return the current index in use and the next version number for the current index (we'll use the number on management commands). Below the code:
 
INDEX_TEMPLATE = "{}_v{}"

def number_sequence():
    n = 0
    while True:
        yield n
        n += 1

def current_index(es_client, index_name):
    version = number_sequence()
    index = INDEX_TEMPLATE.format(index_name, version.next())

    if not es_client.indices.exists_alias(name=index_name):
        return INDEX_TEMPLATE.format(index_name, 0), 1

    while not es_client.indices.exists(index=index):
        index = INDEX_TEMPLATE.format(index_name, version.next())

    return index, version.next()

The guard for exists_alias is used to know if the rebuild_index has been run at least once. If no alias is present, there is no index. The version number added to the name start from 0. The default index, when the rebuild_index command is called will be haystack_v0.
An objection is the use of while not... because the function search from zero to current index version number and it could be a great number and for each number hits the elasticsearch; it could be more efficient having two indexes name. Take this as a "quick win" solution and improving it!

Let's go to write the reindex_index command, the steps are:
  • Delete the new index ignoring the errors
  • Create the new index with the current index settings and fields mapping
  • Reindex data from current index to new index
  • Zero downtime: remove from alias the current index and add the new index
  • Delete current index, that is old. The new one is used 
from elasticsearch import Elasticsearch
from elasticsearch.helpers import reindex
from django.core.management.base import BaseCommand
from django.conf import settings
from search.utils import current_index, INDEX_TEMPLATE

class Command(BaseCommand):

    def __init__(self):
        super(Command, self).__init__()

        self.es_client = Elasticsearch(hosts=settings.HAYSTACK_CONNECTIONS['default']['URL'])
        self.index_alias = settings.HAYSTACK_CONNECTIONS['default']['INDEX_NAME']
        self.current_index, version = current_index(self.es_client, self.index_alias)
        self.new_index = INDEX_TEMPLATE.format(self.index_alias, version)
        # Update settings with fields mapping
        self.index_settings = settings.ELASTICSEARCH_INDEX_SETTINGS
        self.index_settings.update(self.es_client.indices.get_mapping()[self.current_index])

    def handle(self, *args, **options):
        self.es_client.indices.delete(index=self.new_index, ignore=[404, 400])
        self.es_client.indices.create(index=self.new_index, body=self.index_settings)
        reindex(self.es_client, self.current_index, self.new_index)
        update_aliases = {
            "actions": [
                {"remove": {"index": self.current_index, "alias": self.index_alias}},
                {"add": {"index": self.new_index, "alias": self.index_alias}}
            ]
        }
        self.es_client.indices.update_aliases(body=update_aliases)
        self.es_client.indices.delete(index=self.current_index)

        print(u"Successfully reindex.")


Now we need to override rebuild_index command with:
  • Rewriting clear_index call
  • Creating the alias using the INDEX_NAME from haystack settings
    • delete all index that match haystack_v*
    • create the index "v0" with haystack settings
    • create the alias haystack to haystack_v0
  • Call the update_index
from elasticsearch import Elasticsearch

from django.conf import settings
from django.core.management import call_command
from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend
from haystack.management.commands import rebuild_index
from search.utils import INDEX_TEMPLATE

class Command(rebuild_index.Command):

    def __init__(self):
        super(Command, self).__init__()
        self.es_client = Elasticsearch(hosts=settings.HAYSTACK_CONNECTIONS['default']['URL'])
        self.index_name = INDEX_TEMPLATE.format(settings.HAYSTACK_CONNECTIONS['default']['INDEX_NAME'], 0)
        self.index_alias = settings.HAYSTACK_CONNECTIONS['default']['INDEX_NAME']

    def _create_index_alias(self):
        self.es_client.indices.delete(index=self.index_name.replace("0", "*"))
        self.es_client.indices.create(index=self.index_name,
                                      body=ElasticsearchSearchBackend.DEFAULT_SETTINGS)
        print(u"Created index {}".format(self.index_name))
        self.es_client.indices.put_alias(index=self.index_name, name=self.index_alias)
        print(u"Added index {}_v0 to alias {}.".format(self.index_name, self.index_alias))

    def handle(self, **options):
        self._create_index_alias()
        call_command('update_index', **options)

Instead of ElasticsearchSearchBackend.DEFAULT_SETTINGS you should use your elasticsearch settings. The links above explained how to make it.
A nice improvement is having the yes/no options, the rebuild command is destructive.

Any suggestions will be appreciated! That's it. Enjoy! :-)

Cheers

giovedì 18 giugno 2015

Django Haystack Elasticsearch: index pdf files

In this article I would like to explain how to index pdf files into a haystack elasticsearch backend and to follow you need some knowledge about django and haystack.
Elasticsearch configuration is not treated.

Beginning 

IMHO, haystack documentation is not very clear about "FileIndex", or better, yes but only for Solr backend, see Rich Content Extraction; for elasticsearch backend you need to get your hands dirty :-)

Issue
  • The pdf files for index are located into the django media directory under document folder and subfolders.

That we need
  1. Retrieve all files and put some data into a list of dictionaries
  2. A pdf file model, haystack requires a model in order to perform index
  3. A custom haystack elasticsearch backend, we need to override the extract_file_contents method
  4. A search_indexes.py file
Solving 1.

This is simple, walk through the directories and store the full path into a list of dictionaries.
I left the exercise to the reader.
The final result is to obtain a result like this:

[
  {"path": "/path/to/media/my_fantastic_pdf.pdf, "url": "media/url/my_fantastic_pdf.pdf"},
  {"path": "...", "url": "..."},
   ...
]

Solving 2.

First of all we need a model, not managed:

class PdfFileInfo(models.Model):
    path = models.CharField(max_length=250)
    url = models.CharField(max_length=250)

    objects = PdfFileInfoManager()

    def get_absolute_url(self):
        return self.url

    class Meta:
        managed = False

As you can see we don't have a real table, then we need to create a custom QuerySet and Manager in order to supply to this lack.
Searching around the net I've found this article how to quack like a QuerySet that explains how to have a copy of original django QuerySet and having some nice tricks.

Below the code of QuerySet:

class PdfFileInfoQuerySet(object):

    def __init__(self):
        # avoid circular dependencies
        from .models import PdfFileInfo

        self.pdf_files = []
        docs = retrieve_files()  # remember it's your homework :-P
        for pk, doc in enumerate(docs):
            doc['id'] = pk 
            self.pdf_files.append(PdfFileInfo(**doc))

    def __iter__(self):
        for pdf_file in self.pdf_files:
            yield pdf_file

    def __repr__(self):
        return repr(self.pdf_files)

    def __getitem__(self, k):
        if not isinstance(k, (slice, int, long)):
            raise TypeError
        assert ((not isinstance(k, slice) and (k >= 0))
                or (isinstance(k, slice) and (k.start is None or k.start >= 0)
                    and (k.stop is None or k.stop >= 0))), "Negative indexing is not supported."
        if isinstance(k, slice):
            return self.pdf_files[k]
        else:
            return self.pdf_files[k:k + 1][0]

    def count(self):
        return len(self.pdf_files)

    def all(self):
        return self._clone()

    def filter(self, *args, **kwargs):
        return self._clone()

    def exclude(self, *args, **kwargs):
        return self._clone()

    def order_by(self, *ordering):
        return self._clone()

    def _clone(self):
        qs = PdfFileInfoQuerySet()
        qs.pdf_files = self.pdf_files[:]
        return qs

Note on a pitfall: assign the pk to the model ensures that indexer will create all the documents into the index, otherwise it will create only one document (the last item).

And for the Manager:

class PdfFileInfoManager(models.Manager):
    def all(self):
        return PdfFileInfoQuerySet()

Solving 3.

Creating a custom backend... The "easy" part :-)
I choose pyPdf for extracting pdf contents. [Python recipe]

class ElasticsearchEngineBackendCustom(ElasticsearchSearchBackend):
    # ... 
    def extract_file_contents(self, file_obj):

        pdf = pyPdf.PdfFileReader(file_obj)

        content = ""
        for num_page in range(0, pdf.getNumPages()):
            content += pdf.getPage(num_page).extractText() + "\n"

        content = (" ".join(content.replace(u"\xa0", " ").strip().split())).encode("ascii", "xmlcharrefreplace")

        pdf_info = {
            'contents': content
        }

        return pdf_info

class ElasticsearchEngineCustom(ElasticsearchSearchEngine):
    backend = ElasticsearchEngineBackendCustom

You can find some other info about extending backend on my stackoverflow answer

Solving 4.

Cool, now we have the basis to build the index like haystack documentation says.

class FileIndex(indexes.SearchIndex, indexes.Indexable):
    # ...
    def prepare(self, obj):
        data = super(FileIndex, self).prepare(obj)

        extracted_data = self._get_backend(None).extract_file_contents(open(obj.path, "rb"))

        t = loader.select_template(('search/indexes/file_text.txt',))
        data['text'] = t.render(Context({'object': obj, 'extracted': extracted_data}))

        return data

    def get_model(self):
        return PdfFileInfo

    def index_queryset(self, using=None):
        return PdfFileInfo.objects.all()

The template search/indexes/file_text.txt is very simple:

{{ extracted.contents|striptags|safe }}

That's it, run the rebuild_index command and see indexer in action.

This is a working example, maybe require some adjustments for your purpose, e.g. I think that with a little effort you can index file if it's an "attachment file" in your django model.

Any suggestions will be appreciated!
Cheers!

sabato 15 marzo 2014

duckdns e localtunnel

Dopo un po' di tempo... un post, giusto per tenere traccia delle sperimentazioni.

BTW, cercavo un servizio per condividere una web application dal RaspberryPi e ho trovato un paio di soluzioni interessanti.

Dynamic DNS

DuckDNS un servizio di gratuito di DNS dinamico.
Richiede la configurazione del router per IP e porte.
Ad oggi mette a disposizione fino a quattro domini di terzo livello per utente, non male.
Nota: funziona pure con Fastweb, i nuovi router prevedono la configurazione via MyFastPage.

Sperimentazione: semplice form di file upload e un'applicazione Flask per gestire la post.

Localhost Anywhere

localtunnel e` un servizio di "tunneling" ed evita di configurare DNS e router.
Facile da installare, richiede nodejs.
Consiglio l'utilizzo di nvm per gesitre le installazioni di nodejs.
Molto utile in fase di sviluppo per condividere il progetto che gira in locale con l'esterno.

Sperimentzione: un "hello world!" da una applicazione web Tornado... d'altra pare c'e`sempre il C10K problem da valutare... :-P


Sani