fields module
Contains functions and classes related to fields.
Schema class
-
class whoosh.fields.Schema(**fields)
Represents the collection of fields in an index. Maps field names to
FieldType objects which define the behavior of each field.
Low-level parts of the index use field numbers instead of field names for
compactness. This class has several methods for converting between the
field name, field number, and field object itself.
All keyword arguments to the constructor are treated as fieldname =
fieldtype pairs. The fieldtype can be an instantiated FieldType object,
or a FieldType sub-class (in which case the Schema will instantiate it
with the default constructor before adding it).
For example:
s = Schema(content = TEXT,
title = TEXT(stored = True),
tags = KEYWORD(stored = True))
-
add(name, fieldtype, glob=False)
Adds a field to this schema.
Parameters: |
- name – The name of the field.
- fieldtype – An instantiated fields.FieldType object, or a
FieldType subclass. If you pass an instantiated object, the schema
will use that as the field configuration for this field. If you
pass a FieldType subclass, the schema will automatically
instantiate it with the default constructor.
|
-
copy()
- Returns a shallow copy of the schema. The field instances are not
deep copied, so they are shared between schema copies.
-
has_vectored_fields()
- Returns True if any of the fields in this schema store term vectors.
-
items()
- Returns a list of (“fieldname”, field_object) pairs for the fields
in this schema.
-
names(check_names=None)
Returns a list of the names of the fields in this schema.
Parameter: | check_names – (optional) sequence of field names to check
whether the schema accepts them as (dynamic) field names -
acceptable names will also be in the result list.
Note: You may also have static field names in check_names, that
won’t create duplicates in the result list. Unsupported names
will not be in the result list. |
-
scorable_names()
- Returns a list of the names of fields that store field
lengths.
-
separate_spelling_names()
- Returns a list of the names of fields that require special handling
for generating spelling graphs... either because they store graphs but
aren’t indexed, or because the analyzer is stemmed.
-
stored_names()
- Returns a list of the names of fields that are stored.
-
vector_names()
- Returns a list of the names of fields that store vectors.
-
class whoosh.fields.SchemaClass(**fields)
Allows you to define a schema using declarative syntax, similar to
Django models:
class MySchema(SchemaClass):
path = ID
date = DATETIME
content = TEXT
You can use inheritance to share common fields between schemas:
class Parent(SchemaClass):
path = ID(stored=True)
date = DATETIME
class Child1(Parent):
content = TEXT(positions=False)
class Child2(Parent):
tags = KEYWORD
This class overrides __new__ so instantiating your sub-class always
results in an instance of Schema.
>>> class MySchema(SchemaClass):
... title = TEXT(stored=True)
... content = TEXT
...
>>> s = MySchema()
>>> type(s)
<class 'whoosh.fields.Schema'>
All keyword arguments to the constructor are treated as fieldname =
fieldtype pairs. The fieldtype can be an instantiated FieldType object,
or a FieldType sub-class (in which case the Schema will instantiate it
with the default constructor before adding it).
For example:
s = Schema(content = TEXT,
title = TEXT(stored = True),
tags = KEYWORD(stored = True))
FieldType base class
-
class whoosh.fields.FieldType(format, analyzer, vector=None, scorable=False, stored=False, unique=False, multitoken_query='default', sortable=False)
Represents a field configuration.
The FieldType object supports the following attributes:
- format (formats.Format): the storage format for the field’s contents.
- analyzer (analysis.Analyzer): the analyzer to use to turn text into
terms.
- vector (formats.Format): the storage format for the field’s vectors
(forward index), or None if the field should not store vectors.
- scorable (boolean): whether searches against this field may be scored.
This controls whether the index stores per-document field lengths for
this field.
- stored (boolean): whether the content of this field is stored for each
document. For example, in addition to indexing the title of a document,
you usually want to store the title so it can be presented as part of
the search results.
- unique (boolean): whether this field’s value is unique to each document.
For example, ‘path’ or ‘ID’. IndexWriter.update_document() will use
fields marked as ‘unique’ to find the previous version of a document
being updated.
- multitoken_query is a string indicating what kind of query to use when
a “word” in a user query parses into multiple tokens. The string is
interpreted by the query parser. The strings understood by the default
query parser are “first” (use first token only), “and” (join the tokens
with an AND query), “or” (join the tokens with OR), “phrase” (join
the tokens with a phrase query), and “default” (use the query parser’s
default join type).
The constructor for the base field type simply lets you supply your own
configured field format, vector format, and scorable and stored values.
Subclasses may configure some or all of this for you.
-
clean()
- Clears any cached information in the field and any child objects.
-
has_morph()
- Returns True if this field by default performs morphological
transformations on its terms, e.g. stemming.
-
index(value, **kwargs)
Returns an iterator of (btext, frequency, weight, encoded_value)
tuples for each unique word in the input value.
The default implementation uses the analyzer attribute to tokenize
the value into strings, then encodes them into bytes using UTF-8.
-
parse_query(fieldname, qstring, boost=1.0)
- When self_parsing() returns True, the query parser will call
this method to parse basic query text.
-
parse_range(fieldname, start, end, startexcl, endexcl, boost=1.0)
- When self_parsing() returns True, the query parser will call
this method to parse range query text. If this method returns None
instead of a query object, the parser will fall back to parsing the
start and end terms using process_text().
-
process_text(qstring, mode='', **kwargs)
Analyzes the given string and returns an iterator of token texts.
>>> field = fields.TEXT()
>>> list(field.process_text("The ides of March"))
["ides", "march"]
-
self_parsing()
- Subclasses should override this method to return True if they want
the query parser to call the field’s parse_query() method instead
of running the analyzer on text in this field. This is useful where
the field needs full control over how queries are interpreted, such
as in the numeric field type.
-
separate_spelling()
Returns True if this field requires special handling of the words
that go into the field’s word graph.
The default behavior is to return True if the field is “spelled” but
not indexed, or if the field is indexed but the analyzer has
morphological transformations (e.g. stemming). Exotic field types may
need to override this behavior.
This method should return False if the field does not support spelling
(i.e. the spelling attribute is False).
-
sortable_terms(ixreader, fieldname)
Returns an iterator of the “sortable” tokens in the given reader and
field. These values can be used for sorting. The default implementation
simply returns all tokens in the field.
This can be overridden by field types such as NUMERIC where some values
in a field are not useful for sorting.
-
spellable_words(value)
Returns an iterator of each unique word (in sorted order) in the
input value, suitable for inclusion in the field’s word graph.
The default behavior is to call the field analyzer with the keyword
argument no_morph=True, which should make the analyzer skip any
morphological transformation filters (e.g. stemming) to preserve the
original form of the words. Exotic field types may need to override
this behavior.
-
supports(name)
Returns True if the underlying format supports the given posting
value type.
>>> field = TEXT()
>>> field.supports("positions")
True
>>> field.supports("characters")
False
-
to_bytes(value)
- Returns a bytes representation of the given value, appropriate to be
written to disk. The default implementation assumes a unicode value and
encodes it using UTF-8.
-
to_column_value(value)
- Returns an object suitable to be inserted into the document values
column for this field. The default implementation simply calls
self.to_bytes(value).
-
tokenize(value, **kwargs)
- Analyzes the given string and returns an iterator of Token objects
(note: for performance reasons, actually the same token yielded over
and over with different attributes).
Pre-made field types
-
class whoosh.fields.ID(stored=False, unique=False, field_boost=1.0, spelling=False, sortable=False, analyzer=None)
Configured field type that indexes the entire value of the field as one
token. This is useful for data you don’t want to tokenize, such as the path
of a file.
Parameter: | stored – Whether the value of this field is stored with the
document. |
-
class whoosh.fields.IDLIST(stored=False, unique=False, expression=None, field_boost=1.0, spelling=False)
Configured field type for fields containing IDs separated by whitespace
and/or punctuation (or anything else, using the expression param).
Parameters: |
- stored – Whether the value of this field is stored with the
document.
- unique – Whether the value of this field is unique per-document.
- expression – The regular expression object to use to extract
tokens. The default expression breaks tokens on CRs, LFs, tabs,
spaces, commas, and semicolons.
|
-
class whoosh.fields.STORED
- Configured field type for fields you want to store but not index.
-
class whoosh.fields.KEYWORD(stored=False, lowercase=False, commas=False, vector=None, scorable=False, unique=False, field_boost=1.0, spelling=False, sortable=False)
Configured field type for fields containing space-separated or
comma-separated keyword-like data (such as tags). The default is to not
store positional information (so phrase searching is not allowed in this
field) and to not make the field scorable.
Parameters: |
- stored – Whether to store the value of the field with the
document.
- comma – Whether this is a comma-separated field. If this is False
(the default), it is treated as a space-separated field.
- scorable – Whether this field is scorable.
|
-
class whoosh.fields.TEXT(analyzer=None, phrase=True, chars=False, vector=None, stored=False, field_boost=1.0, multitoken_query='default', spelling=False, sortable=False, lang=None)
Configured field type for text fields (for example, the body text of an
article). The default is to store positional information to allow phrase
searching. This field type is always scorable.
Parameters: |
- analyzer – The analysis.Analyzer to use to index the field
contents. See the analysis module for more information. If you omit
this argument, the field uses analysis.StandardAnalyzer.
- phrase – Whether the store positional information to allow phrase
searching.
- chars – Whether to store character ranges along with positions.
If this is True, “phrase” is also implied.
- vector – A whoosh.formats.Format object to use to store
term vectors, or True to store vectors using the same format as
the inverted index, or None or False to not store vectors.
By default, fields do not store term vectors.
- stored – Whether to store the value of this field with the
document. Since this field type generally contains a lot of text,
you should avoid storing it with the document unless you need to,
for example to allow fast excerpts in the search results.
- spelling – Whether to generate word graphs for this field to make
spelling suggestions much faster.
- sortable – If True, make this field sortable using the default
column type. If you pass a whoosh.columns.Column instance
instead of True, the field will use the given column type.
- lang – automaticaly configure a
whoosh.analysis.LanguageAnalyzer for the given language.
This is ignored if you also specify an analyzer.
|
-
class whoosh.fields.NUMERIC(numtype=<type 'int'>, bits=32, stored=False, unique=False, field_boost=1.0, decimal_places=0, shift_step=4, signed=True, sortable=False, default=None)
Special field type that lets you index integer or floating point
numbers in relatively short fixed-width terms. The field converts numbers
to sortable bytes for you before indexing.
You specify the numeric type of the field (int or float) when you
create the NUMERIC object. The default is int. For int, you can
specify a size in bits (32 or 64). For both int and float
you can specify a signed keyword argument (default is True).
>>> schema = Schema(path=STORED, position=NUMERIC(int, 64, signed=False))
>>> ix = storage.create_index(schema)
>>> with ix.writer() as w:
... w.add_document(path="/a", position=5820402204)
...
You can also use the NUMERIC field to store Decimal instances by specifying
a type of int or long and the decimal_places keyword argument.
This simply multiplies each number by (10 ** decimal_places) before
storing it as an integer. Of course this may throw away decimal prcesision
(by truncating, not rounding) and imposes the same maximum value limits as
int/long, but these may be acceptable for certain applications.
>>> from decimal import Decimal
>>> schema = Schema(path=STORED, position=NUMERIC(int, decimal_places=4))
>>> ix = storage.create_index(schema)
>>> with ix.writer() as w:
... w.add_document(path="/a", position=Decimal("123.45")
...
Parameters: |
- numtype – the type of numbers that can be stored in this field,
either int, float. If you use Decimal,
use the decimal_places argument to control how many decimal
places the field will store.
- bits – When numtype is int, the number of bits to use to
store the number: 8, 16, 32, or 64.
- stored – Whether the value of this field is stored with the
document.
- unique – Whether the value of this field is unique per-document.
- decimal_places – specifies the number of decimal places to save
when storing Decimal instances. If you set this, you will always
get Decimal instances back from the field.
- shift_steps – The number of bits of precision to shift away at
each tiered indexing level. Values should generally be 1-8. Lower
values yield faster searches but take up more space. A value
of 0 means no tiered indexing.
- signed – Whether the numbers stored in this field may be
negative.
|
-
class whoosh.fields.DATETIME(stored=False, unique=False, sortable=False)
Special field type that lets you index datetime objects. The field
converts the datetime objects to sortable text for you before indexing.
Since this field is based on Python’s datetime module it shares all the
limitations of that module, such as the inability to represent dates before
year 1 in the proleptic Gregorian calendar. However, since this field
stores datetimes as an integer number of microseconds, it could easily
represent a much wider range of dates if the Python datetime implementation
ever supports them.
>>> schema = Schema(path=STORED, date=DATETIME)
>>> ix = storage.create_index(schema)
>>> w = ix.writer()
>>> w.add_document(path="/a", date=datetime.now())
>>> w.commit()
Parameters: |
- stored – Whether the value of this field is stored with the
document.
- unique – Whether the value of this field is unique per-document.
|
-
class whoosh.fields.BOOLEAN(stored=False, field_boost=1.0)
Special field type that lets you index boolean values (True and False).
The field converts the boolean values to text for you before indexing.
>>> schema = Schema(path=STORED, done=BOOLEAN)
>>> ix = storage.create_index(schema)
>>> w = ix.writer()
>>> w.add_document(path="/a", done=False)
>>> w.commit()
Parameter: | stored – Whether the value of this field is stored with the
document. |
-
class whoosh.fields.NGRAM(minsize=2, maxsize=4, stored=False, field_boost=1.0, queryor=False, phrase=False, sortable=False)
Configured field that indexes text as N-grams. For example, with a field
type NGRAM(3,4), the value “hello” will be indexed as tokens
“hel”, “hell”, “ell”, “ello”, “llo”. This field type chops the entire text
into N-grams, including whitespace and punctuation. See NGRAMWORDS
for a field type that breaks the text into words first before chopping the
words into N-grams.
Parameters: |
- minsize – The minimum length of the N-grams.
- maxsize – The maximum length of the N-grams.
- stored – Whether to store the value of this field with the
document. Since this field type generally contains a lot of text,
you should avoid storing it with the document unless you need to,
for example to allow fast excerpts in the search results.
- queryor – if True, combine the N-grams with an Or query. The
default is to combine N-grams with an And query.
- phrase – store positions on the N-grams to allow exact phrase
searching. The default is off.
|
-
class whoosh.fields.NGRAMWORDS(minsize=2, maxsize=4, stored=False, field_boost=1.0, tokenizer=None, at=None, queryor=False, sortable=False)
Configured field that chops text into words using a tokenizer,
lowercases the words, and then chops the words into N-grams.
Parameters: |
- minsize – The minimum length of the N-grams.
- maxsize – The maximum length of the N-grams.
- stored – Whether to store the value of this field with the
document. Since this field type generally contains a lot of text,
you should avoid storing it with the document unless you need to,
for example to allow fast excerpts in the search results.
- tokenizer – an instance of whoosh.analysis.Tokenizer
used to break the text into words.
- at – if ‘start’, only takes N-grams from the start of the word.
If ‘end’, only takes N-grams from the end. Otherwise the default
is to take all N-grams from each word.
- queryor – if True, combine the N-grams with an Or query. The
default is to combine N-grams with an And query.
|
Exceptions
-
exception whoosh.fields.FieldConfigurationError
-
exception whoosh.fields.UnknownFieldError