reading module
This module contains classes that allow reading from an index.
Classes
-
class whoosh.reading.IndexReader
Do not instantiate this object directly. Instead use Index.reader().
-
all_doc_ids()
- Returns an iterator of all (undeleted) document IDs in the reader.
-
all_stored_fields()
- Yields the stored fields for all documents (including deleted
documents).
-
all_terms()
- Yields (fieldname, text) tuples for every term in the index.
-
close()
- Closes the open files associated with this reader.
-
codec()
- Returns the whoosh.codec.base.Codec object used to read
this reader’s segment. If this reader is not atomic
(reader.is_atomic() == True), returns None.
-
column_reader(fieldname, column=None, reverse=False, translate=False)
Parameters: |
- fieldname – the name of the field for which to get a reader.
- column – if passed, use this Column object instead of the one
associated with the field in the Schema.
- reverse – if passed, reverses the order of keys returned by the
reader’s sort_key() method. If the column type is not
reversible, this will raise a NotImplementedError.
- translate – if True, wrap the reader to call the field’s
from_bytes() method on the returned values.
|
Returns: | a whoosh.columns.ColumnReader object.
|
-
corrector(fieldname)
- Returns a whoosh.spelling.Corrector object that suggests
corrections based on the terms in the given field.
-
doc_count()
- Returns the total number of UNDELETED documents in this reader.
-
doc_count_all()
- Returns the total number of documents, DELETED OR UNDELETED,
in this reader.
-
doc_field_length(docnum, fieldname, default=0)
- Returns the number of terms in the given field in the given
document. This is used by some scoring algorithms.
-
doc_frequency(fieldname, text)
- Returns how many documents the given term appears in.
-
expand_prefix(fieldname, prefix)
- Yields terms in the given field that start with the given prefix.
-
field_length(fieldname)
- Returns the total number of terms in the given field. This is used
by some scoring algorithms.
-
field_terms(fieldname)
- Yields all term values (converted from on-disk bytes) in the given
field.
-
first_id(fieldname, text)
- Returns the first ID in the posting list for the given term. This
may be optimized in certain backends.
-
frequency(fieldname, text)
- Returns the total number of instances of the given term in the
collection.
-
generation()
- Returns the generation of the index being read, or -1 if the backend
is not versioned.
-
has_deletions()
- Returns True if the underlying index/segment has deleted
documents.
-
has_vector(docnum, fieldname)
- Returns True if the given document has a term vector for the given
field.
-
has_word_graph(fieldname)
- Returns True if the given field has a “word graph” associated with
it, allowing suggestions for correcting mis-typed words and fast fuzzy
term searching.
-
indexed_field_names()
- Returns an iterable of strings representing the names of the indexed
fields. This may include additional names not explicitly listed in the
Schema if you use “glob” fields.
-
is_deleted(docnum)
- Returns True if the given document number is marked deleted.
-
iter_docs()
- Yields a series of (docnum, stored_fields_dict)
tuples for the undeleted documents in the reader.
-
iter_field(fieldname, prefix='')
- Yields (text, terminfo) tuples for all terms in the given field.
-
iter_from(fieldname, text)
- Yields ((fieldname, text), terminfo) tuples for all terms in the
reader, starting at the given term.
-
iter_postings()
- Low-level method, yields all postings in the reader as
(fieldname, text, docnum, weight, valuestring) tuples.
-
iter_prefix(fieldname, prefix)
- Yields (text, terminfo) tuples for all terms in the given field with
a certain prefix.
-
leaf_readers()
- Returns a list of (IndexReader, docbase) pairs for the child readers
of this reader if it is a composite reader. If this is not a composite
reader, it returns [(self, 0)].
-
lexicon(fieldname)
- Yields all bytestrings in the given field.
-
max_field_length(fieldname)
- Returns the minimum length of the field across all documents. This
is used by some scoring algorithms.
-
min_field_length(fieldname)
- Returns the minimum length of the field across all documents. This
is used by some scoring algorithms.
-
most_distinctive_terms(fieldname, number=5, prefix='')
- Returns the top ‘number’ terms with the highest tf*idf scores as
a list of (score, text) tuples.
-
most_frequent_terms(fieldname, number=5, prefix='')
- Returns the top ‘number’ most frequent terms in the given field as a
list of (frequency, text) tuples.
-
postings(fieldname, text)
Returns a Matcher for the postings of the
given term.
>>> pr = reader.postings("content", "render")
>>> pr.skip_to(10)
>>> pr.id
12
Parameters: |
- fieldname – the field name or field number of the term.
- text – the text of the term.
|
Return type: | whoosh.matching.Matcher
|
-
segment()
- Returns the whoosh.index.Segment object used by this reader.
If this reader is not atomic (reader.is_atomic() == True), returns
None.
-
storage()
- Returns the whoosh.filedb.filestore.Storage object used by
this reader to read its files. If the reader is not atomic,
(reader.is_atomic() == True), returns None.
-
stored_fields(docnum)
Returns the stored fields for the given document number.
Parameter: | numerickeys – use field numbers as the dictionary keys instead of
field names. |
-
term_info(fieldname, text)
- Returns a TermInfo object allowing access to various
statistics about the given term.
-
terms_from(fieldname, prefix)
- Yields (fieldname, text) tuples for every term in the index starting
at the given prefix.
-
terms_within(fieldname, text, maxdist, prefix=0)
Returns a generator of words in the given field within maxdist
Damerau-Levenshtein edit distance of the given text.
Important: the terms are returned in no particular order. The only
criterion is that they are within maxdist edits of text. You
may want to run this method multiple times with increasing maxdist
values to ensure you get the closest matches first. You may also have
additional information (such as term frequency or an acoustic matching
algorithm) you can use to rank terms with the same edit distance.
Parameters: |
- maxdist – the maximum edit distance.
- prefix – require suggestions to share a prefix of this length
with the given word. This is often justifiable since most
misspellings do not involve the first letter of the word.
Using a prefix dramatically decreases the time it takes to generate
the list of words.
- seen – an optional set object. Words that appear in the set will
not be yielded.
|
-
vector(docnum, fieldname, format_=None)
Returns a Matcher object for the
given term vector.
>>> docnum = searcher.document_number(path=u'/a/b/c')
>>> v = searcher.vector(docnum, "content")
>>> v.all_as("frequency")
[(u"apple", 3), (u"bear", 2), (u"cab", 2)]
Parameters: |
- docnum – the document number of the document for which you want
the term vector.
- fieldname – the field name or field number of the field for which
you want the term vector.
|
Return type: | whoosh.matching.Matcher
|
-
vector_as(astype, docnum, fieldname)
Returns an iterator of (termtext, value) pairs for the terms in the
given term vector. This is a convenient shortcut to calling vector()
and using the Matcher object when all you want are the terms and/or
values.
>>> docnum = searcher.document_number(path=u'/a/b/c')
>>> searcher.vector_as("frequency", docnum, "content")
[(u"apple", 3), (u"bear", 2), (u"cab", 2)]
Parameters: |
- docnum – the document number of the document for which you want
the term vector.
- fieldname – the field name or field number of the field for which
you want the term vector.
- astype – a string containing the name of the format you want the
term vector’s data in, for example “weights”.
|
-
word_graph(fieldname)
- Returns the root whoosh.fst.Node for the given
field, if the field has a stored word graph (otherwise raises an
exception). You can check whether a field has a word graph using
IndexReader.has_word_graph().
-
class whoosh.reading.MultiReader(readers, generation=None)
- Do not instantiate this object directly. Instead use Index.reader().
-
class whoosh.reading.TermInfo(weight=0, df=0, minlength=None, maxlength=0, maxweight=0, minid=None, maxid=0)
Represents a set of statistics about a term. This object is returned by
IndexReader.term_info(). These statistics may be useful for
optimizations and scoring algorithms.
-
doc_frequency()
- Returns the number of documents the term appears in.
-
max_id()
- Returns the highest document ID this term appears in.
-
max_length()
- Returns the length of the longest field value the term appears
in.
-
max_weight()
- Returns the number of times the term appears in the document in
which it appears the most.
-
min_id()
- Returns the lowest document ID this term appears in.
-
min_length()
- Returns the length of the shortest field value the term appears
in.
-
weight()
- Returns the total frequency of the term across all documents.
Exceptions
-
exception whoosh.reading.TermNotFound