See also whoosh.qparser which contains code for parsing user queries into query objects.
The following abstract base classes are subclassed to create the “real” query operations.
Abstract base class for all queries.
Note that this base class implements __or__, __and__, and __sub__ to allow slightly more convenient composition of query objects:
>>> Term("content", u"a") | Term("content", u"b")
Or([Term("content", u"a"), Term("content", u"b")])
>>> Term("content", u"a") & Term("content", u"b")
And([Term("content", u"a"), Term("content", u"b")])
>>> Term("content", u"a") - Term("content", u"b")
And([Term("content", u"a"), Not(Term("content", u"b"))])
Applies the given function to this query’s subqueries (if any) and then to this query itself:
def boost_phrases(q):
if isintance(q, Phrase):
q.boost *= 2.0
return q
myquery = myquery.accept(boost_phrases)
This method automatically creates copies of the nodes in the original tree before passing them to your function, so your function can change attributes on nodes without altering the original tree.
This method is less flexible than using Query.apply() (in fact it’s implemented using that method) but is often more straightforward.
Returns a set of all terms in this query tree.
This method exists for backwards-compatibility. Use iter_all_terms() instead.
Parameter: | phrases – Whether to add words found in Phrase queries. |
---|---|
Return type: | set |
If this query has children, calls the given function on each child and returns a new copy of this node with the new children returned by the function. If this is a leaf node, simply returns this object.
This is useful for writing functions that transform a query tree. For example, this function changes all Term objects in a query tree into Variations objects:
def term2var(q):
if isinstance(q, Term):
return Variations(q.fieldname, q.text)
else:
return q.apply(term2var)
q = And([Term("f", "alfa"),
Or([Term("f", "bravo"),
Not(Term("f", "charlie"))])])
q = term2var(q)
Note that this method does not automatically create copies of nodes. To avoid modifying the original tree, your function should call the Query.copy() method on nodes before changing their attributes.
Returns an iterator of docnums matching this query.
>>> with my_index.searcher() as searcher:
... list(my_query.docs(searcher))
[10, 34, 78, 103]
Parameter: | searcher – A whoosh.searching.Searcher object. |
---|
Returns a set of all byteterms in this query tree that exist in the given ixreader.
Parameters: |
|
---|---|
Return type: | set |
Returns an iterator of (fieldname, text) pairs for all terms in this query tree.
>>> qp = qparser.QueryParser("text", myindex.schema)
>>> q = myparser.parse("alfa bravo title:charlie")
>>> # List the terms in a query
>>> list(q.iter_all_terms())
[("text", "alfa"), ("text", "bravo"), ("title", "charlie")]
>>> # Get a set of all terms in the query that don't exist in the index
>>> r = myindex.reader()
>>> missing = set(t for t in q.iter_all_terms() if t not in r)
set([("text", "alfa"), ("title", "charlie")])
>>> # All terms in the query that occur in fewer than 5 documents in
>>> # the index
>>> [t for t in q.iter_all_terms() if r.doc_frequency(t[0], t[1]) < 5]
[("title", "charlie")]
Parameter: | phrases – Whether to add words found in Phrase queries. |
---|
Returns a Matcher object you can use to retrieve documents and scores matching this query.
Return type: | whoosh.matching.Matcher |
---|
Returns a recursively “normalized” form of this query. The normalized form removes redundancy and empty queries. This is called automatically on query trees created by the query parser, but you may want to call it yourself if you’re writing your own parser or building your own queries.
>>> q = And([And([Term("f", u"a"),
... Term("f", u"b")]),
... Term("f", u"c"), Or([])])
>>> q.normalize()
And([Term("f", u"a"), Term("f", u"b"), Term("f", u"c")])
Note that this returns a new, normalized query. It does not modify the original query “in place”.
Returns a copy of this query with oldtext replaced by newtext (if oldtext was anywhere in this query).
Note that this returns a new query with the given text replaced. It does not modify the original query “in place”.
Returns a set of queries that are known to be required to match for the entire query to match. Note that other queries might also turn out to be required but not be determinable by examining the static query.
>>> a = Term("f", u"a")
>>> b = Term("f", u"b")
>>> And([a, b]).requires()
set([Term("f", u"a"), Term("f", u"b")])
>>> Or([a, b]).requires()
set([])
>>> AndMaybe(a, b).requires()
set([Term("f", u"a")])
>>> a.requires()
set([Term("f", u"a")])
Yields zero or more (fieldname, text) pairs queried by this object. You can check whether a query object targets specific terms before you call this method using Query.has_terms().
To get all terms in a query tree, use Query.iter_all_terms().
Yields zero or more analysis.Token objects corresponding to the terms searched for by this query object. You can check whether a query object targets specific terms before you call this method using Query.has_terms().
The Token objects will have the fieldname, text, and boost attributes set. If the query was built by the query parser, they Token objects will also have startchar and endchar attributes indexing into the original user query.
To get all tokens for a query tree, use Query.all_tokens().
Parameter: | exreader – a reader to use to expand multiterm queries such as prefixes and wildcards. The default is None meaning do not expand. |
---|
Returns a COPY of this query with the boost set to the given value.
If a query type does not accept a boost itself, it will try to pass the boost on to its children, if any.
Matches documents containing the given term (fieldname+text pair).
>>> Term("content", u"render")
Matches documents containing words similar to the given term.
Parameters: |
|
---|
Matches documents containing a given phrase.
Parameters: |
|
---|
Matches documents that match ALL of the subqueries.
>>> And([Term("content", u"render"),
... Term("content", u"shade"),
... Not(Term("content", u"texture"))])
>>> # You can also do this
>>> Term("content", u"render") & Term("content", u"shade")
Matches documents that match ANY of the subqueries.
>>> Or([Term("content", u"render"),
... And([Term("content", u"shade"), Term("content", u"texture")]),
... Not(Term("content", u"network"))])
>>> # You can also do this
>>> Term("content", u"render") | Term("content", u"shade")
Parameters: |
|
---|
Excludes any documents that match the subquery.
>>> # Match documents that contain 'render' but not 'texture'
>>> And([Term("content", u"render"),
... Not(Term("content", u"texture"))])
>>> # You can also do this
>>> Term("content", u"render") - Term("content", u"texture")
Parameters: |
|
---|
Matches documents that contain any terms that start with the given text.
>>> # Match documents containing words starting with 'comp'
>>> Prefix("content", u"comp")
Matches documents that contain any terms that match a “glob” pattern. See the Python fnmatch module for information about globs.
>>> Wildcard("content", u"in*f?x")
Matches documents containing any terms in a given range.
>>> # Match documents where the indexed "id" field is greater than or equal
>>> # to 'apple' and less than or equal to 'pear'.
>>> TermRange("id", u"apple", u"pear")
Parameters: |
|
---|
A range query for NUMERIC fields. Takes advantage of tiered indexing to speed up large ranges by matching at a high resolution at the edges of the range and a low resolution in the middle.
>>> # Match numbers from 10 to 5925 in the "number" field.
>>> nr = NumericRange("number", 10, 5925)
Parameters: |
|
---|
This is a very thin subclass of NumericRange that only overrides the initializer and __repr__() methods to work with datetime objects instead of numbers. Internally this object converts the datetime objects it’s created with to numbers and otherwise acts like a NumericRange query.
>>> DateRange("date", datetime(2010, 11, 3, 3, 0),
... datetime(2010, 11, 3, 17, 59))
A query that matches every document containing any term in a given field. If you don’t specify a field, the query matches every document.
>>> # Match any documents with something in the "path" field
>>> q = Every("path")
>>> # Matcher every document
>>> q = Every()
The unfielded form (matching every document) is efficient.
The fielded is more efficient than a prefix query with an empty prefix or a ‘*’ wildcard, but it can still be very slow on large indexes. It requires the searcher to read the full posting list of every term in the given field.
Instead of using this query it is much more efficient when you create the index to include a single term that appears in all documents that have the field you want to match.
For example, instead of this:
# Match all documents that have something in the "path" field
q = Every("path")
Do this when indexing:
# Add an extra field that indicates whether a document has a path
schema = fields.Schema(path=fields.ID, has_path=fields.ID)
# When indexing, set the "has_path" field based on whether the document
# has anything in the "path" field
writer.add_document(text=text_value1)
writer.add_document(text=text_value2, path=path_value2, has_path="t")
Then to find all documents with a path:
q = Term("has_path", "t")
Parameter: | fieldname – the name of the field to match, or None or * to match all documents. |
---|
Merges overlapping and touches spans in the given list of spans.
Note that this modifies the original list.
>>> spans = [Span(1,2), Span(3)]
>>> Span.merge(spans)
>>> spans
[<1-3>]
Abstract base class for span-based queries. Each span query type wraps a “regular” query that implements the basic document-matching functionality (for example, SpanNear wraps an And query, because SpanNear requires that the two sub-queries occur in the same documents. The wrapped query is stored in the q attribute.
Subclasses usually only need to implement the initializer to set the wrapped query, and matcher() to return a span-aware matcher object.
Matches spans that end within the first N positions. This lets you for example only match terms near the beginning of the document.
Parameters: |
|
---|
Note: for new code, use SpanNear2 instead of this class. SpanNear2 takes a list of sub-queries instead of requiring you to create a binary tree of query objects.
Matches queries that occur near each other. By default, only matches queries that occur right next to each other (slop=1) and in order (ordered=True).
For example, to find documents where “whoosh” occurs next to “library” in the “text” field:
from whoosh import query, spans
t1 = query.Term("text", "whoosh")
t2 = query.Term("text", "library")
q = spans.SpanNear(t1, t2)
To find documents where “whoosh” occurs at most 5 positions before “library”:
q = spans.SpanNear(t1, t2, slop=5)
To find documents where “whoosh” occurs at most 5 positions before or after “library”:
q = spans.SpanNear(t1, t2, slop=5, ordered=False)
You can use the phrase() class method to create a tree of SpanNear queries to match a list of terms:
q = spans.SpanNear.phrase("text", ["whoosh", "search", "library"],
slop=2)
Parameters: |
|
---|---|
Pram mindist: | the minimum distance allowed between the queries. |
Matches queries that occur near each other. By default, only matches queries that occur right next to each other (slop=1) and in order (ordered=True).
New code should use this query type instead of SpanNear.
(Unlike SpanNear, this query takes a list of subqueries instead of requiring you to build a binary tree of query objects. This query should also be slightly faster due to less overhead.)
For example, to find documents where “whoosh” occurs next to “library” in the “text” field:
from whoosh import query, spans
t1 = query.Term("text", "whoosh")
t2 = query.Term("text", "library")
q = spans.SpanNear2([t1, t2])
To find documents where “whoosh” occurs at most 5 positions before “library”:
q = spans.SpanNear2([t1, t2], slop=5)
To find documents where “whoosh” occurs at most 5 positions before or after “library”:
q = spans.SpanNear2(t1, t2, slop=5, ordered=False)
Parameters: |
|
---|---|
Pram mindist: | the minimum distance allowed between the queries. |
Matches spans from the first query only if they don’t overlap with spans from the second query. If there are no non-overlapping spans, the document does not match.
For example, to match documents that contain “bear” at most 2 places after “apple” in the “text” field but don’t have “cute” between them:
from whoosh import query, spans
t1 = query.Term("text", "apple")
t2 = query.Term("text", "bear")
near = spans.SpanNear(t1, t2, slop=2)
q = spans.SpanNot(near, query.Term("text", "cute"))
Parameters: |
|
---|
Matches documents that match any of a list of sub-queries. Unlike query.Or, this class merges together matching spans from the different sub-queries when they overlap.
Parameter: | subqs – a list of queries to match. |
---|
Matches documents where the spans of the first query contain any spans of the second query.
For example, to match documents where “apple” occurs at most 10 places before “bear” in the “text” field and “cute” is between them:
from whoosh import query, spans
t1 = query.Term("text", "apple")
t2 = query.Term("text", "bear")
near = spans.SpanNear(t1, t2, slop=10)
q = spans.SpanContains(near, query.Term("text", "cute"))
Parameters: |
|
---|
Matches documents where the spans of the first query occur before any spans of the second query.
For example, to match documents where “apple” occurs anywhere before “bear”:
from whoosh import query, spans
t1 = query.Term("text", "apple")
t2 = query.Term("text", "bear")
q = spans.SpanBefore(t1, t2)
Parameters: |
|
---|
Matches documents that satisfy both subqueries, but only uses the spans from the first subquery.
This is useful when you want to place conditions on matches but not have those conditions affect the spans returned.
For example, to get spans for the term alfa in documents that also must contain the term bravo:
SpanCondition(Term("text", u"alfa"), Term("text", u"bravo"))
A query that allows you to search for “nested” documents, where you can index (possibly multiple levels of) “parent” and “child” documents using the group() and/or start_group() methods of a whoosh.writing.IndexWriter to indicate that hierarchically related documents should be kept together:
schema = fields.Schema(type=fields.ID, text=fields.TEXT(stored=True))
with ix.writer() as w:
# Say we're indexing chapters (type=chap) and each chapter has a
# number of paragraphs (type=p)
with w.group():
w.add_document(type="chap", text="Chapter 1")
w.add_document(type="p", text="Able baker")
w.add_document(type="p", text="Bright morning")
with w.group():
w.add_document(type="chap", text="Chapter 2")
w.add_document(type="p", text="Car trip")
w.add_document(type="p", text="Dog eared")
w.add_document(type="p", text="Every day")
with w.group():
w.add_document(type="chap", text="Chapter 3")
w.add_document(type="p", text="Fine day")
The NestedParent query wraps two sub-queries: the “parent query” matches a class of “parent documents”. The “sub query” matches nested documents you want to find. For each “sub document” the “sub query” finds, this query acts as if it found the corresponding “parent document”.
>>> with ix.searcher() as s:
... r = s.search(query.Term("text", "day"))
... for hit in r:
... print(hit["text"])
...
Chapter 2
Chapter 3
Parameters: |
|
---|
This is the reverse of a NestedParent query: instead of taking a query that matches children but returns the parent, this query matches parents but returns the children.
This is useful, for example, to search for an album title and return the songs in the album:
schema = fields.Schema(type=fields.ID(stored=True),
album_name=fields.TEXT(stored=True),
track_num=fields.NUMERIC(stored=True),
track_name=fields.TEXT(stored=True),
lyrics=fields.TEXT)
ix = RamStorage().create_index(schema)
# Indexing
with ix.writer() as w:
# For each album, index a "group" of a parent "album" document and
# multiple child "track" documents.
with w.group():
w.add_document(type="album",
artist="The Cure", album_name="Disintegration")
w.add_document(type="track", track_num=1,
track_name="Plainsong")
w.add_document(type="track", track_num=2,
track_name="Pictures of You")
# ...
# ...
# Find songs where the song name has "heaven" in the title and the
# album the song is on has "hell" in the title
qp = QueryParser("lyrics", ix.schema)
with ix.searcher() as s:
# A query that matches all parents
all_albums = qp.parse("type:album")
# A query that matches the parents we want
albums_with_hell = qp.parse("album_name:hell")
# A query that matches the desired albums but returns the tracks
songs_on_hell_albums = NestedChildren(all_albums, albums_with_hell)
# A query that matches tracks with heaven in the title
songs_with_heaven = qp.parse("track_name:heaven")
# A query that finds tracks with heaven in the title on albums
# with hell in the title
q = query.And([songs_on_hell_albums, songs_with_heaven])