I need to count how many entity of kind
Blog has boolean property
accepted set to True, but I suddenly realized that
OFFSET in query is no use for me (In fact, it is not really useful).
In SDK 1.1.0,
OFFSET does what you think on
Development Server if you first use GAE and have experience of SQL, but it's still different than on Production Server.
Basically, if you have 1002 entities in
Blog and you want to get the 1002nd entity. The follows will not get you that entity:
q = Blog.all()
# Doing filter here
# Order here
# Then fetch
r = q.fetch(1, 0)[0] # 1st
r = q.fetch(1, 1)[0] # 2nd
r = q.fetch(1, 999)[0] # 1000th
r = q.fetch(1, 1000)[0] # 1001st
r = q.fetch(1, 1001)[0] # 1002nd
You will get an exception on the last one like:
BadRequestError: Offset may not be above 1000.
BadRequestError: Too big query offset.
First one is on Production Sever, second is on Development Server.
The
OFFSET takes effective after:
- filter data (WHERE clause)
- sort data (ORDER clause)
- truncate to first 1001 entities (even though count() only returns 1000 at most)
After filtering, sorting, truncating to first 1001 entities, then you can have your
OFFSET. If you have read
Updaing Your Model's Schema, it warns you:
A word of caution: when writing a query that retrieves entities in batches, avoid OFFSET (which doesn't work for large sets of data) and instead limit the amount of data returned by using a WHERE condition.
The only way is to filtering data (
WHERE clause), you will need a
unique property if you need to walk through all entities.
An amazing thing is you don't need to create new property, there is already one in all of you Kinds, the
__key__ in query, the
Key.
The benefits of using it:
- No additional property,
- No additional index (Because it's already created by default), and
- Combination of two above, you don't need to use additional datastore quota. Index and Property use quota.
Here is a code snippet that I use to count
Blog entities, you should be able to adapt it if you need to process data:
def get_count(q):
r = q.fetch(1000)
count = 0
while True:
count += len(r)
if len(r) < 1000:
break
q.filter('__key__ >', r[-1])
r = q.fetch(1000)
return count
q = db.Query(blog.Blog, keys_only=True)
q.order('__key__')
total_count = get_count(q)
q = db.Query(blog.Blog, keys_only=True)
q.filter('accepted =', True)
q.order('__key__')
accepted_count = get_count(q)
q = db.Query(blog.Blog, keys_only=True)
q.filter('accepted =', False)
q.order('__key__')
blocked_count = get_count(q)
Note that
- Remove keys_only=True if you need to process data. And you will need to use r[-1].key() to filter.
- Add a resuming functionality because it really uses a lot of CPU time if it works on large set of data.