As of 2016-02-26, there will be no more posts for this blog. s/blog/pba/
Showing posts with label datastore. Show all posts

This article, Updating Your Model's Schema, is already great and clear, but it does not have a complete code example. I decided to make one and write down some explanations. Just in case I might need it later.

It has one two stages to remove a property from a data model:
  1. Inherit from db.Expando if the model does not inherit from that.
  2. Remove the obsolete property from model definition.
  3. Delete the attribute, the property, of each entity del entity.obsolete
  4. Inherit from db.Model if the model originally inherited from.

How to actually do it:

Assume a model look like:
class MyModel(db.Model):
foo = db.TextProperty()
obsolete = db.TextProperty()

Re-define the model to:
class MyModel(db.Expando):
#class MyModel(db.Model):
foo = db.TextProperty()
# obsolete = db.TextProperty()

Make sure the model inherit from db.Expando and comment out (or just delete the line) the obsolete property.

Here is the example code to delete the attribute, the property:

from google.appengine.runtime import DeadlineExceededError

def del_obsolete(self):

count = 0
last_key = ''
try:
q = MyModel.all()
cont = self.request.get('continue')
if cont:
q.filter('__key__ >=', db.Key(cont))
q.order('__key__')
entities = q.fetch(100)
while entities:
for entity in entities:
last_key = str(entity.key())
try:
del entity.obsolete
except AttributeError:
pass
entity.put()
count += 1
q.filter('__key__ >', entities[-1].key())
entities = q.fetch(100)
except DeadlineExceededError:
self.response.out.write('%d processed, please continue to %s?continue=%s' % (count, self.request.path_url, last_key))
return
self.response.out.write('%d processed, all done.' % count)

Note that this snippet is to be used as a webapp.RequestHandler's get method, so it has self.response.

It use entities' keys to walk through every entity, it is efficient and safe. But you may also want to put your application under maintenance, preventing other code to add new entities, even though the values of keys seem to be increased only for new entities, but you really don't need to waste CPU time since new entities has no obsolete property.

Because it have to go through all entities and therefore it takes a lot of time to process. A mechanism to continue the process on the rest of entities is necessary. The code will catch google.appengine.runtime.DeadlineExceededError if it can not finish in one request, it then return a link which allows you to continue if you follow it. If you have lots of entities, you may want to use task instead of manual continuation. You may also want to set up the maximal amount of processing entities like 1000 entities in one request.

Once it has done its job, change the model definition back to db.Model and remove obsolete property line:
class MyModel(db.Model):
foo = db.TextProperty()


That's it.

I need to count how many entity of kind Blog has boolean property accepted set to True, but I suddenly realized that OFFSET in query is no use for me (In fact, it is not really useful).

In SDK 1.1.0, OFFSET does what you think on Development Server if you first use GAE and have experience of SQL, but it's still different than on Production Server.

Basically, if you have 1002 entities in Blog and you want to get the 1002nd entity. The follows will not get you that entity:
q = Blog.all()
# Doing filter here
# Order here
# Then fetch
r = q.fetch(1, 0)[0] # 1st
r = q.fetch(1, 1)[0] # 2nd
r = q.fetch(1, 999)[0] # 1000th
r = q.fetch(1, 1000)[0] # 1001st
r = q.fetch(1, 1001)[0] # 1002nd

You will get an exception on the last one like:
BadRequestError: Offset may not be above 1000.
BadRequestError: Too big query offset.
First one is on Production Sever, second is on Development Server.

The OFFSET takes effective after:
  1. filter data (WHERE clause)
  2. sort data (ORDER clause)
  3. truncate to first 1001 entities (even though count() only returns 1000 at most)
After filtering, sorting, truncating to first 1001 entities, then you can have your OFFSET. If you have read Updaing Your Model's Schema, it warns you:
A word of caution: when writing a query that retrieves entities in batches, avoid OFFSET (which doesn't work for large sets of data) and instead limit the amount of data returned by using a WHERE condition.
The only way is to filtering data (WHERE clause), you will need a unique property if you need to walk through all entities.

An amazing thing is you don't need to create new property, there is already one in all of you Kinds, the __key__ in query, the Key.

The benefits of using it:
  • No additional property,
  • No additional index (Because it's already created by default), and
  • Combination of two above, you don't need to use additional datastore quota. Index and Property use quota.
Here is a code snippet that I use to count Blog entities, you should be able to adapt it if you need to process data:
def get_count(q):
r = q.fetch(1000)
count = 0
while True:
count += len(r)
if len(r) < 1000:
break
q.filter('__key__ >', r[-1])
r = q.fetch(1000)
return count

q = db.Query(blog.Blog, keys_only=True)
q.order('__key__')
total_count = get_count(q)

q = db.Query(blog.Blog, keys_only=True)
q.filter('accepted =', True)
q.order('__key__')
accepted_count = get_count(q)

q = db.Query(blog.Blog, keys_only=True)
q.filter('accepted =', False)
q.order('__key__')
blocked_count = get_count(q)

Note that
  • Remove keys_only=True if you need to process data. And you will need to use r[-1].key() to filter.
  • Add a resuming functionality because it really uses a lot of CPU time if it works on large set of data.

I just download the data from one of my App Engine application by following Uploading and Downloading, I used this new and experimental bulkloader.py to download data into a sqlite3 database. You don't need to create the Loader/Exporter classes with this new method

It does explain how to download and upload, but, as for, uploading is only for production server. You have to look into the command line options, it's not complicated.

Here is a complete example to dump data:
$ python googleappengine/python/bulkloader.py --dump --kind=Kind --url=http://app-id.appspot.com/remote_api --filename=app-id-Kind.db /path/to/app.yaml/
[INFO ] Logging to bulkloader-log-20091111.001712
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Opening database: bulkloader-progress-20091111.001712.sql3
[INFO ] Opening database: bulkloader-results-20091111.001712.sql3
[INFO ] Connecting to brps.appspot.com/remote_api
Please enter login credentials for app-id.appspot.com
Email: [email protected]
Password for [email protected]:
.[INFO ] Kind: No descending index on __key__, performing serial download
.......................................................................................................................................................................................
.................................
[INFO ] Have 2160 entities, 0 previously transferred
[INFO ] 2160 entities (0 bytes) transferred in 134.6 seconds

And the following is for upload to Development Server using the sqlite3 database which we just download (not the CSV):
$ python googleappengine/python/bulkloader.py --restore --kind=Kind --url=http://localhost:8080/remote_api --filename=app-id-Kind.db --app_id=app-id
[INFO ] Logging to bulkloader-log-20091111.004013
[INFO ] Throttling transfers:
[INFO ] Bandwidth: 250000 bytes/second
[INFO ] HTTP connections: 8/second
[INFO ] Entities inserted/fetched/modified: 20/second
[INFO ] Opening database: bulkloader-progress-20091111.004013.sql3
Please enter login credentials for localhost
Email: [email protected] <- This does not matter, type anything
Password for [email protected]: <- Does not matter
[INFO ] Connecting to localhost:8080/remote_api
[INFO ] Starting import; maximum 10 entities per post
........................................................................................................................................................................................................................
[INFO ] 2160 entites total, 0 previously transferred
[INFO ] 2160 entities (0 bytes) transferred in 31.3 seconds
[INFO ] All entities successfully transferred

You will need to specify the app id, which must match the Development server is running on.

This may be no need once the bulkloader.py is stable.

I just tried to add two entity counts to my app's statistics page. Then I found out, the statistics APIreleased on 10/13/2009, version 1.2.6is not available for development server.

You can run the following code without errors:
from google.appengine.ext.db import stats
global_stat = stats.GlobalStat.all().get()

But global_stat is always None.

So I ended up with a code as follows:
db_blog_count = memcache.get('db_blog_count')
if db_blog_count is None:
blog_stat = stats.KindStat.all().filter('kind_name =', 'Blog').get()
if blog_stat is None:
db_blog_count = 'Unavailable'
else:
db_blog_count = blog_stat.count
memcache.set('db_blog_count', db_blog_count, 3600)

The documentation didn't explicit mention whether if the statistics is available for development server or notmaybe I didn't read carefully, neither did Release Notes.

PS. I know the code is awful, str / int types mixed, terrible. But I am lazy to add and if clause in template file to check if db_blog_count is None or something like -1, or anything represents the data is not available.

PS2. The code should be just if blog_stat: (fourth line) and swap the next two statements if you know what I meant.

Recently, my GAE application started to get few timeouts on operations on datastore.

Here is a sample traceback:
datastore timeout: operation took too long.
Traceback (most recent call last):
File "/base/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 498, in __call__
handler.get(*groups)
File "/base/data/home/apps/brps/1.330624965687476780/index.py", line 104, in get
p = post.get(blog_id, post_id)
File "/base/data/home/apps/brps/1.330624965687476780/brps/post.py", line 85, in get
p = db.run_in_transaction(transaction_update_relates, blog_id, post_id, relates)
File "/base/python_lib/versions/1/google/appengine/api/datastore.py", line 1451, in RunInTransaction
raise _ToDatastoreError(err)
File "/base/python_lib/versions/1/google/appengine/api/datastore.py", line 1637, in _ToDatastoreError
raise errors[err.application_error](err.error_detail)
Timeout: datastore timeout: operation took too long.

Here is how you can catch it:
from google.appengine.api.datastore_errors import Timeout

try:
pass
except Timeout:
pass