As of 2016-02-26, there will be no more posts for this blog. s/blog/pba/
Showing posts with label size. Show all posts

More than two years ago, I posted Finding large emails in Gmail using Python IMAP with XOAuth, which was really not an easy way if you don't know how to run a Python script.

Now, Gmail finally supports new operators for such task:

size: Search for messages larger than the specified size in bytes Example: size:1000000
Meaning: All messages larger than 1MB (1,000,000 bytes) in size.
Similar to size: but allows abbreviations for numbers Example: larger:10M
Meaning: All messages of at least 10M bytes (10,000,000 bytes) in size.

And this is my test:

It also supports date ranges, no need to browse through pages for old emails anymore. It's fast, only took than more than two years to develop.

du -sh is one of the common ways which I utilize du command, I used it to get the total disk size of current directory occupied. Another one is du -hd1, getting the disk sizes of each subdirectory uses, it lists one by one instead of a grand amount.

But how about the total sizes of individual files which you are interested? Not indented to show off my AWK scripting skill, but I did use AEK to sum it up byte counts from find or ls command if its too complicated, i.e. involving some directories. To be honest, that shows no skill at all, 10-minute AWK noob can do that and only reveals how I was unfamiliar with du command and clearly I didnt RTFM. From its manpage:

-c, --total produce a grand total

Its as simply as that and I didnt even know before. So, basically, you can do:

find -L -name 'PATTERN' -print0 | du -ch --files0-from -

Or simply, if filenames do not contain spaces:

du -ch $(find -L -name 'PATTERN' -print0)

Thats all you need, although you still need some knowledge of find. The -L is for symbolic link (symlink), you can ignore/omit that if you dont even know what it is, you probably dont need that. For files in current directory, you can use it as if its a ls command, for example:

du -ch *.txt

Thats all.


Gmail supports new operators for size range searching, see my blog post about them. (2012-11-14)

After I posted about using Googles Python XOAuth library to get the unread mail count and list, I finally found a good reason to use IMAP, you can search based on the message size! Which you cant do in the web interface.

typ, data =, '(SMALLER %d) (LARGER %d)' % (MAXSIZE * 1000, MINSIZE * 1000))

That is just great but not awesome because Gmails IMAP server does not support SORT command, which is an IMAP4rev1 extension command, according to Python doc.

The entire source code is the similar to the one in my previous post:

#!/usr/bin/env python
# Copyright 2010 Yu-Jie Lin
# BSD license

import email
import email.header
import imaplib
import sys

import xoauth

scope = ''
consumer = xoauth.OAuthEntity('anonymous', 'anonymous')
imap_hostname = ''

# How many messages will be fetched for listing?

  import config
except ImportError:
  class Config():
  config = Config()

def get_access_token():

  request_token = xoauth.GenerateRequestToken(
      consumer, scope, nonce=None, timestamp=None,

  oauth_verifier = raw_input('Enter verification code: ').strip()
    access_token = xoauth.GetAccessToken(
        consumer, request_token, oauth_verifier, config.google_accounts_url_generator)
  except ValueError:
    # Could indicate failure of authentication because verifier is incorrect
    print 'Incorrect verification code?'
  return access_token

def main():

  # Checking user email and access token
  if not hasattr(config, 'user') or not hasattr(config, 'access_token'):
    config.user = raw_input('Please enter your email address: ')
    config.google_accounts_url_generator = xoauth.GoogleAccountsUrlGenerator(config.user)
    access_token = get_access_token()
    config.access_token = {'key': access_token.key, 'secret': access_token.secret}
    # XXX save token, this is not a good way, I'm too lazy to use something
    # like shelve.
    f = open('', 'w')
    f.write('user = %s\n' % repr(config.user))
    f.write('access_token = %s\n' % repr(config.access_token))
    print '\n\ written.\n\n'

  config.google_accounts_url_generator = xoauth.GoogleAccountsUrlGenerator(config.user)
  access_token = xoauth.OAuthEntity(config.access_token['key'], config.access_token['secret'])

  # Generate xoauth string
  class ImBad():
    # I'm bad because I'm going to shut xoauth's mouth up. So you won't see these debug messages:
    # signature base string:
    # GET&
    # xoauth string (before base64-encoding):
    # GET oauth_co...
    def write(self, msg): pass
  sys.stdout = ImBad()
  xoauth_string = xoauth.GenerateXOauthString(
      consumer, access_token, config.user, 'IMAP',
      xoauth_requestor_id=None, nonce=None, timestamp=None)
  sys.stdout = sys.__stdout__

  MINSIZE = int(raw_input('Larger than in KB [1000]? ') or 1000)
  MAXSIZE = int(raw_input('Smaller than in KB [5000]? ') or 5000)
    print >> sys.stderr, 'Wrong size range!'
  imap_conn = imaplib.IMAP4_SSL(imap_hostname)
  imap_conn.authenticate('XOAUTH', lambda x: xoauth_string)'[Gmail]/All Mail', readonly=True)
  typ, data =, '(SMALLER %d) (LARGER %d)' % (MAXSIZE * 1000, MINSIZE * 1000))
  # No SORT command on Gmail IMAP server
  #typ, data = imap_conn.sort('(REVERSE SIZE)', 'UTF-8', '(LARGER %d)' % SIZE)
  unreads = data[0].split()
  print '%d messages are between %d and %d KB.' % (len(unreads), MINSIZE, MAXSIZE)
  ids = ','.join(unreads[:MAX_FETCH])
  if ids:
    print 'Listing %d messages:' % min(len(unreads), MAX_FETCH)
    typ, data = imap_conn.fetch(ids, '(RFC822.HEADER)')
    for item in data:
      if isinstance(item, tuple):
        raw_msg = item[1]
        msg = email.message_from_string(raw_msg)
        # Some email's header are encoded, for example: '=?UTF-8?B?...'
        print '\033[1;35m%s\033[0m: \033[1;32m%s\033[0m' % (

if __name__ == '__main__':

The output would look like:

% python2.5 ./
Larger than in KB [1000]?
Smaller than in KB [5000]?

23 messages are between 1000 and 5000 KB.

Listing 20 messages:
[messages here]

The search would take quite a lot of time to complete, up to minutes. So, please be patient.

I want to find those big emails because I couldnt figure out why 9,085 emails can take up to 543 MB in my Gmail. I found the biggest mail, 15,189KB, 2.80% of used space. Second and third takes 9,366 and 7,659KB, together take 3.14%.

YouTube provides few predefined sizes of embedded code, but usually that won't fit perfectly for your blogs. I wrote a simple script to calculator based on the width that you want, it will give you what proportion you should set.

As of writing, the control panel, at bottom of video, is 25 pixels high. If you have set border on, that will increase 20 pixels on both width and height.

A sample embedded code looks like:
<object width="width" height="height">
<param name="movie" value=""></param>
<param name="allowFullScreen" value="true"></param>
<param name="allowscriptaccess" value="always"></param>
<embed src=";hl=en&fs=1&rel=0"
type="application/x-shockwave-flash" allowscriptaccess="always"
allowfullscreen="true" width="width" height="height"></embed>

You need to replace proportion in two places, which are marked in red.