Donkeypuss's Wonderful World of Technology

Having used Opera’s mail client on my laptop recently (I got sick of Mozilla), I was very pleased with its simplicity. It felt like setting up Thunderbird took forever,with options scattered in two places, but Opera’s defaults suited me perfectly.

I decided to take the plunge and switch to Opera for my primary email client. That means importing thousands of old emails from Thunderbird. The import process was easy and seemed to go by without problems. After it was complete, I noticed several messages had been inadvertently duplicated—seemingly at random, and on all the accounts I imported. About 14,000 of the almost 100,000 emails were duplicates.

I figured out a relatively convoluted and somewhat hacked-together way to remove them. I learned some things about Opera M2 in the process:

M2 keeps a database with some basic information from each email, but the messages themselves are in individual files which are neatly sorted by date. When you click or double-click on a message Opera loads it from these individual files. However, it only reads as much of the file as it knows its size to be—that is, adding text to the end of the body in the file will not show up in the mail client. The size, subject, (and probably sender and receiver) are stored in a separate database file—this is the one that is accessed to add the messages to the list. So Â changing the subject line, for example, in the individual files does nothing to the subject as it appears in Opera. If you delete the message file, Opera does not remove it from its database. Instead, it looks as though it thinks it only downloaded the headers and the message is still on the server. So I figured there was no way to delete the emails other than from within Opera itself.

If you’re going to try this, don’t be stupid—backup your entire Opera mail directory like I did. Also, remember these scripts were developed by examples in my email collection, so it may not work for everyone. And like any good programmer, I only added checks for errors that actually occurred when I ran the scripts.

First, I wrote a Python script to calculate the MD5 checksum of the messages. It wasn’t as simple as calculating it for the entire file because Opera adds some “X” fields to the top of the header. By trial and error I arrived at the noted fields as being possibly different between otherwise identical emails. I think it’s safe to just discard the first 7 lines and calculate the checksum with the rest, but it didn’t seem to go significantly faster. Remember you are opening every single email you have, so expect this to take a while.

Next I loaded the resulting message-checksum list into a spreadsheet program, sorted it by checksum, and saved the results.

The next script loads the sorted list and looks for messages with identical checksums. It assumes there is only a maximum of 2 identical messages, though I believe the entire process would still work with n duplicates. It then saves identical pairs in another file. Note that along with comparing the checksums, it also compares the messages’ directories. I ran into at least one case where two actually “distinct” messages had the same checksum. Note that if you have a lot of short messages on the same day and account, there’s a good chance they’ll end up being marked as duplicates, and all but one will be removed.

Next I had to devise a way of marking the duplicate messages so that I could filter them in Opera and then delete them. This turned out to be more difficult than I had hoped for, because the subject line is not read from the message files (it is in the database) and because anything added to the end of the message file will not show up in Opera because it only reads up to the size that it has recorded for that message in the database. So the only way to achieve this was to find the beginning of the body and add some distinctive set of characters. Then the end of the message (or an attachment) would be cut off at the end, but we don’t care since we’re going to delete these guys anyway. I found that in my several years of email collection, going from (I think) MSN mail to Outlook Express to Outlook to Thunderbird to Opera, a few of the older guys got mangled into bare headers without a message. I noticed some of these had been duplicated, too, but my scripts do nothing to them.

Finding the beginning of the body is a bit tricky. The header is separated from the message by a blank line. But if there are attachments, a MIME header follows. The body is then separated by a blank line after that. This script takes that into account. You can uncomment the “raw_input” line to allow you to stop processing in case you want to see what it will do to the first duplicate. Also, I highly recommend you run the script once through with the line that adds the marker to the messages commented out to make sure no errors occur during processing. I suppose it’s no big deal, but if for some reason the script stops (or you have to stop it) before it’s done, the messages that have been processed will be reprocessed the next time you run the script, thus adding the marker twice. I don’t think that’s a problem, but beware.

After all the messages have been tagged, you just need to set up a filter in Opera to search message bodies for whatever you set “dupemarker” to. Once the filter has gone through all the messages, you can check that the number of messages found is less than or equal to the number of pairs messages with the same checksum (it may be less than if some messages had the same checksum but were in different accounts, as happened at least once in my case, or if they are blank, which also happened at least a few times). In my case, I had 14550 duplicate pairs of which 14391 were marked by the script.

Once the filter is done, select all the messages, and delete! You can keep them in the trash for a while if you’d like. Or, you can just keep them in the filter until you’re sure nothing went wrong. Remember that although Opera will not show the end of the tagged messages, or claim that attachments cannot be loaded because of parsing errors, the information is still there. If you remove the marker from the messages, Opera will be able to read them properly. This is due to the fact that Opera keeps the size of the messages separate from the messages themselves.

Good luck.

How to remove duplicate emails in Opera M2

Post a Comment

Categories

Archives

Advertisements

People

Projects

Reference

Software

Tools