Jayd Saucedo

Blog

Receipt Data Analyzer
It's been quite a while since I posted. The reason for this is partly due to the fact that I've been busy with other things and the time I have spent working on side projects has declined quite a bit. However, it has not ceased entirely, it has just become a much much slower process. It's not a surprise that summer is when I start delving into side projects more. The side project I wish to talk about today has been something that I started toying with around March or so. The idea came from the fact that I really don't have much of an idea what a good price for certain groceries are. I go to the store and see that Bananas are $0.55 a piece and I can't tell if that's good or bad. I simply don't remember what the going rate for Bananas is and what would be a good deal. So I thought to myself, "wouldn't it be great if grocery stores had a database of all of their items available to the public so we can check out prices from various stores without actually having to spend the time and energy to go and scope out the individual stores in person?" Well, the reality is Grocery stores are probably never going to do that, because that wouldn't really benefit them.

If they're not going to provide a database of items I'll just have to make my own! So, this means that I'm going to go into the store and record the price of every single item in the store, and keep doing that because prices change frequently. No, that would be an awful waste of time. It may be practical some day when the technology exists to send robots to go do that. For now, what I am going to do is just go to the store, buy the things that I'm actually interested in, and then use that receipt data to form my database.

So, how much data is on a receipt? Not a lot!

This is a receipt from a recent trip to Costco. Really, there's only two important columns of information here. The name of the item and the price. This is a rather small receipt. I only bought 7 things. So the idea that I could manually enter these into a database isn't too far fetched. Except that this is Costco, so 7 things is a LOT of food. Now, if I went to a non-bulk grocery store I might have gotten 30 things (I actually do have a scan of such a receipt but the image would be too big and cumbersome), which means that I'd have to be pretty dedicated to enter all that manually.

Thankfully, I'm a lazy programmer and I do my best to find solutions that require the least effort. Optical character recognition(OCR) is the solution for such tediousness. This enables us to allow the receipt image to turn into the equivalent text. I used pytesser as the OCR engine for my program, which in itself is just an API for Tesseract. It's not always perfect, but if you can take a few seconds to read over the results and fix them then it'd still be less work than inputting all the data manually.

The PHP code for the upload simply looks like:

$py = "C:\\Program Files (x86)\\python27\\python.exe";
$scr = "C:\\Program Files\\wamp\\www\\site\\test\\receipt\\pytesser\\pytesser.py";

if(isset($_FILES['photo'])){
	$name = $_FILES['photo']['name'];
	$tmp = $_FILES['photo']['tmp_name'];
	//TODO: verify that it is an image
	exec("\"$py\" \"$scr\" \"$tmp\"", $output, $return_var);
	// echo "
".print_r($output, true)."
"; // echo "
".$tmp; echo implode("\n", $output); }

Now that we have that text we can now try and interpret it. The computer doesn't know what each column is, so you have to tell it. My program allows you to enter a regular expression that describes the different sections of data. To interpret the Costco receipt, the regex is as simple as:
\w\s*(\d+)\s*(.+)\s*(\d+\.\d+)

Now, a winco receipt looks more like this:

(.+)\s(\d{10,11})\s(\d*\.\d+)\s(TF|FS|TX)
(\d*\.\d+)\s*\w+\s*@\s*\d*\s\w*\s*/\s(\d?\d?\.\d+)\n(.+)\s(\d{4,5})\s(\d*\.\d+)\s(FS|TX|TF)
(\d)\s*@\s*(\d*\.\d+)\n(.+)\s*(\d{10,11})\s*(\d*\.\d+)\s(FS|TX|TF)

Which might look terrifying but it's not really as bad as it looks once you understand how regular expressions work. What you're actually looking at here is a set of 3 different regular expressions that are used to describe different situations on a receipt. For example, if you bought fruit by the pound the receipt would probably say that it was 3lbs of fruit for $0.50 a pound for a total of $1.50. Which would be different then if you bought some cereal and the receipt simply said "Cereal $1.50" without specifying price per unit or the quantity.

After you parse the data to describe the different sections, my program puts all that data into a table formatted in exactly the same fashion as the receipt, where each row is a new line, each column can be described using a drop down box, and each cell is has an input column with the data from the receipt. Each regular expression set is put into different sections as those column labels might have to describe the columns differently. You can also delete rows and columns if you find the data frivolous.

You are also capable of inserting new columns so you can describe things that were not on the receipt. Sure it's useful to know that the cereal was $1.50, but say you go to another store and see the same cereal in a slightly bigger box for $2.00. Is there just $0.50 more cereal in that box or is it a better price? Impossible to know if you don't have the quantity of the $1.50 cereal. So you may choose to enter that. Which is where the program might get tedious, who wants to go through their groceries and figure out the quantity of everything? I don't, but I also don't have a better solution to that yet, so I guess I'll do it anyway.

The next step was to get all that data into the database, which was a bit tricky because you basically created your own form when you entered your regex and generated your own table. I won't get into the details about that, but it did require a little bit of thought. My next step in the program is to make all this data editable in case you want to go back and change or delete stuff, and then after that I can start messing with the data look up techniques for when you're out and about shopping.

Hopefully one day I can buy my bananas confident that I am getting a good deal.