"The art of mathematics consists in finding that special case that contains all the seeds of generality," David Hilbert

Wednesday, September 29, 2010

Secure Voting Protocols

How much confidence do you have that your vote was counted in the last election? If you had to guess, what do you suppose the overall accuracy of a typical U.S. election is? I have to wonder if we do any better than +/- 2.5%, which strikes me as awful.

The punchscan voting system is a way of running an optical scan paper ballot election that makes it possible for any interested voter to independently verify the vote totals by issuing receipts to voters that allow them to check that their vote was counted as they intended it to be, without making it possible for them to use the receipt to prove to anyone else how they voted (well, there are risks here, but in later sections I argue that they are acceptable risks).

In this article, I render my own explanation of how the nuts and bolts of the protocol work. I consider a variation on the protocol that uses letters written next to the candidate bubbles (instead of the two ballot sheet in which holes on the top sheet cover the letters on the bottom sheet) which simplifies the initial exposition significantly, but still covers the meat of the protocol.

Running an Election with Receipts

The full method is a bit much to swallow in one bite, so we introduce the features of this method one at a time (starting with the notion of a voter receipt) by imagining an election in which 5 ballots are cast. We consider a total of 5 voting methods, with the last one showing all of the methods of the Punchscan system.

In all of these scenarios:

  • There are 3 candidates in the election, numbered 0, 1, and 2.
  • 5 ballots are cast.
  • When you go to vote, you are handed a ballot with a serial number. You fill in a circle next to the candidate you want.
  • Your ballot is scanned
  • EA stands for "Election Authority".

Scheme 1: Publish Ballot ID's and Choices

P Table
idchoice
10
21
31
41
52
  • Assume the 5 votes were cast as shown above (one for candidate 0, 3 for candidate 1, and 1 for candidate 2).
  • The election authority publishes the P Table (the list of ballot id's and choices).
  • Good
    • cheating is hard
      • since you know how which ballot ID you were given, and how you voted, you can tell whether your vote was counted correctly or not.
      • if the EA tried to cheat by changing rows in this table, the voters whose votes were stolen in this way would know and could cry foul.
      • if even a small percentage of voters bothered to verify their vote from the published record, any attempt on the government's part to steal the election would be easily detected
  • Bad
    • vote selling
      • Anyone with power over you can force you to vote for X
      • They would ask you to tell them, shortly after you vote (and before the P table is published, when you could just look up another vote for X and give them that serial number), the ID number on your ballot.
      • After the EA publishes the P Table, they can check to see how you voted.
      • If you don't vote they way they want you to, they can fire you, give you a bad grade in class, divorce you, etc.

Scheme 2: Scramble P before Publishing

We introduce 2 new tables:

  • D Table: this table scrambles the rows of P to produce another table R.
    • The pid column is the id of the row of the P table the vote (choice) came from.
    • The rid column is the id of the row in the R table to vote (choice) is copied to.
  • R Table: this is used to report the votes.
The EA, after the election, publishes the R table, keeping the P and D tables secret.

    • Secrecy has been restored, as randomizing the order of the votes hides the connection between the ballot choices and serial numbers.
    • Cheating is now trivial, as the voter has no way to check their receipts, and without any information about the P and D tables, the R table can not be verified by the public.
    • In the following schemes, we will restore this ability while preserving the secrecy of the ballot.

Below is our original P table, together with one possible set of D and R tables to scramble the ballot order:

P Table
idchoice
10
21
31
41
52

D Table
idpidrid
132
223
344
415
551

R Table
idchoice
12
21
31
41
50

Scheme 3: Coding the Ballots

We can restore some of the ability of the public to check the accuracy of the process by introducing coded marks on the ballots. Recall that there are 3 candidates, numbered 0, 1, and 2. A typical ballot looks like this:

    • ballot #1034
    • ( ) candidate 0
    • ( ) candidate 1
    • ( ) candidate 2

We now introduce letters to the left of the bubbles in a random order on each ballot. An example (cast) ballot would look like this:

    • ballot #1034
    • b ( ) candidate 0
    • c (X) candidate 1
    • a ( ) candidate 2

Since there are only 3 mark symbols introduced (a, b, and c), we can number these 0, 1, and 2 and track them as numbers. In printing the ballots, we choose a random shift of 0, 1, or 2 to assign the marks to the choices. A shift of 1 gives us a ballot like this:

    • mark 1 candidate 0
    • mark 2 candidate 1
    • mark 0 candidate 2

While a shift of 2 would give us this ballot:

    • mark 2 candidate 0
    • mark 0 candidate 1
    • mark 1 candidate 2

The P, D, and R tables, after the election is held, now look like this:

P Table
idchoiceshiftmark
1022
2112
3120
4112
5202

D Table
idpidrid
132
223
344
415
551

R Table
idchoice
12
21
31
41
50

The EA publishes the id and mark columns of P table (which we can denote P{id,mark}) and the full R table (which we can denote R{id,choice}). Now the public can use their receipts to check that the id and mark from their ballot matches the published P{id,mark} table, without revealing to anyone else (except the EA) how they voted. The EA can still make up any R table they want, since no information about the D table is published.

Assuming a trustworthy EA, the voters are now in a position to detect faulty optical scanning machines that record the wrong marks, and other mistakes that could result in the P table not being filled in correctly with the marks from the ballots.

Scheme 4: Adding Random Shifts to the D Table

We now make it possible for the EA to reveal information about the D table, that would allow more public scrutiny into the election, by introducing two additional shifts of the ballot mark, which we will call the left shift and the right shift. The EA can still easily make up any vote totals it wants; the voters can only help an honest EA detect mistakes in the vote counting.

The columns of the D table are now these:

  • id - the row number
  • pid - the row number in the P table the vote comes from
  • rid - the row number in the R table the vote is copied to
  • left - a random shift, applied to the mark on the ballot, to determine the mark in the D table. Thus, D[id].mark = (P[pid].mark + D[id].left) mod 3
  • right - a shift, applied to the D mark, that shifts it back to the original choice in the P table. Thus, (D[id].mark + D[id].right) mod 3 = P[pid].choice = R[rid].choice.
  • mark - the mark from the P table shifted by D[id].left.

The tables, after the election, now look like this:

P Table
idchoiceshiftmark
1022
2112
3120
4112
5202

D Table
idpidridleftrightmark
132010
223201
344110
415221
551120

R Table
idchoice
12
21
31
41
50

Suppose there were mistakes made by the (honest) EA in filling in the D and R tables. How might the EA be able to enlist the help of the public in spotting those mistakes? If the EA were required to carryout the D table operations by hand and not use any computers, there would be a real possibility of human error.

For each row of the D table, there are two steps to filling it, and the corresponding row in the R table, in:

  • copying P info into D
    • D[id].mark = (P[pid].mark + D[id].left) mod 3
  • copying D info into R
    • R[rid].choice = (D[id].mark + D[id].right) mod 3

Recall that the EA has already published P{id,mark}. The EA now publishes D{id,mark}. So far, the public can't determine P{choice}, and it can't validate the two operations above.

Notice that for each row in D, the EA could reveal the id, pid, and left columns without revealing P[id].choice. We will call D{id,pid,left} the left half of the row. Notice further the the EA could reveal the id, rid, and right columns of D without giving away P{choice}. We will call D{id,rid,right} the right half of the table.

If the EA reveals the left half of a D row, the public can validate the first P->D operation above. If is reveals the right half, then the public can validate the D->R operation.

The EA, then, does this:

  • for each row in the D table, it randomly chooses the left half or right half.
  • it reveals that information to the public.

To recap the election protocol as presented so far, the steps are:

  • the EA generates P{id,shift} and prints ballots with numbers to the left (the marks) of each bubble and numbers to the right (the choices).
  • the EA generates D{id,pid,rid,left,right}
  • The voters cast ballots, filling in P{mark,choice}
  • The EA publishes P{id,mark} and the voters check their receipts.
  • For each id in D{id}, the EA carries out these operations:
    • D[id].mark = (P[pid].mark + D[id].left) mod 3
    • R[rid].choice = (D[id].mark + D[id].right) mod 3
  • the EA publishes R{id,choice}, and D{id,mark}.
  • For each for in D, the EA opens the left {id,pid,left} or right {id,rid,right} half.
  • The public verifies half of these operations:
    • D[id].mark = (P[pid].mark + D[id].left) mod 3
    • R[rid].choice = (D[id].mark + D[id].right) mod 3

What are the chances that any given error in transferring a choice from the P table to the R table will go undetected? There are two steps (left and right), and the information required to validate is only revealed for one of them, so there is a probability of 1/2 that any given error will go undetected. The probability of 2 errors going undetected is 1/4, and 3 is 1/8. Every additional error halves the probability of none of the errors in the set being detected, so for N errors we have a probability of 1/(2**N).

Cheating, on the other hand, is still trivial for the EA. For any given row, they can simply hide the cheating in the half they don't open. Even if a body of auditors were to make the left/right choice for every row in D, the EA could simply change the requested information before "revealing" it.

Suppose, for example, that a ballot was voted and tallied through the D and R tables thusly:

  • P.choice= 1
  • P.shift= 1
  • P.mark= 2
  • D.left= 1
  • D.mark= 0
  • D.right= 1
  • R.choice= 1

The EA decides to change this vote to choice 2 on the right side, and publish this information (bear in mind that the public doesn't know which rows in (P, D, and R are tied to each other yet, even though these 3 pieces of information appear in the published tables)

  • P.mark= 2
  • D.mark= 0
  • R.choice= 2

If the EA is asked to reveal the left half, nothing will seem amiss. If the EA is asked to reveal the right half, then they can simply claim:

  • D.right= 2

even though, when they originally setup the D table, they had:

  • D.right= 1

What's needed here is a way to force the EA to reveal enough information about the D table so that they can't change the pid, rid, left, or right values after they initially pick them, so that once the left/right choice is made they have no choice but to reveal the same values they chose initially, or be caught lying red-handed.

We need to be able to force them to commit to a secret value before revealing it.

Commitments

We pause in our presentation of election schemes show one technique for doing that in this section.

Suppose Alice and Bob wanted to choose a place to eat for lunch. Alice calls heads, Bob flips a coin, the coin lands tails, and Alice accepts defeat.

What if Alice and Bob wanted to do this over email? How would Alice know, after she emails him "heads" and he replies "tails you lose", that he actually flipped a coin instead of making it up?

Bob can generate two messages: one that encodes a choice of heads or tails (but doesn't reveal which one), and a second one that decodes the first (revealing the choice). He can generate the first message in such a way that he can't change his choice later (that is, if he generates a message committing to tails, he can't generate a second message the decodes that one to heads). A scheme for generating messages like this is called a "commitment" scheme.

One such commitment scheme uses a hash function, which we'll call f() for now. f() takes as input text string (of any length) as input and produces another text string as output (always 160 bits, or 20 characters long, for a total of 2**160 possible values). The key property of f() we will be using here is that it is very hard, given a particular output value, hv (hv = "hash value"), to find an input string x such that f(x)=hv. It is so hard, in fact, that the easiest way to attempt that is to generate x values at random and evaluate f(x) until you find one where f(x)=hv. On average, you will have to search a number of text strings that is a good fraction of 2**160, and thus the sun will burn out before you can reasonable expect to find one.

Armed with our hash function f(), Alice and Bob can use the following protocol to flip a coin over email, with no reasonable possibility of either side cheating:

  • Bob flips a coin and gets tails. He then generates a random number N and computes h1= f(N+'tails') (where the string 'tails' is concatented to the string of digits representing N). He emails Alice h1.
  • Alice replies "I call heads"
  • Bob replies "Here is N. The coin was tails, you lose"
  • Alice computes h1= h(N+'tails') and sees that it matches h1 from Bob's earlier email.

If Alice had called tails, Bob would have to try close to 2**160 different values of N before he would have a good chance of finding one where f(N+'heads')=h1, matching the first email he sent Alice.

Commonly known hash functions are MD5, SHA1, and SHA2 (which is actually a set of related functions: SHA-224, SHA-256, SHA-384, SHA-512).

SHA256 produces output strings 160 bits long. Here are some example values (we are representing the output values as hexadecimal strings, which only conveys 4 bits per character, resulting in 160/4 = 40 hex characters of output):

  • SHA256('What is the sound of one hand clapping?') = 12e62ed5e0ea7da28f8726ef7b2eda8d8457d075
  • SHA256('What is the sound of one hand Clapping?') = 20e69b1f5698714dd57af1faebb0201c075f598d

Notice that in the second string we only changed a single character (we changed 'c' to 'C' in 'clapping'), yet most of the hex characters changed as a result. Generally, changing a single bit of the input is enough to change roughly half of the bits in the output, which is what you would expect from a function that was assigning output values to input values randomly (thus forcing a full search to work backwards from a known output).

Scheme 5: Committing to the secrets in P and D

We can now present the full punchscan voting scheme.

To remove the ability of the EA to change the pid, rid, left, or right values in the D table, and thus produce any R table they want without fear of detection, we add commit columns to the P and D tables.

The full specification of the P and D tables are now:

P:

  • id - the row number (also the serial number of the ballot).
  • choice - a number (0, 1, or 2) choosing a possible candidate.
  • shift - a number (0, 1, or 2) that is added to choice to calculate mark.
  • mark - a number (0, 1, or 2) that records choice while hiding it. While this is represented as a number in the table, it would appear on the ballot as a letter (a, b, or c) or other symbol when printed.
  • n - the random number used in the commit
  • commit - f(n+ id + shift)

D:

  • id - the row number
  • pid - the row number in the P table the vote comes from
  • rid - the row number in the R table the vote is copied to
  • left - a random shift, applied to the mark on the ballot, to determine the mark in the D table. Thus, D[id].mark = (P[pid].mark + D[id].left) mod 3
  • right - a shift, applied to the D mark, that shifts it back to the original choice in the P table. Thus, (D[id].mark + D[id].right) mod 3 = P[pid].choice = R[rid].choice.
  • mark - the mark from the P table shifted by D[id].left.
  • ln - the random number used to calculate lcommit
  • lcommit - f(ln + id + pid + left)
  • rn - the random number used to calculate rcommit
  • rcommit - f(ln + id + rid + right)

The Election Protocol is now:

  • The EA generates P{id,shift,n,commit} and prints ballots with numbers to the left (the marks) of each bubble and numbers to the right (the choices).
  • The EA generates D{id,pid,rid,left,right,ln,lcommit,rn,rcommit}
  • The EA publishes P{id,commit}, and D{id,ln,lcommit,rn,rcommit}
  • The auditors randomly choose half of the ballots that were printed
    • The EA marks these ballots as spolied, opens them for inspection, and publishes the corresponding rows of the P and D tables fully (P{id,shift,n}, D{id,pid,rid,left,right,ln,rn}).
    • The auditors verify the P and D commitments, and that the P{id,shift} matches the spoiled ballots.
    • Note that in anticipation of this step, the EA prints at least twice as many ballots as it thinks it will need to run the elestion.
  • The voters cast ballots, filling in P{mark,choice}.
  • The EA publishes P{id,mark} and the voters check their receipts.
  • The EA tallies the votes and publishes the R table (R{id,choice}).
      • This involves carrying out these operations for each id in D{id}:
      • D[id].mark = (P[pid].mark + D[id].left) mod 3
      • R[rid].choice = (D[id].mark + D[id].right) mod 3
  • the EA publishes R{id,choice}, and D{id,mark}.
  • The auditors randomly make a left/right choice for every row in the D table and publish this list.
  • In accordance with this list, the EA opens the left {id,pid,left} or right {id,rid,right} half.
  • The public then
      • verifies the revealed left halves: D[id].mark = (P[pid].mark + D[id].left) mod 3
      • and right halves: R[rid].choice = (D[id].mark + D[id].right) mod 3

Threats

The Diskless Workstation and the Election Seed

How does the the Election Authority keep its secrets? Any attempt to store the private portions of the P and D tables on a computer, or to write them down, is highly suspect because of how valuable that information is to any group that wants to force vote selling.

One way to keep the amount of information that has to be safeguarded small is to use a pseudo random number generator that is initialized from a single 160 bit seed value to generate the private portions of the P and D tables. If this value, in turn, were generated from a passphrase entered by an EA trustee, then there would be no need to write anything down. The trustee could simply keep the passphrase in her head. In addition, you could arrange to split the information among a group of trustees by having them, at the time of each EA publishing event (where the tables have to be generated in full, then used to calculate a set of columns that need to be published), enter their passphrases to a workstation that then generates the 160 bit seed from the set of the passphrases taken jointly. Now the knowledge is split between each of the trustees and can only be used when they meet to act together.

The most likely point of attack is now the workstation that is used to recreate the seed and all of the private P, and D table information.

If one of the trustees provides the workstation that they all use, how can the other trustees be sure the workstation is not storing the seed value where it can later be retrieved and sold for its value in vote buying?

Even if all of the election software is run from a bootable DVD on a workstation with no (apparent) hard drive or flash devices attached, and the exact ISO image of the DVD is public knowledge (that is, the entire OS and election software is a publicly vetted open source compilation), and the trustees each calculate a checksum on the DVD in their own laptops before booting the diskless workstation from it, how can all of the trustees verify that the "trusted" workstation doesn't have a hidden flash device hidden on a chip on the motherboard that contains the real OS used to boot the machine, and which saves the seed value so that it can be harvested and sold later?

I think the best that you can do is to, on the day of each EA meeting, have the trustees jointly generate a random number which is then used to select a computer store within 10 miles of the meeting place (whatever the radius, there should be enough stores on the list that it becomes impractical for a group of trustees to corrupt that store enough to plant a bugged PC there). The trustees then drive together to the selected store, buy a diskless PC, drive it to their meeting place, validate the live DVD that are going to use, and boot the PC from it. After the publish event of that day, the trustees should destroy the PC.

How do the trustees jointly generate the number used to select the store from the list? I think the list of stores should be published the day before the meeting, and the random number chosen in a predictable way from public data that only becomes available right before the meeting (for example, the closing prices of a large number of companies of a given stock market). Anyone can then run the calculation themselves to see which random number was used to pick the store, removing the possibility of the EA trustees colluding to pick just the right store, at which they had previously arranged to stash a compromised PC.

Another way to mitigate the threat of disclosure of the election seed is to run a separate punchscan election at each polling location. That is, each polling location would have a different set of trustees, which would perform all of the table generation steps and audit responses independently of the groups at other polling locations. As each EA would now be in charge of at most several thousand votes, engaging in vote selling on a large scale would require corrupting a large number of EA's.

Who Prints the Ballots?

How does the EA arrange to print the ballots before election day? Suppose the EA needs 10,000 ballots for their polling location. At one of their meetings, after booting their diskless workstation, they generate a set of 10,000 PDFs, which are burned onto a DVD. Then then take this DVD to Kinko's for printing.

There are a number of problems with this procedure.

The P{shift} column can be compiled by looking at each ballot image, noting the order of the marks next to the column of bubbles, and writing it down. If you simply handed the DVD to a person at Kinko's, they could make a copy and then sell it to anyone interested in vote buying.

The only solution to this problem I can think of is to handle the printing using the same procedure for obtaining the diskless workstation. In addition to buying the diskless workstation from a random computer store, the printer is also bought from the same store on the day of the EA meeting. The ballots are printed by the trustees in full view of each other, and the printer destroyed along with the workstation after the last ballot is printed.

The printed ballots are then treated with the same chain of custody procedures as a ballot box filled with cast ballots on election day.