Tube Matching with Python, Part 3 — Finding the Least Squares

Since my last post, a lot has changed.  I re-thought the overall organization of my code and made significant changes, struggled through the process of getting familiar with Pandas, and finally got all the functions refined enough to give me the answers I was looking for.


I finally have a way to find the closest matches for any tube I have on hand, but I can also figure out exactly which combinations produce the best possible sets out of the entire batch.  I think this will likely result in a marginal but verifiable performance increase in my M97 compressors, which is fantastic.

All that remains at this point is the data visualization — there’s a lot of data there that I want to see represented in a graphical way, and I haven’t decided whether to use pandas’ built-in graphing, matplotlib, seaborn, bokeh, pygal, or any of the other available tools.  The ChiPy mentorship ends in three weeks (!) so I’ll be experimenting with the graphing libraries any chance I can squeeze in a couple of hours, and hopefully present something that has the appearance of being finished in time for the deadline!

Code Reorganization

When I started my project, my mentor was not particularly enthusiastic about my plans to use tkinter’s GUI elements in my program, and instead favored a web-based interface.  I had originally envisioned a web-based interface as well, but by that time I was picturing a little app that would only ever run on the PC that I have dedicated to tube measurements in my lab.  The web interface seemed unnecessary, and the idea of sculpting my own interface appealed to me, so I continued to explore tkinter and ended up using the filedialog and simpledialog objects as described in Blog #2.

My original 2-module approach, with an importer module and an analyzer module, made sense logically when I started but was beginning to seem unnecessary now.  At some point it dawned on me that my code should make the importing of the files an invisible, under-the-hood process that the user should be unburdened by. I’m already using the command line to manage my git repository, launch Jupyter, etc. so why not also navigate to the proper folder and pwd, then copy/paste directly into Jupyter?  It’s so simple that I can’t justify the use of tkinter anymore.

Restructuring the Two Modules

The problem with putting everything into one Jupyter Notebook is having to look at all the code for the Importer module, which is about 90 lines.  Instead, I decided to make it a separate module and import it at the beginning of my Jupyter file.

from electronaut_uTracer_Import import *

So actually my program maintains the original two-module idea, but implements them both into one Jupyter Notebook, making it simpler to use.  I’m much happier with this set-up.

Scratching the surface of Pandas

After a couple months of studying Python, I was feeling happy with the progress I was making and was feeling like it was all a lot easier than I had anticipated.  Then came Pandas!  For whatever reason pandas seemed far less intuitive, the documentation rarely seemed to answer my questions (or I’m too impatient to read through it all), and the errors didn’t necessarily make it obvious what the problem is.  Thank god for Stack Overflow!  There are a bunch of things I wish I had known when I started, and I’ll be talking about those toward the end of this blog with the hope that any other pandas newcomers will find them helpful.


After the importing process has been completed, the main dataframe that holds all the tube data is created.

The first thing I did was to run the .describe() method to produce a new dataframe showing the statistics of the data.

For reasons that I don’t fully understand, I had to explicitly tell the .describe() method to interpret the data as floats in order to get the right results, otherwise I would get a dataframe with only the ‘Count’, ‘Unique’, ‘Top’, and ‘Freq’ rows.

This created a new problem, however: columns that should be ints, like the tube_ID column, were now floats.  I resolved this by explicitly switching those columns back to ints.  Obviously, this seems like a glaring code smell, but I’m not sure how to fix it and frankly, it’s not that important.  Another rainy-day item for the infinitely long to-do list!

Now I have a nice new dataframe showing the statistics of all the tube data.

Least Squares

Each tube has 13 current measurements, and I needed a way to compare them all to every other tube in the set.  I decided to try the ‘least squares’ technique, which involves computing the difference between two measurements, squaring that number, repeating that for every other measurement, then adding all the squares together to get a final score.  The lowest score of the bunch is the best match to that particular tube.

Basically, I’m comparing every tube in the batch to every other tube and building a giant dataframe filled with differences, squared.  The len of this new dataframe will always be the number of tubes I imported, squared.

This process actually takes quite a long time.  In the case of importing a batch of 167 tube data files, it took 20.65 seconds to calculate 446244 values, resulting in 27,889 rows!

As can be seen in the image above, the ‘match tube’ is compared to the ‘ref tube’ to compute the squares.  The first row shown above is the result of comparing tube number 1 to itself — there are no differences, so all the values are zero.  Subsequent rows show the squared differences when tube 1 is compared to other tubes, and the squares are summed and shown in the error_sum category.

From here, finding a match for any particular tube is simple — just filter the dataframe to include only the ref_tube I want to find matches for, and pick the tubes with the lowest error_sum value.

I made a function to do this:

This function works great, if I don’t mind hard-coding the tube I’m trying to match and the number of matches I want in the set.

I can come up with an overall ‘score’ by summing all the mismatch errors for the new tube set, as can be seen above.  In the case of finding matches for tube #1, the best matches produced a total tube_set_score of 306.

Is that a good score?  How does that score compare to other possible sets?  At this point these questions remained unanswered, but I could see their importance just by playing around with the function above.  For example, if I find matches to tube #2, the tube_set_score is more then ten times better!

What I needed was a way to iterate through the entire dataframe, calculating the score of every possible tube set, then produce a list of tube sets and their scores in order from best to worst.

This was trickier than I had originally expected it to be, and I spent a lot of time pacing around my lab wracking my brain, trying to figure it out.  The procedure would be:

  1. Calculate every possible tube set, score them all, and sort them in a new dataframe.
  2. Take the very best tube set from the list, append a list with its Tube ID numbers and set score, then remove all references to those tubes from the dataframe.
  3. Rinse and repeat, finding more sets until the number of tubes remaining in the dataframe is too small to make a complete set at the user-defined set size.


This totally worked!  When I run this on my sample data set of 167 tubes, the following results are produced:

I could have thrown a party, I was so excited!

The code spits out the above report as it calculates tube sets, but it also produces a new dataframe with all the tube numbers and their scores, which is shown below.



Too Many Numbers!

As I said at the beginning of this post, I now have all the information I need to match the tubes into sets.  However, at this point I just have to trust that it’s correct, blindly.  That freaks me out.  Instead, I want to see graphical proof that the tubes’ characteristics are indeed matched to each other, and that will require me to move beyond pandas and start learning one of the available graphing libraries.

Wish me luck!