Python and SQLite – First Steps

The company I work for uses LTO to archive data. A few years ago, the key IT personnel left, and along with them the full working knowledge of the archiving system (TBS), which was heavily based around Linux scripts (I’ve spotted some PERL).

The retrieval system is composed of a user-friendly HTML browser front-end that communicates with the archiving daemon, which publishes search results and the generates code to handle retrieval requests.

The archiving/storing system, on the other hand, is entirely another matter: it seems to be all Linux shell scripting, and no one knew how to use it. We could retrieve stuff, but we couldn’t archive anything back to tape.

The company consulted third-party IT pros and companies about it and they made a decision to buy and use PreRollPost as our archiving system. I haven’t really used PRP, and I don’t have any opinion on it. Needless to say, TBS allowed view access, search queries, and retrieval requests to the database from any networked computer. PRP can only be operated locally, so all actions and operations must be done in that one computer it is installed on.

In the past, when producers came to me asking to pull an archive, or check for its existence, I would do a search in TBS from my own workstation. Now, it wouldn’t be possible, and that an archive operator role (ARCOP) needs to be created. In an effort to go back to the same efficient workflow, one of the guys in charge of the new PRP system, decided that he’d create a Google spreadsheet and enter the database information by hand. It’s incredibly tedious and, I reckon, prone to mistakes. Due to the company’s slow adoption of a new archive system, he’s also backlogged with stuff that needs archiving.

I’m not in the IT department; I’m a CG technical supervisor, and I hesitate to try to assist in these things because I’m expected, first and foremost, to be a CG operator. The producers would sooner throw me a motion graphics job over any workflow project for the archive system. Actual jobs are visible, and thus get more attention. But they won’t pay anyone to fix things that aren’t readily visible — they don’t deem it really that important — until they become the cause of visible disruptions. And so launching into a coding project always risks not being able to finish it; we’re not a software company, too.

When I saw the spreadsheet, I first thought that I should make a simple helper script for the ARCOP by retrieving relevant information directly from PRP’s SQLite database, putting that as a CSV-formatted text in a buffer or file, and it would be a copy-paste matter after that. I began the journey first by converting the .sqlite to .csv using DB Browser for SQLite, which was indispensable both in analyzing PRP’s database structure, and learning and troubleshooting SQLite commands (more on that later). When I realised that I didn’t want the ARCOP to deal with having to convert .sqlite to .csv everytime he needs to update the spreadsheet, I reformatted the code to use the sqlite3 Python module and directly access the .sqlite itself.

All was going fine; I copied the .sqlite file from the PRP computer into my dev workstation as a local copy. Here, I saw that the PRP database stored full source paths and this meant that I could extrapolate some details from this: I could determine the project title, the tape number, whether or not an entry should be filtered out based on its relevance. But a surprise came when I re-checked the PRP database and found out I was using an older version. The main difference was huge: full paths were no longer stored; nodes were checked in with their own unique node numbers, and they also had a parent node. This meant that if I wanted the full path to a particular node, or the full path that a node belongs to, I would have to recurse through the hierarchy myself using SQL commands.

It was at this time that an assistant producer rang me asking if a certain project existed. I looked at the ARCOP’s spreadsheet, and it wasn’t there. I was getting a bit better using DB Browser that I decided to look directly into the database. I put in the keywords in one of the columns’ filter, and found it; tracing its parent node one-by-one until I got to the root, I finally concatenated the PRP location on a piece of paper, and sent it to the ARCOP to retrieve.

It was at this time that I decided that creating a helper script for the ARCOP to assist inputting data into the spreadsheet was an obsolete idea: there’s no need to recreate a database in a spreadsheet when I already had the ability to programmatically access the database. The issue here was that I wasn’t familiar at all with SQLite. And, up until recently, I hadn’t really touched GUI programming in Python. These were the major knowledge hurdles for me, but there seemed to be no other way, unless I simply dropped the endeavour, which didn’t appeal to me.

The idea now changed to creating a prog that will provide a simple search query mechanism to the database. It would have a ‘search all keywords’ function, it will need to filter out most nodes except .tar files (which is used as compression container for most projects) and directories. Then it needs to present the search results in a list and a string field so that the user can copy-paste the location to a text email when he/she requests an archive pull.

After some brief research I decided to implement wxPython for my GUI. I spotted a helper prog called wxFormBuilder that helped me immensely in quickly understanding, through its generated code, what wxPython was doing. Through the many articles in Stack Overflow, I also learned how to create a multi-column list which would be populated with search results.

I think the hardest part about the project was SQLite which I’ve no previous experience. The main difficulty lay in understanding the flow, and if there were actual concepts of variables, what Common Table Expressions were conceptually, and just how much difference lay in programming languages and SQL commands.

WITH RECURSIVE
    find_parent(Z_PK,ZPARENTNODE,ZNAME,ZISDIRECTORY,ZBYTECOUNT,ZNOTES,ZBACKUPDATABASE,ZMODIFICATIONDATE,ZCREATIONDATE) AS ( 
        SELECT Z_PK,ZPARENTNODE,ZNAME,ZISDIRECTORY,ZBYTECOUNT,ZNOTES,ZBACKUPDATABASE,ZMODIFICATIONDATE,ZCREATIONDATE FROM ZPRPFILESYSTEMNODE WHERE ZNAME LIKE '%fsp%'
        UNION 
        SELECT ZPRPFILESYSTEMNODE.Z_PK, ZPRPFILESYSTEMNODE.ZPARENTNODE, ZPRPFILESYSTEMNODE.ZNAME, ZPRPFILESYSTEMNODE.ZISDIRECTORY, ZPRPFILESYSTEMNODE.ZBYTECOUNT, ZPRPFILESYSTEMNODE.ZNOTES, ZPRPFILESYSTEMNODE.ZBACKUPDATABASE, ZPRPFILESYSTEMNODE.ZMODIFICATIONDATE, ZPRPFILESYSTEMNODE.ZCREATIONDATE FROM ZPRPFILESYSTEMNODE, find_parent WHERE find_parent.ZPARENTNODE=ZPRPFILESYSTEMNODE.Z_PK
        ),
    find_child(Z_PK,ZPARENTNODE,ZNAME,ZISDIRECTORY,ZBYTECOUNT,ZNOTES,ZBACKUPDATABASE,ZMODIFICATIONDATE,ZCREATIONDATE) AS ( 
        SELECT Z_PK,ZPARENTNODE,ZNAME,ZISDIRECTORY,ZBYTECOUNT,ZNOTES,ZBACKUPDATABASE,ZMODIFICATIONDATE,ZCREATIONDATE FROM ZPRPFILESYSTEMNODE WHERE ZNAME LIKE '%fsp%'
        UNION  
        SELECT ZPRPFILESYSTEMNODE.Z_PK, ZPRPFILESYSTEMNODE.ZPARENTNODE, ZPRPFILESYSTEMNODE.ZNAME, ZPRPFILESYSTEMNODE.ZISDIRECTORY, ZPRPFILESYSTEMNODE.ZBYTECOUNT, ZPRPFILESYSTEMNODE.ZNOTES, ZPRPFILESYSTEMNODE.ZBACKUPDATABASE, ZPRPFILESYSTEMNODE.ZMODIFICATIONDATE, ZPRPFILESYSTEMNODE.ZCREATIONDATE FROM ZPRPFILESYSTEMNODE, find_child
        WHERE find_child.Z_PK=ZPRPFILESYSTEMNODE.ZPARENTNODE
        )

 

The snippet above is one the test SQL commands I was debugging under DB Browser. It is the template that I used to get a node’s parent node and child node. I admit that there is still this haze between my brain and SQL commands. I think time will tell whether I would need to delve into it a bit more.

Over the weekend, I struggled with many aspects of the search function and search results. One of the main issues was how much filtering should/could I do using fast SQL commands versus when should I let Python’s convenience functions do work. In the end, SQL command return all traversed trees containing any of the keywords, and Python adds another level of filtering based on the user options.

There are still things to do in the prog itself, such as problems with the sortable columns (probably the itemDataMap attribute is not being populated properly), adding the ability to choose a database to connect to, etc. I’d also need to decide whether the prog run as a script, or a compiled .app/.exe prog. Back at the office, the PRP database needs to be accessible to networked computers.

I’m pretty happy that I got relatively deeper than I thought in a few days of development. But this Saturday weekend was necessary for the push since there are too many distractions at work to be able to really absorb technical information especially a topic that’s really new to me.