Archive for the ‘Project’ Category

Introducing NightSky2

Saturday, March 19th, 2011

I have had a small hobby project in the pipe for a while now (several actually, I am very good at starting new things, and now I am working up the skill to actually finish them as well), and now I have finally gotten around to finish it.

It has been near completion for a while, but I managed (two times that I can remember) to find fault with it, and then tear it all down and start anew.

None of this matters now, for now I am going live. I will get around to setting up a mercurial repository on bitbucket for it, but for the time being it will be made available in The source is available through a mercurial repository at bitbucket.org as well as through tarballs and zip-archives.

So without further ado, I give you: NightSky2

So then, what is NightSky2? It’s an homage to a website that taught me the (very) basics of identifying constellations in a very pedagogical way.

When I tried to find it again last year, I found that I couldn’t. It had disappeared off the net. It made me quite sad, because it was truly a great introductory source of knowledge, and I wanted to show it to a dear friend of mine.

Scouring through the Internet, I finally remembered something that mk told me one morning on the ride in to FSCONS about a site named archive.org. I found remnants of it there, the latest (partial) working set having been mirrored in 2008.

But it gave me enough of my waning knowledge back to be able to build a site of my own, so… here we are :)

Hope it is of use to someone.

Update: Added link to repository.

:wq

Introducing jQuery.xmastree

Saturday, January 15th, 2011

With this post I hereby announce the first release of a little jQuery-script I call xmastree. This script has NOTHING to do with Christmas decorations.

I could have given the project the more specific name jQuery.HullOpeningIndicator, but that is both long and ugly, and I fear that it would also have trapped people in thinking that it could only be useful for indicating whether openings in a submarines hull were either open or closed.

In simplified terms, an xmastree on board a submarine is a panel which displays the state of every opening into said submarine. In a submarine, the only two reasonable states are “open” and “closed”.

As such, a submarine xmastree (hull opening indicator) visualizes a series of binary states.

And that is what jQuery.xmastree attempts to mimic. Of course, this xmastree will make no assumptions about what these states represent, or how they should be visualized (that is configurable and left to the user to decide). In fact, jQuery.xmastree make no assumption about the data being visualized is binary either. It most certainly doesn’t need to be, there are no such limitations in the code.

The demo use ASCII for visualization, but one could just as simply modify data.json to have xmastree output <img> tags and thus images.

This project is released under the GNU GPL version 3 or later, and the source may be found here.

:wq

Back from obscurity

Thursday, November 19th, 2009

Ouch! Seems like there might be some people actually interested in my project I made in the Advanced Free Software Tools course.

I haven’t had much time (read: I haven’t made time) to fiddle with it in ages. Perhaps it is time to revive it. The concepts where all pretty solid, although I am ashamed to say that there is really no documentation at all about how it works #procrastinationfail

I think what ground me to a stop the last time was how to have Django dynamically add more form fields based on a model. I guess I’ll just have to do some more research.

Thanks hesa for pushing me into the deep end of the pool, I need that from time to time ;D

Putting technologies to use in peculiar ways

Wednesday, March 4th, 2009

I just read a daily WTF and I can’t be sure why, possibly because they were generating invoices, an activity which my mind for some reason has been linked to PDFs, I had a flashback to term 5 at ITU, where our project group collected a bunch of data through a web-based questionnaire, and stored in a database.

Then there was the question about retrieving the information and presenting it in our document (a PDF, generated by LaTeX), which, if I remember correct, was done by me by ugly-hacking together a PHP-script which, depending on what script you called from the webserver, either presented you with a csv file, or a LaTeX formatted file. To be completely honest I guess stream would be the better description, which the browser interpreted as a file and rendered.

In any case, I have a little suspicion that this wasn’t one of the intended domains for PHP, but it did the job well nonetheless.

Midnight hacking, part 2

Saturday, February 28th, 2009

I have since the last post come up with a name for this little project: “Vocabulary fingerprinting”. This post should be part 3 or 4, but obviously it fell out of my memory to write the second post after my second midnight session. The first refactorings I made brought down execution time a bit. Not as much as I had hoped for, but a bit. When I then added more regular expressions to improve the accuracy of what should be stored in the database, execution time was impacted negatively. It didn’t quite go back to the first value of ~120 minutes, but is now holding steady at around 113 minutes. I will have to profile the code to get closer to the root, but my suspicions are a) the regular expressions (I believe I now have in the vicinity of 15 different regexes which are being executed once on every line I have written, and some on every word) and b) SQLite (disk I/O).

I do have some ideas about how to speed it up, in the case of the regexes I’d have to rearrange the order in which they are executed, and make some of them more generic (having 3 or 4 regexes to sort out all the various smileys… no not good) and for the probable disk I/O bottleneck… well I can’t really get around writing to the disk, but I can store a bunch of words in memory, and then write them all to disk at once… I don’t know why that would affect anything, but it just feels like it would help.

I have to confess that it has actually been a couple of days since I last worked, or thought about working, on this project, but today when I woke up I got a new idea. I haven’t yet decided if the idea is any good.

Since I am parsing instant-messaging logs, each line represents one message. Since messages differ in length, both in terms of number of words and number of characters, it could be interesting to store an “average line length” value in the database as well, since this could make identification easier (think ballpark estimates, “chatty” or “quiet”). But then the code would become very domain specific. You couldn’t try to identify a person by feeding the code an article, since these will be a great mass of text, which each line formatted to be roughly of equal length (on or around 80 characters per line probably)

One could of course craft two modules, one article specific and one instant messaging specific, and have them share database (i.e. the “vocabulary” table would be shared, but meta-data around how the writer compose his texts would be stored in two different tables, depending on what we want to compare with. This could work, since it is possible to identify “chattiness” in articles as well, the only difference being that the word count ranges for determining “chattiness” are larger (150-2000, 2000-6000, etc). (I have a feeling that I could easily end up in the latter range…)

Finally, I tried to implement a measurement facility in the code. Something along the lines of:

Given ten measurement points, evenly distributed across the files being parsed, in this case 2900 files / 10 points = every 290 files, record the current time (timestamp) at each point.

In the end, what I want is an output for every 290 files, telling me the estimated time to completion, based on the average time it has taken the code to accomplish its task so far.

I thought that it would be pretty straightforward, just a simple case of determining how many measurement points there are left, create time-deltas (mp1 – starttime, mp2 – mp1, mp3 – mp2 …) and add those together, divide by the number of time-deltas, and multiply it by the number of measurement points there are left.

I checked it with my brother (who is way more math smart than me) and he think it seems legit… but the code doesn’t work. It reports that it’s about 30 minutes to completion on every measurement point up until the last where it drops to 8 minutes…

His thought was that maybe the initial files are so small compared to the files at the end, that the estimations are frakked up. I guess I will have to either randomize the order of the list containing the filenames, or just reverse it, to see if it makes any difference.

That’s all for now.

Mercurial and hooks

Thursday, February 19th, 2009

I found myself today with a problem. I have a development server on which I run tests and build things. It as of today also houses a new mercurial repository. Inside it, a bunch of PHP-files. My original idea was to link the needed files from the repository into the wwwroot. This of course will not work as no complete files (to my knowledge) is stored inside the repository. So then, after having committed, I would want the repository to push the new changes out to a local clone, which I could then link to from the wwwroot.

This was actually fairly easy. Inside the repository you find a hidden directory “.hg”. Within it there should exist a file “hgrc” (it didn’t in my case so I created it).

My first attempt, following these instructions didn’t quite work out. I don’t really know why, but checking the local clone made evident that it had not updated as it should have.

What I tried was:

[hooks]
changegroup = hg push /path/to/clone

which left me with an error message on the client “abort: unexpected response: ‘pushing to /path/to/clone/[repo-name]\n’“. My next attempt was to use a shell-script instead. The second attempt failed also, this time because I stuck the shell-script inside the .hg directory, and tried to call the script with a relative path from hgrc (I guess hg isn’t executed from that directory so it fell flat on its face)

Third and final attempt, the same shell-script, moved to a directory on the $PATH, and I push from my remote (workstation) repository. The client still receive an error message: “abort: unexpected response: ‘pulling from /path/to/repository/[repo-name]\n’“, but at least this time the clone on the server has been updated.

The shell-script was a quick and dirty hack:

#!/bin/sh
cd /path/to/clone
hg pull -u
exit 0

but worked like a charm. This is in no way extensible (although I guess one could make it work iff the hook-scripts are named carefully, but it would be a much better solution to have each project specific hook located inside the project repository instead…

Anyway, my Google-Fu fails me in my searches for how to get around the client error message. It obviously isn’t aborting since the clone, pulling from the server, is getting the changes. If you know, I’d be happy to hear from you.

Update:

My Google-Fu eventually came through, and I found this conversation in which the proposed solution worked superbly. My hgrc now look like this:

[hooks]
changegroup = /path/to/shell-script > /dev/null

Midnight hacking

Thursday, February 12th, 2009

Last Saturday… sorry, early Sunday, way past any reasonable bedtime, I was twisting and turning, finding it impossible to fall asleep. Reading a magazine didn’t work, in fact it might just have had the opposite effect. It got my brain working, and all of the sudden an idea entered my mind.

I can’t understand it myself, so don’t bother asking, there will be no coherent or reasonable answer, but I got the idea to pull my Pidgin log-files, all 2900 of them, dating back from 2008.01.01, and have a program go through all of them, cataloging and counting the outgoing words.

Maybe it was some urge to code, maybe my subconscious has a plan for the code which it has yet to reveal to me, I couldn’t tell you, but the more I thought about it, the more the idea appealed to me. Within half an hour I knew roughly how I wanted to do it.

The premise was: 2900 HTML-formatted log-files describing interactions between me and one or more external parties. Pidgin stores each sent message on a separate line, so except for some meta-data about when the conversation took place, located at the top of the file, there was one line per message.

I wanted the code to be, I hesitate to call the resulting code “modular”, but “dynamic” might be better. So no hard coded values about what alias to look for. This worked out fine, as I soon realized I needed a file name for the SQLite database which would store the data.

The script is called with two parameters, an alias and a path to the directory in which the logs can be found. This is also where I cheated. I should have made the script recursively walk into any sub directory in that path, looking for HTML-files, but I opted instead to move all the files from their separate sub directories into one large directory. Nautilus gives me an angry stare every time I even hint at wanting to open that directory, but handling sub directories will come in a later revision.

So, given an alias (the unique identifier which every line that shall be consumed should have) and a path, list all HTML files found in that path. Once this list has been compiled, begin working through it, opening one file at a time, and for each line in that file, determine it the line should be consumed, or discarded.

Since the line contains HTML-formatting, as well as the alias and a timestamp, this would be prudent to scrape away, regular expressions to the rescue. Notice the trailing “s”, simple code is better than complex code, and a couple of fairly readable regular expressions is better than one monster of an expression. So away goes HTML-formatting, smileys, timestamps and the alias. What should now, theoretically, be left, is a string of words.

So that string is split up into words and fed into the SQLite database.

I was happy, this was my first attempt at working with SQLite, and thus my first attempt at getting Python to work with SQLite. It worked like a charm. Three separate queries where used, one, trying to select the word being stored. If the select returned a result, the returned value was incremented by one, and updated. Of no result was returned, a simple insert was called.

This is of course the naive and sub-optimal way to do it, but right then I was just so thrilled about coding something that I didn’t want to risk leaving “the zone”. Needless to say, doing two queries per word, means hitting the hard drive two times per word, EVERY word, for EVERY matching line, for EVERY one of the 2900 files. Yeah, I think there is room for some improvement here.

But I have to admit, I am impressed, in roughly four hours, give or take half an hour, I managed to put together a program which worksed surprisingly well. The one regret I have right now is that I didn’t write anything in the way of tests. No unit tests, no performance tests, no nothing. Of course, had I done that, I probably would have gotten bored half way through, and fallen asleep. And tests can be written after the fact.

Well, I wrote one simple progress measurement. The loop going through the files is called through the function enumerate, so I got hold of an index indicating what file was being processed, and for each file having been closed (processed and done) I printed the message “File %d done!”. From this I was able to clock the script at finishing roughly 20 files a minute (the measurements was taken at ten minute intervals) but this is rather inprecise as no file equals another in line or word length.

It was truly inspiring to realize how much can be done, in so little time. The next steps, besides the obvious room for improvement and optimization, is to use this little project as a real-life excercise to test how much I have learned by reading Martin Fowler’s Refactoring – Improving the Design of Exising Code.

Adding the ability to walk down into sub directories should of course be added, but the most interesting thing at the moment is going to be finding a way to detect typoes. The regexp rule for how to detect and split up words is a little… “stupid” at the moment.

Initially (after having slept through the day following that session) I thought about the typoes, and how to detect them, and how one might be able to use something like levenshtein, but again, this would entail IO heavy operations, and also start impacting the processor. There is probably some Python binding for Aspell one could use, I will have to look into that.

So, finally, why? What’s the reason? The motivation?

Well, somewhere in the back of my mind I remember having read an article somewhere which discussed the ability to use writing to identify people. So if I publish enough text on this blog, and then, on another, more anonymous blog, I publish something else, the words I use, or the frequency with which I use them, should give me away. In order to prove or disprove that hypothesis a signature would need to be identified, in the form of a database containing words and their frequency (in such a case it might even be beneficial to NOT attempt to correct spelling errors as they might indeed also be a “tell”) and then write a program which attempts to determine whether or not it is probable that this text was written by me.

While talking about the idea with a friend, she asked me about privacy concerns (I can only assume that she didn’t feel entirely satisfied with the thought of me sitting on a database with her words and their frequencies) and that is a valid concern. Besides the ethical ramifications of deriving data from and about people I call my friends, there is a potential flaw in trying to generate signatures for my friends from the partial data feed I am privy to. I base this potential flaw on the fact that I know that my relationship, experience and history with my various friends make for a diversified use of my vocabulary. In short, what language I use is determined by the person I speak with.

And now I have spewn out too many words again… *doh*

Django, command_extensions and pygraphviz

Wednesday, November 26th, 2008

Trying to find a way to comply with the last week’s assignment (profiling your software) I today found out that the command_extensions for Django could provide some help (runprofileserver). However, that is not why I am currently writing.

The reason for this post is another command, graph_models, which can be used as such:

wildcard /home/wildcard/voxsite
$ python manage.py graph_models -a -g -o my_project_visualized.png

This however requires a few things to work, namely python-pygraphviz and graphviz-dev (if you’re a Ubuntu user at least). But this is pretty cool, now I have automatically generated class-diagrams of my project.

\o/

Lessons learnt: Python and importing

Friday, November 21st, 2008

This will probably not be something you will do every day, but some day you might need to import a module from an altogether different directory not on the python path. Let’s for instance say that you have a script in your home folder:

~/some_script.py

This script needs to import another module, and, as in my case, you are only given the file system path to the directory in which you can find said module. What to do?

/opt/some_module.py

The solution is rather simple. some_script.py will need to import sys, in order to get a hold of the sys.path variable, to which we can append the path.

import sys
sys.path.append('/opt/')
import some_module

Tah-dah. Once the script has been executed and dies, the sys.path is restored, so no extra fiddling needed. The one gotcha I encountered, which made this problem take way much longer than it should have:

I was wrapping this code up in a function, and that made the import local to that function, and not visible in the rest of the script, so binding functions/variables from the imported module to local ditos is advised, and then moving these around instead.

What kind of ugly beast of a script I needed something as convoluted as this for? A script which tries to verify that a piece of installed software has been installed correctly, and at the correct place with respect to other software, which I from the onset cannot know where it exists (this is for Vox Anonymus, and I simply needed to check that the Django site specific settings file had been correctly updated and could find Vox Anon.

Documentation, best practices?

Friday, November 14th, 2008

I am, as part of the AFST course, working on a free software project, Vox Anonymus. One of the requirements for the software is that it should come complete with documentation on how to install it (not at all unreasonable by any measure).

But I find asking myself how to handle this install information. I needs to be included in the INSTALL file, as well as on the website. At the same time, I feel the urge to not repeat myself. DRY (Don’t Repeat Yourself).

I’ve read up on some techniques, reStructured Text, python-docutils, etc. but I have been unable to find a suitable solution which would convert some simple text format to both (x)html and some reasonable plain text representation for the INSTALL file.

The simplest solution would probably be to use some mark-up language, and a formatting system, and then let the source file be the INSTALL file, from which the html file can be generated. This would leave some “mark-up artifacts” for the prospective users of the application.

Second easist solution: Have the html file be the source file, and generate the INSTALL file by stripping the tags out of the html file. While this would be acceptable, two things bother me:

  • It could potentially take some work in order to make the stripping / reformatting perform properly (with regards to newlines, indentation, etc)
  • Going against YAGNI (You Ain’t Gonna Need It), what if there is a future format I would wish to support?

(I will have to admit though, whilst browsing through “Beginning Python” by Magnus Lie Hetland (Apress) I discovered a chapter (20) outlining a simple system for doing just this, and it sparked my curiosity, so I might have been more than a little “influenced” to reject all other ideas ;))

The third option, then, the path I at least for the moment, have settled on, is to create a miniature mark-up syntax, with the accompanying formatter scripts, to allow for generation of both plain text and html, and extensibility for the future.

The final tipping point is that I can have more automation this way. With the first approach, this would have had to have been a tack-on ugly hack. With the second approach, since a couple of simple sed commands would have done the trick, and I would thus have employed a shell script to reformat the text I would have had to solve it manually, but now with the third option, it can be brought into the cor functionality;

The tarball I generate is given a filename consisting of the name of the project, as well as the version of the project, as can be found in setup.py. Hence, the web-page download link need to be updated every now and then. If I am generating the html, it would make sense to have python also generate an up to date link to the tarball.

Overall, this seem to have the makings of a good solution (all things considered) as well as being a good learning experience. Win-win.

But this was actually not what my post was supposed to be about. In the title, notice the question mark? With it I was not implying that I might be on to the best practice, but rather a question aimed at you, the readers. How would you have done it? Because there is bound to be better ways, with better motivations, than what I have cobbled up. There just has to be, since people have been putting together INSTALL instructions and other application documentation for free software for at least a good… what? thirty – forty? years. And there are bound to be those who doesn’t like repeating yourselves.