Posts Tagged ‘SQLite’

2011w32

Sunday, August 14th, 2011

FSCONS

This week has mostly been allocated to FFKP/FSCONS work. I learnt about “Special Pages” on mediawiki wikis. That was a great find, especially as I could be called upon to administer the FSCONS wiki ;)

Furthermore, I was “volunteered” to draft a staffing proposal (i.e. make an estimate about how many people we need to execute another successful FSCONS).

Scheduling and management is so far from the things I excel at doing—mostly for lack of practice, which in turn stems from me not being interested in any of it—so naturally it didn’t take two minutes from my submission to the mailing list until greg had dutifully heeded my plea to catch as many bugs as possible, but hey, better now than in three months :)

For a short period of time I tried attacking the problem with planner, but I am no friend of project management software, so that struck out. I ended up using LibreOffice Calc, which also gave me some headaches, but did eventually carry me all the way to the present version of the proposal.

What I’ve learnt about Calc:

  • ROUND() will properly round values up and down as expected
  • If you want a function which always rounds upwards, that’s CEILING
  • Parameters for functions are separated not by commas, but by semicolons

What I’ve come up with is that I have 162 hours which needs to be staffed by volunteers. No volunteer should be expected to work more than 8 hours during the conference (which spans from Friday evening, staffing-wise, to Sunday night), which means that we need 21 volunteers.

Other than that we need 1 “room host” per track, that’s 8 hosts, we need 1 camera person per track, another 8 people, and finally we need a senior person from the core team of organizers always in the “command centre”.

We have five of those. 21 + 8 + 8 + 5 = 42. The answer! Yes, I’m a geek ;D

SMS messages in a Nokia N900

razor tried to find a funny sms the other day (which he eventually did) but I couldn’t help thinking that it could have gone faster if we’d known where the messages were stored, and instead searched for specific keywords in that database.

They are stored in an SQLite database in ~/.rtcom-eventlogger/el.db

Links

Placekitten: a service to set cute cats as place holders in your designs. Too funny to resist ;D

Blueslug

how to make the internet not suck (as much)

FuzzyWuzzy: Fuzzy String Matching in Python

remar.kz

Vim recipes

Extending timetrack

Saturday, May 21st, 2011

There are at least three features I feel is currently lacking in my timetrack suite, and two of them should be more easily added than the third.

Monthly breakdown and tagging

Soonish there should be another add-on, presenting hours but broken down on a per month basis.

This would however necessitate an update of the timetrack storage format (I am leaning towards using SQLite).

Tagging is the other simple feature I feel is missing, and again, it would be a much simpler feat to accomplish if stored using SQLite.

The downside to this, of course, would be the dependency on SQLite. I really don’t like to introduce more dependencies than is necessary.

I am, unfortunately, not smart enough to figure out a better (plaintext) format that would be able to accommodate tags, at least not without making the parsing a bloody mess.

Automatic session detection

In addition to that, my introductory post to timetrack yielded a comment from archie which got me thinking. It really would be nice if the sessions started on their own.

I am thinking that for actual coding sessions, this shouldn’t be all that impossible.

For planning and design work (I am thinking mental modelling and time spent just grasping the concepts) it would be harder (and if I do go down the route with SQLite I suspect I’d need to create another script just for simple addition into the database after the fact.

However, for file-bound operations one could try to see if a similar approach to what fsniper is doing couldn’t be used. The technical details of fsniper is described as:

“fsniper uses inotify to watch for when a file is closed after being written to. This means utilities such as touch will cause the event to trigger.”http://files.minuslab.net/doc.html

Most suggested uses of fsniper has always rotated around doing something to the file that was just modified, but from what I can tell, there shouldn’t be an reason that one couldn’t just execute any old script, doing just about anything, for any purpose, when a file in the correct directory has been modified.

This would all hinge on one small detail though: That there is some event in inotify similar to the one fsniper listens for, but for when a file is opened for writing. (This might however be indistinguishable from a file being opened for just reading, and then it would trigger on just about anything…)

Of course, this would also mean that we need some way of graphically asking the user if a session should be started (the script won’t be executed from a visible shell), and for that I am thinking about Zenity for that.

But the best thing about this is that this solution, with inotify, something fsniper-ish and Zenity would represent optional dependencies (iff/when I manage to get some working code for it)

:wq

Midnight hacking, part 2

Saturday, February 28th, 2009

I have since the last post come up with a name for this little project: “Vocabulary fingerprinting”. This post should be part 3 or 4, but obviously it fell out of my memory to write the second post after my second midnight session. The first refactorings I made brought down execution time a bit. Not as much as I had hoped for, but a bit. When I then added more regular expressions to improve the accuracy of what should be stored in the database, execution time was impacted negatively. It didn’t quite go back to the first value of ~120 minutes, but is now holding steady at around 113 minutes. I will have to profile the code to get closer to the root, but my suspicions are a) the regular expressions (I believe I now have in the vicinity of 15 different regexes which are being executed once on every line I have written, and some on every word) and b) SQLite (disk I/O).

I do have some ideas about how to speed it up, in the case of the regexes I’d have to rearrange the order in which they are executed, and make some of them more generic (having 3 or 4 regexes to sort out all the various smileys… no not good) and for the probable disk I/O bottleneck… well I can’t really get around writing to the disk, but I can store a bunch of words in memory, and then write them all to disk at once… I don’t know why that would affect anything, but it just feels like it would help.

I have to confess that it has actually been a couple of days since I last worked, or thought about working, on this project, but today when I woke up I got a new idea. I haven’t yet decided if the idea is any good.

Since I am parsing instant-messaging logs, each line represents one message. Since messages differ in length, both in terms of number of words and number of characters, it could be interesting to store an “average line length” value in the database as well, since this could make identification easier (think ballpark estimates, “chatty” or “quiet”). But then the code would become very domain specific. You couldn’t try to identify a person by feeding the code an article, since these will be a great mass of text, which each line formatted to be roughly of equal length (on or around 80 characters per line probably)

One could of course craft two modules, one article specific and one instant messaging specific, and have them share database (i.e. the “vocabulary” table would be shared, but meta-data around how the writer compose his texts would be stored in two different tables, depending on what we want to compare with. This could work, since it is possible to identify “chattiness” in articles as well, the only difference being that the word count ranges for determining “chattiness” are larger (150-2000, 2000-6000, etc). (I have a feeling that I could easily end up in the latter range…)

Finally, I tried to implement a measurement facility in the code. Something along the lines of:

Given ten measurement points, evenly distributed across the files being parsed, in this case 2900 files / 10 points = every 290 files, record the current time (timestamp) at each point.

In the end, what I want is an output for every 290 files, telling me the estimated time to completion, based on the average time it has taken the code to accomplish its task so far.

I thought that it would be pretty straightforward, just a simple case of determining how many measurement points there are left, create time-deltas (mp1 – starttime, mp2 – mp1, mp3 – mp2 …) and add those together, divide by the number of time-deltas, and multiply it by the number of measurement points there are left.

I checked it with my brother (who is way more math smart than me) and he think it seems legit… but the code doesn’t work. It reports that it’s about 30 minutes to completion on every measurement point up until the last where it drops to 8 minutes…

His thought was that maybe the initial files are so small compared to the files at the end, that the estimations are frakked up. I guess I will have to either randomize the order of the list containing the filenames, or just reverse it, to see if it makes any difference.

That’s all for now.

Midnight hacking

Thursday, February 12th, 2009

Last Saturday… sorry, early Sunday, way past any reasonable bedtime, I was twisting and turning, finding it impossible to fall asleep. Reading a magazine didn’t work, in fact it might just have had the opposite effect. It got my brain working, and all of the sudden an idea entered my mind.

I can’t understand it myself, so don’t bother asking, there will be no coherent or reasonable answer, but I got the idea to pull my Pidgin log-files, all 2900 of them, dating back from 2008.01.01, and have a program go through all of them, cataloging and counting the outgoing words.

Maybe it was some urge to code, maybe my subconscious has a plan for the code which it has yet to reveal to me, I couldn’t tell you, but the more I thought about it, the more the idea appealed to me. Within half an hour I knew roughly how I wanted to do it.

The premise was: 2900 HTML-formatted log-files describing interactions between me and one or more external parties. Pidgin stores each sent message on a separate line, so except for some meta-data about when the conversation took place, located at the top of the file, there was one line per message.

I wanted the code to be, I hesitate to call the resulting code “modular”, but “dynamic” might be better. So no hard coded values about what alias to look for. This worked out fine, as I soon realized I needed a file name for the SQLite database which would store the data.

The script is called with two parameters, an alias and a path to the directory in which the logs can be found. This is also where I cheated. I should have made the script recursively walk into any sub directory in that path, looking for HTML-files, but I opted instead to move all the files from their separate sub directories into one large directory. Nautilus gives me an angry stare every time I even hint at wanting to open that directory, but handling sub directories will come in a later revision.

So, given an alias (the unique identifier which every line that shall be consumed should have) and a path, list all HTML files found in that path. Once this list has been compiled, begin working through it, opening one file at a time, and for each line in that file, determine it the line should be consumed, or discarded.

Since the line contains HTML-formatting, as well as the alias and a timestamp, this would be prudent to scrape away, regular expressions to the rescue. Notice the trailing “s”, simple code is better than complex code, and a couple of fairly readable regular expressions is better than one monster of an expression. So away goes HTML-formatting, smileys, timestamps and the alias. What should now, theoretically, be left, is a string of words.

So that string is split up into words and fed into the SQLite database.

I was happy, this was my first attempt at working with SQLite, and thus my first attempt at getting Python to work with SQLite. It worked like a charm. Three separate queries where used, one, trying to select the word being stored. If the select returned a result, the returned value was incremented by one, and updated. Of no result was returned, a simple insert was called.

This is of course the naive and sub-optimal way to do it, but right then I was just so thrilled about coding something that I didn’t want to risk leaving “the zone”. Needless to say, doing two queries per word, means hitting the hard drive two times per word, EVERY word, for EVERY matching line, for EVERY one of the 2900 files. Yeah, I think there is room for some improvement here.

But I have to admit, I am impressed, in roughly four hours, give or take half an hour, I managed to put together a program which worksed surprisingly well. The one regret I have right now is that I didn’t write anything in the way of tests. No unit tests, no performance tests, no nothing. Of course, had I done that, I probably would have gotten bored half way through, and fallen asleep. And tests can be written after the fact.

Well, I wrote one simple progress measurement. The loop going through the files is called through the function enumerate, so I got hold of an index indicating what file was being processed, and for each file having been closed (processed and done) I printed the message “File %d done!”. From this I was able to clock the script at finishing roughly 20 files a minute (the measurements was taken at ten minute intervals) but this is rather inprecise as no file equals another in line or word length.

It was truly inspiring to realize how much can be done, in so little time. The next steps, besides the obvious room for improvement and optimization, is to use this little project as a real-life excercise to test how much I have learned by reading Martin Fowler’s Refactoring – Improving the Design of Exising Code.

Adding the ability to walk down into sub directories should of course be added, but the most interesting thing at the moment is going to be finding a way to detect typoes. The regexp rule for how to detect and split up words is a little… “stupid” at the moment.

Initially (after having slept through the day following that session) I thought about the typoes, and how to detect them, and how one might be able to use something like levenshtein, but again, this would entail IO heavy operations, and also start impacting the processor. There is probably some Python binding for Aspell one could use, I will have to look into that.

So, finally, why? What’s the reason? The motivation?

Well, somewhere in the back of my mind I remember having read an article somewhere which discussed the ability to use writing to identify people. So if I publish enough text on this blog, and then, on another, more anonymous blog, I publish something else, the words I use, or the frequency with which I use them, should give me away. In order to prove or disprove that hypothesis a signature would need to be identified, in the form of a database containing words and their frequency (in such a case it might even be beneficial to NOT attempt to correct spelling errors as they might indeed also be a “tell”) and then write a program which attempts to determine whether or not it is probable that this text was written by me.

While talking about the idea with a friend, she asked me about privacy concerns (I can only assume that she didn’t feel entirely satisfied with the thought of me sitting on a database with her words and their frequencies) and that is a valid concern. Besides the ethical ramifications of deriving data from and about people I call my friends, there is a potential flaw in trying to generate signatures for my friends from the partial data feed I am privy to. I base this potential flaw on the fact that I know that my relationship, experience and history with my various friends make for a diversified use of my vocabulary. In short, what language I use is determined by the person I speak with.

And now I have spewn out too many words again… *doh*