And I thought keeping software engineers in line was a hard job. I definitely do not envy the job of a director.
XText is a pretty cool plugin/SDK for Eclipse that makes it relatively easy to make a full-featured editor for your favorite language. I recently
(wasted a few days) played around with it and created a simple editor for my current language of choice, IcedCoffeescript. (If you’re really curious/bored, the source is on github.)
Almost all of the information you find on Xtext requires defining a complete grammar for your language, which is a bit too much hassle if you’re just starting out and all you really want is some pretty syntax. Fortunately, it’s pretty straightforward to get a syntax only parser if you want one – simply define a grammar with all of your terminals:
terminal KW_X: "x"; terminal KW_Y: "y"; terminal KW_Z: "z"; terminal ANY: .;
and a single top-level rule:
You then can supply a highlighting configuration to specify how various terms should be highlighted, and a TokenMapper that maps from actual tokens to term classes. (Yes, I know the terminology gets a bit confusing/baroque). Here’s some links to the various parts I used in my project:
Whenever I encounter a paper like this – Towards automatic optimization of MapReduce programs – and there are a lot of them, I find myself sighing inwardly. (Heck, we even had a student of ours who ended up tweaking a bunch of these knobs: http://www.news.cs.nyu.edu/node/146).
This seems to be a common refrain in Java programs, but Hadoop especially – rather then either choosing a sensible constant, or adapting a value at runtime, let’s foist all of the work onto the user. But the way it’s phrased is clever – we’re not avoiding the decision, we’re just making it so the user can configure things however they want. I’ve done this a lot myself – it’s just so easy to add a flag to your command line or to your config file and pride yourself on a job well (not) done.
The key issue here is that, as a user, I don’t know what to put in for these values, I don’t know what’s important to change, and so I’m the absolute worst person to be responsible for these things.
Seriously, why are you giving the user these parameters to tweak?
What inevitably happens is we don’t know what any of these things actually mean when it comes to making things faster, so we end up searching the internet for the magic numbers to plug in, rerunning our jobs a whole bunch and wasting a crap-load of time.
This is not a desirable user experience. I mean, here’s the interface a car exposes to me:
There’s a “go faster” pedal and a “go slower” pedal. These correspond to all sorts of complicated, dynamic magic inside of the engine compartment, but I don’t need to know about them – the system handles it for me. Moreover, it can adjust parameters at runtime, in response to the behavior of the car – unlike most of our lazy computer programs!
If only our programs could be more like cars (though hopefully with better gas mileage).
Occasionally I find I need a package that isn’t in my distribution, or I need to rebuild from source for whatever reason. In the past, I’ve always been conflicted about that age old question:
Where do I install this bugger?
The default for most packages, /usr/local, is fine for most purposes, but then there are annoyances – if I want to use this package on other machines, then I’d be better to put it under ~ (/home/power), (our cluster has a shared NFS mount). But then I’m filling up my home directory with random bins and sbins and includes and if/when I need to uninstall something I always get confused and blow away the wrong thing (since all of the binaries end up in /home/power/bin)
Installing into subdirectories (/home/power/my-package) has it’s own annoyances – I have to make sure to update my $PATH everytime, and I start to get confused when there are too many things in my home directory (I don’t know why, I just do!).
Fortunately, I’ve come across a nice solution to all this. I install everything into /home/power/pkg:
power@kermit> ls -l pkg ... drwxrwxr-x 7 power power 4096 Nov 19 19:42 openmpi drwxrwxr-x 6 power power 4096 Jul 24 18:48 oprofile drwxr-xr-x 6 power power 4096 Feb 19 2011 paperpile drwxrwxr-x 4 power power 4096 Nov 6 2011 parallel drwxrwxr-x 5 power power 4096 May 9 2012 perl5 drwxrwxr-x 6 power power 4096 Apr 15 2012 postgresql drwxr-xr-x 7 power power 4096 Feb 9 2012 pypy-1.8 drwxr-xr-x 7 power power 4096 Jun 7 12:50 pypy-1.9 drwxr-xr-x 6 power power 4096 Apr 27 2012 python-2.7
And add the following to my bashrc:
for d in /home/power/pkg/*/bin; do export PATH=$PATH:$d done for d in /home/power/pkg/*/lib; do export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$d done
Now every time I install a package under ‘pkg’, my PATH is automatically updated to discover it. And if do need to remove a package, I just blow away the directory.
I won’t claim that this is brilliant or original. But it works for me, so… it’s nice.
I’ve been using bash for far too long (> 10 years now), and I was reminded the other day (after helping a colleague with some coding) of some of the random tricks that you pick up over time. I’ve put them into a (almost completely wrong) timeline. Are there any other common idioms that I’m forgetting to put in here?
“Oh man, look at that, I can hit up-arrow and repeat the previous command!”
You’re too excited by the novelty that you don’t pay attention to the fact it’s slower then retyping the old command.
After again hitting up-arrow 30+times to find an old command, you accidentally come upon reverse i-search (
CTRL+R) and realize what it means.
You start using quick substitution:
echo foo ^foo^bar
And are feeling pretty much like a master.
HERE documents are now completely under your control, and you have started writing scripts to try to automate everything, even operations that you know you’ll only have to do once.
You spend at least one entire day fiddling with themes, in the name of “productivity”.
You’re feeling more confident, and moreover, the reckless abandon of your Bash youth seems to have passed. After a brief spell with the vi key bindings, you’re back to the emacs bindings, but feeling invigorated by your exploration. You realize that there is a separation between inputrc and bashrc, but you don’t really have time to investigate further. After all, you just added
set completion-query-items 10000 set completion-ignore-case On
.inputrc and are far too excited about the idea of never being asked:
Display all 3425 possibilities? (y or n)?
export HISTSIZE=1000000 export HISTFILESIZE=1000000 shopt -s histappend
How could it have taken you so long to search for this? No longer will having multiple terminals open cause you to lose your hard earned history. You anticipate the point in time where you will have accumulated so many commands in your history file that you will never have to type a new one.
You dabble with ZSH after seeing a friends super colorful console. You give up after you realize that zsh is missing most of the awesome TAB-completions you are by now accustomed to. By accident, you try tab-completing an scp command and are floored by the fact it’s actually based on the remote filesystem. You start trying to write your own completion scripts, but realize these are things better left to experts. It is slowly dawning on you that you are not an expert.
You’ve also become more confident in your escapes – you feel not the slightest bit scared about using arithmetic now:
MYPORT=$((PORTBASE + 1))
And you use the subshell escape
result=$(echo foo) to differentiate yourself from those silly backtick users who don’t know what they’re missing:
ROOT=$(dirname $(dirname $(readlink -f $0)))
You accidentally hit
CTRL+X CTRL+E again, but this time you noticed the magic keystrokes that got you here. An
$EDITOR window for modifying the command line? How cool is this? Now your Awk scripts will become even more powerful (and ever more incomprehensible).
Your scripts have started to become Zen-like koans of existential beauty. Your full knowledge of the power of
trap EXIT allows you to impress your neighbors, whose adulation you accept with a wry smile. You know when to
CTRL-\ (SIGQUIT) and when to
CTRL-C (SIGINT) – you use force only when required.
You have come to the realization that you are just a beginner, and have so much more to learn.
I spent a little time cleaning up mycloud recently – it’s a Python library that gives you a simple map/mapreduce interface without any setup (just SSH access).
I’ve been using it a lot for little processing tasks – it saves me a lot of time over running things using just my machine, or having to switch over to writing Hadoop code.
Source is on github, and the package is available on PyPi for easy installation:
pip install [--user] mycloud
Begin verbatim README dump
MyCloud makes parallelizing your existing Python code using local machines easy – all you need is SSH access to a machine and you too can be part of this whole cloud revolution!
Starting your cluster:
import mycloud cluster = mycloud.Cluster(['machine1', 'machine2']) # or use defaults from ~/.config/mycloud # cluster = mycloud.Cluster()
Map over a list:
result = cluster.map(compute_factors, range(1000))
Use the MapReduce interface to easily handle processing of larger datasets:
from mycloud.mapreduce import MapReduce, group from mycloud.resource import CSV input_desc = [CSV('/path/to/my_input_%d.csv') % i for i in range(100)] output_desc = [CSV('/path/to/my_output_file.csv')] def map_identity(kv_iter, output): for k, v in kv_iter: output(k, int(v)) def reduce_sum(kv_iter, output): for k, values in group(kv_iter): output(k, sum(values)) mr = MapReduce(cluster, map_identity, reduce_sum, input_desc, output_desc) result = mr.run() for k, v in result.reader(): print k, v
It is, keep in mind, written entirely in Python.
Some simple operations I’ve used it for (6 machines, 96 cores):
- Sorting a billion numbers: ~5 minutes
- Preprocessing 1.3 million images (resizing and SIFT feature extraction): ~1 hour
Mycloud has builtin support for processing the following file types:
- Text (lines)
Adding support for your own is simple – just write a resource class describing how to get a reader and writer. (see resource.py for details).
Sometimes you’re developing something in Python (because that’s what you do), and you decide you’d like it to be parallelized. Our current options are multiprocessing (limiting us to a single machine) and Hadoop streaming (limiting us to strings and Hadoop’s input formats).
Also, because I could.
The default size chosen by imshow yields unpleasantly small images. Fortunately, you can easily change them using the rather strangely named gcf() function:
import pylab as P ... f = P.gcf() f.set_figheight(16) f.set_figwidth(16)
Tip for future office developers: save money and make people happier by not completing trying to override mother nature!
That’s right – when it’s 40-50 degrees F outside, please please please stop heating my office to 80! And when it’s 90 outside, please stop cooling my office to 50… the planet and I will thank you.
I’m toying with the idea of creating BigTable/HBase extension that exposes tables as a gigantic virtual spreadsheet.
Then, following the spreadsheet paradigm, I should be able to enter formulas for columns/cells and have them be calculated dynamically based on the data. This would be similar in concept to a database view.
Now, if someone could just go ahead and make an open-source version of Spanner it would simplify this a lot for me…
I knew this had to exist, since otherwise generated logarithmic plots in matplotlib would be a pain in the butt. Still, it took a bit of searching, although perhaps just the name should have clued me in.
fig, ax = plt.subplots() steps = N.log10(N.logspace(0.9, 1-1e-5)) ax.set_yscale('log', basex=10) ax.plot(steps, f(steps), '-')
Also, a shout-out for the ipython inline graphs (
ipython notebook --pylab inline). Beautiful, and I can copy-paste them into emails and google docs!