Do you have a few computers lying around the house or lab that you want to make use of for your Python computing? Too lazy to setup a real MapReduce system like Hadoop? Well, this could be your solution! (It probably won’t be, but, hey, you never know).
mycloud allows you to make use of your spare cycles, all from within Python – all you need to have available is the SSH connection that you’re already using to manage your systems. Here’s an example:
Starting your cluster:
# list each machine and the number of cores to use
cluster = mycloud.Cluster([('machine1', 4),
('machine2', 4)],
fs_prefix='/path/to/store/results')
Invoke a function over a list of inputs:
result = cluster.map(my_expensive_function, range(1000))
Use the MapReduce interface to easily handle processing of larger datasets:
from mycloud.resource import CSV
input_desc = [CSV('my_input_%d.csv' % i for i in range(100)]
output_desc = [CSV('my_output_file.csv']
def map_identity(k, v):
yield (k, int(v[0]))
def reduce_sum(k, values):
yield (k, sum(values))
mr = mycloud.mapreduce.MapReduce(cluster,
map_identity,
reduce_sum,
input_desc,
output_desc)
result = mr.run()
for k, v in result[0].reader():
print k, v
The code is mostly not-documented, but please contribute documentation and/or complaints as you see fit!
You can access the source code here: https://bitbucket.org/rjpower/mycloud
Or you can access the code with pip via pypi:
pip install mycloud