Skip to content

This Blog moved to GitHub / HUGO

After many months inactivity, I’ve moved this blog to Hugo / Github in a hope that a change of tools (i.e. the ability to post from the command line) may inspire me to keep it more up to date.

http://jarvist.github.io/

Embarrassing embarrassingly parallel

I gave up OPENMP’ing my Monte Carlo code about a year ago. I was trying to do it at the level of MC moves, and there are massive issues here in that the different threads are trying to sample and update the same simulation space – leading to segfaults as the variables crash into one another.

One method I’ve seen used in the literature is domain decomposition – where each thread runs MC on a segment of the simulation volume. This is quite a headache to set up, and then you have issues about simultaneous updates, and the inability to exchange particles / states across the boundaries. All scary issues.

So I gave up. I am not a computer scientist. I could not understand even the terms of reference; Lambda calculus reductions and all that jazz. In Simon Peyton Jones’ terms, “I am a worm.”

I revisited it this evening, with the intention of seeing whether I could separate the thread of the random number generator at least. However, it instantaneously struct me what I should do. Each MC-move attempt calls ‘site-energy’ with the test move which has a set of for loops evaluating over nearby space (i.e. with a cutoff) to get a total energy change for the test move.
The change in energy variable ‘dE’ is a natural reduction vector, none of this modifies the simulation volume, and in a production run where you are integrating out a large volume of space for accurate energies – the majority of computer time is spend in these for loops.

One: #pragma omp parallel for reduction(+:dE) before the start of the for-loop nest; and a recompile with -fopenmp, and it now maxes out the 4 hyperthreaded cores on my laptop.

Scaling will be more limited on bigger machines, and increase in speed will only be seen where this energy summation, rather than the single-core modifications to the simulation volume and random number generator, dominate the CPU required.

But still. Embarrassing embarrassingly parallel, because I could have and should have seen this a year ago!

Organising computational job workflow, as you go: freck

The workflow in computational materials modelling is often highly complex, and bespoke to the particular calculations you are doing. The flow is highly derivative (results of later jobs depend on work on prior jobs), but can also be quite chaotic and multiply branching. Unsurprisingly, your files easily end up very messy, with significant confusion about what steps were required to get to the present working setup.

Version control is an absolute godsend, being able to flow down the list settings changes is extremely useful, and then being able to simply open the Git archive on GitHub as a way of archiving your research data is fantastic. However, this still leaves problems to be solved – the main one is actually constructing the folder structure sensibly. (A lesser issue is that intermediate and binary files in electronic structure calculations can be enormous (>multi gigabyte) and so far too large for git.)

To aid this I’ve written a very lightweight shell script called ‘freck’ (to move swiftly or nimbly, an obsolete English word). Each time this is run in a working directory, it generates a folder named 0001-, 0002- etc. with a user-requested file name, and then moves all the present working files into that directory. It also saves a ‘breadcrumb’ file with information about when and where you are writing this. I would like it to also add information extracted from your present history (i.e. all the commands you have used to work on the data), but the problem is that the shell script runs within its own shell, and it’s not obvious how to easily or portably access this.
(history > history.log ; is very quick to run though)

The suggestion would be that you then typically run a ‘git add’ on the small input / output files, and commit these to a repository with a sensible commit history.
Then copy the necessary files to continue your job into the PWD, and continue work.
But it’s flexible enough that you can also just use this to separate your work when running on a cluster.

I’m sure I’m not the only one who’s ended up with a folder structure that ends up looking hellishly like this:-
~/PROJECT_FOO/geom_opt/geom_opt_restart/geom_opt_restart_higher_convergence/moar_kpoints/cation_geom/cation_geom_restart/frequency_calc/ etc.

Instead you end up with something much more sensible, time ordered and single-depthh: https://github.com/jarvist/CDFT-C2H4

Anyhow, check it out here + I’m always interested in all + any feedback!
https://github.com/jarvist/hpc-bin/blob/master/freck

Writing scientific papers

I am probably the last person you would want to consult on tips to write a scientific publication. I’m slow, I procrastinate, my English is not the best, much work lies on the cutting room floor – skeletons of half assembled papers. Still, ‘that which we are, we are’, though my skills are lacking I’m very interested in improving, and certainly enjoy reading a well written paper.

It is an unfortunate fact that the average quality of scientific writing is decreasing. Some of this can be explained by the laudable continuing expansion of science across class boundaries, and across the world (a greater number of authors writing in English as an additional language). Yet I think the majority reason is the pressure to publish so much, we write so many more papers, and there’s enormous pressure to oversell.

My parents were visiting this weekend. “How is work?” my mother asked. “I should be writing more.” – the scientist’s lament. “Oh, but don’t you only write a paper when you make a breakthrough?” To be honest I think we would have a better scientific corpus if this was still the case – carefully work on a hard problem for a few years until we gain some traction, then write it up for the benefit of anyone who could take it elsewhere.

(Along these lines, Ross McKenzie has a recent blog post ‘In praise of modest goals‘.)

So the first thing to do is to overcome the cynicism! When I get particular disaffected and want to crawl under my duvet and drift off to sleep reading Feynman, I return to Simon Peyton Jones’ talk on how to write a great research paper . His enthusiasm and motivation is infectious. I am a worm. A worm with a infectious mind virus, and that is good. (Much like his choice of font, do read: http://www.mcsweeneys.net/articles/im-comic-sans-asshole )

Sabine Hossenfelder wrote a recent blog post on How to write your first scientific paper, this covers the construction and sectioning of a physics paper.

I’ve personally found the Penguin Writer’s Manual very useful for the nitty gritty of ‘that’ vs ‘which’. I’ve also read Strunk and White (The elements of style), though I’m not sure how much I took from it. The economist style guide (available online) is also very useful as a reference.

I much prefer to write with Latex. Sometimes I first go via Markdown (a lightweight markup language that looks a lot like how you would naturally format a text email), so I don’t need to bother with thinking about the Latex commands when forming a text, and then convert to Latex (with pandoc) for inclusions of figures & all the revisions that incur thereafter. The whole project I put within a ‘git’ repository and so have version control, effectively offsite backup (to Github / Bitbucket). To get the diffs to work well with the files, it’s easiest to add a newline after the end of every sentence (I usually hard-break my lines at 80 chars as I write in Vim).

I have contributed to papers in Microsoft Word (usually originating with collaborators). It works OK, and the track changes / comment tool can be really useful, though a distressing number of collaborators seem to send you back a ‘clean’ document with the tracking dropped. Whether that’s due to ignorance of the tools, compatibility with different versions of Word, or an attempt to bury any information about which of your changes they reverted, I do not know. Certainly it makes me respect them less academically and professionally.

I typically plot data in GNUPLOT or XMGRACE, sometimes also in Python’s Matplotlib (though this is much less deterministic – the output depends on the specific version you have installed, and it can be extremely frustrating to reproduce a tweaked diagram). I’ve found some useful websites on tips + tricks for plotting, useful beyond an immediate answer to resolving a problem.

I’ve consolidated all my useful bookmarks on the above, and put them in a shared Google Chrome bookmark folder : Writing scientific papers

Hex Grids

Solid state materials pack together in a myriad of ways. The crystal structures are typically referred to with the name of the archetype compound. This ranges from fairly common English words such as ‘diamond’ and ‘rock-salt’ to, ‘zinc-blend’ ‘wurtzite’ and, of course, ‘perovskite’.

The atomic packing within these structures is fully described by their space-group, of which there are 230 to choose from. These space groups take the crystallographic (periodic structure) compatible point groups, and combine them with different possible lattice vectors.

http://img.chem.ucl.ac.uk/sgp/large/sgp.htm

Some structures are fiendishly complicated, others are effectively cubic grids (perhaps with lattice vectors which are neither orthogonal nor equal in length). Even very complicated structures often have a sub lattice which packs cubically, or with hexagonal close packing.

One of the perhaps surprising things is that hexagonal packing and cubic packing are extremely similar to one other – in particular cubic (BCC and FCC) and hexagonal close packing both have a coordination number of 12. Each sphere (atom) has 12 nearest neighbours.

This is very useful, when you are building a computer model for a large structure. You can store the data about site occupancy in a simple array[][][], and calculate various necessary metrics (in particular real space distance vectors) via simple (i.e. free, in terms of computer time) arithmetic.

I came across an article on hexagonal grids today; much of it was startling familiar, but in that way that it was giving you a framework and lexicon for understanding for what you’d hacked together with a partial glimpse of true understanding.

(It’s written from the perspective of a games developer, but it’s directly applicable to scientific computation – just don’t tell the Research Councils how similar our austere science is to a game engine!)

http://www.redblobgames.com/grids/hexagons/

via HN discussion https://news.ycombinator.com/item?id=8941588

Snippet: SSH at Bath Uni

I polished some slides I gave at our group meeting last summer, on using ‘ssh’ proficiently at Bath university. I thought they may be of use to other’s, so published them online.

They mainly cover the setting up of a ssh-config (the examples are probably useful for most people that use HPC facilities), and tunneling via the LCPU facility (fairly Bath specific, though other academic institutes might offer a similar unix machine in a DMZ to bounce through).

Google Present Slides

https://github.com/jarvist/filthy-dotfiles/blob/master/ssh-config – my ssh-config, with Bath (& jarvist!) specific settings.

Snippet: New Julia paper from the developers

Looks to be a great reference, with plenty of code snippets getting bits of numerical methods working, finishing on the random-matrix semicircle theory, with @parallel speedup! Wish this was around when I was figuring out my little toy model… https://github.com/jarvist/LongSnakeMoan/blob/master/Sturm-DoS/Sturm_Drang.jl

http://arxiv.org/abs/1411.1607

via the ever useful, if distracting, Hacker News:
https://news.ycombinator.com/item?id=8576411