README.md 13.4 KB
Newer Older
Khalid Kunji's avatar
Khalid Kunji committed
1
# Run GIGI Split and Merge Scripts 
2 3


Khalid Kunji's avatar
Khalid Kunji committed
4
### Runs GIGI with multiple threads by splitting the input and merging the output.  
5

Khalid Kunji's avatar
Khalid Kunji committed
6 7
#### Requirements
Somewhat modern version of g++ if you need to recompile the binaries, we haven't checked how far back you can go but the default for most OS package managers should be fine.  
8

9
#### Getting GIGI-Quick
Khalid Kunji's avatar
Khalid Kunji committed
10
##### With Git
Khalid Kunji's avatar
Khalid Kunji committed
11 12 13 14
Run the following command to clone the repository with git (git is a version management program started by Linus Torvalds https://git-scm.com/downloads)

```git clone https://cse-git.qcri.org/Imputation/GIGI-Quick.git```  

Khalid Kunji's avatar
Khalid Kunji committed
15
##### With a browser
Khalid Kunji's avatar
Khalid Kunji committed
16
Go to this url: https://cse-git.qcri.org/Imputation/GIGI-Quick/tree/master
Khalid Kunji's avatar
Khalid Kunji committed
17

Khalid Kunji's avatar
Khalid Kunji committed
18
Click on the icon with the download arrow above the column "Last Update" on the right hand side. 
Khalid Kunji's avatar
Khalid Kunji committed
19
There are several download options with different compressions.  If you get run_GIGI this way, then you will need to decompress it before proceeding.
20

Khalid Kunji's avatar
Khalid Kunji committed
21
#### Installation
Khalid Kunji's avatar
Khalid Kunji committed
22 23 24
Once you have the files, there are executables compiled on Red Hat, 64 bit Linux, if this is not your system, then you may need to recompile them, 
but this is usually not necessary unless your system has a different architecture (e.g., 32 bit x86, PowerPC, ARM):  
To recompile, starting from the folder where you downloaded run_GIG:  
Khalid Kunji's avatar
Khalid Kunji committed
25

Khalid Kunji's avatar
Khalid Kunji committed
26 27
```cd GIGI-Quick/SPLIT/```  

Ehsan Ullah's avatar
Ehsan Ullah committed
28
```g++ -O2 GIGISplit.cpp -o gigisplit```
Khalid Kunji's avatar
Khalid Kunji committed
29 30 31

```cd ../MERGE/  ```

Ehsan Ullah's avatar
Ehsan Ullah committed
32
```g++ -O2 GIGIMerge.cpp -o gigimerge```  
Khalid Kunji's avatar
Khalid Kunji committed
33

Khalid Kunji's avatar
Khalid Kunji committed
34 35 36 37 38 39
```cd ../GIGI/src/GIGI_v1.06.1```  

```g++ -O2 GIGI.cpp -o ../../GIGI```

That's it, GIGI-Quick is installed, the main file/command to run it is run_GIGI
If you like you can now add GIGI-Quick to your path, the examples assume that you have, you can do this by adding the following to your .bashrc (located in your home folder)
Khalid Kunji's avatar
Khalid Kunji committed
40 41 42 43 44 45

```export PATH=${PATH}:/path/to/folder/where/you/put/run_GIGI```

Then source your .bashrc to apply the changes right away

```source ~/.bashrc```
Khalid Kunji's avatar
Khalid Kunji committed
46

Khalid Kunji's avatar
Khalid Kunji committed
47 48 49 50 51
The folder structure of GIGI-Quick should not be separated, GIGI-Quick depends on relative paths to locate the scripts and executables included other than run_GIGI. 
To add run_GIGI to the path system-wide for all users you can create a symlink in /usr/bin pointing to the run_GIGI script: 

```ln -s /path/to/run_GIGI/script /usr/bin/run_GIGI```

Khalid Kunji's avatar
Khalid Kunji committed
52 53 54 55 56
#### Usage

Note: The parameter file is the same as you would use for GIGI normally, but if you are using the long format, then pass the "-l" option
      The examples in shown below use the file "param-v1_06.txt" because it is included in the repository and can be run by simply cutting and pasting the example line.  

Khalid Kunji's avatar
Khalid Kunji committed
57
run_GIGI parameter_file -o [OUTPUT FOLDER] -n [RUN NAME] -t [THREADS] -m [MEMORY IN MB] [-l] [-v] -q [THREADS] -r [START] [END] [-V] [-h]
Khalid Kunji's avatar
Khalid Kunji committed
58 59 60 61 62 63

-o [OUTPUT FOLDER] : This is the path to use for the outputs from the run_GIGI scripts, including temporary files.  
-n [RUN NAME]      : This is a path relative to the [OUTPUT FOLDER] to use to keep the outputs from more than one run of run_GIGI separated.  
-t [THREADS]       : The number of threads to use for run_GIGI, and also the number of chunks to split the input into.  
-m [MEMORY IN MB]  : The amount of RAM that run_GIGI will restrict its use to, not yet implemented  
-l                 : Specifies that the input is in the long format.  
64
-V                 : Verbose mode, output from run_GIGI is much quieter now, you can see much more of what it is doing and what variables are set to at various stages with -v. 
Khalid Kunji's avatar
Khalid Kunji committed
65 66 67 68 69 70 71
-v                 : Display the version of GIGI-Quick and exit.  
-h                 : Display this help text.  
-r [START] [END]   : Run on only a selected region, starting at start and ending at end, this region will be selected before any further splitting.  
-q [THREADS]       : Run in queued mode, this mode will run up to THREADS instances of GIGI at a time and will attempt to keep the total amount of memory being used less than 
                     [MEMORY IN MB] using an estimate of the amount of memory GIGI may need.  If -m [MEMORY IN MB] wasn't given, then it will use the amount of memory available 
                     as shown by 'free.'  For older kernels this isn't shown and we use an estimate that is no longer accurate for modern systems (amount free + amount of buff/cache).
                     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34e431b0ae398fc54ea69ff85ec700722c9da773  Also, -t is ignored when -q is given.  
Khalid Kunji's avatar
Khalid Kunji committed
72
-e [MEMORY IN MB]  : Manual estimate of how much memory GIGI will need for queued mode in case the calculated estimate is too inaccurate
73

74
Examples: 
75
```bash
Ehsan Ullah's avatar
Ehsan Ullah committed
76 77 78 79 80 81 82 83
  ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt    #Output in the current folder with no run name identifying subfolder, threads and memory determined automatically
  ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS -n test_run    #Output in ./OUTPUTS/test_run
  ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS -n test_run -v #Output in ./OUTPUTS/test_run, verbose mode (print more detailed information)
  ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS -n test_run -l    #Output in ./OUTPUTS/test_run for a parameter file in the long format, do not cut and paste this one because the included param-v1_06.txt is not in the long format  
  ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -t 2    #Limit to only 2 threads (and hence two chunks)
  ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -m 1000    #Limit memory use to 1 GB, please read the section on memory and cgroups
  ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -lmt 1000 2   #Limit memory use to 1 GB, please read the section on memory and cgroups, and threads to 2 with input in the long format, do not cut and paste this one because the included param-v1_06.txt is NOT in the long format  
  ./run_GIGI INPUTS/Sample_Input/param-v1_06.txt -o RUN_FOLDER/ -n test_run -m 20 -q 3 -V -r 3 70 #Output in ./RUN_FOLDER/test_run, limit memory to 20 MB, use the queued mode with up to 3 threads at a time, and run on only the region from 3 to 70, note: the memory estimated as needed in queued mode does not account for the restricted region
84
```
Khalid Kunji's avatar
Khalid Kunji committed
85 86

If there is a problem that makes GIGI stop before completion, then the output files are left as they are in order to allow users to rerun only failed portions as needed.  
Khalid Kunji's avatar
Khalid Kunji committed
87 88 89
If you are unsure where the failure occurred, then the safest approach will be to remove the output files before rerunning (e.g. rm -R [OUTPUT FOLDER]/[RUN NAME]), always use rm with caution as always 
e.g. if the 2nd example failed, I would "rm -R ./OUTPUTS/test_run" before rerunning.  

Khalid Kunji's avatar
Khalid Kunji committed
90 91 92 93 94 95 96 97 98 99
The -n option is largely redundant, as it is equivalent to using the -o option with a longer path giving the subfolder, e.g.  

```./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS -n test_run```  

is equivalent to:  

```./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS/test_run```  

The inclusion of -n is mostly a semantic convenience.  

Khalid Kunji's avatar
Khalid Kunji committed
100 101 102 103
##### Logs

With the addition of -v and cleanup of output, you may notice that even with -v you don't see the output of split, gigi, and merge any longer.  These are now written to their own individual log files in the output directory/run subdirectory.  

Ehsan Ullah's avatar
Ehsan Ullah committed
104
e.g. ./run_GIGI ./INPUTS/Sample_Input/param-v1_06.txt -o ./OUTPUTS -n test_run will have logs in ./OUTPUTS/test_run/LOGS
Khalid Kunji's avatar
Khalid Kunji committed
105

Khalid Kunji's avatar
Khalid Kunji committed
106 107
#### Memory and cgroups

Khalid Kunji's avatar
Khalid Kunji committed
108
We handle memory restrictions using cgroups.  After looking at a number of different memory limiting mechanisms we saw this as the best solution, unfortunately it has some caveats.  One is that root/sudo access is required to create the initial cgroup.  If you are on a shared machine then we encourage you to discuss this with your system administrator if you intend to use the cgroups.  For most shared clusters, we encourage you to use the built in memory limiting mechanims of your submission system (e.g. qsub, SLURM, Torque) instead of limiting it through run_GIGI, most of these also themselves make use of cgroups (e.g. https://slurm.schedmd.com/cgroups.html and HTCondor http://help.uis.cam.ac.uk/supporting-research/research-support/camgrid/camgrid/technical3/cgroups).    
Khalid Kunji's avatar
Khalid Kunji committed
109 110
If you are using this on your own system where you have root/sudo access, then you will need to make sure that your cgroups are set up and that you have your equivalent of the libcgroup library installed for the cgcreate and cgexec commands for your distribution.  
If you have a very old (e.g. maybe 7+ years old) kernel, then you may need to install a newer kernel that has cgroups (they are part of the Linux kernel technically).  
Khalid Kunji's avatar
Khalid Kunji committed
111 112


Khalid Kunji's avatar
Khalid Kunji committed
113
Here is a list of common distributions and links to help/documentation on cgroups
Khalid Kunji's avatar
Khalid Kunji committed
114

Khalid Kunji's avatar
Khalid Kunji committed
115 116
Redhat: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch-Using_Control_Groups.html

Khalid Kunji's avatar
Khalid Kunji committed
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
Arch: https://wiki.archlinux.org/index.php/cgroups you may note that libcgroup is an AUR package, to install such packages: https://wiki.archlinux.org/index.php/Arch_User_Repository

Debian/Ubuntu: https://www.devinhoward.ca/technology/2015/feb/implementing-cgroups-ubuntu-or-debian

Fedora: https://docs.fedoraproject.org/en-US/Fedora/17/html/Resource_Management_Guide/ch-Using_Control_Groups.html

Fedora: https://docs.fedoraproject.org/en-US/Fedora/15/html/Resource_Management_Guide/sec-Creating_Cgroups.html

OpenSuSE: https://www.suse.com/documentation/opensuse114/book_tuning/data/sec_tuning_cgroups_usage.html


Once you have a functional cgcreate command to create cgroups, you can make them permanent (unfortunately in different syntax) by editing /etc/cgconfig.conf on Linux distributions using systemd (most of them).  

Ubuntu: https://askubuntu.com/questions/836469/install-cgconfig-in-ubuntu-16-04

Already covered in many of the other links above

If your distro isn't covered, it is still worth looking at the above guides, most things will be similar in your distro though they may not be exactly the same (e.g. package names could be different, package manager, etc..).  


Khalid Kunji's avatar
Khalid Kunji committed
137
Here is some distribution agnostic information on cgroups: http://man7.org/linux/man-pages/man7/cgroups.7.html
Khalid Kunji's avatar
Khalid Kunji committed
138

Khalid Kunji's avatar
Khalid Kunji committed
139 140
https://www.kernel.org/doc/Documentation/cgroup-v1/

Khalid Kunji's avatar
Khalid Kunji committed
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
cgroups will eventually be replaced with cgroups2, but most of their controllers are not yet functional: https://www.kernel.org/doc/Documentation/cgroup-v2.txt

Technically you can create the cgroup/s we need with mount and mkdir commands, but we ourselves depend on cgcreate and cgexec in code, of course you could create cgcreate and cgexec scripts and add them to your path instead of using the programs in cgroup-tools.  We wouldn't recommend that route though.  


##### Example
Essentially, the goal here is to get a user writable cgroup setup that run_GIGI (running as your user) can make use of to create its own subcgroup. 

On Ubuntu in BASH you can do this as follows: 

First we install cgroup-tools to get cgcreate and cgexec, etc...
```bash
sudo apt-get install cgroup-tools
```
Then we create a cgroup that your user has access to
```bash
sudo cgcreate -a ${USER} -g memory,cpu:user_cgroup
```
We can see that it was create by checking the contents of /sys/fs/cgroup/memory and/or /sys/fs/cgroup/cpu

They should both now have a folder user_cgroup that your user has write permissions to the contents of

```bash
ls -la /sys/fs/cgroup/memory/user_cgroup

ls -la /sys/fs/cgroup/cpu/user_cgroup
```

When run as your user normally with -m, run_GIGI will make its own subcgroup of this cgroup (do not run run_GIGI with sudo)
These are not persistent cgroups (that is, they will disappear on reboot).  
To make persistent ones, please see the distribution documentation above, for most this involves editing a configuration file /etc/cgconfig.conf


##### Future
We may soon also add the ability to control swap usage through the cgroups for run_GIGI, (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html) some distributions need a kernel parameter set at boot to allow this (Debian, Ubuntu, Arch, ??).  
Khalid Kunji's avatar
Khalid Kunji committed
176 177
See the issue and solution here: http://matthewkwilliams.com/index.php/2016/03/17/docker-cgroups-memory-constraints-and-java-cautionary-tale/
The same method in a shorter read here: https://unix.stackexchange.com/questions/147158/how-to-enable-swap-accounting-for-memory-cgroup-in-archlinux
Khalid Kunji's avatar
Khalid Kunji committed
178 179
If you have a different bootloader, adding that same option to your boot command should work but you'll need to consult the documentation for your bootloader to see how to do this.  

Khalid Kunji's avatar
Khalid Kunji committed
180

Khalid Kunji's avatar
Khalid Kunji committed
181
##### Caveat and Solution with GRUB/Other Bootloader
Khalid Kunji's avatar
Khalid Kunji committed
182 183 184 185 186 187
Be careful when editing this boot line, mistakes may cause your machine to fail to boot linux.  This will not harm your data but you may need to manually fix or reinstall your bootloader.  
Useful resource for that situation: https://help.ubuntu.com/community/Grub2/Troubleshooting#Editing_the_GRUB_2_Menu_During_Boot
One could also edit during boot like that to test the line without making it permenent.  Thereby avoiding any more serious GRUB issues than a single failed boot.  
If you have messed up your GRUB and can't figure out how to get it back, the most reliable method I have used to reliably get back GRUB is to reinstall via chroot: https://help.ubuntu.com/community/Grub2/Installing#via_ChRoot


Khalid Kunji's avatar
Khalid Kunji committed
188 189 190 191
Going forward, part of this may become easier for Ubuntu users.  From 14.04 and onwards there should be a user writable cgrouup by default.  It is created by systemd automatically and I'm not sure how consistent the location is, https://help.ubuntu.com/lts/serverguide/cgroups-delegation.html

I think the best way to make use of this may be through cgmanager,  we will explore this possibility.