Common CMake issues

I am creating this blog as a reference for the common CMake issues that I have during my projects. I end up solving them at the time but when the issue hits me again in a few months, I kind of not remember the details of how I tackled it the previous time. As many a times these issues are specific to the machine that I am on  – the modules, platforms, versions, environment variables etc, the common solution of google search is only partially helpful. This blog is intended as a solution for the issues. 


The C compiler 
is not able to compile a simple test program.

This is happening on Jetbrains and not when I try to build from terminal. Intel C compiler was complaining as I didn’t have licence for it on my local machine. So, I am now building it with gnu/6.3.0 . There are two solutions for it. First, specify the compiler in CLion’s settings. The second, and the one that I am using right now, is to build it first on a terminal with the gcc compiler that I want. And since it gets stored in the CMakeCache.txt, CMake just reads it from there itself. It will create problem if I try to delete the cache and try to build again, but now I know how to handle it.


This issue comes up while trying to run cmake on a machine which doesn’t have nvidia driver installed. So, to solve this one, just switch over to a machine that has nvida drivers, not just CUDA toolkit.



C++ 11/14/17 standard flags don’t get propagated over to nvcc


set_property(TARGET target PROPERTY CUDA_STANDARD  11 ) 



no kernel image is available for execution on the device

This error message comes up while launching the kernel if the cuda code is not compiled for the specific architecture of the gpu device. Solution for the problem is to add the following nvcc flags

–gencode arch=compute_61,code=compute_61. 

This is for the Pascal architecture with compute capability 6.1. For k80, the compute capability is 3.7. You can add as many cc as you want in order to have backward compatibility, but at the same time your performance may hamper. It’s better to know your architecture’s cc and build for as few as you can.



CMake doesn’t identify the correct C/CXX version which was set using the module. It recognizes the one at /usr/local/c++

Solution :
This should be set by the module itself. However, as it is not set, we can set it by 
export CC=<path-to-icc>
export CXX=<path-to-icpc>



CMake is not able to pass the -G flag to debug the cuda portion of the code

Solution :

I haven’t found the exact method yet. But what seems to be workin for now is to simply use

set cuda break_on_launch all

after reaching the host level code where the device kernel is being called.

Sponsored Post Learn from the experts: Create a successful blog with our brand new courseThe Blog is excited to announce our newest offering: a course just for beginning bloggers where you’ll learn everything you need to know about blogging from the most trusted experts in the industry. We have helped millions of blogs get up and running, we know what works, and we want you to to know everything we know. This course provides all the fundamental skills and inspiration you need to get your blog started, an interactive community forum, and content updated annually.

Nature of warp execution

In a series of blog posts, we’ll cover the nature of warp execution, measure the performance and discuss ways of optimizing the code in order to make optimum utilization of the underlying hardware. 

Warps are the basic unit of execution in a Streaming Multiprocessor(SM). What this means is that all threads (32) in a warp are executed by the SM at the same time in a SIMT (Single Instruction Multiple Threads) fashion. When we launch a kernel, we generally divide the threads into grid and blocks. That is the logical view from a programmer’s perspective. Hardware however sees them differently. All threads in a block are divided linearly into groups (warps) of  32 threads each. For example, lets look at the following kernel call

kernel<<< ((bx, by, bz),(tx, ty, tz) )>>>(…..)

So, each block has tt (= tx*ty*tz) threads. These will just be divided into ceil(tt / 32) warps. This means that if tt is not a multiple of 32, the last warp will have less than 32 threads running. A warp won’t span over more than one block. This is the first room for optimization while launching the kernel. We’ll discuss more about this later.

As all the threads in a warp are executed in SIMT fashion, each thread executes the same instruction at the same time. So, if two threads in a warp have to be in different clauses based on an if-then-else condition, threads not executing a particular section will be stalled while the other threads execute that section. This is known as thread divergence and contributes to inefficiency in parallelization. Unlike the CPUs the GPU cores are not optimized for branch prediction and hence the divergence cost is more taxing in this case. Care should be taken to adjust branch granularity to be a multiple of warp size to avoid warp divergence. 

Setting up Visual Studio 17 for CUDA project

In this blog I will try to explain the basic setup of VS2017 for a CUDA project.  Here’s what we’ll need.

  1. Make sure that your graphics card is CUDA-compatible. Most of the newer ones are compatible while the older ones aren’t. This is how you can check it – Goto Device Manager -> Display Adapters. Check if you have a device named “NVIDIA …” in that list. I have “NVIDIA GeForce GT 750M”. It’s not new but gets the work done. Next goto nvidia-developer-page  and check if your device is listed.
  2. Check the Compute Capability of your device. This dictates the kind of work that you can do your graphics card. The compute capability of my device is only 3.0 (This is an old machine, but is fine for my personal use). CC identifies the features supported by the device (the GPU hardware) and is used by the applications at runtime to determine the hardware features and/or instructions available on the GPU device.
    Compute Capability Architecture
    1._ Tesla
    2._ Fermi
    3._ Kepler
    5._ Maxwell
    6._ Pascal
    7._ Volta
    7.5 Turing

    There are no 4._ compute capability devices and the latest architecture – Turing – has a similar CC as Volta.

  3. I assume that you have VS2017 already installed. If not, go ahead and install it first.
  4. Next, install the CUDA Toolkit from here. First install the base installer and then the patches if any.
  5. Start VS2017. Goto File -> New Project. Choose CUDA 9.0 from the left pane. Name the solution whatever you want. I am naming it testCUDA9. A new project with a file is created. addWithCuda function has a bunch of ‘goto’s in it and no one would advice you to use it. However, that’s not the point of this article.
  6. Build the project. Goto Build -> Build Solution. The project might not build and fail with the error,

    #error — unsupported Microsoft Visual Studio version! Only the versions 2012, 2013, 2015 and 2017 are supported!

    This can be handled by changing the visual studio toolset you are using to build the project. Click Project -> testCuda9 Properties. Under General -> Platform Toolset, choose Visual Studio 2015 (v140).

  7.  Build the project again. This time, you should get something like

    ========== Build: 1 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

  8. Next, click Debug -> Start without Debugging. A new window will open up with the results of summation performed on the GPU device.

That’s it !!

Cartesian sum of two sets

Q. Consider two sets A and B, each having n integers in the range from 0 to 10n. We wish to compute the cartesian sum of the A and B, defined by

C = {x+y: x e A and y e B}.

Note that the integers in C are in the range from 0 to 20n. We want to find the elements of C and the number of times each elemetn of C is realized as a sum of elements in A and B. Show how to solve the problem in O(n lg n) time.
[Cormen pg 906, q 30.1-7]

Solution :

Lets first understand the problem. A and B each have n elements which can range from 0 to 10n. C is the cartersian sum of the two sets so by the defintion fo the set given earlier, it has as many as n^2 elments

for a in set X :
for b in set Y :
insert into set C (a+b)

So, this brute force approach gives us a O(n^2) time algorithm. We are asked to find a more efficient algorithm for this – one that can solve the problem in O(n lgn ) time instead.

Solution to the problem lies in spotting the analogy of the problem with that of fast-fourier transform. FFT provides us a way to multiply two polynomials of degree-bound n in O(n lg n ) time rather than the brute force O(n^2) method for multiplying the two polynomials.

So, lets construct a polynomial A of degree 10n as follows

A(x) = \Sum_{i=0}^{10n-1} a_i x^i

where a_i = freq(i) in set A.

Similarly construct set B.

We can multiply the two polynomials in O(n ln n) time.The non-zero coefficient of the terms are the frequency of the elements in C, and the power of the non-zero-coefficient terms are the elements of C.

Handy gdb commands

Here are some of the gdb commands for the people who are just getting started with gdb.

  1. Compile your code with the debug option enabled. In case of gnu compilers for c, c++, fortran, it generally boils down to just compiling with -g option.
  2. start gdb with the command
    gdb <executable-file-name>
    The important to look for is whether gdb is able to read the symbols correctly. The last line from the previous output should be :
    ‘Reading symbols from <exec-file>..done’
  3. Now you are in the gdb world! You will be able to see the
    prompt on your screen. You can run the gdb commands here.
  4. Start with the simplest ‘run’ command.
    (gdb) run
    This should run exactly same as when you ran your program from the command line. If you were passing some other arguments in order to run your program, pass them here too. For ex :
    (gdb)run < testFile.txt
  5. The previous command will run the program from the beginning till as far it can go. You will generally prefer to break it at some line such that the program runs till that point in one go and you can then be able to step through the problematic region of your code trying to figure out what went wrong there. You can do this by :
    (gdb) break main.c:90Breakpoint 1 at 0x1385b4b: file main.c, line 90.

    This command applies a breakpoint at main.c  on line 90. If you try to run your program, it only goes till line 90 and transfers control back to you.
    Breakpoint 1, main(..) at main.c:90
    90     i=i+1

  6. You can now print the value stored in any variable using the print command.
    (gdb)print k
    $1 = 10
  7. To execute the next line, use next
    91   a[i]=i*2
  8. There’s another command to execute one line at a time.
    The difference between next and step is that step will take you inside the function if it is called in the line that is currently executed. In contrast, next will execute the the entire function and your prompt will be at the next line in the program that you are running rather than the hopping into the function called.
  9. Using the command continue, runs the program till the end or till the next breakpoint encountered or till the next segmentation fault.
  10. Another useful command to know where in the code you are actually at, is :
    This simply spits out 5 lines before and after where you are actually at.
  11. until is a useful command as well. It is similar to next, as in,  next allows you to hop over the function call and until lets you hop over a loop structure (for, while etc) [in simple words]
  12. For each of these commands you can also use, their short-form.
    r for run
    b for break
    c for continue
    n for next
    s for step
    u for until
    Also, if you just press Enter, the previous command that you gave is executed again.These are the very basics of using gdb to get the uninitiated started. gdb has much richer features available, which you can search online or read with help command. Just play around with it, it’s pretty intuitive. And as your code base starts getting complicated, gdb is going to be indispensable. All the best !

Graph Theory


What is a graph ?

It is a triple consisting of a vertex set V(G), an edge set E(G) and a relation that associates with each edge two vertices (not necessarily distinct) called its end points.


Some common terms :

A loop
Multiple edge.
Simple Graph
Finite Graph


Some simple graphs

Complete Graph.



H is a subgraph of G: Then V(H) ⊆ V(G), E(H) ⊆ E(G).The assignment of end points to edges in H is the same as that in G.


Induced graph
H is an induced subgraph of G on S, where S ⊆ V(G): Then V(H) = S and E(H) is the set of edges of G such that both the end points belong to S.


A graph G is connected if each pair of vertices belongs to a path.


An Isomorphism from a simple graph G to a simple graph H is a
bijection f : V(G) → V(H) such that (u,v) ∈ E(G) if and only if
(f (u), f (v)) ∈ E(H)


Forest and Tree
A graph without any cycle is acyclic.
A forest is an acyclic graph
A tree is a connected acyclic graph.


Bipartite Graph
A graph G is bipartite if V(G) is the union of two disjoint
(possibly empty) independent sets called partite sets of G.
(A subset S ⊆ V(G) is an independent set if the induced subgraph
on S contains no edges. )


A tree is a bipartite graph.
Can we say that the complete graph Kn is bipartite ?
The complete bipartite graph.


Vertex Cover
A set S ⊆ V(G) is a vertex cover of G (of the edges of G) if every
edge of G is incident with a vertex in S.
A vertex cover of the minimum cardinality is called a minimum
vertex cover. We will denote this set by MVC(G).


What is the cardinality of MVC in the complete graph Kn ?
What about the complete bipartite graph Km,n ?
The cycle Cn, when n is even and odd ?


The cardinality of a biggest independent set in G is called the independence number (or stability number) of G and is denoted by α(G).

Is there any relation between |MVC(G)| and α(G) ?
If we remove a VC from G, the rest is an independent set.
So, if we remove MVC from G, the rest, i.e. V − MVC is an
independent set.
So, α(G) ≥ n − |MVC(G)|. Thus |MVC(G)| ≥ n − α(G).
Similiarly if we remove any independent set from G, the rest
is VC, and so |MVC| ≤ n − α(G).
Thus we get |MVC| = n − α(G).
If we denote —MVC(G)— by β(G), then we have
β(G) + α(G) = n.


A set M of independent edges in a graph is called a matching.
M is a matching of U ⊆ V(G), if every vertex in U is incident
with an edge in M.
Then a vertex in U is a matched vertex. The vertices which
are not incident with any edge of M is unmatched.


A matching M is a perfect matching of G, if every vertex in G is
matched by M.
If G has n vertices, what is the cardinality of a perfect matching M
of G ?


The cardinality of the biggest matching in G can be denoted by

What is the value of α'(G) for:
Cycle Cn
Path Pn
Complete Graph Kn
Complete Bipartite graph Km,n

Hello world!

Welcome to After you read this, you should delete and write your own post, with a new title above. Or hit Add New on the left (of the admin dashboard) to start a fresh post.

Here are some suggestions for your first post.

  1. You can find new ideas for what to blog about by reading the Daily Post.
  2. Add PressThis to your browser. It creates a new blog post for you about any interesting  page you read on the web.
  3. Make some changes to this page, and then hit preview on the right. You can alway preview any post or edit you before you share it to the world.