1.6 unreleased
==============

Notable User Facing Changes
---------------------------

- CUDA kernels using constant __local blocks are now ABI incompatible with
  previous release. Users need to delete their pocl cache.


Optimizations
-------------

- Improved the PTX code generation for __local blocks. Previously constant
  __local blocks and __local arguments were using one dynamic shared CUDA
  memory block with offsets computed at runtime. Now if there is no __local
  arguments, separate static shared CUDA memory is used. If there are
  __local arguments, the constant __local blocks are indexed with compile
  time constants. This improves the performance due to better SASS code
  generation because it avoids what appears to be a pointer aliasing issue.

  Running SHOC benchmark GEMM with size class 4 on a NVIDIA Titan X gives
  the following performance improvements.

    sgemm_n:        23.2%
    sgemm_t:        18.5%
    sgemm_n_pcie:   23.3%
    sgemm_t_pcie:   19.5%
    dgemm_n:        51.4%
    dgemm_t:         2.8%
    dgemm_n_pcie:   51.6%
    dgemm_t_pcie:    6.9%

Notable Bug Fixes
-----------------
- Fix LLVM loop vectorizing remarks printing (POCL_VECTORIZER_REMARKS=1).
- Fix an issue in which the loop vectorizer produced code with invalid
  memory reads (issue #757).

Security
--------

Added a build parameter HARDENING_ENABLE that applies hardening flags present in some modern compilers to produce a more secure libpocl.so with the trade-off in performance.


1.5 April 2020
==============

Notable User Facing Changes
---------------------------
- Support for LLVM 10.
- POCL_TRACE_EVENT, POCL_TRACE_EVENT_OPT and POCL_TRACE_EVENT_FILTER
  environment variables were renamed to
  POCL_TRACING, POCL_TRACING_OPT and POCL_TRACING_FILTER, respectively.
- Refactored the implementation of convert_T() OpenCL functions to better meet
  autovectorization criteria under LLVM, thus utilizing device's SIMD ISA
  capabilities where available; e.g. on an ARM64 Cortex-A72 convert_int8(short8)
  is 5.5x faster now when measured in a tight loop.
- A lot of fixes.

Usability
---------
- A simple per-kernel execution time statistics atexit() for quick and easy
  low-impact per-device profiling purposes (relies on event time stamps purely).
  It can be enabled by setting POCL_TRACING env to 'cq'.

1.4 September 2019
==================

Highlights
----------
- pocl-accel: An example driver and support infrastructure for OpenCL 1.2
  CL_DEVICE_TYPE_CUSTOM hardware accelerators which implement a memory
  mapped control interface.

- Improved SPIR and SPIR-V support. clCreateProgramWithIL() implemented,
  Kernel library (for CPU target) support for SPIR-mangling improved

Kernel Compiler
---------------
- Specialize work-group functions for global offset (0,0,0).
- A pocl installation with clang, hwloc statically linked in
  is now relocatable.
- Clang/LLVM versions older than 6.0 are no longer supported.
- Create specialized work-group functions for small (defined by a device
  driver specific limit) grid dimensions.
- Add Range Metadata to various ID queries etc. to improve vectorizing
  index computation to smaller lane widths and other optimizations.
- Passes only the launched kernel to work-group generation and code gen, thus
  speeding up the compilation process.

Misc.
-----
- hsa-native: Downgraded the advertised version to 1.2
  which is closer to the truth (fixes OCLTest of Glow).
- hsa-native: Add support for byval (struct) argument passing.
- hsa-native: Allow offsets in block copy.

Notable Internal Changes
------------------------
- Allow devices to utilize the ROCm-Device-Libs ocml
  builtins for their builtin libraries if seen fit.
  https://github.com/RadeonOpenCompute/ROCm-Device-Libs/tree/master/ocml
  was mirrored in lib/kernel and made it easy to cherry pick
  implementations to targets' kernel libary.

- libltdl is replaced with libdl on UNIX platforms.

Notable Bug Fixes
-----------------
- Fix a race condition in device initialization, which
  caused issues in applications that cause reinitialization
  of pocl device drivers (appeared in Glow's OCLTest).

Device Driver Specific
----------------------
- hsa-native: Downgraded the advertised version to 1.2
  which is closer to the truth (fixes OCLTest of Glow).
- hsa-native: Add support for byval (struct) argument passing.
- hsa-native: Allow offsets in block copy.

1.3 April 2019
==============

Highlights
----------
- Support for Clang/LLVM 8.0.
- Support ICD on OSX.

Misc.
-----
- Ability to have size_t (basically derived from the largest supported
  object) smaller than CL_ADDRESS_BITS. This is an unofficial optional
  extension as the OpenCL standard mandates it to be the same.
- POCL_EXTRA_BUILD_FLAGS can be used to force add extra build flags such as
  '-g' to all clBuildProgram() calls.
- Allow building pocl without CPU backend drivers. When set to off,
  CPU will not appear in the list of OpenCL devices reported by pocl.
  Controllable via ENABLE_HOST_CPU_DEVICES=off cmake option.
- Build logs are now produced also for illegal options passed to the kernel
  build e.g. via the options parameter of clBuildProgram().
- hsa-native: Device side printf-support and alternative < 1.2 non-standard
  C99 printf exposing support.
- pocl's binary format has been slightly updated (changes are listed in
  the top of pocl_binary.c file) to version 7, but pocl can still read
  also the previous version 6 format.
- Allow local-size-specializing also SPMD-targeted kernels to enable compile
  time optimization of code depending on the local dimensions.
- Support older GLIBC versions.
- HSA: Initial experimental support for native-ISA compilation on top of HSA
  runtime. Tested and works currently only on phsa-runtime. Can be enabled
  with ENABLE_HSAIL=off cmake option.
- Add option to disable installing of OpenCL headers.

Notable Bug Fixes
-----------------
- Fixed kernel debug symbol generation.
- HSA: fix kernel caching.
- Fix issue #661: clCreateImage doesn't fail with unsupported image type.
- Fix issue #668: handle non-kernel functions with barriers properly.
- Fix issue #671: Unable to build pocl with CUDA support with LLVM 7 and host GCC 8.2.
- Fix image format/size handling with multiple devices in context.
- Fix padding issue with context arrays that manifested as unaligned
  access errors after autovectorization.

Notable Internal Changes
------------------------
- Add group ids as hidden kernel arguments instead of digging them up
  from the context struct.
- Ability to generate the final binary via separate assembly text + assembler
  call. Useful for supporting LLVM targets without direct binary emission
  support.
- Use Clang's Driver API for launching the final linkage step. This way we
  utilize the toolchain registry with correct linkage steps required for
  the target at hand.
- Add 'device_aux_functions' to the driver layer attributes. This can be used
  to retain device-specific functions required by the target across the
  pruning of unused globals.
- The "default kernels" hack which was used to store kernel metadata, has
  been removed. Kernel metadata are now stored only once, in cl_program
  struct; every new cl_kernel structs holds only a pointer.
- Major 'pthread' CPU driver cleanup.
- Major Workgroup.cc cleanup.

1.2 September 2018
==================

- LLVM 7.0 is now supported.

- Version 2.0 of hwloc library is supported.

- device-side printf; more consistent printf output.

1.1 March 2018
==============

Highlights
----------

- LLVM 6.0 is now supported.

- Reintroduced experimental SPIR LLVM bitcode support to pocl.
  Requires LLVM 5 or newer. New experimental feature: SPIR-V support;
  requires a working llvm-spirv converter. Currently only loading
  of SPIR-V binaries by pocl is supported, not output.
  See docs/features.rst for more details.

- Refactored pocl cache now does away with LLVM file locks and relies
  entirely on system calls for proper synchronization. Additionally,
  cache file writes are now fdatasync()ed.

- Improved kernel compilation time (with cold cache). Improvement
  depends on sources - it's bigger for large programs with many kernels.
  Luxmark now compiles in seconds instead of dozens of seconds;
  internal pocl tests run in 30-50% less time.

- LLVM Scalarizer pass is now only called for SPMD devices. Performance
  change varies across tests, but positive seems to outweigh negative.

- Implemented uninitialization callback for device drivers. This is
  triggered when the last cl_context is released. Currently only the
  CPU driver implements the callback.

- Removed libpoclu from installed files; this library contains helpers
  for pocl's internal tests, and from installed files was only used by
  poclcc, which has been updated to not rely on it.

- POCL_MAX_WORK_GROUP_SIZE is now respected by all devices. This variable
  limits the reported maximum WG sizes & dimensions; tuning max WG size
  may improve performance due to cache locality improvement.

- CL_PLATFORM_VERSION now contains much more information about how
  pocl was built.

- For users still building with Vecmathlib, performance should be back
  to levels of pocl 0.14 (there was a huge drop caused by a change
  in -O0 optimization level of LLVM 5.0).

- Improved support for ARM and ARM64 architectures. All internal tests
  now pass (on Cortex-A53 and Cortex-A15), although it's still far
  from full conformance.


1.0 December 2017
=================

Highlights
----------
- Improved automatic local work-group sizing on kernel enqueue, taking
  into account standard constraints, SIMD width for vectorization as
  well as the number of compute units available on the device.
- Support for NVIDIA GPUs via a new CUDA backend (currently experimental).
- Removed support for BBVectorizer.
- LLVM 5.0 is now supported.
- A few build options have been added for distribution builds,
  see README.packaging.
- Somewhat improved scalability in the CPU driver. CPUs with many cores
  and programs using a lot of WIs with small kernels can run somewhat faster.
- The OpenCL 1.2 conformance tests now pass with selected CPUs. There are some
  caveats though - see the documentation.
- When conformance is enabled, some kernel library functions might be
  slower than in previous releases.
- Pocl now reports OpenCL 1.2 instead of 2.0, except HSA enabled builds.
- Updated format of pocl binaries, which is NOT backwards compatible.
  You'll need to clean any kernel caches.
- Fixed several memory leaks.
- Unresolved symbols (missing/misspelled functions etc) in a kernel will
  result in error in clBuildProgram() instead of pocl silently ignoring
  them and then aborting at dlopen().
- New env variable POCL_MEMORY_LIMIT=<num> limits the Global memory size
  reported by pocl to <num> gigabytes.
- New env variable POCL_AFFINITY (defaults to 0): if enabled, sets
  the affinity of each CPU driver pthread to a single core.
- Improved AVX512 support (with LLVM 5.0). Note that even with LLVM 5.0
  there are still a few bugs (see pocl issue #555); AVX512 + LLVM 4.0 are
  a lot more broken, and probably not worth trying.
- POCL_DEBUG env var has been revamped. You can now limit debuginfo to
  these categories (or their combination): all,error,warning,general
  memory,llvm,events,cache,locking,refcounts,timing,hsa,tce,cuda
  The old setting POCL_DEBUG=1 now equals error+warning+general.


0.14 April 2017
===============

Highlights
----------
- Support for LLVM/Clang versions 3.9 and 4.0. Version 3.9 was the first
  release to include all frontend features for OpenCL 2.0.
- Ability to build pocl in a mode where online compilation is not
  supported to run in hosts without LLVM and binaries compiled offline
  e.g. using poclcc.
- pocl's binary format now can contain all the necessary bits to
  execute the programs on a host without online compiler support.
- Initial support for out-of-order execution execution of command queues.
- It's now possible to cross-compile pocl when building an offline
  compiler build.
- New driver api extension to support out-of-order and asynchronous
  devices/drivers.
- Pthread and HSA drivers are now fully asynchronous.
- CMake now the only supported build system, autotools removed.
- LTTng tracing support

OpenCL Runtime/Platform API support
-----------------------------------
- implemented clEnqueueBarrierWithWaitList
- implemented clEnqueueMigrateMemObjects

Other
-----
- Support for reqd_work_group_size attribute in the binary format and poclcc:
  Generates a static sized work-group function to help optimizations
  such as autovectorization.
- HSA: added support for phsa (https://github.com/HSAFoundation/phsa)
- A lot of bug and memory leak fixes. Some notable ones:
  - Issue #1, passing aggregates as kernel value parameters, can be
    now fixed with an LLVM patch.
  - Now it's possible to build pocl without using the fake address
    space ids, which were a source of many annoying issues.

0.13 April 2016
===============

Highlights
-----------
- Support for LLVM/Clang 3.8
- initial (partial) OpenCL 2.0 support
  (only Shared Virtual Memory and Atomics are supported ATM)
- CMake build system almost on parity with autotools
  (TCE, all external testsuites)
- CMake build is now able to build multiple kernel libraries
  for different CPUs and let pocl select a suitable one at runtime

Bugfixes
---------
- clEnqueueCopyImage() now works properly
- improved file locking (much less disk access to kernel cache)
- Address spaces of structs are handled properly

Other
------
- removed custom buffer alloc from pthread device
- removed IBM Cell support
- removed support for older LLVM versions (before 3.7)
- significantly higher performance with a lot of small kernel enqueues
  (due to improved file locking)
- vecmathlib now supports AVX2
- a few more HSA kernel library implementations: l/tgamma, erf(c), hypot
- implemented OpenCL 2.0 API calls: clEnqueueSVM*, clSVMalloc/free,
  clEnqueueFillBuffer, clSetKernelExecInfo, clSetKernelArgSVMPointer,
  clCreateCommandQueueWithProperties - no device side queues yet
- OpenCL 2.0 atomics (C11 atomics subset) for x86-64 and HSA
- new testsuites: AMD SDK 3.0, Intel SVM
- New CMake-only testsuites: ASL, clBLAS, clFFT, arrayfire
- more debugging info (timing, mem stats)
- ansi colors with POCL_DEBUG=1 if the output is a terminal


0.12 October 2015
===============

Highlights
----------
- Support for HSA-compliant devices (kernel agents). The GPU of AMD Kaveri
  now works through pocl with a bunch of test cases in the AMD SDK 2.9 example
  suite.
- New and improved kernel cache system that enables caching
  kernels with #includes.
- Support for LLVM/Clang 3.7.
- Little endian MIPS32 now passes almost all pocl testsuite tests.

OpenCL Runtime/Platform API support
-----------------------------------
- Transferred buffer read/write/copy offset calculation to device driver side.
  - these driver api functions have changed; got offset as a new argument.
- Maximum allocation is not limited to 1/4th of total memory size.
- Maximum image dimensions grow to fit maximum allocation.
- clGetDeviceInfo() reports better information about CPU vendor and cache.
- experimental clCreateSubDevices() for pthread CPU device.

OpenCL C Builtin Function Implementations
-----------------------------------------
- Implemented get_image_dim().

Bugfixes
--------
- Avoid infinite loops when users recycle an event waiting list.
- Correctly report the base address alignment.
- Lots of others.

Misc
----
- Tests now using new cl2.hpp, removing dependency on OpenGL headers


0.11 March 2015
===============

Highlights
----------
- Support for LLVM/Clang 3.6
- Kernel compiler cache.
- Android support.

Kernel compiler
---------------
- Do not add implicit barriers to kernels without WG barriers
  to avoid WI context data overheads.
- Setting the POCL_VECTORIZER_REMARKS env to 1 prints out LLVM vectorizer 
  remarks during kernel compilation.
- Implicit work-group vectorizer improvements.
- POCL_VECTORIZER_REMARKS: When set to 1, prints out remarks produced by 
  the loop vectorizer of LLVM during kernel compilation.

OpenCL Runtime/Platform API support
-----------------------------------
- Minimal initial implementation for clCreateSubDevices()

Bugfixes
--------
- Fix falsely detecting operations with side-effects (especially atomic
  operations) as uniform. This caused deadlock/race situations due to 
  illegal implicit barrier injection.
- Fix several reference counting issues.
- Memory leak fixes.
- ARM/openSUSE build fixes.
- Plenty of CMake fixes.

New test/example cases
----------------------
- Several Halide examples using its OpenCL backend added.
- CloverLeaf

Misc.
-----
- The old BBVectorizer forked WIVectorizer removed due to bit rot and 
  the general hackiness of it.
- Experimental Windows/Visual Studio support (in progress).
- Initial support for MIPS architecture (with known issues).
- Runtime debug printouts that can be enabled via POCL_DEBUG=1.
- Streamlined the buffer allocation and fixed several issues with it.

 
0.10 September 2014
===================

This lists only the most interesting changes. Please
refer to the version control log for a full listing.

Highlights
----------
- Support for LLVM/Clang 3.5 
- Support for building using CMake (experimental with known issues).

Bugfixes
--------
- TCE: kernel building was broken when running pocl
  from install location 
- thread-safety (as required since OpenCL 1.1) improved

Kernel compiler
---------------
- Final code generation now done via LLVM API calls instead of
  calling the llc binary.
- Sensible linking of functions from the monolithic kernel built-in
  library. Major compilation speedup for smaller kernels.

OpenCL C Builtin Function Implementations
-----------------------------------------
- Improved support for halfN functions.
- ilogb and ldexp available with vecmathlib

OpenCL Runtime/Platform API support
-----------------------------------
- Implement clCreateKernelsInProgram()
- OpenCL-C shuffle() and shuffle2() implementation added
- Device probing modified to allow for device driver to detect device during
  runtime. POCL_DEVICES still supported.
- Checks in clSetKernelArgs() for argument validity
- Checks in clEnqueueNDRange() for arguments to be all set
- Implement clGetKernelArgInfo()
- clEnqueueCopyImage() 

Misc
----
- ViennaCL testsuite updated to 1.5.1

0.9 January 2014
================

This lists only the most interesting changes. Please
refer to the version control log for a full listing.

Highlights
----------
- Major improvements to the kernel compiler's vectorization 
  performance. Twofold speedups in some benchmarks
- Support for most of the piglit CL tests

OpenCL Runtime/Platform API support
-----------------------------------
- clCreateImage2D() and clCreateImage3D() implementation moved to 
  clCreateImage()
- Image creation now uses clCreateBuffer() 
- clBuildProgram: Propagate the supported -cl* compiler options to Clang's 
  OpenCL frontend. 
- clFinish: works with commands with event wait lists.
- Preliminary support for OpenCL 2.0 blocks
- Added support for clEnqueueNativeKernel()

Builtin Function Implementations (OpenCL 1.2 Section 6.12)
----------------------------------------------------------
- Refactored read/write_image()-functions to support refactored device image 
  object. (Only functions used by SimpleImage test)
- Introduced new macro based implementation for read/write_image()-functions
- Added sampler implementation for CLK_ADDRESS_CLAMP and 
  CLK_ADDRESS_CLAMP_TO_EDGE (Only integer coords supported)
- Most of the printf() format strings now works. Missing features:
  - long on 32-bit architectures

Performance Improvements
------------------------
- Kernel compiler now tries to avoid replicating uniform variables,
  this leads to less context data to be saved per work-item and cleaner
  kernel bitcode for later optimizations
- Use a precompiled header for OpenCL C builtin declarations to speed up
  the kernel compilation
- Kernel compiler vectorization optimizations:
  - Inject implicit barriers both to loop starts and ends to
    horizontally vectorize the inner loop.
  - Reduce "peeling" by minimizing the conditional barrier region
    by injecting implicit barrier close to the branch points for
    conditional barrier cases.
  - Breaking of vector datatypes for more efficient loop 
    vectorization.
  - Support LLVM 3.4 parallel loop metadata.

Misc
----
- Explicitly specify the target architecture/CPU for the 
  kernel complier.
- Kernel compiler frontend defaults to implementation using LLVM API
  directly instead of the scripts. 
- __OPENCL_VERSION__ defined to 120
- poclu: helpers for converting between the C float and OpenCL cl_half
  types
- clEnqueueNativeKernel implemented
- Static and cmake-builds of LLVM can now be used. 

Bugfixes
--------
- Correct isequal, isnan, and similar routines

0.8  August 2013
================

This lists only the most interesting changes. Please
refer to the version control log for a full listing.

Overall
-------

- Added support for LLVM/Clang 3.3.
- Dropped support for LLVM/Clang v3.1.
- Removed the depedency on llvm-ld (which was copied to
  pocl-llvm-ld to pocl tree). Now uses llvm-link instead.
- Project renamed to Portable Computing Language (pocl).
- Luxmark v2.0 now works.
- x86_64 can now use efficient math built-in function
  implementations from the vecmathlib project to avoid libm 
  calls and to exploit the SIMD instructions more efficiently
  in case of vector datatypes in the kernel.
- Parallelize kernel inner loops "horizontally", if possible.
  This converts possibly sequential inner kernel loops to parallel 
  loops by effectively performing "loop interchange" of the
  work-item loop and the kernel's inner loop.
- Added VexCL tests to the test suite. All but one of them
  work with pocl.

Major bugfixes
--------------
- Fixed passing NULL as a buffer argument to clSetKernelArg
  (this time with a regression test added).
- Constant BitCast expressions broken to variables to avoid
  crashing when copying a kernel with casts on automatic
  local pointers.
- Fixes for i386/i686. Tested on Pentium4/Ubuntu 10.04 LTS.
- Lots of API error checking added (found by the Piglit testing suite).
- Fixed bug in select producing incorrect results when the third
  conditional argument is an unsigned scalar or vector.
- Replaced deprecated SSE 4.1 assembly mneunomics in x86-64 min/max
  kernel functions that have since been removed in more recent
  versions of gas and llvm-as.
- SPIR/LLVM IR 'byval' attributes are now handled correctly on
  kernel function arguments, allowing for structs and oversized
  vectors to be passed in with value semantics.
- Fixed to work with the latest Khronos OpenCL headers for 1.2.
  Some issues fixed with the new cl.hpp.
- The ICD dispatch table was too small which might have caused
  "interesting" behavior when calling the later functions in
  the table and not using ocl-icd as the dispatcher.
- Several kernel compiler bugs fixed.
- A multithreaded host application could free the same object
  multiple times due to a race issue.

Platform Layer implementations (OpenCL 1.2 Chapter 4)
-----------------------------------------------------
- Return correctly formatted CL_DEVICE_VERSION and
  CL_DEVICE_OPENCL_C_VERSION.
- clGetDeviceInfo: Use the 'cpufreq' sys interface of Linux for 
  querying the CPU clock frequency, if available.

The OpenCL Runtime (OpenCL 1.2 Chapter 5)
-----------------------------------------
- clGetEventInfo: Querying the command type, command queue,
  and the reference count of the event.

Builtin Function Implementations (OpenCL 1.2 Section 6.12)
----------------------------------------------------------
- convert_type* builtins now generated with a Python script by
  Victor Oliveira.
- length() fingerprint was assuming two arguments instead of one.
- The kernel bitcode library is now optimized when built in pocl. Speeds
  up kernel optimization for cases which use the kernel functions
  a lot.
- Fix mul_hi() implementation

ICD
---
- Fixed pocl tests to work when executed through the Khronos 
  supplied icd loader (needs a patch applied to the loader be able to
  override the .icd search path).

Misc.
-----
- Fix to the helper script search logic:
  Search from the BUILDDIR only if env POCL_BUILDING is defined.
  Otherwise search from PKGDATADIR first, then from the PATH.
- Fixed memory leaks in clCreateContext* and clCreateKernel
- Ensured that stored arguments are adequately aligned in
  clSetKernelArg and clEnqueueNDRangeKernel.

0.7  January 2013
=================

This lists only the most interesting changes. Please
refer to the version control log for a full listing.

Overall
-------
- Support for LLVM 3.2.
- Multi-WI work group functions can be now generated
  using loops which are only partially unrolled. Reduces
  code size explosion with large WGs in comparison to
  the full replication method.
- PowerPC 64 support (tested on Cell/Debian Sid/PS3).
- PowerPC 32 support (tested on Cell/Debian Sid/PS3).
- ARM v7 support (on Linux)
- Beginning of Cell SPU support (very experimental!).
- Most of the AMD APP SDK OpenCL examples now work and have been
  added to the pocl test suite.
- Most of the Parboil benchmark cases added to the test
  suite.

Kernel Compiler Passes
----------------------
- Several miscompilations and compiler crashes fixed.
- Multiple bugs fixed from the work group vectorizer.
- Updated metadata format pocl uses to pass information
  to vectorization and TCE backend to simplify debuging.
- Kernel pointer arguments are not always marked 'noalias' (restricted).
  Doing this previously was a specs misunderstanding.
- ConstantGEPs to static variables generated from automated
  locals caused problems. Now converting them to normal GEPs
  using a pass from the SAFECode project.

OpenCL Platform Layer implementations (OpenCL 1.2 Chapter 4)
-------------------------------------------------------
- clGetDeviceInfo now uses the hwloc lib for device property
  queries. Many new queries implemented.
- clGetKernelInfo (initial implementation)
- clGetMemObjectInfo (initial implementation)
- clGetCommandQueueInfo (initial implementation)
- clReleaseDevice
- clRetainDevice
- Proper freeing of devices in clReleaseContext

The OpenCL Runtime Implementations (OpenCL 1.2 Chapter 5)
---------------------------------------------------------
- clBuildProgram: support for passing options to the compiler.
- clEnqueueMarker

OpenCL C Builtin Function Implementations (OpenCL 1.2 Section 6.12)
-------------------------------------------------------------------
- Atomic Functions (6.12.11)
- get_global_offset() was not linked correctly

Framework
---------

- Made it possible to override the .cl -> .bc build command
  called by clBuildProgram per device.

Device Drivers
--------------

- pthread/basic:
  * extract CPU clock frequency from /proc/cpuinfo, if available
  * return cl_khr_fp64 if doubles supported by the CPU
- ttasim: support for explicitly calling custom/special operations
  through the vendor extensions API

Misc.
-----

- Fixes for MacOSX builds.
- Fixed passing NULL as a buffer argument to clSetKernelArguments
- Fixed a major bug when launching the same kernel multiple times:
  the arguments very not copied to the command object.
- Fixed several issues with ICD, it is now considered stable to be
  used by default.

0.6   August 2012
=================

Kernel library
--------------

- Added initial optimized kernel library for X86_64/SSE.
- Preliminary support for ARM architectures on Linux
  (briefly tested on MeeGo/Nokia N9).

Pthread device driver
---------------------

- Multithreading at the work group granularity using pthreads.
- Tries to figure out the optimal maximum number of
  threads for the system based on the available hardware
  threads. Currently works only in Linux using the
  /proc/cpuinfo interface.
- Region-based customized memory allocator for speeding up buffer
  allocations.

Kernel compiler
---------------

- Most of the tricky work group barrier cases (barriers inside
  for-loops etc) now supported.
- Support for local variables, also automatic locals.
- Reuse previous compilation results, if available.
- Automatic vectorization of work groups (multiple work items
  in parallel).

Miscellaneous
-------------
- Installable Client Driver (icd) support.
- Event profiling support (incomplete, works only for kernel and
  buffer read/write/map/unmap events).

Known issues
------------

- Non-pointer struct kernel arguments fail due to varying ABIs
  * https://bugs.launchpad.net/pocl/+bug/987905
- Produces always "fully unrolled" chains of work items for
  work groups causing code size explosion for large WGs.